GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
Summary
GRZO is a novel zeroth-order optimization method for fine-tuning large language models that reduces variance by using group-relative normalization, achieving better accuracy and memory efficiency compared to MeZO.
View Cached Full Text
Cached at: 06/03/26, 09:40 AM
# GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
Source: [https://arxiv.org/html/2606.02857](https://arxiv.org/html/2606.02857)
Liyan TanYequan ZhaoYifan Yang Ruijie ZhangXinling YuZheng Zhang University of California, Santa Barbara \{liyan\_tan, yequan\_zhao, ruijiezhang, xyu644\}@ucsb\.edu yifanyang@cs\.ucsb\.eduzhengzhang@ece\.ucsb\.edu
###### Abstract
Zeroth\-order \(ZO\) optimization is a memory\-efficient alternative to backpropagation for fine\-tuning large language models, but its deployment is limited by the high variance of gradient estimation\. We proposeGRZO, aGroup\-RelativeZeroth\-Order optimizer that draws one pseudo\-independent perturbation per mini\-batch example and aggregates the per\-example losses through group\-relative normalization, raising the effective gradient\-direction count from one to the batch size at no additional forward cost while preserving inference\-level memory\. We prove that GRZO is directionally unbiased with variance shrinking proportionally to the batch size, yielding a tighter nonconvex convergence bound than MeZO\. Across RoBERTa\-large, Llama3\-8B, and OPT\-13B over multiple tasks, GRZO improves average accuracy on Llama3\-8B by\+3\.0\+3\.0over MeZO at23%23\\%lower peak GPU memory; as a drop\-in replacement for the MeZO core, it lifts sparse, low\-rank, and quantized ZO variants by\+6\.0\+6\.0on average\.
GRZO: Group\-Relative Zeroth\-Order Optimization for Large Language Model Fine\-Tuning
Liyan Tan Yequan Zhao Yifan YangRuijie ZhangXinling YuZheng ZhangUniversity of California, Santa Barbara\{liyan\_tan, yequan\_zhao, ruijiezhang, xyu644\}@ucsb\.eduyifanyang@cs\.ucsb\.eduzhengzhang@ece\.ucsb\.edu
## 1Introduction
\(a\)Efficiency and accuracy comparison\.
\(b\)Loss convergence\.
Figure 1:GRZO at a glance on RTE \(Llama3\-8B\)\.Left: lowest peak memory \(16\.016\.0GB\), highest accuracy \(81\.6%81\.6\\%\), and MeZO\-comparable per\-step time\.Right: fastest convergence in both training steps and wall\-clock time\.Fine\-tuning large language models \(LLMs\) for downstream tasks remains essential, yet first\-order fine\-tuning is expensive: backpropagation requires storing activations, gradients, and optimizer states, and these costs grow linearly with model scale\. Memory\-efficient methods such as LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.02857#bib.bib13)\), Adapter\(Houlsbyet al\.,[2019](https://arxiv.org/html/2606.02857#bib.bib43)\), Prefix\-Tuning\(Li and Liang,[2021](https://arxiv.org/html/2606.02857#bib.bib42)\), Prompt\-Tuning\(Lesteret al\.,[2021](https://arxiv.org/html/2606.02857#bib.bib44)\), GaLore\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib18)\), CoLA\(Liuet al\.,[2025b](https://arxiv.org/html/2606.02857#bib.bib39)\), and Lax\(Zhanget al\.,[2026](https://arxiv.org/html/2606.02857#bib.bib40)\)reduce certain memory footprints but still rely on backpropagation, inheriting most of its activation\-storage cost\. Moreover, many practical objectives—accuracy, F1, reward signals—are non\-differentiable and lie outside the first\-order pipeline\. These considerations motivate zeroth\-order \(ZO\) fine\-tuning as a forward\-only alternative\.
The canonical ZO method for LLM fine\-tuning is MeZO\(Malladiet al\.,[2023a](https://arxiv.org/html/2606.02857#bib.bib2)\), a two\-point estimator that approximates the gradient from the loss difference between two perturbed forward passes\.Malladiet al\.\([2023a](https://arxiv.org/html/2606.02857#bib.bib2)\)report up to a12×12\\timesmemory reduction over SGD\(Amari,[1993](https://arxiv.org/html/2606.02857#bib.bib3)\)and AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.02857#bib.bib4)\)fine\-tuning, keeping training memory near inference levels while remaining compatible with non\-differentiable objectives\. The catch is that MeZO uses a single random perturbation direction per step; the variance of this estimator grows with model dimension, producing slow descent and brittle optimization once the backbone reaches the multi\-billion\-parameter range\.
However, the high variance of ZO gradient estimation makes MeZO prone to slower or sub\-optimal convergence\. A growing literature addresses this variance by reducing the dimension via low\-rank\(Chenet al\.,[2025](https://arxiv.org/html/2606.02857#bib.bib1)\)or sparse\(Liuet al\.,[2025a](https://arxiv.org/html/2606.02857#bib.bib10); Zhanget al\.,[2025](https://arxiv.org/html/2606.02857#bib.bib8)\)perturbations; in LLM fine\-tuning, however, the loss landscape exhibits low effective rank\(Aghajanyanet al\.,[2021](https://arxiv.org/html/2606.02857#bib.bib32); Malladiet al\.,[2023b](https://arxiv.org/html/2606.02857#bib.bib31)\), so the convergence rate can be independent of the parameter count\. Another direction designs lower\-variance ZO gradient estimators via control variates\(Gautamet al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib5)\), Hessian curvature\(Zhaoet al\.,[2025a](https://arxiv.org/html/2606.02857#bib.bib6)\), or minimum\-variance two\-point estimators\(Ma and Huang,[2025](https://arxiv.org/html/2606.02857#bib.bib22)\)\. The drawback is that these estimators introduce additional computation or memory overhead, eroding the system benefits that motivated BP\-free LLM fine\-tuning\.
These approaches achieve variance reduction at an additional cost—a narrowed update space, extra forward passes, or extra persistent memory—eroding the inference\-level efficiency that motivated ZO in the first place\. We identify a third, long\-overlooked axis for variance reduction:*the mini\-batch itself*\. Existing ZO methods reuse a single perturbation direction across allBBexamples of a step, even though the loss is evaluated per example\. DrawingBBpseudo\-independent directions instead—one per example—would, by standard Monte Carlo, reduce the SPSA \(simultaneous perturbation stochastic approximation\)\(Spall,[2002](https://arxiv.org/html/2606.02857#bib.bib25)\)estimator’s variance by a factor of1/B1/Bat no additional forward cost, no parameter\-space restriction, and no extra persistent memory\. Realizing this axis efficiently, while preserving MeZO’s two\-forward\-pass budget and inference\-level memory footprint, is the central design problem of this paper\.
We propose GRZO \(Group\-Relative Zeroth\-Order Optimization\), which realizes this axis by drawingBBpseudo\-independent perturbations via Flipout\-style sign factorization\(Wenet al\.,[2018](https://arxiv.org/html/2606.02857#bib.bib11)\)and aggregating the resulting per\-example loss differences through GRPO\-style group\-relative normalization\(Shaoet al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib26)\), all within a single two\-forward\-pass step\. Our contributions are:
- •Algorithm\.A ZO optimizer that turns the mini\-batch into pseudo\-independent perturbation directions while preserving MeZO’s two\-forward\-pass budget and inference\-level memory\.
- •Theory\.We show the directional unbiasedness and batch\-size\-scaled variance reduction of GRZO, yielding a strictly tighter nonconvex convergence bound than single\-direction ZO\.
- •Experimental results\.We show that GRZO outperforms MeZO and its variants on multiple language models\. We further show that GRZO is complementary to sparse, low\-rank, and quantized ZO variants and that they can be combined to achieve further performance benefit\.
A mechanism\-by\-mechanism comparison of representative ZO methods is in Table[4](https://arxiv.org/html/2606.02857#A2.T4)\(Appendix[C](https://arxiv.org/html/2606.02857#A3)\); these approaches are largely orthogonal to GRZO and compose with it\.
Figure 2:Side\-by\-side pipeline comparison of MeZO \(left\) and GRZO \(right\)\. By constructing pseudo\-independent perturbations and group\-relative normalization, GRZO achievesBBeffective perturbation directions and1/B1/Bgradient variance under the same forward\-pass budget compared to MeZO\.
## 2Background and Related Work
### 2\.1Zeroth\-Order \(ZO\) Optimization
Zeroth\-order \(ZO\) optimization\(Nesterov and Spokoiny,[2017](https://arxiv.org/html/2606.02857#bib.bib29); Ghadimi and Lan,[2013](https://arxiv.org/html/2606.02857#bib.bib24)\)adjusts model parameters𝜽∈ℝd\\bm\{\\theta\}\\in\\mathbb\{R\}^\{d\}using only forward queries of the lossℒ\(𝜽\)\\mathcal\{L\}\(\\bm\{\\theta\}\), avoiding the memory overhead caused by activation and gradient buffers in backpropagation\. A ZO optimizer still performs gradient\-descent update𝜽t←𝜽t−1−α𝐠\\bm\{\\theta\}\_\{t\}\\leftarrow\\bm\{\\theta\}\_\{t\-1\}\-\\alpha\\,\\mathbf\{g\}, but it approximates the gradient𝐠\\mathbf\{g\}viaNNforward passes
𝐠≈∇^𝜽ℒ\(𝜽\)=∑i=1N1Nμ\[ℒ\(𝜽\+μ𝝃i\)−ℒ\(𝜽\)\]𝝃i,\\mathbf\{g\}\\approx\\widehat\{\\nabla\}\_\{\\bm\{\\theta\}\}\\mathcal\{L\}\(\\bm\{\\theta\}\)=\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{N\\mu\}\\left\[\\mathcal\{L\}\(\\bm\{\\theta\}\+\\mu\\bm\{\\xi\}\_\{i\}\)\-\\mathcal\{L\}\(\\bm\{\\theta\}\)\\right\]\\bm\{\\xi\}\_\{i\},\(1\)with\{𝝃i\}i=1N\\\{\\bm\{\\xi\}\_\{i\}\\\}\_\{i=1\}^\{N\}drawn i\.i\.d\. from an isotropic distributionρ\(𝝃\)\\rho\(\\bm\{\\xi\}\)\(e\.g\.,𝒩\(𝟎,𝐈\)\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)or Rademacher\) andμ\>0\\mu\>0a small sampling radius\. The estimator∇^𝜽ℒ\\widehat\{\\nabla\}\_\{\\bm\{\\theta\}\}\\mathcal\{L\}is unbiased w\.r\.t\. the gradient of the smoothed surrogatefμ\(𝜽\):=𝔼𝝃∼ρ\[ℒ\(𝜽\+μ𝝃\)\]f\_\{\\mu\}\(\\bm\{\\theta\}\):=\\mathbb\{E\}\_\{\\bm\{\\xi\}\\sim\\rho\}\[\\mathcal\{L\}\(\\bm\{\\theta\}\+\\mu\\bm\{\\xi\}\)\], but biased w\.r\.t\. the true gradient∇𝜽ℒ\\nabla\_\{\\bm\{\\theta\}\}\\mathcal\{L\}\(Berahaset al\.,[2022](https://arxiv.org/html/2606.02857#bib.bib33)\), and its variance carries a dimension\-dependent factorO\(d/N\)O\(d/N\)atμ=O\(1/N\)\\mu=O\(1/\\sqrt\{N\}\)\(Liuet al\.,[2020](https://arxiv.org/html/2606.02857#bib.bib28); Duchiet al\.,[2015](https://arxiv.org/html/2606.02857#bib.bib23); Gao and Sener,[2022](https://arxiv.org/html/2606.02857#bib.bib30)\)\.
MeZO, one of the most popular ZO optimizers for LLM fine\-tuning, is theN=1N\{=\}1two\-sided instantiation of Eq\. \([1](https://arxiv.org/html/2606.02857#S2.E1)\): a seed\-regenerated direction𝝃∈ℝd\\bm\{\\xi\}\\in\\mathbb\{R\}^\{d\}drives two symmetric forward passesℓ±=ℒ\(𝜽±μ𝝃;ℬ\)\\ell^\{\\pm\}=\\mathcal\{L\}\(\\bm\{\\theta\}\\pm\\mu\\bm\{\\xi\};\\mathcal\{B\}\), and parameters are updated in place along−α\(ℓ\+−ℓ−\)𝝃/2μ\-\\alpha\(\\ell^\{\+\}\-\\ell^\{\-\}\)\\bm\{\\xi\}/2\\muwithout materializing the perturbation tensor\. Two knobs govern the estimator’s quality: the parameter dimensionddover which its variance scales, and the SPSA construction itself\. The ZO fine\-tuning literature addresses these knobs separately\.
### 2\.2Reducing the Effective Dimension
One family of approaches shrinks the update space to mitigate theO\(d\)O\(d\)variance scaling\. DeepZero\(Chenet al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib34)\)and Sparse\-MeZO\(Liuet al\.,[2025a](https://arxiv.org/html/2606.02857#bib.bib10)\)restrict updates to a sparsity mask, and MaZO\(Zhanget al\.,[2025](https://arxiv.org/html/2606.02857#bib.bib8)\)extends masking to multi\-task fine\-tuning; Low\-rank methods reparameterize the perturbation through low\-rank matrices\(Chenet al\.,[2025](https://arxiv.org/html/2606.02857#bib.bib1)\)or tensors\(Zhaoet al\.,[2023](https://arxiv.org/html/2606.02857#bib.bib35); Yanget al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib7)\)\. These methods exploit the low effective rank of LLM fine\-tuning at the cost of full\-parameter expressivity\. Along an orthogonal axis, QuZO\(Zhouet al\.,[2025](https://arxiv.org/html/2606.02857#bib.bib9)\)and Poor\-Man’s Training\(Zhaoet al\.,[2025b](https://arxiv.org/html/2606.02857#bib.bib27)\)reduce memory via low\-bit forward passes without altering the SPSA estimator\.
### 2\.3Improving the Estimator Construction
A second family of research keeps the full\-parameter update space and injects additional information into the estimator\. MeZO\-SVRG\(Gautamet al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib5)\)pairs each probe with a periodic full\-batch reference for an SVRG\-style control variate, at the cost of doubled persistent memory; HiZOO\(Zhaoet al\.,[2025a](https://arxiv.org/html/2606.02857#bib.bib6)\)adds a diagonal\-Hessian preconditioner estimated from one extra forward pass per step; FZOO\(Danget al\.,[2026](https://arxiv.org/html/2606.02857#bib.bib14)\)samplesNNparallel directions for1/N1/N\-scale variance reduction atN\+1N\{\+\}1forwards per step; subspace\-orthogonalization\(Langet al\.,[2026](https://arxiv.org/html/2606.02857#bib.bib21)\)decorrelates the direction sequence across steps; andMa and Huang \([2025](https://arxiv.org/html/2606.02857#bib.bib22)\)revisit minimum\-variance two\-point estimator design; SharpZO\(Yanget al\.,[2026](https://arxiv.org/html/2606.02857#bib.bib41)\)extends the forward\-only paradigm to sharpness\-aware VLM prompt tuning\. In every case, the variance reduction is achieved with extra forwards, extra persistent memory, or both\.
## 3Method
Both ZO families reviewed in Section[2](https://arxiv.org/html/2606.02857#S2)reduce variance at a cost: shrinking the effective parameter dimension sacrifices full\-parameter expressivity, while enriching the SPSA estimator adds forward passes or persistent memory\. GRZO instead turns the mini\-batch dimension itself into the variance\-reduction lever: a single two\-forward\-pass step yieldsBBpseudo\-independent gradient directions, preserving the inference\-level memory and two\-forward\-pass envelope of MeZO\. Section[3\.1](https://arxiv.org/html/2606.02857#S3.SS1)constructs the per\-example perturbation directions via a sign factorization of a shared base perturbation; Section[3\.2](https://arxiv.org/html/2606.02857#S3.SS2)then converts the resultingBBper\-example loss signals into group\-relative weights inspired by GRPO advantages\. Figure[2](https://arxiv.org/html/2606.02857#S1.F2)contrasts the resulting pipeline with MeZO\.
### 3\.1Per\-Example Perturbations via Structured Injection
GRZO constructs per\-example perturbations through a sign factorization originally proposed for Bayesian weight sampling\(Wenet al\.,[2018](https://arxiv.org/html/2606.02857#bib.bib11)\)\. For a linear layer with weight𝐖∈ℝdout×din\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, we first generate a shared base perturbation matrix𝐔∈ℝdout×din\\mathbf\{U\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}in each step from a symmetric isotropic distribution \(e\.g\.,𝒩\(0,1\)\\mathcal\{N\}\(0,1\)or Rademacher entries\)\. Then for each data sample in\{𝐱i,𝐲i\}i=1B\\\{\\mathbf\{x\}\_\{i\},\\mathbf\{y\}\_\{i\}\\\}\_\{i=1\}^\{B\}, we modulate𝐔\\mathbf\{U\}with per\-example sign vectors𝐫i∈\{±1\}dout\\mathbf\{r\}\_\{i\}\\in\\\{\\pm 1\\\}^\{d\_\{\\text\{out\}\}\}and𝐬i∈\{±1\}din\\mathbf\{s\}\_\{i\}\\in\\\{\\pm 1\\\}^\{d\_\{\\text\{in\}\}\}\(independent Rademacher pairs\) to obtain
Δ𝐖i=𝐔⊙\(𝐫i𝐬i⊤\),\\Delta\\mathbf\{W\}\_\{i\}\\;=\\;\\mathbf\{U\}\\odot\(\\mathbf\{r\}\_\{i\}\\mathbf\{s\}\_\{i\}^\{\\top\}\),\(2\)where⊙\\odotdenotes the element\-wise \(Hadamard\) product\. This construction yieldsBBpseudo\-independent per\-example perturbations from a single shared base𝐔\\mathbf\{U\}, without materializingBBseparate weight copies; see Figure[2](https://arxiv.org/html/2606.02857#S1.F2)for the pipeline and Appendix[B](https://arxiv.org/html/2606.02857#A2)for the vectorized form\. The overhead is a small constant per linear layer \(one extra matrix multiplication plus elementwise sign modulations\)\.
###### Lemma 1\(Isotropy and conditional decorrelation\)\.
Let𝐳i:=vec\(Δ𝐖i\)=vec\(𝐔⊙\(𝐫i𝐬i⊤\)\)\\mathbf\{z\}\_\{i\}:=\\mathrm\{vec\}\(\\Delta\\mathbf\{W\}\_\{i\}\)=\\mathrm\{vec\}\(\\mathbf\{U\}\\odot\(\\mathbf\{r\}\_\{i\}\\mathbf\{s\}\_\{i\}^\{\\top\}\)\)with𝐔\\mathbf\{U\}and\{\(𝐫i,𝐬i\)\}i=1B\\\{\(\\mathbf\{r\}\_\{i\},\\mathbf\{s\}\_\{i\}\)\\\}\_\{i=1\}^\{B\}as defined above\. Then𝔼\[𝐳i𝐳i⊤\]=𝐈d\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mathbf\{z\}\_\{i\}^\{\\top\}\]=\\mathbf\{I\}\_\{d\}marginally and𝔼\[𝐳i𝐳j⊤∣𝐔\]=0\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mathbf\{z\}\_\{j\}^\{\\top\}\\mid\\mathbf\{U\}\]=0fori≠ji\\neq j\(Appendix[H\.6\.1](https://arxiv.org/html/2606.02857#A8.SS6.SSS1)\)\.
### 3\.2Group\-Relative Aggregation for Zeroth\-Order Updates
Given the per\-example perturbationsΔ𝐖i\\Delta\\mathbf\{W\}\_\{i\}in Eq\. \([2](https://arxiv.org/html/2606.02857#S3.E2)\) and a perturbation scaleσ\>0\\sigma\>0, exampleiicontributes the two\-sided perturbed lossesℓi±=L\(𝜽±σΔ𝐖i;𝐱i,𝐲i\)\\ell\_\{i\}^\{\\pm\}=L\(\\bm\{\\theta\}\\pm\\sigma\\Delta\\mathbf\{W\}\_\{i\};\\mathbf\{x\}\_\{i\},\\mathbf\{y\}\_\{i\}\)and the perturbation\-induced loss difference
δi:=ℓi\+−ℓi−\.\\delta\_\{i\}:=\\ell\_\{i\}^\{\+\}\-\\ell\_\{i\}^\{\-\}\.\(3\)GRZO converts\{δi\}i=1B\\\{\\delta\_\{i\}\\\}\_\{i=1\}^\{B\}into advantage\-like weights via group\-relative normalization\. We compute the within\-batch standard deviation
s=1B∑i=1B\(δi−δ¯\)2withδ¯=1B∑i=1Bδi,s=\\sqrt\{\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\bigl\(\\delta\_\{i\}\-\\bar\{\\delta\}\\bigr\)^\{2\}\}\\;\\text\{with\}\\;\\bar\{\\delta\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\delta\_\{i\},\(4\)and define the group\-relative weights
ai=δis\+ϵ\.a\_\{i\}=\\frac\{\\delta\_\{i\}\}\{s\+\\epsilon\}\.\(5\)Hereϵ\>0\\epsilon\>0is a small constant to ensure numerical stability\. These weights are scale\-invariant under rescaling of\{δi\}\\\{\\delta\_\{i\}\\\}, decoupling the update from loss\-magnitude drift; two\-sided differences also give𝔼\[δi\]=0\\mathbb\{E\}\[\\delta\_\{i\}\]=0by the𝐳i↔−𝐳i\\mathbf\{z\}\_\{i\}\\leftrightarrow\-\\mathbf\{z\}\_\{i\}symmetry, so the numerator needs no explicit mean\-centering\. The update direction is
𝐠^=12σB∑i=1Bai𝐳i,\\widehat\{\\mathbf\{g\}\}\\;=\\;\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}a\_\{i\}\\,\\mathbf\{z\}\_\{i\},\(6\)where𝐳i\\mathbf\{z\}\_\{i\}is the per\-example perturbation direction from the preceding Section[3\.1](https://arxiv.org/html/2606.02857#S3.SS1); full pseudocode is in Algorithm[1](https://arxiv.org/html/2606.02857#alg1)\(Appendix[A](https://arxiv.org/html/2606.02857#A1)\)\. The conditional decorrelation in Lemma[1](https://arxiv.org/html/2606.02857#Thmlemma1)eliminates the cross\-example covariance terms inVar\(𝐠^\)\\mathrm\{Var\}\(\\hat\{\\mathbf\{g\}\}\); combined with∑iai2≈B\\sum\_\{i\}a\_\{i\}^\{2\}\\approx Bfor two\-sided differences, this controls the diagonal of the variance bound \(Appendix[H\.6](https://arxiv.org/html/2606.02857#A8.SS6)\)\.
## 4Theoretical Analysis
The construction in Section[3](https://arxiv.org/html/2606.02857#S3)promisesBBpseudo\-independent gradient directions per step at no extra forward cost\. We now provide some formal theoretical results: the GRZO estimator \(i\) is directionally unbiased for the gradient of the smoothed objectiveFσ\(𝜽\)=𝔼𝐳\[F\(𝜽\+σ𝐳\)\]F\_\{\\sigma\}\(\\bm\{\\theta\}\)=\\mathbb\{E\}\_\{\\mathbf\{z\}\}\[F\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\)\]up to a positive scaling absorbed into the effective learning rate, \(ii\) admits roughly1/Beff1/B\_\{\\mathrm\{eff\}\}\(the number of effectively independent perturbation directions, withBeff≈BB\_\{\\mathrm\{eff\}\}\\approx Bat typical batch sizes; detailed below\) of MeZO’s variance under the same forward budget, and \(iii\) improves the MeZO convergence bound byBeff\\sqrt\{B\_\{\\mathrm\{eff\}\}\}\. Full proofs are in Appendix[H](https://arxiv.org/html/2606.02857#A8)\.
###### Theorem 1\(Directional Unbiasedness \(Informal\)\)\.
Under standard smoothness assumptions, the GRZO estimator satisfies
𝔼\[𝐠^t∣𝜽t\]=ct⋅∇Fσ\(𝜽t\)\+O\(σ2\),\\mathbb\{E\}\[\\hat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]\\;=\\;c\_\{t\}\\cdot\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\+O\(\\sigma^\{2\}\),wherect\>0c\_\{t\}\>0is a positive scalar absorbed into the effective learning rate, andO\(σ2\)O\(\\sigma^\{2\}\)is the standard ZO smoothing bias vanishing asσ→0\\sigma\\to 0\.
###### Theorem 2\(Variance Bound\)\.
Under standard smoothness assumptions, the GRZO estimator satisfies
Var\(𝐠^GRZO\(𝜽\)\)≤d−1B\(‖∇F\(𝜽\)‖2\+ν2\)\+O\(ρ2σ4d4\),\\mathrm\{Var\}\\bigl\(\\hat\{\\mathbf\{g\}\}\_\{\\mathrm\{GRZO\}\}\(\\bm\{\\theta\}\)\\bigr\)\\;\\leq\\;\\frac\{d\-1\}\{B\}\\bigl\(\\\|\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\+\\nu^\{2\}\\bigr\)\+O\(\\rho^\{2\}\\sigma^\{4\}d^\{4\}\),whered=DoutDind=D\_\{\\mathrm\{out\}\}D\_\{\\mathrm\{in\}\}\. \(Proof: Appendix[H\.6](https://arxiv.org/html/2606.02857#A8.SS6)\.\)
Building on Theorems[1](https://arxiv.org/html/2606.02857#Thmtheorem1)–[2](https://arxiv.org/html/2606.02857#Thmtheorem2), we state the full nonconvex convergence guarantee\. Let𝐠^t\\hat\{\\mathbf\{g\}\}\_\{t\}denote the GRZO estimator \([6](https://arxiv.org/html/2606.02857#S3.E6)\)\.
###### Assumption 1\.
FσF\_\{\\sigma\}isℒ\\mathcal\{L\}\-smooth and lower bounded byFσ⋆F\_\{\\sigma\}^\{\\star\}\. The step sizeη\\etasatisfies the stability condition in Appendix[H\.7](https://arxiv.org/html/2606.02857#A8.SS7)\. Per\-example gradients satisfy𝔼‖∇ℓ\(𝛉;ξ\)−∇F\(𝛉\)‖2≤ν2\\mathbb\{E\}\\\|\\nabla\\ell\(\\bm\{\\theta\};\\xi\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\\leq\\nu^\{2\}\.
###### Theorem 3\(Nonconvex Convergence\)\.
Under Assumption[1](https://arxiv.org/html/2606.02857#Thmassumption1), GRZO iterations satisfy
1T∑t=0T−1𝔼‖∇Fσ\(𝜽t\)‖2\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\big\\\|^\{2\}≤4\(Fσ\(𝜽0\)−Fσ⋆\)ηT\\displaystyle\\;\\leq\\;\\frac\{4\\bigl\(F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-F\_\{\\sigma\}^\{\\star\}\\bigr\)\}\{\\eta T\}\+2ℒη𝒱¯GRZO,\\displaystyle\\quad\+2\\mathcal\{L\}\\eta\\,\\overline\{\\mathcal\{V\}\}\_\{\\mathrm\{GRZO\}\},\(7\)where𝒱¯GRZO\\overline\{\\mathcal\{V\}\}\_\{\\mathrm\{GRZO\}\}is the per\-step variance bounded in Appendix[H\.6](https://arxiv.org/html/2606.02857#A8.SS6)\. \(Formal statement: Theorem[4](https://arxiv.org/html/2606.02857#Thmtheorem4)in Appendix[H\.7](https://arxiv.org/html/2606.02857#A8.SS7)\.\)
Comparing the leadingd−1B\\frac\{d\-1\}\{B\}scaling in Theorem[2](https://arxiv.org/html/2606.02857#Thmtheorem2)to MeZO’s analogous bound \(which lacks the1/B1/Bfactor, as MeZO uses a single perturbation per step\) gives𝒱¯GRZO≈𝒱¯MeZO/Beff\\overline\{\\mathcal\{V\}\}\_\{\\mathrm\{GRZO\}\}\\approx\\overline\{\\mathcal\{V\}\}\_\{\\mathrm\{MeZO\}\}/B\_\{\\mathrm\{eff\}\}at matched forward budget, so withη∝1/T\\eta\\propto 1/\\sqrt\{T\}the stationarity bound improves byBeff\\sqrt\{B\_\{\\mathrm\{eff\}\}\}relative to MeZO\. In practice,B≥16B\\geq 16is needed for stable group\-relative normalization \(Section[5\.4](https://arxiv.org/html/2606.02857#S5.SS4.SSS0.Px3)\)\.
##### Interpreting the Variance Reduction\.
BeffB\_\{\\mathrm\{eff\}\}captures how many effectively independent perturbation directions GRZO extracts from one mini\-batch\. The per\-example perturbations are only*conditionally*independent given the shared base𝐔\\mathbf\{U\}, so in principleBeffB\_\{\\mathrm\{eff\}\}is slightly belowBB\. Lemma[1](https://arxiv.org/html/2606.02857#Thmlemma1)however shows that the cross\-example covariance vanishes in expectation, soBeff≈BB\_\{\\mathrm\{eff\}\}\\approx Bat typical batch sizes\. At the defaultB=16B\{=\}16, GRZO achieves∼4×\\sim 4\{\\times\}reduction in the per\-step gradient\-estimate standard deviation\. The faster training\-loss descent visible in Figure[3](https://arxiv.org/html/2606.02857#S5.F3)is a direct consequence\.
##### Convergence Rate Comparison with MeZO\.
Settingη=c/T\\eta=c/\\sqrt\{T\}in \([7](https://arxiv.org/html/2606.02857#S4.E7)\), the dominant stationarity term becomesO\(𝒱¯GRZO/T\)O\(\\overline\{\\mathcal\{V\}\}\_\{\\mathrm\{GRZO\}\}/\\sqrt\{T\}\)\. Since𝒱¯GRZO≈𝒱¯MeZO/Beff\\overline\{\\mathcal\{V\}\}\_\{\\mathrm\{GRZO\}\}\\approx\\overline\{\\mathcal\{V\}\}\_\{\\mathrm\{MeZO\}\}/B\_\{\\mathrm\{eff\}\}, GRZO achieves anϵ\\epsilon\-stationary point inBeff×B\_\{\\mathrm\{eff\}\}\\timesfewer steps than MeZO under the same per\-step forward budget\. Crucially, this improvement is*free*in terms of forward evaluations: both methods perform exactly two forward passes per step, but GRZO amortizes its variance reduction over the batch dimension rather than requiring additional perturbation queries\.
## 5Experiments
We consider two architectural families\. Masked language models such as BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.02857#bib.bib19)\)and RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2606.02857#bib.bib20)\)learn bidirectional representations under a masked\-token objective; auto\-regressive language models such as Llama\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib16)\)and OPT\(Zhanget al\.,[2022](https://arxiv.org/html/2606.02857#bib.bib17)\)predict the next token\. We benchmark GRZO under full\-parameter fine\-tuning \(Section[5\.1](https://arxiv.org/html/2606.02857#S5.SS1)\), report memory and per\-step time on Llama3\-8B \(Section[5\.2](https://arxiv.org/html/2606.02857#S5.SS2)\), show GRZO composes with parameter\-efficient ZO variants such as Sparse\-MeZO, LOZO, and QuZO \(Section[5\.3](https://arxiv.org/html/2606.02857#S5.SS3)\), and ablate the components of GRZO \(Section[5\.4](https://arxiv.org/html/2606.02857#S5.SS4)\)\.
Setup\.We compare against first\-order baselines \(Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2606.02857#bib.bib12)\), LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.02857#bib.bib13)\)\) and zeroth\-order baselines \(MeZO\(Malladiet al\.,[2023a](https://arxiv.org/html/2606.02857#bib.bib2)\), FZOO\(Danget al\.,[2026](https://arxiv.org/html/2606.02857#bib.bib14)\)\) on classification tasks from GLUE\(Wanget al\.,[2018](https://arxiv.org/html/2606.02857#bib.bib36)\)\(SST\-2\) and SuperGLUE\(Wanget al\.,[2019](https://arxiv.org/html/2606.02857#bib.bib15)\)\(RTE, CB, BoolQ, WiC, MultiRC, COPA\), and QA tasks SQuAD\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2606.02857#bib.bib37)\)and DROP\(Duaet al\.,[2019](https://arxiv.org/html/2606.02857#bib.bib38)\)\. All methods train for20k20ksteps at batch size 16 in FP16; ZO methods use perturbation scaleσ=10−3\\sigma\{=\}10^\{\-3\}\. For RoBERTa\-large we follow thek=512k\{=\}512few\-shot protocol ofMalladiet al\.\([2023a](https://arxiv.org/html/2606.02857#bib.bib2)\)\. Hyperparameter details are in Appendix[D](https://arxiv.org/html/2606.02857#A4)\.
### 5\.1Accuracy and Convergence
Table 1:Results on RoBERTa\-large \(350M,kk=512\)\. FO methods are marked with orange bullets\. Among ZO methods, the highest accuracy is highlighted inbold\.##### Masked Language Models\.
We evaluate GRZO on RoBERTa\-large \(350M\) under thek=512k\{=\}512few\-shot setting\. Table[1](https://arxiv.org/html/2606.02857#S5.T1)shows GRZO outperforms MeZO on all six tasks and FZOO on four of six, with the largest gains on SNLI \(\+0\.8\+0\.8\) and RTE \(\+0\.9\+0\.9\)\. Averaged across tasks, GRZO reaches 81\.1, vs\. 80\.9 for FZOO and 80\.0 for MeZO\.
Table 2:Results on Llama3\-8B across SuperGLUE and QA tasks\. FO methods are marked with orange bullets\. Among ZO methods, the highest accuracy is highlighted inbold\.
##### Auto\-Regressive Language Models\.
We expand to Llama3\-8B and OPT\-13B on the same SuperGLUE\+QA suite\. Table[2](https://arxiv.org/html/2606.02857#S5.T2)shows GRZO is the best ZO method on 7/9 Llama3\-8B tasks \(average 78\.6 vs\. 76\.5 FZOO and 75\.6 MeZO,\+2\.1\+2\.1under the same two\-forward\-pass budget\) and 6/9 OPT\-13B tasks \(average 70\.5 vs\. 70\.0 FZOO and 67\.9 MeZO\)\. The largest gains over FZOO appear on tasks requiring deeper understanding:\+3\.4\+3\.4CB,\+5\.0\+5\.0RTE,\+7\.6\+7\.6DROP on Llama3\-8B;\+4\.1\+4\.1DROP,\+1\.2\+1\.2RTE,\+1\.0\+1\.0COPA on OPT\-13B\.
Figure 3:Training\-loss curves on Llama3\-8B \(RTE, MultiRC\) and OPT\-13B \(SQuAD, DROP\) plotted against training steps and wall\-clock time\.
##### Convergence and Wall\-Clock Efficiency\.
Figure[3](https://arxiv.org/html/2606.02857#S5.F3)plots training\-loss curves on Llama3\-8B \(RTE, MultiRC\) and OPT\-13B \(SQuAD, DROP\) against both training steps and wall\-clock seconds\. GRZO descends faster than FZOO and MeZO across all four panels and reaches a lower final loss on Llama\-RTE, Llama\-MultiRC, and OPT\-SQuAD; on Llama\-MultiRC in particular, GRZO matches the final loss of FZOO in roughly half the wall\-clock time\. MeZO diverges on OPT\-DROP and makes no measurable progress on Llama\-RTE or Llama\-MultiRC\.
### 5\.2Memory and Time Analysis
GRZO offers a strong memory advantage at a modest per\-step time cost\. Figure[5](https://arxiv.org/html/2606.02857#S5.F5)reports a production profile on Llama3\-8B; Figure[4](https://arxiv.org/html/2606.02857#S5.F4)extends the memory picture across model sizes\.
Figure 4:Peak GPU memory \(GB\) vs\. model size for OPT \(1\.3B–30B\)\. GRZO matches the inference footprint, consuming even less memory than MeZO\.\(a\)Peak GPU memory breakdown\.
\(b\)Per\-step time breakdown\.
Figure 5:Production profile on Llama3\-8B \(RTE, fp16,B=16B\{=\}16, 4×\\timesA100, mean over 20 steps×\\times4 ranks\)\.Left: peak GPU memory per step \(model footprint vs\. transient overhead\)\.Right: per\-step time decomposition\.##### Memory\.
Fig\.[5\(a\)](https://arxiv.org/html/2606.02857#S5.F5.sf1)shows that vanilla GRZO holds peak GPU memory at16\.0216\.02GB—essentially the bare model footprint \(16\.0 GB of fp16 weights,\+0\.02\+0\.02GB of transient buffers\)—while MeZO peaks at20\.8420\.84GB \(\+4\.84\+4\.84GB during the update step\), a23%23\\%reduction\. The advantage carries over to combined variants \(Section[5\.3](https://arxiv.org/html/2606.02857#S5.SS3)\): GRZO\+X consistently uses less peak memory than MeZO\+X across all four pairings, with full numbers in Appendix[E](https://arxiv.org/html/2606.02857#A5)\. Figure[4](https://arxiv.org/html/2606.02857#S5.F4)shows that this inference\-level footprint scales smoothly across OPT 1\.3B–30B, with GRZO using1\.31\.3–1\.4×1\.4\\timesless memory than MeZO and88–11×11\\timesless than full fine\-tuning\.
Table 3:GRZO with orthogonal ZO baselines on Llama3\-8B\. Parentheses:\(Δ\(\\Deltavs\. paired baseline/Δ\\,/\\,\\Deltavs\. vanilla GRZO\)\)\. Best per task inbold\. QuZO and Qu\-GRZO both use 8\-bit quantization for weights and perturbations\.
##### Per\-Step Time\.
Figure[5\(b\)](https://arxiv.org/html/2606.02857#S5.F5.sf2)decomposes the per\-step wall\-clock cost\. MeZO completes in805805ms; GRZO fuses the per\-example perturbation into the forward via per\-Linear pre\-hooks, making this fused forward only∼\\sim24% slower than MeZO’s forward\-plus\-in\-place\-perturbation\. The update shrinks to8282ms via sign\-vector products, and GRZO totals973973ms \(\+21%\+21\\%over MeZO\)\. This per\-step cost is more than offset by the variance reduction \(factor of1/Beff1/B\_\{\\mathrm\{eff\}\}, Theorem[2](https://arxiv.org/html/2606.02857#Thmtheorem2)\), which translates into proportionally fewer optimization steps to a target loss \(Figure[3](https://arxiv.org/html/2606.02857#S5.F3)\)\. The same trade applies to combined variants; full per\-method breakdowns are in Appendix[E](https://arxiv.org/html/2606.02857#A5)\.
Figure 6:Training loss curves on Llama3\-8B comparing vanilla GRZO with the three GRZO\-combined variants\.
##### Why GRZO Uses Less Memory\.
GRZO and MeZO both avoid the backward pass and optimizer state; their gap comes from the perturbation steps\. MeZO mutates the weight tensor in place \(𝐖←𝐖±σ𝐳\\mathbf\{W\}\\leftarrow\\mathbf\{W\}\\pm\\sigma\\mathbf\{z\}\) and must keep a parameter\-aligned noise tensor live during the update step\. GRZO never modifies the base weight: each layer’s perturbed weight is built transiently inside a forward pre\-hook and freed before the next layer runs, so the perturbation overhead is at most one layer’s worth at any moment\. The sign factorizationΔ𝐖i=𝐔⊙\(𝐫i𝐬i⊤\)\\Delta\\mathbf\{W\}\_\{i\}=\\mathbf\{U\}\\odot\(\\mathbf\{r\}\_\{i\}\\mathbf\{s\}\_\{i\}^\{\\top\}\)further compresses per\-example variation into±1\\pm 1sign vectors per layer, keeping multi\-direction GRZO at the same per\-layer overhead as a single\-direction estimator\.
### 5\.3Combination with ZO Variants
GRZO as a drop\-in base for other ZO variants\.We frame GRZO not only as a stronger ZO estimator than MeZO, but as a base on which existing MeZO variants—sparse \(Sparse\-MeZO\), low\-rank \(LOZO\), and quantized \(QuZO\) perturbations—can be dropped in for further gains\. GRZO intervenes only at the MeZO core, swapping the single\-direction perturbation and scalar update for multi\-directional per\-example perturbation and group\-relative normalization; sparsity, low\-rank, and quantization act on orthogonal axes of the perturbation and remain complementary to GRZO\.
Table[3](https://arxiv.org/html/2606.02857#S5.T3)shows the composition is Pareto\-favorable along two directions\. First, every GRZO\-combined variant outperforms its baseline on every shared task \(the sole exception is LO\-GRZO at−0\.6\-0\.6F1 on SQuAD\); the most striking case is Qu\-GRZO on DROP, where52\.3→63\.952\.3\\to 63\.9F1 \(\+11\.6\+11\.6\) brings the low\-bit baseline within1\.11\.1F1 of vanilla GRZO \(65\.065\.0\), showing that GRZO offers a variance\-reduction alternative that efficiency\-axis techniques cannot reach on their own\. Second, the combinations also beat*vanilla*GRZO on most tasks, showing that these ZO variants complement, rather than compete with, GRZO\. Fig\.[6](https://arxiv.org/html/2606.02857#S5.F6)confirms both effects in the training dynamics; two per\-variant comparison examples on SQuAD and BoolQ are in Figure[8](https://arxiv.org/html/2606.02857#A5.F8)\(Appendix[F](https://arxiv.org/html/2606.02857#A6)\)\. The standard ZO taxonomy—sparsity, low\-rank, quantization, and variance reduction—thus reads as independent axes that can be stacked, with GRZO supplying the variance\-reduction axis that has no native solution at MeZO’s forward budget\.
##### Efficiency Benefit of the GRZO Core\.
Swapping the MeZO core for GRZO inside any ZO variant improves not only convergence quality but also resource efficiency\. Across the three fp16 variant pairings \(vanilla, LOZO, Sparse\), GRZO\+X reduces peak GPU memory by∼4\.8\{\\sim\}4\.8GB \(18–23%\) over MeZO\+X\. GRZO\+X is 3–21% slower per step than MeZO\+X—each forward fuses the per\-example perturbation—but the variance reduction greatly reduces the number of optimization steps to a target loss \(Figure[6](https://arxiv.org/html/2606.02857#S5.F6)\)\. We exclude QuZO and Qu\-GRZO from this efficiency claim because our fp16 implementations measure fake\-quant overhead rather than the low\-bit deployment regime targeted byZhouet al\.\([2025](https://arxiv.org/html/2606.02857#bib.bib9)\); per\-method numbers and a full discussion are in Appendix[E](https://arxiv.org/html/2606.02857#A5)\.
### 5\.4Ablation Study
Two ablations isolate GRZO’s design: which component—per\-example \(PE\) perturbation vs\. group\-relative normalization \(GN\)—drives convergence, and whether the choice of perturbation noise distribution materially affects performance\.
##### Component Ablation\.
Figure[7](https://arxiv.org/html/2606.02857#S5.F7)\(left\) compares full GRZO, GRZO without GN, and GRZO without both PE and GN \(a MeZO\-style estimator\) on SST\-2\. Removing GN alone slows descent, confirming GN dominates convergence; further removing PE degrades it more, showing the two components are complementary—PE enables efficient batched estimation, GN stabilizes the gradient signal\. GRZO beats FZOO by\+0\.4\+0\.4on SST\-2 \(\+2\.1\+2\.1on Llama3\-8B average; Section[5\.1](https://arxiv.org/html/2606.02857#S5.SS1)\), ruling out a normalization\-only explanation\.
Figure 7:Left: GRZO components on SST\-2\.Right: Perturbation ablation on DROP\.
##### Perturbation Type\.
Figure[7](https://arxiv.org/html/2606.02857#S5.F7)\(right\) compares Gaussian and Rademacher perturbations on DROP; both converge similarly \(61\.4 vs\. 61\.8\), confirming robustness\. We default to Rademacher for its lower memory overhead\.
##### Batch Size Sensitivity\.
Figure[9](https://arxiv.org/html/2606.02857#A7.F9)\(Appendix[G](https://arxiv.org/html/2606.02857#A7)\) showsB=4B\{=\}4diverges andB=8B\{=\}8is unstable, whileB≥16B\{\\geq\}16converges smoothly\. This is consistent with group\-relative normalization requiring a stable within\-batch variance estimate; we recommendB≥16B\\geq 16\.
## 6Conclusion
We have presented GRZO, a zeroth\-order optimizer that treats the mini\-batch as a source of perturbation directions rather than only loss\-averaging samples\. By combining pseudo\-independent per\-example perturbations with group\-relative normalization, GRZO extracts a gradient direction from every mini\-batch example under MeZO’s two\-forward\-pass budget, greatly mitigating the variance bottleneck of single\-direction ZO methods\. We have provided theoretical guarantees on directional unbiasedness, variance reduction, and nonconvex convergence\. Extensive experiments on multiple models have shown that GRZO consistently outperforms state\-of\-the\-art ZO baselines at inference\-level memory, and serves as a drop\-in replacement for the MeZO core that composes with sparsity, low\-rank, and quantization variants\. Extending GRZO to further MeZO variants and full pre\-training of large\-scale models remains open\.
## Limitations
GRZO’s main per\-step cost comes from running two perturbed forwards \(ℓ\+\\ell^\{\+\}andℓ−\\ell^\{\-\}\) to form a directionally unbiased two\-sided estimate\. A one\-sided variant comparing a single perturbed forward to an unperturbed forward would be faster than MeZO but at the cost of a biased gradient estimate and slower descent; building a low\-bias one\-sided design is open future work\.
Our empirical study covers four ZO families \(vanilla, Sparse\-MeZO, LOZO, QuZO\) up to 13B parameters\. Other MeZO variants—curvature\-preconditioned \(HiZOO\), control\-variate \(MeZO\-SVRG\), and subspace\-orthogonalization—and 70B\+ scaling are the most immediate empirical extensions: the GRZO core swap is mechanically straightforward in each case, but the variance\-versus\-stability trade\-off and downstream quality at those settings remain to be verified\.
## Ethical Considerations
This work uses publicly available pre\-trained models and benchmarks for research purposes, follows their respective licenses and terms of use, and does not involve the collection or release of personally identifiable information\. We do not foresee specific ethical concerns beyond the general risks associated with large language models\. In particular, GRZO may inherit biases, hallucinations, or misleading patterns from the underlying models and data\. We therefore do not recommend deploying it as a standalone decision\-making system, especially in high\-stakes settings\. Any practical use should include human oversight and task\-specific safety evaluation\.
## References
- Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing \(volume 1: long papers\),pp\. 7319–7328\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p3.1)\.
- S\. Amari \(1993\)Backpropagation and stochastic gradient descent method\.Neurocomputing5\(4\-5\),pp\. 185–196\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p2.1)\.
- A\. S\. Berahas, L\. Cao, K\. Choromanski, and K\. Scheinberg \(2022\)A theoretical and empirical comparison of gradient approximations in derivative\-free optimization\.Foundations of Computational Mathematics22\(2\),pp\. 507–560\.Cited by:[§2\.1](https://arxiv.org/html/2606.02857#S2.SS1.p1.14)\.
- A\. Chen, Y\. Zhang, J\. Jia, J\. Diffenderfer, K\. Parasyris, J\. Liu, Y\. Zhang, Z\. Zhang, B\. Kailkhura, and S\. Liu \(2024\)DeepZero: scaling up zeroth\-order optimization for deep model training\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 50185–50206\.Cited by:[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1)\.
- Y\. Chen, Y\. Zhang, L\. Cao, K\. Yuan, and Z\. Wen \(2025\)Enhancing zeroth\-order fine\-tuning for language models with low\-rank structures\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 62581–62607\.Cited by:[Table 4](https://arxiv.org/html/2606.02857#A2.T4.11.14.2.1.1.1),[§1](https://arxiv.org/html/2606.02857#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1),[Table 3](https://arxiv.org/html/2606.02857#S5.T3.9.5.5.1)\.
- S\. Dang, Y\. Guo, Y\. Zhao, H\. Ye, X\. Zheng, G\. Dai, and I\. W\. Tsang \(2026\)FZOO: fast zeroth\-order optimizer for fine\-tuning large language models towards Adam\-scale speed\.InInternational Conference on Learning Representations,Cited by:[Table 4](https://arxiv.org/html/2606.02857#A2.T4.5.5.1.1.1),[Appendix E](https://arxiv.org/html/2606.02857#A5.SS0.SSS0.Px3.p1.1),[§H\.3](https://arxiv.org/html/2606.02857#A8.SS3.SSS0.Px1.p1.8),[§2\.3](https://arxiv.org/html/2606.02857#S2.SS3.p1.3),[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: Pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p1.1)\.
- D\. Dua, Y\. Wang, P\. Dasigi, G\. Stanovsky, S\. Singh, and M\. Gardner \(2019\)DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2368–2378\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- J\. C\. Duchi, M\. I\. Jordan, M\. J\. Wainwright, and A\. Wibisono \(2015\)Optimal rates for zero\-order convex optimization: the power of two function evaluations\.IEEE Transactions on Information Theory61\(5\),pp\. 2788–2806\.Cited by:[§2\.1](https://arxiv.org/html/2606.02857#S2.SS1.p1.14)\.
- K\. Gao and O\. Sener \(2022\)Generalizing gaussian smoothing for random search\.InInternational Conference on Machine Learning,pp\. 7077–7101\.Cited by:[§2\.1](https://arxiv.org/html/2606.02857#S2.SS1.p1.14)\.
- T\. Gautam, Y\. Park, H\. Zhou, P\. Raman, and W\. Ha \(2024\)Variance\-reduced zeroth\-order methods for fine\-tuning language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 15180–15208\.Cited by:[Table 4](https://arxiv.org/html/2606.02857#A2.T4.3.3.1.1.1),[§1](https://arxiv.org/html/2606.02857#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.02857#S2.SS3.p1.3)\.
- S\. Ghadimi and G\. Lan \(2013\)Stochastic first\- and zeroth\-order methods for nonconvex stochastic programming\.SIAM journal on optimization23\(4\),pp\. 2341–2368\.Cited by:[§2\.1](https://arxiv.org/html/2606.02857#S2.SS1.p1.5)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p1.1)\.
- N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly \(2019\)Parameter\-efficient transfer learning for NLP\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: Low\-Rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[Table 4](https://arxiv.org/html/2606.02857#A2.T4.9.9.2.1.1),[§1](https://arxiv.org/html/2606.02857#S1.p1.1),[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: a method for stochastic optimization\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- Y\. Lang, C\. Wang, Y\. Zhang, M\. Hong, Z\. Zhang, W\. Yin, and S\. Liu \(2026\)Powering up zeroth\-order training via subspace gradient orthogonalization\.arXiv preprint arXiv:2602\.17155\.Cited by:[§2\.3](https://arxiv.org/html/2606.02857#S2.SS3.p1.3)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 conference on empirical methods in natural language processing,pp\. 3045–3059\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p1.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 4582–4597\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p1.1)\.
- S\. Liu, P\. Chen, B\. Kailkhura, G\. Zhang, A\. O\. Hero III, and P\. K\. Varshney \(2020\)A primer on zeroth\-order optimization in signal processing and machine learning: principals, recent advances, and applications\.IEEE Signal Processing Magazine37\(5\),pp\. 43–54\.Cited by:[§2\.1](https://arxiv.org/html/2606.02857#S2.SS1.p1.14)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p1.1)\.
- Y\. Liu, Z\. Zhu, C\. Gong, M\. Cheng, C\. Hsieh, and Y\. You \(2025a\)Sparse MeZO: less parameters for better performance in zeroth\-order LLM fine\-tuning\.InAdvances in Neural Information Processing Systems,Cited by:[Table 4](https://arxiv.org/html/2606.02857#A2.T4.11.13.1.1.1.1),[§1](https://arxiv.org/html/2606.02857#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1),[Table 3](https://arxiv.org/html/2606.02857#S5.T3.9.4.4.1)\.
- Z\. Liu, R\. Zhang, Z\. Wang, M\. Yan, Z\. Yang, P\. D\. Hovland, B\. Nicolae, F\. Cappello, S\. Tang, and Z\. Zhang \(2025b\)CoLA: compute\-efficient pre\-training of LLMs via low\-rank activation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 4627–4645\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p2.1)\.
- S\. Ma and H\. Huang \(2025\)Revisiting zeroth\-order optimization: minimum\-variance two\-point estimators and directionally aligned perturbations\.arXiv preprint arXiv:2510\.19975\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.02857#S2.SS3.p1.3)\.
- S\. Malladi, T\. Gao, E\. Nichani, A\. Damian, J\. D\. Lee, D\. Chen, and S\. Arora \(2023a\)Fine\-tuning language models with just forward passes\.Advances in Neural Information Processing Systems36,pp\. 53038–53075\.Cited by:[§B\.1](https://arxiv.org/html/2606.02857#A2.SS1.p1.4),[Table 4](https://arxiv.org/html/2606.02857#A2.T4.1.1.2.1.1),[Appendix D](https://arxiv.org/html/2606.02857#A4.SS0.SSS0.Px4.p1.1),[Appendix D](https://arxiv.org/html/2606.02857#A4.SS0.SSS0.Px7.p1.1),[Appendix D](https://arxiv.org/html/2606.02857#A4.SS0.SSS0.Px8.p2.1),[§1](https://arxiv.org/html/2606.02857#S1.p2.1),[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- S\. Malladi, A\. Wettig, D\. Yu, D\. Chen, and S\. Arora \(2023b\)A kernel\-based view of language model fine\-tuning\.InInternational Conference on Machine Learning,pp\. 23610–23641\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p3.1)\.
- Y\. Nesterov and V\. Spokoiny \(2017\)Random gradient\-free minimization of convex functions\.Foundations of Computational Mathematics17\(2\),pp\. 527–566\.Cited by:[§2\.1](https://arxiv.org/html/2606.02857#S2.SS1.p1.5)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 2383–2392\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§H\.1](https://arxiv.org/html/2606.02857#A8.SS1.SSS0.Px2.p1.5),[§1](https://arxiv.org/html/2606.02857#S1.p5.1)\.
- J\. C\. Spall \(2002\)Multivariate stochastic approximation using a simultaneous perturbation gradient approximation\.IEEE transactions on automatic control37\(3\),pp\. 332–341\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p4.3)\.
- A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2019\)SuperGLUE: a stickier benchmark for general\-purpose language understanding systems\.Advances in neural information processing systems32\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,pp\. 353–355\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p2.3)\.
- Y\. Wen, P\. Vicol, J\. Ba, D\. Tran, and R\. Grosse \(2018\)Flipout: efficient pseudo\-independent weight perturbations on mini\-batches\.InInternational Conference on Learning Representations,Cited by:[§B\.2](https://arxiv.org/html/2606.02857#A2.SS2.p1.3),[§1](https://arxiv.org/html/2606.02857#S1.p5.1),[§3\.1](https://arxiv.org/html/2606.02857#S3.SS1.p1.7)\.
- Y\. Yang, Z\. Zhang, R\. V\. Swaminathan, J\. Liu, N\. Susanj, and Z\. Zhang \(2026\)SharpZO: hybrid sharpness\-aware vision language model prompt tuning via forward\-only passes\.Advances in Neural Information Processing Systems38,pp\. 143695–143721\.Cited by:[§2\.3](https://arxiv.org/html/2606.02857#S2.SS3.p1.3)\.
- Y\. Yang, K\. Zhen, E\. Banijamali, A\. Mouchtaris, and Z\. Zhang \(2024\)AdaZeta: adaptive zeroth\-order tensor\-train adaption for memory\-efficient large language models fine\-tuning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 977–995\.Cited by:[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1)\.
- R\. R\. Zhang, Z\. A\. Liu, Z\. Wang, and Z\. Zhang \(2026\)Lax: boosting low\-rank training of foundation models via latent crossing\.Advances in Neural Information Processing Systems38,pp\. 142920–142948\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p1.1)\.
- S\. Zhang, S\. Roller, N\. Goyal, M\. Artetxe, M\. Chen, S\. Chen, C\. Dewan, M\. Diab, X\. Li, X\. V\. Lin,et al\.\(2022\)OPT: Open Pre\-trained transformer language models\.arXiv preprint arXiv:2205\.01068\.Cited by:[§5](https://arxiv.org/html/2606.02857#S5.p1.1)\.
- Z\. Zhang, Y\. Yang, K\. Zhen, N\. Susanj, A\. Mouchtaris, S\. Kunzmann, and Z\. Zhang \(2025\)MaZO: masked zeroth\-order optimization for multi\-task fine\-tuning of large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 18537–18554\.Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1)\.
- J\. Zhao, Z\. Zhang, B\. Chen, Z\. Wang, A\. Anandkumar, and Y\. Tian \(2024\)GaLore: memory\-efficient LLM training by gradient low\-rank projection\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.02857#S1.p1.1)\.
- Y\. Zhao, S\. Dang, H\. Ye, G\. Dai, Y\. Qian, and I\. W\. Tsang \(2025a\)Second\-order fine\-tuning without pain for LLMs: a Hessian informed zeroth\-order optimizer\.InInternational Conference on Learning Representations,Cited by:[Table 4](https://arxiv.org/html/2606.02857#A2.T4.2.2.1.1.1),[§1](https://arxiv.org/html/2606.02857#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.02857#S2.SS3.p1.3)\.
- Y\. Zhao, H\. Li, I\. Young, and Z\. Zhang \(2025b\)Poor man’s training on MCUs: a memory\-efficient quantized back\-propagation\-free approach\.ACM Transactions on Design Automation of Electronic Systems30\(5\),pp\. 1–33\.Cited by:[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1)\.
- Y\. Zhao, X\. Yu, Z\. Chen, Z\. Liu, S\. Liu, and Z\. Zhang \(2023\)Tensor\-compressed back\-propagation\-free training for \(physics\-informed\) neural networks\.arXiv preprint arXiv:2308\.09858\.Cited by:[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1)\.
- J\. Zhou, Y\. Yang, K\. Zhen, Z\. Liu, Y\. Zhao, E\. Banijamali, A\. Mouchtaris, N\. Wong, and Z\. Zhang \(2025\)QuZO: quantized zeroth\-order fine\-tuning for large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 5341–5359\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.271)Cited by:[Appendix E](https://arxiv.org/html/2606.02857#A5.SS0.SSS0.Px4.p1.1),[§2\.2](https://arxiv.org/html/2606.02857#S2.SS2.p1.1),[§5\.3](https://arxiv.org/html/2606.02857#S5.SS3.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.02857#S5.T3.9.6.6.1)\.
## Appendix AGRZO Algorithm
Algorithm 1GRZO \(Group\-Relative Zeroth\-Order Optimization\)1:Parameters
𝜽=\{𝐖\(ℓ\)\}\\bm\{\\theta\}=\\\{\\mathbf\{W\}^\{\(\\ell\)\}\\\}; scale
σ\\sigma; batch size
BB; steps
TT; learning rates
\{ηt\}\\\{\\eta\_\{t\}\\\};
ϵ\>0\\epsilon\>0
2:for
t=1,…,Tt=1,\\ldots,Tdo
3:Sample
ℬt\\mathcal\{B\}\_\{t\}; draw per\-example sign vectors
\{\(𝐫i\(ℓ\),𝐬i\(ℓ\)\)\}\\\{\(\\mathbf\{r\}\_\{i\}^\{\(\\ell\)\},\\mathbf\{s\}\_\{i\}^\{\(\\ell\)\}\)\\\}and layer seeds
\{seed\(ℓ\)\}\\\{\\textit\{seed\}^\{\(\\ell\)\}\\\}
4:foreach layer
ℓ\\elldo⊳\\trianglerightFused forward: two forward passes, per\-example perturbations
5:Regenerate
𝐔\(ℓ\)\\mathbf\{U\}^\{\(\\ell\)\}from
seed\(ℓ\)\\textit\{seed\}^\{\(\\ell\)\};
𝐏i\(ℓ\)←\(\(𝐗i⊙𝐒i\(ℓ\)\)𝐔\(ℓ\)\)⊙𝐑i\(ℓ\)\\mathbf\{P\}\_\{i\}^\{\(\\ell\)\}\\leftarrow\(\(\\mathbf\{X\}\_\{i\}\\odot\\mathbf\{S\}\_\{i\}^\{\(\\ell\)\}\)\\,\\mathbf\{U\}^\{\(\\ell\)\}\)\\odot\\mathbf\{R\}\_\{i\}^\{\(\\ell\)\}
6:
𝐗i←Concat\(ϕ\(𝐗i𝐖\(ℓ\)\+σ𝐏i\(ℓ\)\),ϕ\(𝐗i𝐖\(ℓ\)−σ𝐏i\(ℓ\)\)\)\\mathbf\{X\}\_\{i\}\\leftarrow\\textsc\{Concat\}\\\!\\bigl\(\\phi\(\\mathbf\{X\}\_\{i\}\\mathbf\{W\}^\{\(\\ell\)\}\{\+\}\\sigma\\mathbf\{P\}\_\{i\}^\{\(\\ell\)\}\),\\;\\phi\(\\mathbf\{X\}\_\{i\}\\mathbf\{W\}^\{\(\\ell\)\}\{\-\}\\sigma\\mathbf\{P\}\_\{i\}^\{\(\\ell\)\}\)\\bigr\)
7:endfor
8:
δi←ℓi\+−ℓi−\\delta\_\{i\}\\leftarrow\\ell\_\{i\}^\{\+\}\-\\ell\_\{i\}^\{\-\}for all
i=1,…,Bi=1,\\ldots,B⊳\\trianglerightGroup\-relative normalization
9:
s←1B∑i\(δi−δ¯\)2s\\leftarrow\\sqrt\{\\tfrac\{1\}\{B\}\\sum\_\{i\}\(\\delta\_\{i\}\{\-\}\\bar\{\\delta\}\)^\{2\}\}with
δ¯=1B∑iδi\\bar\{\\delta\}=\\tfrac\{1\}\{B\}\\sum\_\{i\}\\delta\_\{i\};
ai←δi/\(s\+ϵ\)a\_\{i\}\\leftarrow\\delta\_\{i\}/\(s\{\+\}\\epsilon\)
10:foreach layer
ℓ\\elldo⊳\\trianglerightSeed\-regenerated weight update
11:Regenerate
𝐔\(ℓ\)\\mathbf\{U\}^\{\(\\ell\)\}from
seed\(ℓ\)\\textit\{seed\}^\{\(\\ell\)\};
𝐌¯\(ℓ\)←1B∑iai𝐫i\(ℓ\)\(𝐬i\(ℓ\)\)⊤\\bar\{\\mathbf\{M\}\}^\{\(\\ell\)\}\\leftarrow\\tfrac\{1\}\{B\}\\sum\_\{i\}a\_\{i\}\\,\\mathbf\{r\}\_\{i\}^\{\(\\ell\)\}\(\\mathbf\{s\}\_\{i\}^\{\(\\ell\)\}\)^\{\\top\}
12:
𝐖\(ℓ\)←𝐖\(ℓ\)−ηt2σ𝐔\(ℓ\)⊙𝐌¯\(ℓ\)\\mathbf\{W\}^\{\(\\ell\)\}\\leftarrow\\mathbf\{W\}^\{\(\\ell\)\}\-\\dfrac\{\\eta\_\{t\}\}\{2\\sigma\}\\,\\mathbf\{U\}^\{\(\\ell\)\}\\odot\\bar\{\\mathbf\{M\}\}^\{\(\\ell\)\}
13:endfor
14:endfor
Algorithm[1](https://arxiv.org/html/2606.02857#alg1)gives the per\-step pseudocode for GRZO, fusing the per\-example sign\-factorized perturbation construction with group\-relative normalization\. The same random seeds are used for the perturbation in the forward pass and the seed\-regenerated weight update, so the per\-example perturbations𝐏i\(ℓ\)\\mathbf\{P\}\_\{i\}^\{\(\\ell\)\}never have to be materialized between the two phases\.
## Appendix BAdditional Background on MeZO and Flipout
### B\.1MeZO as In\-Place Two\-Point Zeroth\-Order Optimization
MeZO\(Malladiet al\.,[2023a](https://arxiv.org/html/2606.02857#bib.bib2)\)adapts the classical two\-point SPSA estimator to LLM fine\-tuning with inference\-level memory by perturbing parameters in place and regenerating the same noise direction from a random seed\. Given parameters𝜽∈ℝd\\bm\{\\theta\}\\in\\mathbb\{R\}^\{d\}, mini\-batchℬ\\mathcal\{B\}, perturbation scaleσ\\sigma, and seed\-generated direction𝐳\(s\)\\mathbf\{z\}\(s\), it evaluates
𝐠^\(𝜽;ℬ\)=L\(𝜽\+σ𝐳;ℬ\)−L\(𝜽−σ𝐳;ℬ\)2σ𝐳,\\widehat\{\\mathbf\{g\}\}\(\\bm\{\\theta\};\\mathcal\{B\}\)=\\frac\{L\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\};\\mathcal\{B\}\)\-L\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\};\\mathcal\{B\}\)\}\{2\\sigma\}\\,\\mathbf\{z\},\(8\)or equivalently the projected scalar
gproj\\displaystyle g\_\{\\mathrm\{proj\}\}=ℓ\+−ℓ−2σ,\\displaystyle=\\frac\{\\ell^\{\+\}\-\\ell^\{\-\}\}\{2\\sigma\},\(9\)ℓ\+\\displaystyle\\ell^\{\+\}=L\(𝜽\+σ𝐳\(s\);ℬ\),\\displaystyle=L\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\(s\);\\mathcal\{B\}\),ℓ−\\displaystyle\\ell^\{\-\}=L\(𝜽−σ𝐳\(s\);ℬ\),\\displaystyle=L\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\}\(s\);\\mathcal\{B\}\),followed by the in\-place update𝜽←𝜽−ηgproj𝐳\(s\)\\bm\{\\theta\}\\leftarrow\\bm\{\\theta\}\-\\eta\\,g\_\{\\mathrm\{proj\}\}\\,\\mathbf\{z\}\(s\)\. This design avoids storing activations or full perturbation tensors, but because each step uses only one direction, reducing estimator variance by averaging more directions increases forward cost linearly\.
Table 4:Comparison of representative ZO fine\-tuning methods, listing the qualitative source of each method’s extra memory; quantitative measurements appear in Appendix[E](https://arxiv.org/html/2606.02857#A5)\.∗MeZO’s persistent noise is seed\-regenerated, but the in\-place perturb–restore cycle keeps a parameter\-aligned buffer live during the update step\.†HiZOO uses one extra forward per step and stores an fp32 diagonal Hessian\.‡MeZO\-SVRG additionally performs periodic full\-dataset sweeps\.§Parallel FZOO runsNNperturbed forwards concurrently; a sequential variant trades activation memory forN×N\{\\times\}wall\-clock\.∗∗GRZO applies perturbations via forward pre\-hooks, so the base weight is never modified and no restore buffer is held\.
### B\.2Flipout for Pseudo\-Independent Per\-Example Perturbations
Flipout\(Wenet al\.,[2018](https://arxiv.org/html/2606.02857#bib.bib11)\)addresses the inefficiency of sharing one weight perturbation across an entire mini\-batch\. For a linear layer with weight matrix𝐖∈ℝdout×din\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, it samples a shared base perturbation𝐔\\mathbf\{U\}and constructs an effective perturbation for examplennas
Δ𝐖n=𝐔⊙\(𝐫n𝐬n⊤\),\\Delta\\mathbf\{W\}\_\{n\}\\;=\\;\\mathbf\{U\}\\odot\(\\mathbf\{r\}\_\{n\}\\mathbf\{s\}\_\{n\}^\{\\top\}\),\(10\)where𝐫n∈\{±1\}dout\\mathbf\{r\}\_\{n\}\\in\\\{\\pm 1\\\}^\{d\_\{\\text\{out\}\}\}and𝐬n∈\{±1\}din\\mathbf\{s\}\_\{n\}\\in\\\{\\pm 1\\\}^\{d\_\{\\text\{in\}\}\}are independent sign vectors\. Stacking a mini\-batch of activations intoXX, and the sign vectors into matricesRRandSS, yields the vectorized form
Y=ϕ\(XW\+\(\(X⊙S\)𝐔\)⊙R\),Y=\\phi\\\!\\left\(XW\\;\+\\;\\big\(\(X\\odot S\)\\mathbf\{U\}\\big\)\\odot R\\right\),\(11\)which avoids materializingNNseparately perturbed weight matrices while still producing pseudo\-independent example\-level perturbations\. GRZO uses this construction to obtainBBperturbation\-induced loss signals within the same two\-forward\-pass budget used by MeZO\.
## Appendix CComparison of ZO Fine\-Tuning Methods
Table[4](https://arxiv.org/html/2606.02857#A2.T4)compares representative zeroth\-order fine\-tuning methods across four dimensions: per\-step forward\-pass count, extra memory relative to MeZO, variance reduction mechanism, and whether backpropagation is required\.
## Appendix DDetailed Experimental Settings
##### Hardware\.
All experiments are conducted on servers equipped with 8×\\timesNVIDIA A100 \(40 GB\) or 8×\\timesNVIDIA A6000 \(48 GB\) GPUs\. Each Llama3\-8B or OPT\-13B fine\-tuning run on a single task takes approximately 5–8 GPU\-hours\.
##### Aggregation\.
All accuracy and loss numbers reported in this paper are means across 3 runs \(different random seeds\) per task\-method configuration\.
##### Implementation\.
All experiments use PyTorch with the Hugging Facetransformerslibrary and standard model checkpoints \(roberta\-large,meta\-llama/Llama\-3\-8B,facebook/opt\-13b\)\. LoRA adapters are implemented via thepeftlibrary; AdamW is fromtorch\.optim\. F1 scorers for SQuAD and DROP follow the official scripts from the respective dataset releases\.
##### RoBERTa\-large\.
We follow the experimental protocol ofMalladiet al\.\([2023a](https://arxiv.org/html/2606.02857#bib.bib2)\)exactly, including data sampling, evaluation splits, and prompt templates\. Each task usesk=512k\{=\}512labeled examples per class\.
##### Llama3\-8B and OPT\-13B\.
Each task is trained on 1,000 examples \(200 for CB, 350 for COPA\), with a held\-out development set of 500 examples used for learning rate selection and a test set of up to 1,000 examples\. We train for 20,000 steps with a linear warmup over 500 steps and a constant learning rate thereafter\. Per\-device batch size is 16 and we use FP16 precision\. The perturbation scale isσ=10−3\\sigma\{=\}10^\{\-3\}\. Task\-specific learning rates are selected from\{1e\-7,2e\-7,3e\-7,4e\-7,5e\-7\}\\\{1\\text\{e\-\}7,\\,2\\text\{e\-\}7,\\,3\\text\{e\-\}7,\\,4\\text\{e\-\}7,\\,5\\text\{e\-\}7\\\}based on development set accuracy\.
##### Perturbed Parameters\.
We apply full\-parameter zeroth\-order fine\-tuning: all learnable parameters are perturbed, including linear projection weights \(sign\-factorized\), embedding matrices \(sparse row\-wise perturbation indexed by active tokens\), and normalization layer parameters \(LayerNorm/RMSNorm\)\. This matches the full\-parameter MeZO setting\.
##### Prompts and Task Formulation\.
We adopt the same prompt templates asMalladiet al\.\([2023a](https://arxiv.org/html/2606.02857#bib.bib2)\)\. For multiple\-choice tasks \(CB, COPA, WiC, BoolQ, MultiRC\), inference uses candidate log\-likelihood scoring: each candidate is appended to the prompt and the candidate with the highest mean per\-token log\-likelihood is selected\. During training, we apply teacher forcing on the correct candidate only, computing the loss solely on candidate tokens while excluding prompt tokens\.
##### Hyperparameters\.
Tables[5](https://arxiv.org/html/2606.02857#A4.T5)and[6](https://arxiv.org/html/2606.02857#A4.T6)summarize the hyperparameter settings for all methods\. For all methods, learning rates are selected by grid search on the development set; the best value per task is reported\. Adam and LoRA use AdamW withβ1=0\.9\\beta\_\{1\}\{=\}0\.9,β2=0\.999\\beta\_\{2\}\{=\}0\.999\.
Table 5:Hyperparameter settings for RoBERTa\-large \(k=512k\{=\}512\)\.Adam \(FO\) and LoRA \(FO\) hyperparameters followMalladiet al\.\([2023a](https://arxiv.org/html/2606.02857#bib.bib2)\)and prior work\.
Table 6:Hyperparameter settings for Llama3\-8B and OPT\-13B\.
## Appendix EDetailed Time and Memory Breakdown
This appendix tabulates the per\-step wall\-clock decomposition and peak GPU memory measurements that underlie Section[5\.2](https://arxiv.org/html/2606.02857#S5.SS2)and Figure[5](https://arxiv.org/html/2606.02857#S5.F5)\. All numbers are from the same production profile: Llama\-3\-8B \(fp16\), RTE, batch sizeB=16B\{=\}16, 4×\\timesA100\-40GB, mean of 80 samples \(2020steps×\\times4 ranks\)\. In Table[7](https://arxiv.org/html/2606.02857#A5.T7),Forwardsums the per\-step forward passes \(two standard for MeZO\-family; two fused\-perturbation for GRZO\-family\);Perturb/Setupcovers perturbation handling \(three in\-place operations for MeZO\-family; per\-Linear sign\-vector and base\-noise setup for GRZO\-family\);Updateis the weight\-update step;Otheris the residual \(loss reduce \+ HF Trainer/DDP overhead\)\.
Table 7:Per\-step wall\-clock breakdown \(ms\) across all profiled methods\.Totals are production measurements; column definitions are given in the text above\.Table 8:Peak GPU memory per step \(GB\), same setup as Table[7](https://arxiv.org/html/2606.02857#A5.T7)\. Model footprint is16\.016\.0GB;Transient= peak−\-footprint;vs MeZO= multiplier vs MeZO\.

Figure 8:Per\-variant GRZO\+X vs MeZO\+X training\-loss curves on Llama3\-8B\.Left: SQuAD\.Right: BoolQ\.##### Observations\.
Two patterns dominate\. First, MeZO\-family in\-place perturbation cost scales steeply with the variant’s per\-parameter work—LOZO \(1\.34×1\.34\\times\), Sparse\-MeZO \(1\.57×1\.57\\times\), QuZO \(3\.80×3\.80\\times\) over plain MeZO—because each variant traverses all∼\\sim8B parameters during perturb, restore, and update\. GRZO removes this scaling by representing each per\-example perturbation asB\(dout\+din\)B\(d\_\{\\text\{out\}\}\{\+\}d\_\{\\text\{in\}\}\)int\-8 sign vectors per layer rather than a per\-parameter modification; consequently the GRZO\-family update step is uniformly8282ms across all four pairings, and combined\-variant perturb/setup cost is≤270\\leq 270ms\. Second, on memory, GRZO\-family transient overhead is dominated by the variant’s own scratch storage \(sparse mask, quant buffers\), not by GRZO itself: GRZO\+X uses 4\.8–6\.4 GB less peak memory than MeZO\+X on every pairing, with the largest absolute saving on QuZO \(−6\.4\-6\.4GB\) where per\-parameter quant intermediates inflate MeZO’s peak\.
##### Why GRZO is uniquely at inference level\.
The MeZO\-family memory overheads in Table[8](https://arxiv.org/html/2606.02857#A5.T8)share a single mechanism: every MeZO variant applies perturbations by mutating the weight tensor in place and must keep a parameter\-aligned buffer live during the update step to apply−ηg^𝐳\-\\eta\\widehat\{g\}\\mathbf\{z\}\. This costs∼0\.3×\{\\sim\}0\.3\{\\times\}the trainable\-weight footprint regardless of the underlying variant\. Sparse\-MeZO compounds it with a per\-parameter boolean mask \(an additional∼0\.5×W\{\\sim\}0\.5\{\\times\}W, raising the total to∼0\.7×W\{\\sim\}0\.7\{\\times\}W\); LOZO’s low\-rank factors are themselves tiny \(a few MB total\) but the method inherits MeZO’s mutation cost; QuZO adds quantization auxiliary buffers on top of mutation\. GRZO sidesteps this pattern entirely: each layer’s perturbed weight is constructed transiently inside a forward pre\-hook and freed before the next layer runs, so the base weight is never modified and no restore buffer is held\. Consequently, vanilla GRZO and LO\-GRZO are the only configurations whose peak memory equals the inference footprint; Sparse\-GRZO and Qu\-GRZO retain the variant\-specific scratch storage but still avoid the MeZO mutation cost, which is why Table[8](https://arxiv.org/html/2606.02857#A5.T8)shows GRZO\+X consistently below MeZO\+X on every pairing rather than only on the unmodified core\.
##### Caveat: FZOO Variant Ambiguity\.
FZOO\(Danget al\.,[2026](https://arxiv.org/html/2606.02857#bib.bib14)\)admits both sequential and batched\-parallel perturbation variants with substantially different wall\-clock and memory profiles \(the latter tradingO\(N\)O\(N\)activation memory for near\-MeZO wall\-clock\)\. Reporting either single number would mischaracterize the method, and the choice is implementation\-dependent rather than algorithmic\. We therefore restrict the FZOO comparison to accuracy, which is independent of the parallel/sequential choice\.
##### Caveat: QuZO and Qu\-GRZO Profile Numbers\.
The peak\-memory and per\-step wall\-clock numbers reported byZhouet al\.\([2025](https://arxiv.org/html/2606.02857#bib.bib9)\)for QuZO \(their Tables 4 and 5; Appendix C\) come from a Cutlass INT8 kernel that stores weights in packed int4/int8 format and dispatches GEMMs onto integer Tensor Core paths; this kernel is not part of the public release\. The released code \(qftmode\) simulates low\-bit fine\-tuning by fake\-quantizing weights inside each forward pass while retaining fp16 storage and fp16 compute, and our QuZO and Qu\-GRZO implementations follow the same paradigm\. Profiling either method on A100 in fp16 therefore captures fake\-quant overhead, not the weight footprint and INT8\-GEMM throughput that drive the deployment numbers ofZhouet al\.\([2025](https://arxiv.org/html/2606.02857#bib.bib9)\); the QuZO and Qu\-GRZO rows in Tables[7](https://arxiv.org/html/2606.02857#A5.T7)–[8](https://arxiv.org/html/2606.02857#A5.T8)should be read as fake\-quant simulation, not as comparable to QuZO’s Table 5\. We therefore omit QuZO and Qu\-GRZO from the cross\-variant memory and time claim in Section[5\.3](https://arxiv.org/html/2606.02857#S5.SS3)\. Their accuracy entries in Table[3](https://arxiv.org/html/2606.02857#S5.T3)apply paper Algorithm 1 with identical W8 weight and perturbation quantization to both methods and remain apples\-to\-apples\. Within the fake\-quant regime on A100, Qu\-GRZO is still empirically more memory\-efficient than QuZO \(though slightly slower per step, consistent with the GRZO–MeZO pattern in other pairings\); this memory advantage stems from forward\-hook streaming rather than storage precision, and would persist or amplify on true low\-bit hardware\.
## Appendix FGRZO\-combined Variants Convergence Examples
Figure[8](https://arxiv.org/html/2606.02857#A5.F8)shows per\-task training\-dynamics views of GRZO\+X versus its paired MeZO variants and vanilla GRZO baseline on SQuAD and BoolQ \(Llama3\-8B\)\.
## Appendix GBatch Size Sensitivity
Figure[9](https://arxiv.org/html/2606.02857#A7.F9)shows training loss curves for GRZO across batch sizesB∈\{4,8,16,32\}B\\in\\\{4,8,16,32\\\}on two tasks: SST\-2 \(Llama3\-8B\) and COPA \(OPT\-13B\)\. The results corroborate the theoretical prediction in Section[5\.4](https://arxiv.org/html/2606.02857#S5.SS4.SSS0.Px3): the group\-relative normalizer requires a stable within\-batch loss standard deviationssto produce reliable advantage weights, and this stability breaks down at very small batch sizes\.
Figure 9:Batch size sensitivity of GRZO\.Left: Llama3\-8B/SST\-2\.Right: OPT\-13B/COPA\.B=4B\{=\}4diverges;B=8B\{=\}8unstable;B≥16B\{\\geq\}16converges stably\.
## Appendix HUnbiasedness and Smoothing Bias of GRZO
### H\.1Estimator Definition \(Canonical vs\. Implementation\)
We analyze a single linear layer weight𝐖∈ℝDout×Din\\mathbf\{W\}\\in\\mathbb\{R\}^\{D\_\{\\text\{out\}\}\\times D\_\{\\text\{in\}\}\}\. Letd=DoutDind=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}and denote𝜽=vec\(𝐖\)∈ℝd\\bm\{\\theta\}=\\mathrm\{vec\}\(\\mathbf\{W\}\)\\in\\mathbb\{R\}^\{d\}\. For each flattened examplei∈\{1,…,B\}i\\in\\\{1,\\dots,B\\\}we sample a Flipout perturbation
Δ𝐖i=𝐔⊙\(𝐫i𝐬i⊤\),𝐳i:=vec\(Δ𝐖i\)∈ℝd,\\Delta\\mathbf\{W\}\_\{i\}=\\mathbf\{U\}\\odot\(\\mathbf\{r\}\_\{i\}\\mathbf\{s\}\_\{i\}^\{\\top\}\),\\qquad\\mathbf\{z\}\_\{i\}:=\\mathrm\{vec\}\(\\Delta\\mathbf\{W\}\_\{i\}\)\\in\\mathbb\{R\}^\{d\},\(12\)and evaluate two\-sided losses
ℓi±:=ℓ\(𝜽±σ𝐳i;ξi\),δi:=ℓi\+−ℓi−\.\\ell\_\{i\}^\{\\pm\}:=\\ell\(\\bm\{\\theta\}\\pm\\sigma\\mathbf\{z\}\_\{i\};\\xi\_\{i\}\),\\qquad\\delta\_\{i\}:=\\ell\_\{i\}^\{\+\}\-\\ell\_\{i\}^\{\-\}\.\(13\)
##### Canonical Two\-Sided ZO Estimator\.
The standard \(canonical\) two\-sided estimator is
𝐠^can\(𝜽\):=12σB∑i=1Bδi𝐳i\.\\widehat\{\\mathbf\{g\}\}\_\{\\text\{can\}\}\(\\bm\{\\theta\}\):=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}\\delta\_\{i\}\\,\\mathbf\{z\}\_\{i\}\.\(14\)This serves as the theoretical baseline from which GRZO departs\.
##### GRZO Estimator \(Group\-Relative Normalization\)\.
GRZO computes the within\-batch standard deviation
s=1B∑i=1B\(δi−δ¯\)2,δ¯=1B∑i=1Bδi,s=\\sqrt\{\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\(\\delta\_\{i\}\-\\bar\{\\delta\}\)^\{2\}\},\\qquad\\bar\{\\delta\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\delta\_\{i\},\(15\)defines group\-relative weightsai:=δi/\(s\+ϵ\)a\_\{i\}:=\\delta\_\{i\}\\,/\\,\(s\+\\epsilon\), and forms
𝐠^GRZO\(𝜽\):=12σB∑i=1Bai𝐳i\.\\widehat\{\\mathbf\{g\}\}\_\{\\text\{GRZO\}\}\(\\bm\{\\theta\}\):=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}a\_\{i\}\\,\\mathbf\{z\}\_\{i\}\.\(16\)The scaling by1/\(s\+ϵ\)1/\(s\+\\epsilon\)makes the update invariant to the magnitude of\{δi\}\\\{\\delta\_\{i\}\\\}, analogous to advantage normalization in GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.02857#bib.bib26)\)\. Because two\-sided finite differences satisfy𝔼\[δi\]=0\\mathbb\{E\}\[\\delta\_\{i\}\]=0by symmetry of𝐳i↔−𝐳i\\mathbf\{z\}\_\{i\}\\leftrightarrow\-\\mathbf\{z\}\_\{i\}, no explicit mean baseline is subtracted from the numerator\.
### H\.2Marginal Distribution of Flipout Perturbations
We only need the*marginal*distribution of𝐳i\\mathbf\{z\}\_\{i\}for unbiasedness\. Let𝐔\\mathbf\{U\}have i\.i\.d\. entries𝐔jk\\mathbf\{U\}\_\{jk\}and let𝐫i∈\{±1\}Dout\\mathbf\{r\}\_\{i\}\\in\\\{\\pm 1\\\}^\{D\_\{\\text\{out\}\}\},𝐬i∈\{±1\}Din\\mathbf\{s\}\_\{i\}\\in\\\{\\pm 1\\\}^\{D\_\{\\text\{in\}\}\}be independent Rademacher vectors\. Then each coordinate of𝐳i\\mathbf\{z\}\_\{i\}is of the form
\(𝐳i\)\(j,k\)=𝐔jk⋅\(𝐫i\)j⋅\(𝐬i\)k,\(\\mathbf\{z\}\_\{i\}\)\_\{\(j,k\)\}=\\mathbf\{U\}\_\{jk\}\\cdot\(\\mathbf\{r\}\_\{i\}\)\_\{j\}\\cdot\(\\mathbf\{s\}\_\{i\}\)\_\{k\},\(17\)i\.e\., a sign flip of𝐔jk\\mathbf\{U\}\_\{jk\}\.
##### Gaussian Base Noise \(Exact\)\.
If𝐔jk∼𝒩\(0,1\)\\mathbf\{U\}\_\{jk\}\\sim\\mathcal\{N\}\(0,1\), then\(𝐳i\)\(j,k\)∼𝒩\(0,1\)\(\\mathbf\{z\}\_\{i\}\)\_\{\(j,k\)\}\\sim\\mathcal\{N\}\(0,1\)as well, because multiplying a standard normal by an independent±1\\pm 1does not change its distribution\. Hence𝐳i∼𝒩\(0,𝐈d\)\\mathbf\{z\}\_\{i\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\)marginally\.
##### Rademacher Base Noise \(Isotropic, Symmetric\)\.
If𝐔jk∼Rad\(±1\)\\mathbf\{U\}\_\{jk\}\\sim\\mathrm\{Rad\}\(\\pm 1\), then\(𝐳i\)\(j,k\)∼Rad\(±1\)\(\\mathbf\{z\}\_\{i\}\)\_\{\(j,k\)\}\\sim\\mathrm\{Rad\}\(\\pm 1\)marginally\. In either case,𝐳i\\mathbf\{z\}\_\{i\}is symmetric \(𝐳i=d−𝐳i\\mathbf\{z\}\_\{i\}\\stackrel\{\{\\scriptstyle d\}\}\{\{=\}\}\-\\mathbf\{z\}\_\{i\}\) and isotropic \(𝔼\[𝐳i\]=0\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\]=0,𝔼\[𝐳i𝐳i⊤\]=𝐈d\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mathbf\{z\}\_\{i\}^\{\\top\}\]=\\mathbf\{I\}\_\{d\}\)\.
### H\.3Unbiasedness w\.r\.t\. a Smoothed Objective \(Gaussian Case: Exact\)
In the Gaussian case, define the Gaussian\-smoothed population objective
Fσ\(𝜽\):=𝔼u∼𝒩\(0,𝐈d\)\[F\(𝜽\+σ𝐮\)\]\.F\_\{\\sigma\}\(\\bm\{\\theta\}\):=\\mathbb\{E\}\_\{u\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\)\}\\big\[F\(\\bm\{\\theta\}\+\\sigma\\mathbf\{u\}\)\\big\]\.\(18\)By Stein’s identity, we have
∇Fσ\(𝜽\)=1σ𝔼u\[F\(𝜽\+σ𝐮\)𝐮\]\.\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)=\\frac\{1\}\{\\sigma\}\\,\\mathbb\{E\}\_\{u\}\\big\[F\(\\bm\{\\theta\}\+\\sigma\\mathbf\{u\}\)\\,\\mathbf\{u\}\\big\]\.\(19\)Using symmetry of the Gaussian,𝐮=d−𝐮\\mathbf\{u\}\\stackrel\{\{\\scriptstyle d\}\}\{\{=\}\}\-\\mathbf\{u\}, we obtain the antithetic form:
∇Fσ\(𝜽\)\\displaystyle\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)=12σ𝔼u\[\(F\(𝜽\+σ𝐮\)−F\(𝜽−σ𝐮\)\)𝐮\]\.\\displaystyle=\\frac\{1\}\{2\\sigma\}\\,\\mathbb\{E\}\_\{u\}\\Big\[\\big\(F\(\\bm\{\\theta\}\+\\sigma\\mathbf\{u\}\)\-F\(\\bm\{\\theta\}\-\\sigma\\mathbf\{u\}\)\\big\)\\,\\mathbf\{u\}\\Big\]\.\(20\)
Now consider a single\-example objectivefi\(𝜽\):=𝔼ξi\[ℓ\(𝜽;ξi\)\]f\_\{i\}\(\\bm\{\\theta\}\):=\\mathbb\{E\}\_\{\\xi\_\{i\}\}\[\\ell\(\\bm\{\\theta\};\\xi\_\{i\}\)\]\. If𝐳i∼𝒩\(0,𝐈d\)\\mathbf\{z\}\_\{i\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\)marginally \(Appendix[H\.2](https://arxiv.org/html/2606.02857#A8.SS2)\), then applying \([20](https://arxiv.org/html/2606.02857#A8.E20)\) tofif\_\{i\}yields
𝔼zi\[fi\(𝜽\+σ𝐳i\)−fi\(𝜽−σ𝐳i\)2σ𝐳i\]=∇fi,σ\(𝜽\),\\mathbb\{E\}\_\{z\_\{i\}\}\\left\[\\frac\{f\_\{i\}\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\_\{i\}\)\-f\_\{i\}\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\}\_\{i\}\)\}\{2\\sigma\}\\,\\mathbf\{z\}\_\{i\}\\right\]=\\nabla f\_\{i,\\sigma\}\(\\bm\{\\theta\}\),\(21\)wherefi,σf\_\{i,\\sigma\}is the Gaussian smoothing offif\_\{i\}\.
Taking expectation over the mini\-batch sampling and averaging overii, linearity of expectation gives the unbiasedness of the canonical estimator:
𝔼\[𝐠^can\(𝜽\)\]=∇Fσ\(𝜽\)\.\\mathbb\{E\}\\big\[\\widehat\{\\mathbf\{g\}\}\_\{\\text\{can\}\}\(\\bm\{\\theta\}\)\\big\]=\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\.\(22\)
##### GRZO as Self\-Normalized Estimator\.
GRZO applies a positive self\-normalization to the canonical estimator\. Sinceai=δi/\(s\+ϵ\)a\_\{i\}=\\delta\_\{i\}/\(s\+\\epsilon\), the estimator factors as
𝐠^GRZO\(𝜽\)=1s\+ϵ⋅12σB∑i=1Bδi𝐳i=1s\+ϵ⋅𝐠^can\(𝜽\)\.\\widehat\{\\mathbf\{g\}\}\_\{\\text\{GRZO\}\}\(\\bm\{\\theta\}\)=\\frac\{1\}\{s\+\\epsilon\}\\cdot\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}\\delta\_\{i\}\\,\\mathbf\{z\}\_\{i\}=\\frac\{1\}\{s\+\\epsilon\}\\cdot\\widehat\{\\mathbf\{g\}\}\_\{\\text\{can\}\}\(\\bm\{\\theta\}\)\.\(23\)The scalar1/\(s\+ϵ\)1/\(s\+\\epsilon\)depends on\{𝐳i\}i=1B\\\{\\mathbf\{z\}\_\{i\}\\\}\_\{i=1\}^\{B\}through\{δi\}\\\{\\delta\_\{i\}\\\}, so it cannot be factored out of the outer expectation in closed form\. Equivalently, the GRZO update can be written as𝜽t\+1=𝜽t−η~t𝐠^can\(𝜽t\)\\bm\{\\theta\}\_\{t\+1\}=\\bm\{\\theta\}\_\{t\}\-\\tilde\{\\eta\}\_\{t\}\\,\\widehat\{\\mathbf\{g\}\}\_\{\\text\{can\}\}\(\\bm\{\\theta\}\_\{t\}\)with adaptive effective step sizeη~t:=ηt/\(st\+ϵ\)\\tilde\{\\eta\}\_\{t\}:=\\eta\_\{t\}/\(s\_\{t\}\+\\epsilon\), paralleling the FZOO\(Danget al\.,[2026](https://arxiv.org/html/2606.02857#bib.bib14)\)convergence framing\. When the within\-batch standard deviationssconcentrates around a positive deterministic scalars⋆\(𝜽\)s\_\{\\star\}\(\\bm\{\\theta\}\)at typical batch sizes,
𝔼\[𝐠^GRZO\(𝜽\)\]≈1s⋆\(𝜽\)\+ϵ∇Fσ\(𝜽\)\+O\(σ2\),\\mathbb\{E\}\\big\[\\widehat\{\\mathbf\{g\}\}\_\{\\text\{GRZO\}\}\(\\bm\{\\theta\}\)\\big\]\\;\\approx\\;\\frac\{1\}\{s\_\{\\star\}\(\\bm\{\\theta\}\)\+\\epsilon\}\\,\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\+O\(\\sigma^\{2\}\),\(24\)so GRZO is approximately direction\-preserving\. The convergence analysis \(Appendix[H\.7](https://arxiv.org/html/2606.02857#A8.SS7)\) treats the self\-normalization1/\(st\+ϵ\)1/\(s\_\{t\}\+\\epsilon\)as an adaptive step size rather than relying on exact direction\-preservation\.
### H\.4Taylor Expansion View \(General Symmetric Isotropic Directions\)
This subsection provides an alternative derivation via Taylor expansion, which applies to any symmetric isotropic direction distribution \(including Rademacher\)\.
Assumef\(𝜽\)f\(\\bm\{\\theta\}\)is three\-times differentiable and its third derivative tensor is bounded:
‖∇3f\(𝜽\)‖op≤ρ,∀𝜽\.\\\|\\nabla^\{3\}f\(\\bm\{\\theta\}\)\\\|\_\{\\mathrm\{op\}\}\\leq\\rho,\\quad\\forall\\bm\{\\theta\}\.\(25\)For a fixed direction𝐳\\mathbf\{z\}, a third\-order Taylor expansion gives
f\(𝜽\+σ𝐳\)\\displaystyle f\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\)=f\(𝜽\)\+σ⟨∇f\(𝜽\),𝐳⟩\+σ22𝐳⊤∇2f\(𝜽\)𝐳\+σ36∇3f\(𝜽\)\[𝐳,𝐳,𝐳\]\+O\(σ4‖𝐳‖4\),\\displaystyle=f\(\\bm\{\\theta\}\)\+\\sigma\\langle\\nabla f\(\\bm\{\\theta\}\),\\mathbf\{z\}\\rangle\+\\frac\{\\sigma^\{2\}\}\{2\}\\mathbf\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{\\theta\}\)\\,\\mathbf\{z\}\+\\frac\{\\sigma^\{3\}\}\{6\}\\nabla^\{3\}f\(\\bm\{\\theta\}\)\[\\mathbf\{z\},\\mathbf\{z\},\\mathbf\{z\}\]\+O\(\\sigma^\{4\}\\\|\\mathbf\{z\}\\\|^\{4\}\),\(26\)f\(𝜽−σ𝐳\)\\displaystyle f\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\}\)=f\(𝜽\)−σ⟨∇f\(𝜽\),𝐳⟩\+σ22𝐳⊤∇2f\(𝜽\)𝐳−σ36∇3f\(𝜽\)\[𝐳,𝐳,𝐳\]\+O\(σ4‖𝐳‖4\)\.\\displaystyle=f\(\\bm\{\\theta\}\)\-\\sigma\\langle\\nabla f\(\\bm\{\\theta\}\),\\mathbf\{z\}\\rangle\+\\frac\{\\sigma^\{2\}\}\{2\}\\mathbf\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{\\theta\}\)\\,\\mathbf\{z\}\-\\frac\{\\sigma^\{3\}\}\{6\}\\nabla^\{3\}f\(\\bm\{\\theta\}\)\[\\mathbf\{z\},\\mathbf\{z\},\\mathbf\{z\}\]\+O\(\\sigma^\{4\}\\\|\\mathbf\{z\}\\\|^\{4\}\)\.\(27\)Subtracting,
f\(𝜽\+σ𝐳\)−f\(𝜽−σ𝐳\)=2σ⟨∇f\(𝜽\),𝐳⟩\+σ33∇3f\(𝜽\)\[𝐳,𝐳,𝐳\]\+O\(σ5‖𝐳‖5\)\.f\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\)\-f\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\}\)=2\\sigma\\langle\\nabla f\(\\bm\{\\theta\}\),\\mathbf\{z\}\\rangle\+\\frac\{\\sigma^\{3\}\}\{3\}\\nabla^\{3\}f\(\\bm\{\\theta\}\)\[\\mathbf\{z\},\\mathbf\{z\},\\mathbf\{z\}\]\+O\(\\sigma^\{5\}\\\|\\mathbf\{z\}\\\|^\{5\}\)\.\(28\)Plugging \([28](https://arxiv.org/html/2606.02857#A8.E28)\) into the two\-sided estimator for a single direction,
f\(𝜽\+σ𝐳\)−f\(𝜽−σ𝐳\)2σ𝐳\\displaystyle\\frac\{f\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\)\-f\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\}\)\}\{2\\sigma\}\\,\\mathbf\{z\}=⟨∇f\(𝜽\),𝐳⟩𝐳\+σ26∇3f\(𝜽\)\[𝐳,𝐳,𝐳\]𝐳\+O\(σ4‖𝐳‖6\)\.\\displaystyle=\\langle\\nabla f\(\\bm\{\\theta\}\),\\mathbf\{z\}\\rangle\\mathbf\{z\}\+\\frac\{\\sigma^\{2\}\}\{6\}\\nabla^\{3\}f\(\\bm\{\\theta\}\)\[\\mathbf\{z\},\\mathbf\{z\},\\mathbf\{z\}\]\\,\\mathbf\{z\}\+O\(\\sigma^\{4\}\\\|\\mathbf\{z\}\\\|^\{6\}\)\.\(29\)Taking expectation over a symmetric isotropic𝐳\\mathbf\{z\}with𝔼\[𝐳𝐳⊤\]=𝐈\\mathbb\{E\}\[\\mathbf\{z\}\\mathbf\{z\}^\{\\top\}\]=\\mathbf\{I\}yields
𝔼\[⟨∇f\(𝜽\),𝐳⟩𝐳\]=𝔼\[𝐳𝐳⊤\]∇f\(𝜽\)=∇f\(𝜽\)\.\\mathbb\{E\}\\big\[\\langle\\nabla f\(\\bm\{\\theta\}\),\\mathbf\{z\}\\rangle\\mathbf\{z\}\\big\]=\\mathbb\{E\}\[\\mathbf\{z\}\\mathbf\{z\}^\{\\top\}\]\\nabla f\(\\bm\{\\theta\}\)=\\nabla f\(\\bm\{\\theta\}\)\.\(30\)For the remainder, symmetry implies cancellation of odd moments and \([25](https://arxiv.org/html/2606.02857#A8.E25)\) implies
‖𝔼\[σ26∇3f\(𝜽\)\[𝐳,𝐳,𝐳\]𝐳\]‖≤σ26ρ𝔼‖𝐳‖4=O\(σ2\)\.\\left\\\|\\mathbb\{E\}\\left\[\\frac\{\\sigma^\{2\}\}\{6\}\\nabla^\{3\}f\(\\bm\{\\theta\}\)\[\\mathbf\{z\},\\mathbf\{z\},\\mathbf\{z\}\]\\,\\mathbf\{z\}\\right\]\\right\\\|\\leq\\frac\{\\sigma^\{2\}\}\{6\}\\rho\\,\\mathbb\{E\}\\\|\\mathbf\{z\}\\\|^\{4\}=O\(\\sigma^\{2\}\)\.\(31\)Therefore,
𝔼\[f\(𝜽\+σ𝐳\)−f\(𝜽−σ𝐳\)2σ𝐳\]=∇f\(𝜽\)\+O\(σ2\)\.\\mathbb\{E\}\\left\[\\frac\{f\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\)\-f\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\}\)\}\{2\\sigma\}\\,\\mathbf\{z\}\\right\]=\\nabla f\(\\bm\{\\theta\}\)\+O\(\\sigma^\{2\}\)\.\(32\)Averaging over examples yields𝔼\[𝐠^can\(𝜽\)\]=∇F\(𝜽\)\+O\(σ2\)\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{\\text\{can\}\}\(\\bm\{\\theta\}\)\]=\\nabla F\(\\bm\{\\theta\}\)\+O\(\\sigma^\{2\}\)\.
### H\.5Smoothing Bias: Smoothed vs\. Original Gradient
We bound‖∇Fσ\(𝜽\)−∇F\(𝜽\)‖\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|\. Assume∇2F\\nabla^\{2\}Fisρ\\rho\-Lipschitz \(equivalently, \([25](https://arxiv.org/html/2606.02857#A8.E25)\) holds forFF\)\. Then for any𝐮\\mathbf\{u\},
∇F\(𝜽\+σ𝐮\)=∇F\(𝜽\)\+∇2F\(𝜽\)σ𝐮\+R\(𝜽,𝐮\),‖R\(𝜽,𝐮\)‖≤ρ2σ2‖𝐮‖2\.\\nabla F\(\\bm\{\\theta\}\+\\sigma\\mathbf\{u\}\)=\\nabla F\(\\bm\{\\theta\}\)\+\\nabla^\{2\}F\(\\bm\{\\theta\}\)\\,\\sigma\\mathbf\{u\}\+R\(\\bm\{\\theta\},\\mathbf\{u\}\),\\qquad\\\|R\(\\bm\{\\theta\},\\mathbf\{u\}\)\\\|\\leq\\frac\{\\rho\}\{2\}\\sigma^\{2\}\\\|\\mathbf\{u\}\\\|^\{2\}\.\(33\)Taking expectation over𝐮∼𝒩\(0,𝐈d\)\\mathbf\{u\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\)\(or any symmetric isotropic distribution\), the linear term vanishes since𝔼\[𝐮\]=0\\mathbb\{E\}\[\\mathbf\{u\}\]=0, giving
‖∇Fσ\(𝜽\)−∇F\(𝜽\)‖\\displaystyle\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|=‖𝔼u\[∇F\(𝜽\+σ𝐮\)\]−∇F\(𝜽\)‖=‖𝔼u\[R\(𝜽,𝐮\)\]‖≤ρ2σ2𝔼‖𝐮‖2\.\\displaystyle=\\left\\\|\\mathbb\{E\}\_\{u\}\[\\nabla F\(\\bm\{\\theta\}\+\\sigma\\mathbf\{u\}\)\]\-\\nabla F\(\\bm\{\\theta\}\)\\right\\\|=\\left\\\|\\mathbb\{E\}\_\{u\}\[R\(\\bm\{\\theta\},\\mathbf\{u\}\)\]\\right\\\|\\leq\\frac\{\\rho\}\{2\}\\sigma^\{2\}\\,\\mathbb\{E\}\\\|\\mathbf\{u\}\\\|^\{2\}\.\(34\)For Gaussian𝐮∼𝒩\(0,𝐈d\)\\mathbf\{u\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\),𝔼‖𝐮‖2=d\\mathbb\{E\}\\\|\\mathbf\{u\}\\\|^\{2\}=d, hence
‖∇Fσ\(𝜽\)−∇F\(𝜽\)‖≤ρ2σ2d\.\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|\\leq\\frac\{\\rho\}\{2\}\\sigma^\{2\}d\.\(35\)
##### Remark \(Weaker Assumption\)\.
If we only assumeFFisLL\-smooth \(i\.e\.,∇F\\nabla FisLL\-Lipschitz\), then one can obtain the weaker bound‖∇Fσ\(𝜽\)−∇F\(𝜽\)‖≤Lσ𝔼‖𝐮‖=O\(σd\)\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|\\leq L\\sigma\\,\\mathbb\{E\}\\\|\\mathbf\{u\}\\\|=O\(\\sigma\\sqrt\{d\}\)for Gaussian𝐮\\mathbf\{u\}\.
### H\.6Second Moment and Variance Bound of GRZO
##### Setup\.
Let𝜽=vec\(𝐖\)∈ℝd\\bm\{\\theta\}=\\mathrm\{vec\}\(\\mathbf\{W\}\)\\in\\mathbb\{R\}^\{d\}withd=DoutDind=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}\. For each flattened examplei∈\{1,…,B\}i\\in\\\{1,\\dots,B\\\}, define the Flipout perturbation𝐳i=vec\(Δ𝐖i\)\\mathbf\{z\}\_\{i\}=\\mathrm\{vec\}\(\\Delta\\mathbf\{W\}\_\{i\}\)withΔ𝐖i=𝐔⊙\(𝐫i𝐬i⊤\)\\Delta\\mathbf\{W\}\_\{i\}=\\mathbf\{U\}\\odot\(\\mathbf\{r\}\_\{i\}\\mathbf\{s\}\_\{i\}^\{\\top\}\)as in \([12](https://arxiv.org/html/2606.02857#A8.E12)\)\. Consider the two\-sided loss differenceδi=ℓ\(𝜽\+σ𝐳i;ξi\)−ℓ\(𝜽−σ𝐳i;ξi\)\\delta\_\{i\}=\\ell\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\_\{i\};\\xi\_\{i\}\)\-\\ell\(\\bm\{\\theta\}\-\\sigma\\mathbf\{z\}\_\{i\};\\xi\_\{i\}\)and the canonical estimator
g^\(𝜽\):=g^can\(𝜽\)=12σB∑i=1Bδi𝐳i\.\\widehat\{g\}\(\\bm\{\\theta\}\):=\\widehat\{g\}\_\{\\text\{can\}\}\(\\bm\{\\theta\}\)=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}\\delta\_\{i\}\\,\\mathbf\{z\}\_\{i\}\.\(36\)
##### Assumptions for This Subsection\.
We assume: \(i\)ℓ\(⋅;ξ\)\\ell\(\\cdot;\\xi\)is three\-times differentiable and‖∇3ℓ\(𝜽;ξ\)‖op≤ρ\\\|\\nabla^\{3\}\\ell\(\\bm\{\\theta\};\\xi\)\\\|\_\{\\mathrm\{op\}\}\\leq\\rho; \(ii\) the per\-example gradients are bounded in second moment:𝔼‖∇ℓ\(𝜽;ξ\)‖2≤‖∇F\(𝜽\)‖2\+ν2\\mathbb\{E\}\\\|\\nabla\\ell\(\\bm\{\\theta\};\\xi\)\\\|^\{2\}\\leq\\\|\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\+\\nu^\{2\}; \(iii\) the base noise has i\.i\.d\. entries with𝔼\[𝐔jk2\]=1\\mathbb\{E\}\[\\mathbf\{U\}\_\{jk\}^\{2\}\]=1and fourth momentm4:=𝔼\[𝐔jk4\]m\_\{4\}:=\\mathbb\{E\}\[\\mathbf\{U\}\_\{jk\}^\{4\}\]\(Gaussian:m4=3m\_\{4\}=3, Rademacher:m4=1m\_\{4\}=1\)\.
#### H\.6\.1Key Conditional Independence
Let𝐮=vec\(𝐔\)∈ℝd\\mathbf\{u\}=\\mathrm\{vec\}\(\\mathbf\{U\}\)\\in\\mathbb\{R\}^\{d\}and define the sign vector𝐯i=vec\(𝐫i𝐬i⊤\)∈\{±1\}d\\mathbf\{v\}\_\{i\}=\\mathrm\{vec\}\(\\mathbf\{r\}\_\{i\}\\mathbf\{s\}\_\{i\}^\{\\top\}\)\\in\\\{\\pm 1\\\}^\{d\}\. Then
𝐳i=𝐮⊙𝐯i\.\\mathbf\{z\}\_\{i\}=\\mathbf\{u\}\\odot\\mathbf\{v\}\_\{i\}\.\(37\)Since\{\(𝐫i,𝐬i\)\}i=1B\\\{\(\\mathbf\{r\}\_\{i\},\\mathbf\{s\}\_\{i\}\)\\\}\_\{i=1\}^\{B\}are independent acrossii, the sign vectors\{𝐯i\}i=1B\\\{\\mathbf\{v\}\_\{i\}\\\}\_\{i=1\}^\{B\}are independent acrossii\. Therefore:
###### Lemma 2\(Conditional independence and moments\)\.
Conditioned on𝐔\\mathbf\{U\}\(equivalently, on𝐮\\mathbf\{u\}\), the perturbations\{𝐳i\}i=1B\\\{\\mathbf\{z\}\_\{i\}\\\}\_\{i=1\}^\{B\}are independent\. Moreover,
𝔼\[𝐳i∣𝐔\]=0,𝔼\[𝐳i𝐳i⊤∣𝐔\]=diag\(𝐮2\)=:D\(𝐔\),𝔼\[𝐳i𝐳j⊤∣𝐔\]=0\(i≠j\)\.\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mid\\mathbf\{U\}\]=0,\\qquad\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mathbf\{z\}\_\{i\}^\{\\top\}\\mid\\mathbf\{U\}\]=\\mathrm\{diag\}\(\\mathbf\{u\}^\{2\}\)=:D\(\\mathbf\{U\}\),\\qquad\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mathbf\{z\}\_\{j\}^\{\\top\}\\mid\\mathbf\{U\}\]=0\\ \(i\\neq j\)\.Unconditionally,𝔼\[𝐳i𝐳i⊤\]=𝐈d\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mathbf\{z\}\_\{i\}^\{\\top\}\]=\\mathbf\{I\}\_\{d\}\.
*Proof\.*Given𝐔\\mathbf\{U\}, each coordinate satisfies\(𝐳i\)k=uk\(𝐯i\)k\(\\mathbf\{z\}\_\{i\}\)\_\{k\}=u\_\{k\}\(\\mathbf\{v\}\_\{i\}\)\_\{k\}with𝔼\[\(𝐯i\)k\]=0\\mathbb\{E\}\[\(\\mathbf\{v\}\_\{i\}\)\_\{k\}\]=0and𝔼\[\(𝐯i\)k\(𝐯i\)k′\]=0\\mathbb\{E\}\[\(\\mathbf\{v\}\_\{i\}\)\_\{k\}\(\\mathbf\{v\}\_\{i\}\)\_\{k^\{\\prime\}\}\]=0fork≠k′k\\neq k^\{\\prime\}\(factorized Rademacher signs\)\. Independence acrossiifollows from independence of\{𝐯i\}\\\{\\mathbf\{v\}\_\{i\}\\\}\.□\\square
#### H\.6\.2Taylor Reduction from Perturbation Products to Quadratic Forms
Fixξi\\xi\_\{i\}and denote𝐠i:=∇𝜽ℓ\(𝜽;ξi\)\\mathbf\{g\}\_\{i\}:=\\nabla\_\{\\bm\{\\theta\}\}\\ell\(\\bm\{\\theta\};\\xi\_\{i\}\)\. By the third\-order Taylor expansion \(see Appendix[H\.4](https://arxiv.org/html/2606.02857#A8.SS4)\), for eachiithere exists a remainderRiR\_\{i\}such that
δi=2σ⟨𝐠i,𝐳i⟩\+Ri,\|Ri\|≤ρ3σ3‖𝐳i‖3\.\\delta\_\{i\}=2\\sigma\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{z\}\_\{i\}\\rangle\+R\_\{i\},\\qquad\|R\_\{i\}\|\\leq\\frac\{\\rho\}\{3\}\\sigma^\{3\}\\\|\\mathbf\{z\}\_\{i\}\\\|^\{3\}\.\(38\)Plugging intoδi𝐳i\\delta\_\{i\}\\mathbf\{z\}\_\{i\}gives the decomposition
δi𝐳i=2σ\(⟨𝐠i,𝐳i⟩𝐳i\)⏟:=𝐪i\+Ri𝐳i⏟:=𝐞i\.\\delta\_\{i\}\\mathbf\{z\}\_\{i\}=2\\sigma\\,\\underbrace\{\\big\(\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{z\}\_\{i\}\\rangle\\mathbf\{z\}\_\{i\}\\big\)\}\_\{:=\\mathbf\{q\}\_\{i\}\}\\;\+\\;\\underbrace\{R\_\{i\}\\mathbf\{z\}\_\{i\}\}\_\{:=\\mathbf\{e\}\_\{i\}\}\.\(39\)Hence
g^\(𝜽\)=1B∑i=1B𝐪i\+12σB∑i=1B𝐞i\.\\widehat\{g\}\(\\bm\{\\theta\}\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathbf\{q\}\_\{i\}\\;\+\\;\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}\\mathbf\{e\}\_\{i\}\.\(40\)
#### H\.6\.3Expanding the Estimator Second Moment: Diagonal and Cross Terms
##### Conditioning convention\.
Throughout this subsubsection and the next, we compute second\-moment expressions treating the per\-example gradients𝐠i=∇𝜽ℓ\(𝜽;ξi\)\\mathbf\{g\}\_\{i\}=\\nabla\_\{\\bm\{\\theta\}\}\\ell\(\\bm\{\\theta\};\\xi\_\{i\}\)as held fixed, i\.e\. all expectations are conditioned implicitly on\{ξi\}i=1B\\\{\\xi\_\{i\}\\\}\_\{i=1\}^\{B\}\. The outer data expectation𝔼ξ\\mathbb\{E\}\_\{\\xi\}is applied at the final substitution step \([54](https://arxiv.org/html/2606.02857#A8.E54)\) via the standard bounded\-variance bound∑i𝔼ξi‖𝐠i‖2≤B\(‖∇F\(𝜽\)‖2\+ν2\)\\sum\_\{i\}\\mathbb\{E\}\_\{\\xi\_\{i\}\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\\leq B\(\\\|\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\+\\nu^\{2\}\)from assumption \(ii\)\. Inner conditional expectations𝔼\[⋅∣𝐔\]\\mathbb\{E\}\[\\cdot\\mid\\mathbf\{U\}\]average over the per\-example sign vectors\{𝐯i\}\\\{\\mathbf\{v\}\_\{i\}\\\}only; the outer expectations𝔼\[⋅\]\\mathbb\{E\}\[\\cdot\]average additionally over𝐔\\mathbf\{U\}\.
We first analyze the leading term1B∑i𝐪i\\frac\{1\}\{B\}\\sum\_\{i\}\\mathbf\{q\}\_\{i\}\. Using‖∑iai‖2=∑i‖ai‖2\+∑i≠jai⊤aj\\\|\\sum\_\{i\}a\_\{i\}\\\|^\{2\}=\\sum\_\{i\}\\\|a\_\{i\}\\\|^\{2\}\+\\sum\_\{i\\neq j\}a\_\{i\}^\{\\top\}a\_\{j\},
𝔼‖1B∑i=1B𝐪i‖2\\displaystyle\\mathbb\{E\}\\left\\\|\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathbf\{q\}\_\{i\}\\right\\\|^\{2\}=1B2∑i=1B𝔼‖𝐪i‖2\+1B2∑i≠j𝔼\[𝐪i⊤𝐪j\]\.\\displaystyle=\\frac\{1\}\{B^\{2\}\}\\sum\_\{i=1\}^\{B\}\\mathbb\{E\}\\\|\\mathbf\{q\}\_\{i\}\\\|^\{2\}\\;\+\\;\\frac\{1\}\{B^\{2\}\}\\sum\_\{i\\neq j\}\\mathbb\{E\}\\big\[\\mathbf\{q\}\_\{i\}^\{\\top\}\\mathbf\{q\}\_\{j\}\\big\]\.\(41\)
##### Diagonal Term\.
Condition on𝐔\\mathbf\{U\}and𝐪i\\mathbf\{q\}\_\{i\}\. Note that‖𝐳i‖2=‖𝐮‖2\\\|\\mathbf\{z\}\_\{i\}\\\|^\{2\}=\\\|\\mathbf\{u\}\\\|^\{2\}does not depend on𝐯i\\mathbf\{v\}\_\{i\}\. Moreover,𝐪i=⟨𝐠i,𝐳i⟩𝐳i\\mathbf\{q\}\_\{i\}=\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{z\}\_\{i\}\\rangle\\mathbf\{z\}\_\{i\}implies
‖𝐪i‖2=\(⟨𝐠i,𝐳i⟩\)2‖𝐳i‖2\.\\\|\\mathbf\{q\}\_\{i\}\\\|^\{2\}=\(\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{z\}\_\{i\}\\rangle\)^\{2\}\\\|\\mathbf\{z\}\_\{i\}\\\|^\{2\}\.Using Lemma[2](https://arxiv.org/html/2606.02857#Thmlemma2)and𝔼\[\(⟨𝐠i,𝐳i⟩\)2∣𝐔\]=𝐠i⊤D\(𝐔\)𝐠i=∑k=1dgi,k2uk2\\mathbb\{E\}\[\(\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{z\}\_\{i\}\\rangle\)^\{2\}\\mid\\mathbf\{U\}\]=\\mathbf\{g\}\_\{i\}^\{\\top\}D\(\\mathbf\{U\}\)\\mathbf\{g\}\_\{i\}=\\sum\_\{k=1\}^\{d\}g\_\{i,k\}^\{2\}u\_\{k\}^\{2\}, we get
𝔼‖𝐪i‖2\\displaystyle\\mathbb\{E\}\\\|\\mathbf\{q\}\_\{i\}\\\|^\{2\}=𝔼\[‖𝐮‖2⋅𝐠i⊤D\(𝐔\)𝐠i\]=∑k=1dgi,k2𝔼\[‖𝐮‖2uk2\]\.\\displaystyle=\\mathbb\{E\}\\Big\[\\\|\\mathbf\{u\}\\\|^\{2\}\\cdot\\mathbf\{g\}\_\{i\}^\{\\top\}D\(\\mathbf\{U\}\)\\mathbf\{g\}\_\{i\}\\Big\]=\\sum\_\{k=1\}^\{d\}g\_\{i,k\}^\{2\}\\,\\mathbb\{E\}\\big\[\\\|\\mathbf\{u\}\\\|^\{2\}u\_\{k\}^\{2\}\\big\]\.\(42\)For i\.i\.d\. coordinates with𝔼\[uk2\]=1\\mathbb\{E\}\[u\_\{k\}^\{2\}\]=1and𝔼\[uk4\]=m4\\mathbb\{E\}\[u\_\{k\}^\{4\}\]=m\_\{4\},
𝔼\[‖𝐮‖2uk2\]=𝔼\[uk4\]\+∑ℓ≠k𝔼\[uℓ2\]𝔼\[uk2\]=m4\+\(d−1\)\.\\mathbb\{E\}\\big\[\\\|\\mathbf\{u\}\\\|^\{2\}u\_\{k\}^\{2\}\\big\]=\\mathbb\{E\}\[u\_\{k\}^\{4\}\]\+\\sum\_\{\\ell\\neq k\}\\mathbb\{E\}\[u\_\{\\ell\}^\{2\}\]\\mathbb\{E\}\[u\_\{k\}^\{2\}\]=m\_\{4\}\+\(d\-1\)\.\(43\)Thus
𝔼‖𝐪i‖2=\(d−1\+m4\)‖𝐠i‖2\.\\mathbb\{E\}\\\|\\mathbf\{q\}\_\{i\}\\\|^\{2\}=\(d\-1\+m\_\{4\}\)\\,\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\.\(44\)
##### Cross Term \(Second Moment\)\.
Fori≠ji\\neq j, conditioned on𝐔\\mathbf\{U\}the vectors𝐳i\\mathbf\{z\}\_\{i\}and𝐳j\\mathbf\{z\}\_\{j\}are independent \(Lemma[2](https://arxiv.org/html/2606.02857#Thmlemma2)\), hence𝐪i\\mathbf\{q\}\_\{i\}and𝐪j\\mathbf\{q\}\_\{j\}are conditionally independent\. Therefore
𝔼\[𝐪i⊤𝐪j∣𝐔\]=𝔼\[𝐪i∣𝐔\]⊤𝔼\[𝐪j∣𝐔\]\.\\mathbb\{E\}\[\\mathbf\{q\}\_\{i\}^\{\\top\}\\mathbf\{q\}\_\{j\}\\mid\\mathbf\{U\}\]=\\mathbb\{E\}\[\\mathbf\{q\}\_\{i\}\\mid\\mathbf\{U\}\]^\{\\top\}\\mathbb\{E\}\[\\mathbf\{q\}\_\{j\}\\mid\\mathbf\{U\}\]\.\(45\)Moreover,𝔼\[𝐪i∣𝐔\]=𝔼\[⟨𝐠i,𝐳i⟩𝐳i∣𝐔\]=𝔼\[𝐳i𝐳i⊤∣𝐔\]𝐠i=D\(𝐔\)𝐠i\\mathbb\{E\}\[\\mathbf\{q\}\_\{i\}\\mid\\mathbf\{U\}\]=\\mathbb\{E\}\[\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{z\}\_\{i\}\\rangle\\mathbf\{z\}\_\{i\}\\mid\\mathbf\{U\}\]=\\mathbb\{E\}\[\\mathbf\{z\}\_\{i\}\\mathbf\{z\}\_\{i\}^\{\\top\}\\mid\\mathbf\{U\}\]\\mathbf\{g\}\_\{i\}=D\(\\mathbf\{U\}\)\\mathbf\{g\}\_\{i\}\. Thus, per the conditioning convention above \(gradients𝐠i\\mathbf\{g\}\_\{i\}held fixed in expectation\),
𝔼\[𝐪i⊤𝐪j\]=𝔼\[𝐠i⊤D\(𝐔\)2𝐠j\]=∑k=1dgi,kgj,k𝔼\[uk4\]=m4⟨𝐠i,𝐠j⟩\.\\mathbb\{E\}\[\\mathbf\{q\}\_\{i\}^\{\\top\}\\mathbf\{q\}\_\{j\}\]=\\mathbb\{E\}\\big\[\\mathbf\{g\}\_\{i\}^\{\\top\}D\(\\mathbf\{U\}\)^\{2\}\\mathbf\{g\}\_\{j\}\\big\]=\\sum\_\{k=1\}^\{d\}g\_\{i,k\}g\_\{j,k\}\\,\\mathbb\{E\}\[u\_\{k\}^\{4\}\]=m\_\{4\}\\,\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{g\}\_\{j\}\\rangle\.\(46\)Under i\.i\.d\. data sampling, the subsequent outer data expectation gives𝔼ξ\[⟨𝐠i,𝐠j⟩\]=‖∇F\(𝜽\)‖2\\mathbb\{E\}\_\{\\xi\}\[\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{g\}\_\{j\}\\rangle\]=\\\|\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}fori≠ji\\neq j; the resulting contribution cancels exactly in the centered*variance*bound \([51](https://arxiv.org/html/2606.02857#A8.E51)\) \(cf\. Subsubsection[H\.6\.4](https://arxiv.org/html/2606.02857#A8.SS6.SSS4)\)\.
##### Putting Diagonal \+ Cross Together\.
Combining \([41](https://arxiv.org/html/2606.02857#A8.E41)\), \([44](https://arxiv.org/html/2606.02857#A8.E44)\), and \([46](https://arxiv.org/html/2606.02857#A8.E46)\), we obtain
𝔼‖1B∑i=1B𝐪i‖2\\displaystyle\\mathbb\{E\}\\left\\\|\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathbf\{q\}\_\{i\}\\right\\\|^\{2\}=d−1\+m4B2∑i=1B‖𝐠i‖2\+m4B2∑i≠j⟨𝐠i,𝐠j⟩\\displaystyle=\\frac\{d\-1\+m\_\{4\}\}\{B^\{2\}\}\\sum\_\{i=1\}^\{B\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\\;\+\\;\\frac\{m\_\{4\}\}\{B^\{2\}\}\\sum\_\{i\\neq j\}\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{g\}\_\{j\}\\rangle=d−1B2∑i=1B‖𝐠i‖2\+m4B2‖∑i=1B𝐠i‖2\.\\displaystyle=\\frac\{d\-1\}\{B^\{2\}\}\\sum\_\{i=1\}^\{B\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\\;\+\\;\\frac\{m\_\{4\}\}\{B^\{2\}\}\\left\\\|\\sum\_\{i=1\}^\{B\}\\mathbf\{g\}\_\{i\}\\right\\\|^\{2\}\.\(47\)
##### Remark \(Gaussian vs\. Rademacher\)\.
If𝐔\\mathbf\{U\}is Rademacher, thenm4=1m\_\{4\}=1and the cross term becomes1B2‖∑i𝐠i‖2\\frac\{1\}\{B^\{2\}\}\\\|\\sum\_\{i\}\\mathbf\{g\}\_\{i\}\\\|^\{2\}\. If𝐔\\mathbf\{U\}is Gaussian, thenm4=3m\_\{4\}=3and the same term is scaled by33\.
#### H\.6\.4Variance Bound: Cross Terms Vanish After Centering
The quantity above is a second moment\. For a*variance*bound around the mean, cross terms vanish due to Flipout decorrelation\.
Let𝐪¯i:=𝐪i−𝔼\[𝐪i∣𝐔\]\\bar\{\\mathbf\{q\}\}\_\{i\}:=\\mathbf\{q\}\_\{i\}\-\\mathbb\{E\}\[\\mathbf\{q\}\_\{i\}\\mid\\mathbf\{U\}\]\. Conditioned on𝐔\\mathbf\{U\}, the\{𝐪¯i\}\\\{\\bar\{\\mathbf\{q\}\}\_\{i\}\\\}are independent and zero\-mean, so fori≠ji\\neq j,
𝔼\[𝐪¯i⊤𝐪¯j∣𝐔\]=0\.\\mathbb\{E\}\\big\[\\bar\{\\mathbf\{q\}\}\_\{i\}^\{\\top\}\\bar\{\\mathbf\{q\}\}\_\{j\}\\mid\\mathbf\{U\}\\big\]=0\.\(48\)Therefore,
𝔼‖1B∑i=1B𝐪¯i‖2\\displaystyle\\mathbb\{E\}\\left\\\|\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\bar\{\\mathbf\{q\}\}\_\{i\}\\right\\\|^\{2\}=1B2∑i=1B𝔼‖𝐪¯i‖2\.\\displaystyle=\\frac\{1\}\{B^\{2\}\}\\sum\_\{i=1\}^\{B\}\\mathbb\{E\}\\\|\\bar\{\\mathbf\{q\}\}\_\{i\}\\\|^\{2\}\.\(49\)Next,
𝔼‖𝐪¯i‖2\\displaystyle\\mathbb\{E\}\\\|\\bar\{\\mathbf\{q\}\}\_\{i\}\\\|^\{2\}=𝔼∥𝐪i∥2−𝔼∥𝔼\[𝐪i∣𝐔\]∥2\\displaystyle=\\mathbb\{E\}\\\|\\mathbf\{q\}\_\{i\}\\\|^\{2\}\-\\mathbb\{E\}\\\|\\mathbb\{E\}\[\\mathbf\{q\}\_\{i\}\\mid\\mathbf\{U\}\]\\\|^\{2\}=\(d−1\+m4\)‖𝐠i‖2−𝔼‖D\(𝐔\)𝐠i‖2=\(d−1\+m4\)‖𝐠i‖2−m4‖𝐠i‖2\\displaystyle=\(d\-1\+m\_\{4\}\)\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\-\\mathbb\{E\}\\\|D\(\\mathbf\{U\}\)\\mathbf\{g\}\_\{i\}\\\|^\{2\}=\(d\-1\+m\_\{4\}\)\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\-m\_\{4\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}=\(d−1\)‖𝐠i‖2\.\\displaystyle=\(d\-1\)\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\.\(50\)Combining \([49](https://arxiv.org/html/2606.02857#A8.E49)\)–\([50](https://arxiv.org/html/2606.02857#A8.E50)\) yields the variance\-type bound
𝔼‖1B∑i=1B𝐪i−𝔼\[1B∑i=1B𝐪i\]‖2≤d−1B2∑i=1B‖𝐠i‖2\.\\mathbb\{E\}\\left\\\|\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathbf\{q\}\_\{i\}\-\\mathbb\{E\}\\left\[\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathbf\{q\}\_\{i\}\\right\]\\right\\\|^\{2\}\\leq\\frac\{d\-1\}\{B^\{2\}\}\\sum\_\{i=1\}^\{B\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\.\(51\)This is exactly where Flipout decorrelation removes the cross\-example covariance terms\.
#### H\.6\.5Remainder Control and Final Bound with Explicit Dimensions
We now incorporate the Taylor remainder𝐞i=Ri𝐳i\\mathbf\{e\}\_\{i\}=R\_\{i\}\\mathbf\{z\}\_\{i\}from \([39](https://arxiv.org/html/2606.02857#A8.E39)\)\. Using\|Ri\|≤ρ3σ3‖𝐳i‖3\|R\_\{i\}\|\\leq\\frac\{\\rho\}\{3\}\\sigma^\{3\}\\\|\\mathbf\{z\}\_\{i\}\\\|^\{3\}, we have
‖𝐞i‖2≤ρ29σ6‖𝐳i‖8\.\\\|\\mathbf\{e\}\_\{i\}\\\|^\{2\}\\leq\\frac\{\\rho^\{2\}\}\{9\}\\sigma^\{6\}\\\|\\mathbf\{z\}\_\{i\}\\\|^\{8\}\.\(52\)Hence
𝔼‖12σB∑i=1B𝐞i‖2≤14σ2B∑i=1B𝔼‖𝐞i‖2≤14σ2⋅ρ29σ6𝔼‖𝐳‖8=O\(ρ2σ4𝔼‖𝐳‖8\)\.\\mathbb\{E\}\\left\\\|\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}\\mathbf\{e\}\_\{i\}\\right\\\|^\{2\}\\leq\\frac\{1\}\{4\\sigma^\{2\}B\}\\sum\_\{i=1\}^\{B\}\\mathbb\{E\}\\\|\\mathbf\{e\}\_\{i\}\\\|^\{2\}\\leq\\frac\{1\}\{4\\sigma^\{2\}\}\\cdot\\frac\{\\rho^\{2\}\}\{9\}\\sigma^\{6\}\\,\\mathbb\{E\}\\\|\\mathbf\{z\}\\\|^\{8\}=O\\\!\\left\(\\rho^\{2\}\\sigma^\{4\}\\,\\mathbb\{E\}\\\|\\mathbf\{z\}\\\|^\{8\}\\right\)\.\(53\)For Rademacher𝐔\\mathbf\{U\}\(hence‖𝐳‖2=‖𝐮‖2=d\\\|\\mathbf\{z\}\\\|^\{2\}=\\\|\\mathbf\{u\}\\\|^\{2\}=ddeterministically\),𝔼‖𝐳‖8=d4\\mathbb\{E\}\\\|\\mathbf\{z\}\\\|^\{8\}=d^\{4\}\. For Gaussian𝐔\\mathbf\{U\},‖𝐮‖2\\\|\\mathbf\{u\}\\\|^\{2\}isχ2\(d\)\\chi^\{2\}\(d\)and𝔼‖𝐳‖8=𝔼‖𝐮‖8=O\(d4\)\\mathbb\{E\}\\\|\\mathbf\{z\}\\\|^\{8\}=\\mathbb\{E\}\\\|\\mathbf\{u\}\\\|^\{8\}=O\(d^\{4\}\)\. Thus the remainder contributesO\(ρ2σ4d4\)O\\\!\\left\(\\rho^\{2\}\\sigma^\{4\}d^\{4\}\\right\)\.
##### Final Variance Bound with Explicit Dimensions\.
Letd=DoutDind=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}\. Writingg^=1B∑i𝐪i\+12σB∑i𝐞i\\widehat\{g\}=\\tfrac\{1\}\{B\}\\sum\_\{i\}\\mathbf\{q\}\_\{i\}\+\\tfrac\{1\}\{2\\sigma B\}\\sum\_\{i\}\\mathbf\{e\}\_\{i\}via \([40](https://arxiv.org/html/2606.02857#A8.E40)\), Young’s inequality‖a\+b‖2≤\(1\+τ\)‖a‖2\+\(1\+1/τ\)‖b‖2\\\|a\+b\\\|^\{2\}\\leq\(1\+\\tau\)\\\|a\\\|^\{2\}\+\(1\+1/\\tau\)\\\|b\\\|^\{2\}with anyτ\>0\\tau\>0, together with𝔼‖X−𝔼X‖2≤𝔼‖X‖2\\mathbb\{E\}\\\|X\-\\mathbb\{E\}X\\\|^\{2\}\\leq\\mathbb\{E\}\\\|X\\\|^\{2\}, controlsVar\(g^\)\\mathrm\{Var\}\(\\widehat\{g\}\)by\(1\+τ\)\(1\+\\tau\)times the leading\-term variance \([51](https://arxiv.org/html/2606.02857#A8.E51)\) plus\(1\+1/τ\)\(1\+1/\\tau\)times the remainder second moment \([53](https://arxiv.org/html/2606.02857#A8.E53)\)\. Choosing any fixedτ∈\(0,1\]\\tau\\in\(0,1\]\(e\.g\.τ=1\\tau=1\) keeps the leading coefficient atO\(1\)O\(1\), and the resulting\(1\+1/τ\)\(1\+1/\\tau\)constant is absorbed into theO\(ρ2σ4d4\)O\(\\rho^\{2\}\\sigma^\{4\}d^\{4\}\)remainder\. Using∑i=1B‖𝐠i‖2≤B\(‖∇F\(𝜽\)‖2\+ν2\)\\sum\_\{i=1\}^\{B\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\\leq B\(\\\|\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\+\\nu^\{2\}\)in expectation, we obtain
𝔼‖g^\(𝜽\)−𝔼\[g^\(𝜽\)\]‖2≤d−1B\(‖∇F\(𝜽\)‖2\+ν2\)\+O\(ρ2σ4d4\),d=DoutDin\.\\mathbb\{E\}\\big\\\|\\widehat\{g\}\(\\bm\{\\theta\}\)\-\\mathbb\{E\}\[\\widehat\{g\}\(\\bm\{\\theta\}\)\]\\big\\\|^\{2\}\\;\\leq\\;\\frac\{d\-1\}\{B\}\\big\(\\\|\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\+\\nu^\{2\}\\big\)\\;\+\\;O\\\!\\left\(\\rho^\{2\}\\sigma^\{4\}d^\{4\}\\right\),\\qquad d=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}\.\(54\)
##### Variance Bound for the GRZO Estimator\.
The GRZO estimator factors as
g^GRZO\(𝜽\)=12σB∑i=1Bδis\+ϵ𝐳i=1s\+ϵ⋅g^can\(𝜽\),\\widehat\{g\}\_\{\\mathrm\{GRZO\}\}\(\\bm\{\\theta\}\)\\;=\\;\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}\\frac\{\\delta\_\{i\}\}\{s\+\\epsilon\}\\,\\mathbf\{z\}\_\{i\}\\;=\\;\\frac\{1\}\{s\+\\epsilon\}\\cdot\\widehat\{g\}\_\{\\mathrm\{can\}\}\(\\bm\{\\theta\}\),\(55\)whereg^can=12σB∑iδi𝐳i\\widehat\{g\}\_\{\\mathrm\{can\}\}=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i\}\\delta\_\{i\}\\mathbf\{z\}\_\{i\}is the canonical estimator bounded by \([54](https://arxiv.org/html/2606.02857#A8.E54)\)\. The self\-normalization1/\(s\+ϵ\)1/\(s\+\\epsilon\)is a*global*scalar shared acrossii\(althoughssitself depends on all\{δj\}\\\{\\delta\_\{j\}\\\}\), so it acts as a multiplicative prefactor ong^can\\widehat\{g\}\_\{\\mathrm\{can\}\}rather than introducing per\-example weights; the cross\-term decorrelation already verified forg^can\\widehat\{g\}\_\{\\mathrm\{can\}\}via Lemma[2](https://arxiv.org/html/2606.02857#Thmlemma2)therefore carries through up to this prefactor\.
Under the concentration assumption of \([24](https://arxiv.org/html/2606.02857#A8.E24)\)—sts\_\{t\}concentrates around a positive deterministic scalars⋆\(𝜽t\)s\_\{\\star\}\(\\bm\{\\theta\}\_\{t\}\)at typical batch sizes—the GRZO update𝜽t\+1=𝜽t−ηg^GRZO\\bm\{\\theta\}\_\{t\+1\}=\\bm\{\\theta\}\_\{t\}\-\\eta\\,\\widehat\{g\}\_\{\\mathrm\{GRZO\}\}is equivalent to a canonical SPSA update with adaptive effective step sizeη~t=η/\(st\+ϵ\)\\tilde\{\\eta\}\_\{t\}=\\eta/\(s\_\{t\}\+\\epsilon\)\(treated as the operative step size throughout Appendix[H\.7](https://arxiv.org/html/2606.02857#A8.SS7)\)\. The variance relevant to the convergence rate is then that ofg^can\\widehat\{g\}\_\{\\mathrm\{can\}\}in the effective\-step\-size frame:
Var\(g^can\(𝜽\)\)≤d−1B\(‖∇F\(𝜽\)‖2\+ν2\)\+O\(ρ2σ4d4\),\\mathrm\{Var\}\\bigl\(\\widehat\{g\}\_\{\\mathrm\{can\}\}\(\\bm\{\\theta\}\)\\bigr\)\\;\\leq\\;\\frac\{d\-1\}\{B\}\\bigl\(\\\|\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\+\\nu^\{2\}\\bigr\)\\;\+\\;O\\\!\\bigl\(\\rho^\{2\}\\sigma^\{4\}d^\{4\}\\bigr\),\(56\)matching \([54](https://arxiv.org/html/2606.02857#A8.E54)\) and exhibiting the1/B1/Bscaling responsible for theBeff\\sqrt\{B\_\{\\mathrm\{eff\}\}\}convergence improvement\. Theorem[2](https://arxiv.org/html/2606.02857#Thmtheorem2)in the main body states this bound in the GRZO frame, with the self\-normalizationct2:=1/\(s⋆\+ϵ\)2c\_\{t\}^\{2\}:=1/\(s\_\{\\star\}\+\\epsilon\)^\{2\}absorbed into the effective step size used in Appendix[H\.7](https://arxiv.org/html/2606.02857#A8.SS7)\.
### H\.7Nonconvex Convergence of GRZO
##### Objective and Update\.
LetF\(𝜽\)=𝔼ξ\[ℓ\(𝜽;ξ\)\]F\(\\bm\{\\theta\}\)=\\mathbb\{E\}\_\{\\xi\}\[\\ell\(\\bm\{\\theta\};\\xi\)\]be the population objective andFσF\_\{\\sigma\}be the smoothed objective induced by the \(Flipout\) perturbation distribution, as defined in Appendix[H\.3](https://arxiv.org/html/2606.02857#A8.SS3)\(Gaussian case\) or in the Taylor\-based view \(Appendix[H\.4](https://arxiv.org/html/2606.02857#A8.SS4)\)\. We analyze the GRZO update with group\-relative normalization:
𝜽t\+1=𝜽t−η𝐠^t,𝐠^t=12σB∑i=1Bat,i𝐳t,i,\\bm\{\\theta\}\_\{t\+1\}=\\bm\{\\theta\}\_\{t\}\-\\eta\\,\\widehat\{\\mathbf\{g\}\}\_\{t\},\\qquad\\widehat\{\\mathbf\{g\}\}\_\{t\}=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}a\_\{t,i\}\\,\\mathbf\{z\}\_\{t,i\},\(57\)whereδt,i=ℓ\(𝜽t\+σ𝐳t,i;ξt,i\)−ℓ\(𝜽t−σ𝐳t,i;ξt,i\)\\delta\_\{t,i\}=\\ell\(\\bm\{\\theta\}\_\{t\}\+\\sigma\\mathbf\{z\}\_\{t,i\};\\xi\_\{t,i\}\)\-\\ell\(\\bm\{\\theta\}\_\{t\}\-\\sigma\\mathbf\{z\}\_\{t,i\};\\xi\_\{t,i\}\),δ¯t=1B∑iδt,i\\bar\{\\delta\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\}\\delta\_\{t,i\},st=1B∑i\(δt,i−δ¯t\)2s\_\{t\}=\\sqrt\{\\frac\{1\}\{B\}\\sum\_\{i\}\(\\delta\_\{t,i\}\-\\bar\{\\delta\}\_\{t\}\)^\{2\}\}, andat,i=δt,i/\(st\+ϵ\)a\_\{t,i\}=\\delta\_\{t,i\}/\(s\_\{t\}\+\\epsilon\)\.
##### Assumptions\.
We assume: \(A1\)FσF\_\{\\sigma\}is lower bounded byFσ⋆F\_\{\\sigma\}^\{\\star\}\. \(A2\)FσF\_\{\\sigma\}isℒ\\mathcal\{L\}\-smooth:‖∇Fσ\(𝜽\)−∇Fσ\(𝜽′\)‖≤ℒ‖𝜽−𝜽′‖\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\-\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}^\{\\prime\}\)\\\|\\leq\\mathcal\{L\}\\\|\\bm\{\\theta\}\-\\bm\{\\theta\}^\{\\prime\}\\\|\. \(A3\) Data noise:𝔼‖∇ℓ\(𝜽;ξ\)−∇F\(𝜽\)‖2≤ν2\\mathbb\{E\}\\\|\\nabla\\ell\(\\bm\{\\theta\};\\xi\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\\leq\\nu^\{2\}\. \(A4\) Approximate direction\-preservation \(from \([24](https://arxiv.org/html/2606.02857#A8.E24)\)\): under concentration ofsts\_\{t\}around a positive deterministic scalars⋆\(𝜽t\)s\_\{\\star\}\(\\bm\{\\theta\}\_\{t\}\), the GRZO estimator satisfies
𝔼\[𝐠^t∣𝜽t\]=ct∇Fσ\(𝜽t\)\+O\(σ2\),\\mathbb\{E\}\\big\[\\widehat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\\big\]=c\_\{t\}\\,\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\+O\(\\sigma^\{2\}\),\(58\)wherect≈1/\(s⋆\(𝜽t\)\+ϵ\)\>0c\_\{t\}\\approx 1/\(s\_\{\\star\}\(\\bm\{\\theta\}\_\{t\}\)\+\\epsilon\)\>0is a positive scalar absorbed into the effective step sizeη~:=ηct\\tilde\{\\eta\}:=\\eta\\,c\_\{t\}\. For the remainder of this section we treatη\\etaas the effective step size, so that𝔼\[𝐠^t/ct∣𝜽t\]=∇Fσ\(𝜽t\)\+O\(σ2\)\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}/c\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]=\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\+O\(\\sigma^\{2\}\); the residual concentration error ofsts\_\{t\}is empirically negligible at the batch sizes used and folded into theO\(σ2\)O\(\\sigma^\{2\}\)remainder\. \(A5\) Second\-moment / variance bound \(from Appendix[H\.6](https://arxiv.org/html/2606.02857#A8.SS6)\): there exist explicit constantsAvar\(𝜽t\)A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)such that
𝔼‖𝐠^t−∇Fσ\(𝜽t\)‖2≤Avar\(𝜽t\),\\mathbb\{E\}\\big\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\-\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\big\\\|^\{2\}\\leq A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\),\(59\)withAvar\(𝜽t\)A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)depending on\(B,Din,Dout\)\(B,D\_\{\\text\{in\}\},D\_\{\\text\{out\}\}\)throughd=DoutDind=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}\.
#### H\.7\.1One\-Step Descent via Smoothness
Byℒ\\mathcal\{L\}\-smoothness ofFσF\_\{\\sigma\}, for any random direction𝐮\\mathbf\{u\},
Fσ\(𝜽t\+1\)≤Fσ\(𝜽t\)\+⟨∇Fσ\(𝜽t\),𝜽t\+1−𝜽t⟩\+ℒ2‖𝜽t\+1−𝜽t‖2\.F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\+1\}\)\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\+\\langle\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\),\\bm\{\\theta\}\_\{t\+1\}\-\\bm\{\\theta\}\_\{t\}\\rangle\+\\frac\{\\mathcal\{L\}\}\{2\}\\\|\\bm\{\\theta\}\_\{t\+1\}\-\\bm\{\\theta\}\_\{t\}\\\|^\{2\}\.\(60\)Substitute𝜽t\+1−𝜽t=−η𝐠^t\\bm\{\\theta\}\_\{t\+1\}\-\\bm\{\\theta\}\_\{t\}=\-\\eta\\widehat\{\\mathbf\{g\}\}\_\{t\}:
Fσ\(𝜽t\+1\)≤Fσ\(𝜽t\)−η⟨∇Fσ\(𝜽t\),𝐠^t⟩\+ℒη22‖𝐠^t‖2\.F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\+1\}\)\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\-\\eta\\langle\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\),\\widehat\{\\mathbf\{g\}\}\_\{t\}\\rangle\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\}\{2\}\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\\\|^\{2\}\.\(61\)Take conditional expectation given𝜽t\\bm\{\\theta\}\_\{t\}\. After absorbing the positive scalarctc\_\{t\}into the effective step size as in \(A4\), we may write𝔼\[𝐠^t∣𝜽t\]=∇Fσ\(𝜽t\)\+bt\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]=\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\+b\_\{t\}with‖bt‖=O\(σ2\)\\\|b\_\{t\}\\\|=O\(\\sigma^\{2\}\)\(smoothing bias only\)\.
𝔼\[Fσ\(𝜽t\+1\)∣𝜽t\]\\displaystyle\\mathbb\{E\}\\big\[F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\+1\}\)\\mid\\bm\{\\theta\}\_\{t\}\\big\]≤Fσ\(𝜽t\)−η‖∇Fσ\(𝜽t\)‖2−η⟨∇Fσ\(𝜽t\),bt⟩\+ℒη22𝔼\[‖𝐠^t‖2∣𝜽t\]\.\\displaystyle\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\-\\eta\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\-\\eta\\langle\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\),b\_\{t\}\\rangle\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\}\{2\}\\mathbb\{E\}\\big\[\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\\\|^\{2\}\\mid\\bm\{\\theta\}\_\{t\}\\big\]\.\(62\)By Cauchy–Schwarz and Young’s inequality, the smoothing\-bias cross term satisfies
\|η⟨∇Fσ\(𝜽t\),bt⟩\|≤η2‖∇Fσ\(𝜽t\)‖2\+η2‖bt‖2=η2‖∇Fσ\(𝜽t\)‖2\+O\(ησ4\),\|\\eta\\langle\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\),b\_\{t\}\\rangle\|\\;\\leq\\;\\tfrac\{\\eta\}\{2\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\tfrac\{\\eta\}\{2\}\\\|b\_\{t\}\\\|^\{2\}\\;=\\;\\tfrac\{\\eta\}\{2\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+O\(\\eta\\sigma^\{4\}\),absorbed into the leading−η‖∇Fσ‖2\-\\eta\\\|\\nabla F\_\{\\sigma\}\\\|^\{2\}term and theO\(σ2\)O\(\\sigma^\{2\}\)remainder\. Therefore, with effective gradient coefficientη2\\tfrac\{\\eta\}\{2\}:
𝔼\[Fσ\(𝜽t\+1\)∣𝜽t\]≤Fσ\(𝜽t\)−η2‖∇Fσ\(𝜽t\)‖2\+ℒη22𝔼\[‖𝐠^t‖2∣𝜽t\]\+O\(ησ4\)\.\\mathbb\{E\}\\big\[F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\+1\}\)\\mid\\bm\{\\theta\}\_\{t\}\\big\]\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\-\\tfrac\{\\eta\}\{2\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\}\{2\}\\mathbb\{E\}\\big\[\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\\\|^\{2\}\\mid\\bm\{\\theta\}\_\{t\}\\big\]\+O\(\\eta\\sigma^\{4\}\)\.\(63\)Next decompose the second moment:
𝔼\[‖𝐠^t‖2∣𝜽t\]=‖∇Fσ\(𝜽t\)‖2\+𝔼\[‖𝐠^t−∇Fσ\(𝜽t\)‖2∣𝜽t\]\.\\mathbb\{E\}\\big\[\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\\\|^\{2\}\\mid\\bm\{\\theta\}\_\{t\}\\big\]=\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\mathbb\{E\}\\big\[\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\-\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\\mid\\bm\{\\theta\}\_\{t\}\\big\]\.\(64\)Plugging \([64](https://arxiv.org/html/2606.02857#A8.E64)\) and \([59](https://arxiv.org/html/2606.02857#A8.E59)\) into \([63](https://arxiv.org/html/2606.02857#A8.E63)\) gives
𝔼\[Fσ\(𝜽t\+1\)∣𝜽t\]≤Fσ\(𝜽t\)−η2\(1−ℒη\)‖∇Fσ\(𝜽t\)‖2\+ℒη22Avar\(𝜽t\)\.\\mathbb\{E\}\\big\[F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\+1\}\)\\mid\\bm\{\\theta\}\_\{t\}\\big\]\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\-\\frac\{\\eta\}\{2\}\\bigl\(1\-\\mathcal\{L\}\\eta\\bigr\)\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\}\{2\}\\,A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)\.\(65\)Assumingη≤1/\(2ℒ\)\\eta\\leq 1/\(2\\mathcal\{L\}\), we have1−ℒη≥121\-\\mathcal\{L\}\\eta\\geq\\frac\{1\}\{2\}and thus
𝔼\[Fσ\(𝜽t\+1\)∣𝜽t\]≤Fσ\(𝜽t\)−η4‖∇Fσ\(𝜽t\)‖2\+ℒη22Avar\(𝜽t\)\.\\mathbb\{E\}\\big\[F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\+1\}\)\\mid\\bm\{\\theta\}\_\{t\}\\big\]\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\-\\frac\{\\eta\}\{4\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\}\{2\}\\,A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)\.\(66\)
#### H\.7\.2Handling Randomness: Data Noise \+ ZO Noise
We now instantiateAvar\(𝜽t\)A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)using Appendix[H\.6](https://arxiv.org/html/2606.02857#A8.SS6)\. Letd=DoutDind=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}\. From \([54](https://arxiv.org/html/2606.02857#A8.E54)\) \(Appendix[H\.6](https://arxiv.org/html/2606.02857#A8.SS6)\), we have
Avar\(𝜽t\)≤d−1B\(‖∇F\(𝜽t\)‖2\+ν2\)\+CZO\(d,B,σ\),A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)\\;\\leq\\;\\frac\{d\-1\}\{B\}\\Big\(\\\|\\nabla F\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\nu^\{2\}\\Big\)\\;\+\\;C\_\{\\text\{ZO\}\}\(d,B,\\sigma\),\(67\)whereCZOC\_\{\\text\{ZO\}\}collects the finite\-difference remainder \(ZO noise\) terms, e\.g\.,
CZO\(d,B,σ\)=O\(ρ2σ4d4\)\.C\_\{\\text\{ZO\}\}\(d,B,\\sigma\)=O\\\!\\left\(\\rho^\{2\}\\sigma^\{4\}d^\{4\}\\right\)\.\(68\)
To express everything in terms of∇Fσ\(𝜽t\)\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\), we use the smoothing bias bound \(Appendix[H\.5](https://arxiv.org/html/2606.02857#A8.SS5)\): assuming‖∇Fσ\(𝜽\)−∇F\(𝜽\)‖≤cbiasσ2d\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|\\leq c\_\{\\text\{bias\}\}\\sigma^\{2\}dfor all𝜽\\bm\{\\theta\}, we have
‖∇F\(𝜽t\)‖2≤2‖∇Fσ\(𝜽t\)‖2\+2cbias2σ4d2\.\\\|\\nabla F\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\\leq 2\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+2c\_\{\\text\{bias\}\}^\{2\}\\sigma^\{4\}d^\{2\}\.\(69\)Plug \([69](https://arxiv.org/html/2606.02857#A8.E69)\) into \([67](https://arxiv.org/html/2606.02857#A8.E67)\):
Avar\(𝜽t\)≤2\(d−1\)B⏟:=α‖∇Fσ\(𝜽t\)‖2\+d−1B\(ν2\+2cbias2σ4d2\)\+CZO\(d,B,σ\)⏟:=β\.A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)\\leq\\underbrace\{\\frac\{2\(d\-1\)\}\{B\}\}\_\{:=\\alpha\}\\,\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\\;\+\\;\\underbrace\{\\frac\{d\-1\}\{B\}\\Big\(\\nu^\{2\}\+2c\_\{\\text\{bias\}\}^\{2\}\\sigma^\{4\}d^\{2\}\\Big\)\+C\_\{\\text\{ZO\}\}\(d,B,\\sigma\)\}\_\{:=\\beta\}\.\(70\)
#### H\.7\.3Telescoping and Average Gradient Norm Bound
Take full expectation of \([66](https://arxiv.org/html/2606.02857#A8.E66)\) and sum fromt=0t=0toT−1T\-1:
𝔼\[Fσ\(𝜽T\)\]\\displaystyle\\mathbb\{E\}\[F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{T\}\)\]≤Fσ\(𝜽0\)−η4∑t=0T−1𝔼‖∇Fσ\(𝜽t\)‖2\+ℒη22∑t=0T−1𝔼\[Avar\(𝜽t\)\]\.\\displaystyle\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-\\frac\{\\eta\}\{4\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\}\{2\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[A\_\{\\text\{var\}\}\(\\bm\{\\theta\}\_\{t\}\)\\big\]\.\(71\)Using the bound \([70](https://arxiv.org/html/2606.02857#A8.E70)\) and rearranging gives
η4∑t=0T−1𝔼‖∇Fσ\(𝜽t\)‖2\\displaystyle\\frac\{\\eta\}\{4\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}≤Fσ\(𝜽0\)−𝔼\[Fσ\(𝜽T\)\]\+ℒη22∑t=0T−1𝔼\[α‖∇Fσ\(𝜽t\)‖2\+β\]\\displaystyle\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-\\mathbb\{E\}\[F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{T\}\)\]\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\}\{2\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\left\[\\alpha\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\beta\\right\]≤Fσ\(𝜽0\)−Fσ⋆\+ℒη2α2∑t=0T−1𝔼‖∇Fσ\(𝜽t\)‖2\+ℒη2T2β\.\\displaystyle\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-F\_\{\\sigma\}^\{\\star\}\+\\frac\{\\mathcal\{L\}\\eta^\{2\}\\alpha\}\{2\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\frac\{\\mathcal\{L\}\\eta^\{2\}T\}\{2\}\\beta\.\(72\)Move the gradient\-sum term to the left:
\(η4−ℒη2α2\)∑t=0T−1𝔼‖∇Fσ\(𝜽t\)‖2≤Fσ\(𝜽0\)−Fσ⋆\+ℒη2T2β\.\\left\(\\frac\{\\eta\}\{4\}\-\\frac\{\\mathcal\{L\}\\eta^\{2\}\\alpha\}\{2\}\\right\)\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\\leq F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-F\_\{\\sigma\}^\{\\star\}\+\\frac\{\\mathcal\{L\}\\eta^\{2\}T\}\{2\}\\beta\.\(73\)Choose step size satisfying
η≤min\{12ℒ,14ℒα\}=min\{12ℒ,B8ℒ\(d−1\)\}\.\\eta\\leq\\min\\left\\\{\\frac\{1\}\{2\\mathcal\{L\}\},\\frac\{1\}\{4\\mathcal\{L\}\\alpha\}\\right\\\}=\\min\\left\\\{\\frac\{1\}\{2\\mathcal\{L\}\},\\frac\{B\}\{8\\mathcal\{L\}\(d\-1\)\}\\right\\\}\.\(74\)
With2ℒηα≤122\\mathcal\{L\}\\eta\\alpha\\leq\\frac\{1\}\{2\}, we have1−2ℒηα≥121\-2\\mathcal\{L\}\\eta\\alpha\\geq\\frac\{1\}\{2\}, henceη4\(1−2ℒηα\)≥η8\\frac\{\\eta\}\{4\}\(1\-2\\mathcal\{L\}\\eta\\alpha\)\\geq\\frac\{\\eta\}\{8\}, and \([73](https://arxiv.org/html/2606.02857#A8.E73)\) yields
1T∑t=0T−1𝔼‖∇Fσ\(𝜽t\)‖2≤8\(Fσ\(𝜽0\)−Fσ⋆\)ηT\+4ℒηβ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\\leq\\frac\{8\\big\(F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-F\_\{\\sigma\}^\{\\star\}\\big\)\}\{\\eta T\}\+4\\mathcal\{L\}\\eta\\,\\beta\.\(75\)
##### Makingβ\\betaexplicit in\(B,Din,Dout\)\(B,D\_\{\\text\{in\}\},D\_\{\\text\{out\}\}\)\.
Recalld=DoutDind=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}\. From \([70](https://arxiv.org/html/2606.02857#A8.E70)\),
β=d−1B\(ν2\+2cbias2σ4d2\)\+O\(ρ2σ4d4\),d=DoutDin\.\\beta=\\frac\{d\-1\}\{B\}\\Big\(\\nu^\{2\}\+2c\_\{\\text\{bias\}\}^\{2\}\\sigma^\{4\}d^\{2\}\\Big\)\+O\\\!\\left\(\\rho^\{2\}\\sigma^\{4\}d^\{4\}\\right\),\\qquad d=D\_\{\\text\{out\}\}D\_\{\\text\{in\}\}\.\(76\)Plugging \([76](https://arxiv.org/html/2606.02857#A8.E76)\) into \([75](https://arxiv.org/html/2606.02857#A8.E75)\) gives an explicit bound that cleanly separates the*data noise*termν2\\nu^\{2\}and the*ZO noise*terms \(proportional toσ4\\sigma^\{4\}\)\.
##### GRZO Normalization Effect on Convergence\.
Direct algebra gives∑iai2=B\(1\+\(δ¯/s\)2\)\(s/\(s\+ϵ\)\)2\\sum\_\{i\}a\_\{i\}^\{2\}=B\\bigl\(1\+\(\\bar\{\\delta\}/s\)^\{2\}\\bigr\)\\bigl\(s/\(s\+\\epsilon\)\\bigr\)^\{2\}, so1B∑iai2≈1\\frac\{1\}\{B\}\\sum\_\{i\}a\_\{i\}^\{2\}\\approx 1at typical batch sizes*provided*: \(i\)𝔼\[st\]≫ϵ\\mathbb\{E\}\[s\_\{t\}\]\\gg\\epsilon\(the within\-batch SD dominates the numerical\-stability constant, sos/\(s\+ϵ\)≈1s/\(s\+\\epsilon\)\\approx 1\); and \(ii\) the batch is large enough that𝔼\[\(δ¯/s\)2\]=O\(1/B\)\\mathbb\{E\}\[\(\\bar\{\\delta\}/s\)^\{2\}\]=O\(1/B\)\(for two\-sided ZO,𝔼\[δ¯\]=0\\mathbb\{E\}\[\\bar\{\\delta\}\]=0by symmetry, so this follows from a standardO\(1/B\)O\(1/\\sqrt\{B\}\)CLT\-type estimate onδ¯\\bar\{\\delta\}\)\. Under \(i\)–\(ii\), the variance ofg^GRZO\\widehat\{g\}\_\{\\mathrm\{GRZO\}\}is self\-normalized and does not grow with the magnitude of\{δi\}\\\{\\delta\_\{i\}\\\}\. The same convergence bound holds with the rescaled step sizeη\\etathat absorbs the positive scalarct=1/\(s⋆\(𝜽\)\+ϵ\)c\_\{t\}=1/\(s\_\{\\star\}\(\\bm\{\\theta\}\)\+\\epsilon\)from directional unbiasedness \(A4\)\.
### H\.8From a Single Layer to the Full Network \(Block\-Wise Aggregation\)
##### Block\-Wise Parameterization\.
Let the full parameter vector be a concatenation ofLLblocks
𝜽=\(𝜽\(1\),𝜽\(2\),…,𝜽\(L\)\)∈ℝdtot,dtot=∑ℓ=1Ldℓ,\\bm\{\\theta\}=\\big\(\\bm\{\\theta\}^\{\(1\)\},\\bm\{\\theta\}^\{\(2\)\},\\dots,\\bm\{\\theta\}^\{\(L\)\}\\big\)\\in\\mathbb\{R\}^\{d\_\{\\text\{tot\}\}\},\\qquad d\_\{\\text\{tot\}\}=\\sum\_\{\\ell=1\}^\{L\}d\_\{\\ell\},where each block corresponds to a linear layer weight𝐖\(ℓ\)∈ℝDout\(ℓ\)×Din\(ℓ\)\\mathbf\{W\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{D\_\{\\text\{out\}\}^\{\(\\ell\)\}\\times D\_\{\\text\{in\}\}^\{\(\\ell\)\}\}withdℓ=Dout\(ℓ\)Din\(ℓ\)d\_\{\\ell\}=D\_\{\\text\{out\}\}^\{\(\\ell\)\}D\_\{\\text\{in\}\}^\{\(\\ell\)\}\. DefineF\(𝜽\)=𝔼ξ\[ℓ\(𝜽;ξ\)\]F\(\\bm\{\\theta\}\)=\\mathbb\{E\}\_\{\\xi\}\[\\ell\(\\bm\{\\theta\};\\xi\)\]and the two\-sided per\-example loss differencesδt,i=ℓ\(𝜽t\+σ𝐳t,i;ξt,i\)−ℓ\(𝜽t−σ𝐳t,i;ξt,i\)\\delta\_\{t,i\}=\\ell\(\\bm\{\\theta\}\_\{t\}\+\\sigma\\mathbf\{z\}\_\{t,i\};\\xi\_\{t,i\}\)\-\\ell\(\\bm\{\\theta\}\_\{t\}\-\\sigma\\mathbf\{z\}\_\{t,i\};\\xi\_\{t,i\}\)\.
##### Per\-Block Flipout Perturbations\.
At each optimizer steptt, for each blockℓ\\ellwe sample an independent base noise𝐔t\(ℓ\)\\mathbf\{U\}\_\{t\}^\{\(\\ell\)\}and independent sign vectors\(𝐫t,i\(ℓ\),𝐬t,i\(ℓ\)\)\(\\mathbf\{r\}\_\{t,i\}^\{\(\\ell\)\},\\mathbf\{s\}\_\{t,i\}^\{\(\\ell\)\}\)for each flattened exampleii\. Let
𝐳t,i\(ℓ\):=vec\(𝐔t\(ℓ\)⊙\(𝐫t,i\(ℓ\)\(𝐬t,i\(ℓ\)\)⊤\)\)∈ℝdℓ,𝐳t,i:=\(𝐳t,i\(1\),…,𝐳t,i\(L\)\)∈ℝdtot\.\\mathbf\{z\}\_\{t,i\}^\{\(\\ell\)\}:=\\mathrm\{vec\}\\\!\\left\(\\mathbf\{U\}\_\{t\}^\{\(\\ell\)\}\\odot\\big\(\\mathbf\{r\}\_\{t,i\}^\{\(\\ell\)\}\(\\mathbf\{s\}\_\{t,i\}^\{\(\\ell\)\}\)^\{\\top\}\\big\)\\right\)\\in\\mathbb\{R\}^\{d\_\{\\ell\}\},\\qquad\\mathbf\{z\}\_\{t,i\}:=\\big\(\\mathbf\{z\}\_\{t,i\}^\{\(1\)\},\\dots,\\mathbf\{z\}\_\{t,i\}^\{\(L\)\}\\big\)\\in\\mathbb\{R\}^\{d\_\{\\text\{tot\}\}\}\.By construction and Lemma[2](https://arxiv.org/html/2606.02857#Thmlemma2), each𝐳t,i\(ℓ\)\\mathbf\{z\}\_\{t,i\}^\{\(\\ell\)\}is symmetric and isotropic, and the blocks are independent acrossℓ\\ell\.
##### Network\-Level GRZO Estimator\.
Compute batch statisticsδ¯t=1B∑iδt,i\\bar\{\\delta\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\}\\delta\_\{t,i\},st=1B∑i\(δt,i−δ¯t\)2s\_\{t\}=\\sqrt\{\\frac\{1\}\{B\}\\sum\_\{i\}\(\\delta\_\{t,i\}\-\\bar\{\\delta\}\_\{t\}\)^\{2\}\}, andat,i=δt,i/\(st\+ϵ\)a\_\{t,i\}=\\delta\_\{t,i\}/\(s\_\{t\}\+\\epsilon\)\. Define the GRZO estimator on the full parameter vector:
𝐠^t:=12σB∑i=1Bat,i𝐳t,i∈ℝdtot,𝐠^t\(ℓ\)is theℓ\-th block of𝐠^t\.\\widehat\{\\mathbf\{g\}\}\_\{t\}:=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}a\_\{t,i\}\\,\\mathbf\{z\}\_\{t,i\}\\in\\mathbb\{R\}^\{d\_\{\\text\{tot\}\}\},\\qquad\\widehat\{\\mathbf\{g\}\}\_\{t\}^\{\(\\ell\)\}\\text\{ is the $\\ell$\-th block of \}\\widehat\{\\mathbf\{g\}\}\_\{t\}\.\(77\)
###### Proposition 1\(Canonical unbiasedness w\.r\.t\. a smoothed objective \(full network\)\)\.
LetFσ\(𝛉\)=𝔼z\[F\(𝛉\+σ𝐳\)\]F\_\{\\sigma\}\(\\bm\{\\theta\}\)=\\mathbb\{E\}\_\{z\}\[F\(\\bm\{\\theta\}\+\\sigma\\mathbf\{z\}\)\]be the smoothed objective induced by the joint perturbation𝐳=\(𝐳\(1\),…,𝐳\(L\)\)\\mathbf\{z\}=\(\\mathbf\{z\}^\{\(1\)\},\\dots,\\mathbf\{z\}^\{\(L\)\}\)\(product across blocks\)\. Define the unnormalized network\-level canonical estimator
𝐠^can,t:=12σB∑i=1Bδt,i𝐳t,i\.\\widehat\{\\mathbf\{g\}\}\_\{\\mathrm\{can\},t\}\\;:=\\;\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}\\delta\_\{t,i\}\\,\\mathbf\{z\}\_\{t,i\}\.Under the conditions of Appendix[H\.3](https://arxiv.org/html/2606.02857#A8.SS3)\(Gaussian, exact\) or Appendix[H\.4](https://arxiv.org/html/2606.02857#A8.SS4)\(general symmetric isotropic,O\(σ2\)O\(\\sigma^\{2\}\)\-accurate\),
𝔼\[𝐠^can,t∣𝜽t\]=∇Fσ\(𝜽t\),and hence𝔼\[𝐠^can,t\(ℓ\)∣𝜽t\]=∇𝜽\(ℓ\)Fσ\(𝜽t\)\.\\mathbb\{E\}\\big\[\\widehat\{\\mathbf\{g\}\}\_\{\\mathrm\{can\},t\}\\mid\\bm\{\\theta\}\_\{t\}\\big\]=\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\),\\qquad\\text\{and hence\}\\qquad\\mathbb\{E\}\\big\[\\widehat\{\\mathbf\{g\}\}\_\{\\mathrm\{can\},t\}^\{\(\\ell\)\}\\mid\\bm\{\\theta\}\_\{t\}\\big\]=\\nabla\_\{\\bm\{\\theta\}^\{\(\\ell\)\}\}F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\.The GRZO estimator factors as𝐠^t=\(st\+ϵ\)−1𝐠^can,t\\widehat\{\\mathbf\{g\}\}\_\{t\}=\(s\_\{t\}\+\\epsilon\)^\{\-1\}\\,\\widehat\{\\mathbf\{g\}\}\_\{\\mathrm\{can\},t\}\([77](https://arxiv.org/html/2606.02857#A8.E77)\); the self\-normalization is analyzed as an adaptive step\-size rescaling and is approximately direction\-preserving under concentration ofsts\_\{t\}\(cf\. \([24](https://arxiv.org/html/2606.02857#A8.E24)\) and Appendix[H\.7](https://arxiv.org/html/2606.02857#A8.SS7)\)\.
###### Proof\.
The proof is identical to the single\-block case after viewing𝐳t,i∈ℝdtot\\mathbf\{z\}\_\{t,i\}\\in\\mathbb\{R\}^\{d\_\{\\text\{tot\}\}\}as the perturbation direction: by Lemma[2](https://arxiv.org/html/2606.02857#Thmlemma2)applied block\-wise and independence across blocks,𝐳t,i\\mathbf\{z\}\_\{t,i\}is symmetric with𝔼\[𝐳t,i\]=0\\mathbb\{E\}\[\\mathbf\{z\}\_\{t,i\}\]=0and𝔼\[𝐳t,i𝐳t,i⊤\]=𝐈dtot\\mathbb\{E\}\[\\mathbf\{z\}\_\{t,i\}\\mathbf\{z\}\_\{t,i\}^\{\\top\}\]=\\mathbf\{I\}\_\{d\_\{\\text\{tot\}\}\}\. Therefore the two\-sided finite\-difference identity applied to𝐠^can,t\\widehat\{\\mathbf\{g\}\}\_\{\\mathrm\{can\},t\}yields unbiasedness for∇Fσ\(𝜽t\)\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\(exact for Gaussian;O\(σ2\)O\(\\sigma^\{2\}\)\-accurate via Taylor expansion otherwise\)\. ∎
###### Proposition 2\(Variance decomposition across blocks\)\.
Assume that the random seeds/noises used in different blocks are independent acrossℓ\\ell\. Then the centered second moment decomposes as
𝔼∥𝐠^t−𝔼\[𝐠^t∣𝜽t\]∥2=∑ℓ=1L𝔼∥𝐠^t\(ℓ\)−𝔼\[𝐠^t\(ℓ\)∣𝜽t\]∥2\.\\mathbb\{E\}\\big\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\-\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]\\big\\\|^\{2\}=\\sum\_\{\\ell=1\}^\{L\}\\mathbb\{E\}\\big\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}^\{\(\\ell\)\}\-\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}^\{\(\\ell\)\}\\mid\\bm\{\\theta\}\_\{t\}\]\\big\\\|^\{2\}\.Consequently, any per\-block variance bound can be summed to yield a network\-level bound\.
###### Proof\.
Because∥⋅∥2\\\|\\cdot\\\|^\{2\}on a concatenated vector is the sum of squared norms of its blocks,
∥𝐠^t−𝔼\[𝐠^t∣𝜽t\]∥2=∑ℓ=1L∥𝐠^t\(ℓ\)−𝔼\[𝐠^t\(ℓ\)∣𝜽t\]∥2\.\\big\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\-\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]\\big\\\|^\{2\}=\\sum\_\{\\ell=1\}^\{L\}\\big\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}^\{\(\\ell\)\}\-\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}^\{\(\\ell\)\}\\mid\\bm\{\\theta\}\_\{t\}\]\\big\\\|^\{2\}\.Taking expectations gives the identity \(no cross terms appear because the blocks live in disjoint coordinates\)\. ∎
##### Plugging in explicit\(B,Din,Dout\)\(B,D\_\{\\text\{in\}\},D\_\{\\text\{out\}\}\)dependence\.
Applying the single\-block bound \([54](https://arxiv.org/html/2606.02857#A8.E54)\) \(Appendix[H\.6](https://arxiv.org/html/2606.02857#A8.SS6)\) to each blockℓ\\ellwithdℓ=Dout\(ℓ\)Din\(ℓ\)d\_\{\\ell\}=D\_\{\\text\{out\}\}^\{\(\\ell\)\}D\_\{\\text\{in\}\}^\{\(\\ell\)\}yields
𝔼∥𝐠^t−𝔼\[𝐠^t∣𝜽t\]∥2\\displaystyle\\mathbb\{E\}\\big\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\-\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]\\big\\\|^\{2\}≤∑ℓ=1L\[dℓ−1B\(‖∇𝜽\(ℓ\)F\(𝜽t\)‖2\+ν2\)\+O\(ρℓ2σ4dℓ4\)\]\\displaystyle\\leq\\sum\_\{\\ell=1\}^\{L\}\\left\[\\frac\{d\_\{\\ell\}\-1\}\{B\}\\Big\(\\\|\\nabla\_\{\\bm\{\\theta\}^\{\(\\ell\)\}\}F\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\nu^\{2\}\\Big\)\+O\\\!\\left\(\\rho\_\{\\ell\}^\{2\}\\sigma^\{4\}d\_\{\\ell\}^\{4\}\\right\)\\right\]=1B∑ℓ=1L\(dℓ−1\)\(‖∇𝜽\(ℓ\)F\(𝜽t\)‖2\+ν2\)\+O\(σ4∑ℓ=1Lρℓ2dℓ4\)\.\\displaystyle=\\frac\{1\}\{B\}\\sum\_\{\\ell=1\}^\{L\}\(d\_\{\\ell\}\-1\)\\Big\(\\\|\\nabla\_\{\\bm\{\\theta\}^\{\(\\ell\)\}\}F\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\nu^\{2\}\\Big\)\+O\\\!\\left\(\\sigma^\{4\}\\sum\_\{\\ell=1\}^\{L\}\\rho\_\{\\ell\}^\{2\}d\_\{\\ell\}^\{4\}\\right\)\.\(78\)Heredℓ=Dout\(ℓ\)Din\(ℓ\)d\_\{\\ell\}=D\_\{\\text\{out\}\}^\{\(\\ell\)\}D\_\{\\text\{in\}\}^\{\(\\ell\)\}, making the dependence on\(Din,Dout\)\(D\_\{\\text\{in\}\},D\_\{\\text\{out\}\}\)explicit block\-wise\.
##### GRZO Normalization\.
Becauseat,ia\_\{t,i\}has approximate unit empirical variance by construction \(1B∑iat,i2≈1\\frac\{1\}\{B\}\\sum\_\{i\}a\_\{t,i\}^\{2\}\\approx 1\), the variance bound above applies directly to the GRZO estimator \([16](https://arxiv.org/html/2606.02857#A8.E16)\), with the positive scale factorct=1/\(s⋆\(𝜽t\)\+ϵ\)c\_\{t\}=1/\(s\_\{\\star\}\(\\bm\{\\theta\}\_\{t\}\)\+\\epsilon\)from \([24](https://arxiv.org/html/2606.02857#A8.E24)\) absorbed into the effective learning rate\.
###### Theorem 4\(Network\-level nonconvex convergence of GRZO\)\.
Let the full parameter vector be𝛉=\(𝛉\(1\),…,𝛉\(L\)\)∈ℝdtot\\bm\{\\theta\}=\(\\bm\{\\theta\}^\{\(1\)\},\\dots,\\bm\{\\theta\}^\{\(L\)\}\)\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{tot\}\}\}with block dimensionsdℓ=Dout\(ℓ\)Din\(ℓ\)d\_\{\\ell\}=D\_\{\\mathrm\{out\}\}^\{\(\\ell\)\}D\_\{\\mathrm\{in\}\}^\{\(\\ell\)\}anddtot=∑ℓ=1Ldℓd\_\{\\mathrm\{tot\}\}=\\sum\_\{\\ell=1\}^\{L\}d\_\{\\ell\}\. LetF\(𝛉\)=𝔼ξ\[ℓ\(𝛉;ξ\)\]F\(\\bm\{\\theta\}\)=\\mathbb\{E\}\_\{\\xi\}\[\\ell\(\\bm\{\\theta\};\\xi\)\]be the population objective andFσF\_\{\\sigma\}be the smoothed objective induced by the joint Flipout perturbation distribution\.
Consider the GRZO update with group\-relative normalization: letδ¯t=1B∑iδt,i\\bar\{\\delta\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\}\\delta\_\{t,i\},st=1B∑i\(δt,i−δ¯t\)2s\_\{t\}=\\sqrt\{\\frac\{1\}\{B\}\\sum\_\{i\}\(\\delta\_\{t,i\}\-\\bar\{\\delta\}\_\{t\}\)^\{2\}\},at,i=δt,i/\(st\+ϵ\)a\_\{t,i\}=\\delta\_\{t,i\}/\(s\_\{t\}\+\\epsilon\), and
𝜽t\+1=𝜽t−η𝐠^t,𝐠^t=12σB∑i=1Bat,i𝐳t,i,\\bm\{\\theta\}\_\{t\+1\}=\\bm\{\\theta\}\_\{t\}\-\\eta\\,\\widehat\{\\mathbf\{g\}\}\_\{t\},\\qquad\\widehat\{\\mathbf\{g\}\}\_\{t\}=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i=1\}^\{B\}a\_\{t,i\}\\,\\mathbf\{z\}\_\{t,i\},whereBBis the batch size andδt,i=ℓ\(𝛉t\+σ𝐳t,i;ξt,i\)−ℓ\(𝛉t−σ𝐳t,i;ξt,i\)\\delta\_\{t,i\}=\\ell\(\\bm\{\\theta\}\_\{t\}\+\\sigma\\mathbf\{z\}\_\{t,i\};\\xi\_\{t,i\}\)\-\\ell\(\\bm\{\\theta\}\_\{t\}\-\\sigma\\mathbf\{z\}\_\{t,i\};\\xi\_\{t,i\}\)\. Assume:
1. \(A1\)FσF\_\{\\sigma\}is lower bounded:Fσ\(𝜽\)≥Fσ⋆F\_\{\\sigma\}\(\\bm\{\\theta\}\)\\geq F\_\{\\sigma\}^\{\\star\}for all𝜽\\bm\{\\theta\}\.
2. \(A2\)FσF\_\{\\sigma\}isℒ\\mathcal\{L\}\-smooth\.
3. \(A3\)Data noise:𝔼‖∇ℓ\(𝜽;ξ\)−∇F\(𝜽\)‖2≤ν2\\mathbb\{E\}\\\|\\nabla\\ell\(\\bm\{\\theta\};\\xi\)\-\\nabla F\(\\bm\{\\theta\}\)\\\|^\{2\}\\leq\\nu^\{2\}\.
4. \(A4\)Directional unbiasedness: with a rescaled step sizeη\\etaabsorbing the positive scalect=1/\(s⋆\(𝜽t\)\+ϵ\)c\_\{t\}=1/\(s\_\{\\star\}\(\\bm\{\\theta\}\_\{t\}\)\+\\epsilon\)\(cf\. \([24](https://arxiv.org/html/2606.02857#A8.E24)\)\),𝔼\[𝐠^t∣𝜽t\]=∇Fσ\(𝜽t\)\+O\(σ2\)\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]=\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\+O\(\\sigma^\{2\}\)\.
5. \(A5\)Block\-wise Flipout construction and independence across blocks, so that the network\-level variance admits the explicit bound \(cf\. \([78](https://arxiv.org/html/2606.02857#A8.E78)\)\): 𝔼∥𝐠^t−𝔼\[𝐠^t∣𝜽t\]∥2≤1B∑ℓ=1L\(dℓ−1\)\(∥∇𝜽\(ℓ\)F\(𝜽t\)∥2\+ν2\)\+CZO,\\displaystyle\\mathbb\{E\}\\big\\\|\\widehat\{\\mathbf\{g\}\}\_\{t\}\-\\mathbb\{E\}\[\\widehat\{\\mathbf\{g\}\}\_\{t\}\\mid\\bm\{\\theta\}\_\{t\}\]\\big\\\|^\{2\}\\leq\\frac\{1\}\{B\}\\sum\_\{\\ell=1\}^\{L\}\(d\_\{\\ell\}\-1\)\\Big\(\\\|\\nabla\_\{\\bm\{\\theta\}^\{\(\\ell\)\}\}F\(\\bm\{\\theta\}\_\{t\}\)\\\|^\{2\}\+\\nu^\{2\}\\Big\)\+C\_\{\\mathrm\{ZO\}\},whereCZO=O\(σ4∑ℓ=1Lρℓ2dℓ4\)C\_\{\\mathrm\{ZO\}\}=O\\\!\\left\(\\sigma^\{4\}\\sum\_\{\\ell=1\}^\{L\}\\rho\_\{\\ell\}^\{2\}d\_\{\\ell\}^\{4\}\\right\)collects the finite\-difference/Taylor remainder terms \(withρℓ\\rho\_\{\\ell\}bounding the third derivative\)\.
Assume additionally a smoothing\-bias bound per block:‖∇𝛉\(ℓ\)Fσ\(𝛉\)−∇𝛉\(ℓ\)F\(𝛉\)‖≤cbias,ℓσ2dℓ\\\|\\nabla\_\{\\bm\{\\theta\}^\{\(\\ell\)\}\}F\_\{\\sigma\}\(\\bm\{\\theta\}\)\-\\nabla\_\{\\bm\{\\theta\}^\{\(\\ell\)\}\}F\(\\bm\{\\theta\}\)\\\|\\leq c\_\{\\mathrm\{bias\},\\ell\}\\sigma^\{2\}d\_\{\\ell\}for all𝛉\\bm\{\\theta\}\. Define
αnet:=2B∑ℓ=1L\(dℓ−1\),βnet:=1B∑ℓ=1L\(dℓ−1\)\(ν2\+2cbias,ℓ2σ4dℓ2\)\+CZO\.\\alpha\_\{\\mathrm\{net\}\}:=\\frac\{2\}\{B\}\\sum\_\{\\ell=1\}^\{L\}\(d\_\{\\ell\}\-1\),\\qquad\\beta\_\{\\mathrm\{net\}\}:=\\frac\{1\}\{B\}\\sum\_\{\\ell=1\}^\{L\}\(d\_\{\\ell\}\-1\)\\Big\(\\nu^\{2\}\+2c\_\{\\mathrm\{bias\},\\ell\}^\{2\}\\sigma^\{4\}d\_\{\\ell\}^\{2\}\\Big\)\+C\_\{\\mathrm\{ZO\}\}\.Choose a constant step size
η≤min\{12ℒ,14ℒαnet\}=min\{12ℒ,B8ℒ∑ℓ=1L\(dℓ−1\)\}\.\\eta\\;\\leq\\;\\min\\left\\\{\\frac\{1\}\{2\\mathcal\{L\}\},\\ \\frac\{1\}\{4\\mathcal\{L\}\\alpha\_\{\\mathrm\{net\}\}\}\\right\\\}=\\min\\left\\\{\\frac\{1\}\{2\\mathcal\{L\}\},\\ \\frac\{B\}\{8\\mathcal\{L\}\\sum\_\{\\ell=1\}^\{L\}\(d\_\{\\ell\}\-1\)\}\\right\\\}\.Then for anyT≥1T\\geq 1,
1T∑t=0T−1𝔼‖∇Fσ\(𝜽t\)‖2≤8\(Fσ\(𝜽0\)−Fσ⋆\)ηT\+4ℒηβnet\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{t\}\)\\big\\\|^\{2\}\\;\\leq\\;\\frac\{8\\big\(F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-F\_\{\\sigma\}^\{\\star\}\\big\)\}\{\\eta T\}\\;\+\\;4\\mathcal\{L\}\\eta\\,\\beta\_\{\\mathrm\{net\}\}\.Equivalently, ifR∼Unif\{0,…,T−1\}R\\sim\\mathrm\{Unif\}\\\{0,\\dots,T\-1\\\}, then
𝔼‖∇Fσ\(𝜽R\)‖2≤8\(Fσ\(𝜽0\)−Fσ⋆\)ηT\+4ℒηβnet\.\\mathbb\{E\}\\big\\\|\\nabla F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{R\}\)\\big\\\|^\{2\}\\;\\leq\\;\\frac\{8\\big\(F\_\{\\sigma\}\(\\bm\{\\theta\}\_\{0\}\)\-F\_\{\\sigma\}^\{\\star\}\\big\)\}\{\\eta T\}\\;\+\\;4\\mathcal\{L\}\\eta\\,\\beta\_\{\\mathrm\{net\}\}\.
##### GRZO Normalization and Effective Step Size\.
The group\-relative weightsat,i=δt,i/\(st\+ϵ\)a\_\{t,i\}=\\delta\_\{t,i\}/\(s\_\{t\}\+\\epsilon\)have approximate unit empirical variance \(1B∑iat,i2≈1\\frac\{1\}\{B\}\\sum\_\{i\}a\_\{t,i\}^\{2\}\\approx 1\), so the convergence bound holds with the sameη\\etaafter absorbing the positive scalect=1/\(s⋆\(𝜽t\)\+ϵ\)c\_\{t\}=1/\(s\_\{\\star\}\(\\bm\{\\theta\}\_\{t\}\)\+\\epsilon\)from \(A4\) into the learning rate\.
##### Remark \(Scope of Theorem\)\.
The theorem covers the GRZO estimator with group\-relative normalization𝐠^t=12σB∑iat,i𝐳t,i\\widehat\{\\mathbf\{g\}\}\_\{t\}=\\frac\{1\}\{2\\sigma B\}\\sum\_\{i\}a\_\{t,i\}\\,\\mathbf\{z\}\_\{t,i\},at,i=δt,i/\(st\+ϵ\)a\_\{t,i\}=\\delta\_\{t,i\}/\(s\_\{t\}\+\\epsilon\), which is the estimator used throughout the paper\. The positive scalect=1/\(s⋆\(𝜽t\)\+ϵ\)c\_\{t\}=1/\(s\_\{\\star\}\(\\bm\{\\theta\}\_\{t\}\)\+\\epsilon\)from \(A4\) is absorbed into the effective learning rate; theO\(σ2\)O\(\\sigma^\{2\}\)smoothing remainder in \(A4\) is the only residual bias\.
##### Remark \(Comparison with MeZO\)\.
MeZO uses a single shared perturbation direction per update, giving a gradient estimator second momentO\(d‖∇F‖2\+dν2\)O\(d\\\|\\nabla F\\\|^\{2\}\+d\\nu^\{2\}\)whereddis the full parameter dimension\. GRZO’s estimator achieves1B∑ℓ=1L\(dℓ−1\)\(‖∇𝜽\(ℓ\)F‖2\+ν2\)\\frac\{1\}\{B\}\\sum\_\{\\ell=1\}^\{L\}\(d\_\{\\ell\}\-1\)\\bigl\(\\\|\\nabla\_\{\\bm\{\\theta\}^\{\(\\ell\)\}\}F\\\|^\{2\}\+\\nu^\{2\}\\bigr\), aBB\-fold reduction\.
## Appendix ILLM Usage
Large language models were used only as writing assistants for minor grammar and phrasing polish on author\-drafted text\. They played no role in research conception, experimental design, or interpretation of results\. All technical content and claims are the authors’ own and were independently verified\.Similar Articles
Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning
This paper introduces AdaNAGED, a method that combines zero-order optimization, parameter-free adaptation, and non-Euclidean update geometry for memory-efficient fine-tuning of large language models, with theoretical convergence guarantees and validation on the OPT-1.3B model.
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
This paper proposes a covariance-aware variant of Group Relative Policy Optimization (GRPO) that uses Gaussian-kernel advantage reweighting to stabilize training entropy and improve reasoning performance in large language models.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
GRLO introduces a novel reinforcement learning post-training method that achieves strong generalization across multiple domains (math, code, etc.) from only 5K prompts and 22.7 GPU hours, significantly outperforming in-domain RLVR baselines in efficiency and data requirements.
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
F-GRPO proposes a factorized group-relative policy optimization framework that unifies candidate generation and ranking in a single autoregressive LLM, addressing credit assignment issues and improving top-ranked performance across sequential recommendation and multi-hop QA benchmarks.