Gradient Extrapolation-Based Policy Optimization
Summary
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
View Cached Full Text
Cached at: 05/11/26, 06:48 AM
# Gradient Extrapolation-Based Policy Optimization
Source: [https://arxiv.org/html/2605.06755](https://arxiv.org/html/2605.06755)
Ismam Nur Swapnil1Aranya Saha2Tanvir Ahmed Khan3 Mohammad Ariful Haque1Ser\-Nam Lim4 1Bangladesh University of Engineering and Technology 2University of Maryland, College Park 3Illinois Institute of Technology4University of Central Florida
###### Abstract
Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked\. Standard GRPO\-style training updates the model using only the current step, while full multi\-step lookahead can give a better update direction but is too expensive because it needs many backward passes\. We proposeGradient Extrapolation\-Based Policy Optimization \(GXPO\), a plug\-compatible policy\-update rule for GRPO\-style reasoning RL\. GXPO approximates a longer local lookahead using only three backward passes during an active phase\. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points\. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtualKK\-step lookahead point, moves the policy partway toward that point, and then applies a corrective update using the true gradient at the new position\. When the lookahead signal becomes unstable, GXPO automatically switches back to standard single\-pass GRPO\. We also give a plain\-gradient\-descent surrogate analysis that explains when the extrapolation is exact and where its local errors come from\. Across Qwen2\.5 and Llama math\-reasoning experiments, GXPO improves the average sampled pass@1 by\+1\.65\\mathbf\{\+1\.65\}to\+5\.00\\mathbf\{\+5\.00\}points over GRPO and by\+0\.14\\mathbf\{\+0\.14\}to\+1\.28\\mathbf\{\+1\.28\}points over the strongest SFPO setting, while keeping the active\-phase cost fixed at three backward passes\. It also achieves up to4\.00×\\mathbf\{4\.00\\times\}step speedup,2\.33×\\mathbf\{2\.33\\times\}wall\-clock speedup, and1\.33×\\mathbf\{1\.33\\times\}backward\-pass speedup in reaching GRPO’s peak accuracy\.
## 1Introduction
Policy\-gradient reinforcement learning underlies modern language\-model alignment and reasoning\(Sutton et al\.,[1999](https://arxiv.org/html/2605.06755#bib.bib32); Williams,[1992](https://arxiv.org/html/2605.06755#bib.bib36); Schulman et al\.,[2017](https://arxiv.org/html/2605.06755#bib.bib27)\)\. In reinforcement learning with verifiable rewards \(RLVR\), models generate candidate solutions, receive verifiable rewards, and update the policy with first\-order gradients\(Shao et al\.,[2024](https://arxiv.org/html/2605.06755#bib.bib28); Yu et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib41)\)\. This setting is central to MATH and GSM8K\-style reasoning benchmarks, where correctness can be checked automatically after long\-form generation\(Hendrycks et al\.,[2021](https://arxiv.org/html/2605.06755#bib.bib12); Cobbe et al\.,[2021](https://arxiv.org/html/2605.06755#bib.bib3)\)\. At LLM scale, each added backward pass multiplies training time and memory, putting update quality in tension with per\-step cost during repeated post\-training iterations\.
Figure 1:Overall GXPO training framework\. Each active step performs three backward passes: two probe gradients during gradient extrapolation and one corrective gradientgslowg\_\{\\text\{slow\}\}at the repositioned pointθ~t\\tilde\{\\theta\}^\{t\}\. The slow correction is always applied at the current step\. After the update, thezz\-score gate checks‖gslow‖\\\|g\_\{\\text\{slow\}\}\\\|; ifZ\>τZ\>\\tau, all subsequent steps permanently fall back to single\-pass GRPO\.Single\-step methods such as PPO and GRPO avoid this cost but use only the current\-policy gradient\(Schulman et al\.,[2017](https://arxiv.org/html/2605.06755#bib.bib27); Shao et al\.,[2024](https://arxiv.org/html/2605.06755#bib.bib28)\)\. Explicit lookahead recovers trajectory information by paying for it:hh\-step policy mirror descent improves regularized policy iteration\(Protopapas and Barakat,[2024](https://arxiv.org/html/2605.06755#bib.bib24)\), tree\-expansion policies reduce policy\-gradient variance\(Dalal et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib6)\), and online or adaptive planning uses learned\-model rollouts for action selection\(Sikchi et al\.,[2021](https://arxiv.org/html/2605.06755#bib.bib31); Rosenberg et al\.,[2023](https://arxiv.org/html/2605.06755#bib.bib26)\)\. These methods rely on planning, tree expansion, or auxiliary value/model estimates rather than the rollout batch already in hand, making them difficult to drop into standard GRPO\-style reasoning pipelines without changing the data path\. Recent LLM reasoning\-RL work improves stability and efficiency through objectives, values, entropy control, filtering, rewards, recipes, or off\-policy reuse\(Liu et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib17); Dai et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib5); Yue et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib42); Xiong et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib38); Cui et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib4); Wen et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib35); Fatemi et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib9); Shen et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib29); Mroueh,[2025](https://arxiv.org/html/2605.06755#bib.bib21); Mroueh et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib22)\), and through sample selection, down\-sampling, selective rollouts, few\-example training, test\-time scaling, or RLHF/generation\-system acceleration\(Chen et al\.,[2024](https://arxiv.org/html/2605.06755#bib.bib2); Xia et al\.,[2024](https://arxiv.org/html/2605.06755#bib.bib37); Ye et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib40); Li et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib16); Xu et al\.,[2026](https://arxiv.org/html/2605.06755#bib.bib39); Wang et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib34); Zheng et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib45); Muennighoff et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib23); Sheng et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib30); Kwon et al\.,[2023](https://arxiv.org/html/2605.06755#bib.bib14); He et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib11)\)\. GXPO is orthogonal: it keeps the rollout batch, rewards, advantages, and GRPO loss unchanged, and only changes the policy update\. This makes the update rule a narrow intervention, not a new reasoning\-RL training recipe or reward pipeline component\.
Optimizer\-side lookahead is closer to our setting\. The Lookahead optimizer interleaves fast inner steps with a slow update\(Zhang et al\.,[2019](https://arxiv.org/html/2605.06755#bib.bib43); Zhou et al\.,[2021](https://arxiv.org/html/2605.06755#bib.bib44)\), and SFPO ports this idea to policy optimization withKKfast inner steps before a slow correction\(Wang et al\.,[2026](https://arxiv.org/html/2605.06755#bib.bib33)\)—payingK\+1K\{\+\}1backward passes per update\.The question we address is whether a policy update can gain similar gradient\-trajectory information to explicit lookahead methods without increasing the number of backward passes\.
We introduceGradient Extrapolation\-Based Policy Optimization \(GXPO\), a GRPO\-compatible update that approximatesKKlocal policy\-gradient steps with three backward passes, independent of virtual depthKK\(see Figure[1](https://arxiv.org/html/2605.06755#S1.F1)\)\. GXPO reuses the rollout batch, rewards, advantages, and regularization; estimates a per\-coordinate retention ratiori=g1,i/g0,ir\_\{i\}=g\_\{1,i\}/g\_\{0,i\}from two probe gradients; moves toward the geometric displacement; and applies a corrective gradient at the repositioned policy, so the final step remains anchored to the true objective rather than the extrapolated prediction\. A rolling z\-score gate falls back to a single\-pass update when the corrective\-gradient norm becomes unstable during training\. Our contributions are:
- •A GRPO\-compatible update with three active\-phase backward passes for any virtual lookahead depthKK, using two probe gradients, geometric extrapolation, and one corrective gradient;
- •A fixed\-batch implementation that reuses rollouts, rewards, advantages, loss, and regularization, with a gate that reverts to the base single\-pass update when the local trajectory signal becomes unreliable;
- •A local quadratic surrogate analysis, plus benchmark, budget, ablation, and diagnostic evidence across two model families\.
## 2Method: Gradient Extrapolation\-Based Policy Optimization
GXPO replaces a single GRPO update with a three\-step update, while reusing the same rollouts, rewards, advantages, and objective, so no additional data or reward computation is required\. During training, it first takes two quick optimization steps using the base actor optimizer \(AdamW in our experiments\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.06755#bib.bib18)\)\) and observes how the parameters change\. It then uses this change to estimate the direction of the update, moves partway in that direction, and applies one final corrective step\. The theory studies a simplified version to clarify this behavior, while the experiments evaluate the method under the chosen optimizer\.
##### Setup and notation\.
Letθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}parameterise the policy and letℒ\(θ\)\\mathcal\{L\}\(\\theta\)be the GRPO loss\. Writeg\(θ\)=∇θℒ\(θ\)g\(\\theta\)=\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\),H\(θ\)=∇θ2ℒ\(θ\)H\(\\theta\)=\\nabla\_\{\\theta\}^\{2\}\\mathcal\{L\}\(\\theta\), andgn=g\(θn\)g\_\{n\}=g\(\\theta\_\{n\}\)\. Letη\>0\\eta\>0be the learning rate andKKthe virtual depth\.
### 2\.1From Taylor Expansion to Geometric Scaling
GXPO needs a cheap way to estimate how the gradient would change over several nearby steps\. The local intuition is simple: if the loss curvature is approximately stable nearθ0\\theta\_\{0\}, then the gradient evolution can be approximated by a predictable local recurrence\. Under a coordinate\-wise surrogate, this means that each gradient coordinate retains a measurable fraction of its previous value\. A Taylor expansion makes this intuition precise\. Aroundθ0\\theta\_\{0\},
g\(θ0\+Δ\)=g\(θ0\)\+H\(θ0\)Δ\+R2\(Δ\),‖R2\(Δ\)‖≤M32‖Δ‖2,g\(\\theta\_\{0\}\+\\Delta\)=g\(\\theta\_\{0\}\)\+H\(\\theta\_\{0\}\)\\,\\Delta\+R\_\{2\}\(\\Delta\),\\qquad\\\|R\_\{2\}\(\\Delta\)\\\|\\leq\\tfrac\{M\_\{3\}\}\{2\}\\\|\\Delta\\\|^\{2\},\(1\)whereM3=supξ‖∇3ℒ\(ξ\)‖M\_\{3\}=\\sup\_\{\\xi\}\\\|\\nabla^\{3\}\\mathcal\{L\}\(\\xi\)\\\|\. Dropping the remainder gives the local quadratic modelg\(θ0\+Δ\)≈g0\+H0Δg\(\\theta\_\{0\}\+\\Delta\)\\approx g\_\{0\}\+H\_\{0\}\\Delta\.
###### Assumption 1\(Local quadratic model\)\.
The Hessian is fixed atH0H\_\{0\}within the local extrapolation region\.
This assumption is used only for extrapolation: small learning rates keep probes local, and the final update still uses the true gradient at the repositioned point\.
In the plain\-GD surrogate, for one gradient stepθ1=θ0−ηg0\\theta\_\{1\}=\\theta\_\{0\}\-\\eta g\_\{0\}, the model gives
g1=g0−ηH0g0=\(I−ηH0\)g0\.g\_\{1\}=g\_\{0\}\-\\eta\\,H\_\{0\}\\,g\_\{0\}=\(I\-\\eta\\,H\_\{0\}\)\\,g\_\{0\}\.\(2\)Repeating this recurrence yields:
###### Theorem 1\(Gradient evolution under the local quadratic model\)\.
Under Assumption[1](https://arxiv.org/html/2605.06755#Thmassumption1), the gradient at thenn\-th gradient descent iterateθn=θn−1−ηgn−1\\theta\_\{n\}=\\theta\_\{n\-1\}\-\\eta\\,g\_\{n\-1\}satisfies
gn=\(I−ηH0\)ng0\.g\_\{n\}=\(I\-\\eta\\,H\_\{0\}\)^\{n\}\\,g\_\{0\}\.\(3\)
See Appendix[A\.1](https://arxiv.org/html/2605.06755#A1.SS1)for the proof\.
### 2\.2The Per\-Parameter Retention Ratio
The full local recurrence involves the Hessian, which is impossible to form or multiply at LLM scale\. GXPO therefore measures gradient evolution directly from two nearby gradients\. For each sufficiently active coordinate, the ratiog1,i/g0,ig\_\{1,i\}/g\_\{0,i\}estimates how much of that coordinate’s gradient is retained after one local step\. This ratio is not treated as an exact Hessian quantity in the implemented optimizer update; it is an empirical signal measured along the realized fast optimizer trajectory\. Formally, for active coordinates, GXPO defines
ri≡g1,ig0,i\.r\_\{i\}\\;\\equiv\\;\\frac\{g\_\{1,i\}\}\{g\_\{0,i\}\}\.\(4\)In the plain\-GD surrogate, whereθ1=θ0−ηg0\\theta\_\{1\}=\\theta\_\{0\}\-\\eta g\_\{0\}, the local quadratic model gives
ri=1−η\[H0g0\]ig0,i\.r\_\{i\}=1\-\\eta\\frac\{\[H\_\{0\}g\_\{0\}\]\_\{i\}\}\{g\_\{0,i\}\}\.In finite precision, Algorithm[1](https://arxiv.org/html/2605.06755#alg1)evaluates this ratio only on the active set𝒜t=\{i:\|git,0\|\>δ\}\\mathcal\{A\}\_\{t\}=\\\{i:\|g\_\{i\}^\{t,0\}\|\>\\delta\\\}\. Fori∉𝒜ti\\notin\\mathcal\{A\}\_\{t\}, no ratio is formed and the observed two\-probe displacement is kept\. On active coordinates,rir\_\{i\}measures local gradient retention:ri≈1r\_\{i\}\\approx 1is nearly flat,0<ri<10<r\_\{i\}<1is contraction, andri<0r\_\{i\}<0indicates overshoot\.
The per\-parameter gradient dynamics are approximated by a coordinate\-wise geometric sequence:
gn,i≈ring0,i\.g\_\{n,i\}\\approx r\_\{i\}^\{n\}\\,g\_\{0,i\}\.\(5\)In the plain\-GD surrogate, this is exact for diagonal Hessians\. For general Hessians, Appendix[A\.3](https://arxiv.org/html/2605.06755#A1.SS3)and Appendix[A\.5](https://arxiv.org/html/2605.06755#A1.SS5)bound the effects of off\-diagonal coupling and Taylor remainder\.
### 2\.3ScaledKK\-Step Extrapolation and Repositioning
The two fast steps give a short observed displacement, but GXPO wants to approximate a longerKK\-step lookahead without taking allKKsteps\. If a coordinate’s gradient follows the measured retention pattern, then the displacement overKKsteps is a scaled version of the observed two\-step displacement\. GXPO uses this geometric scale to predict a longer lookahead point, but moves only partway toward it usingα\\alphato reduce the effect of extrapolation error\. In the GD surrogate, theKK\-step displacement is
θK−θ0=−η∑n=0K−1gn\.\\theta\_\{K\}\-\\theta\_\{0\}=\-\\eta\\sum\_\{n=0\}^\{K\-1\}g\_\{n\}\.GXPO uses this identity only to motivate the geometric scaling rule\. In the implemented optimizer\-state\-aware version, GXPO instead observes the actual two\-step optimizer displacementθ2−θ0\\theta\_\{2\}\-\\theta\_\{0\}and scales this measured displacement\. Therefore, the method does not require the implemented optimizer displacement to equal a sum of raw gradients\.
Using the geometric model in \([5](https://arxiv.org/html/2605.06755#S2.E5)\), coordinateiimoves approximately as
\[θK−θ0\]i≈−ηg0,i∑n=0K−1rin=−ηg0,i1−riK1−ri\(ri≠1\)\.\[\\theta\_\{K\}\-\\theta\_\{0\}\]\_\{i\}\\;\\approx\\;\-\\eta\\,g\_\{0,i\}\\sum\_\{n=0\}^\{K\-1\}r\_\{i\}^\{n\}\\;=\\;\-\\eta\\,g\_\{0,i\}\\;\\frac\{1\-r\_\{i\}^\{K\}\}\{1\-r\_\{i\}\}\\qquad\(r\_\{i\}\\neq 1\)\.\(6\)
GXPO already observes the two\-step displacementθ2−θ0\\theta\_\{2\}\-\\theta\_\{0\}\. It converts this into aKK\-step estimate by multiplying each coordinate by the ratio of theKK\-step geometric sum to the two\-step geometric sum:
scalei=Si\(K\)Si\(2\),Si\(n\)=1−rin1−ri,\\mathrm\{scale\}\_\{i\}=\\frac\{S\_\{i\}\(K\)\}\{S\_\{i\}\(2\)\},\\qquad S\_\{i\}\(n\)=\\frac\{1\-r\_\{i\}^\{n\}\}\{1\-r\_\{i\}\},\(7\)which simplifies to\(1−riK\)/\(1−ri2\)\(1\-r\_\{i\}^\{K\}\)/\(1\-r\_\{i\}^\{2\}\)forri≠1r\_\{i\}\\neq 1\. The predictedKK\-step point is then
θK=θ0\+\(θ2−θ0\)⊙scale\.\\theta\_\{K\}=\\theta\_\{0\}\+\(\\theta\_\{2\}\-\\theta\_\{0\}\)\\odot\\mathrm\{scale\}\.\(8\)
GXPO then moves only partway toward this prediction:
θ~=θ0\+α\(θK−θ0\)\.\\tilde\{\\theta\}=\\theta\_\{0\}\+\\alpha\(\\theta\_\{K\}\-\\theta\_\{0\}\)\.\(9\)Hereα∈\[0,1\]\\alpha\\in\[0,1\]controls the reposition strength:α=0\\alpha=0ignores the prediction entirely, whileα=1\\alpha=1adopts it fully\. The small denominator terms in \([7](https://arxiv.org/html/2605.06755#S2.E7)\) serve as numerical stabilizers\. The theory in §[2\.4](https://arxiv.org/html/2605.06755#S2.SS4)and Appendix[A](https://arxiv.org/html/2605.06755#A1)analyzes the clean version of this rule\.
Since the extrapolated point is only a prediction, GXPO does not directly trust the extrapolated gradient\. Instead, it evaluates the true loss gradient at the repositioned point,gslow=∇θℒ\(θ~\)g\_\{\\mathrm\{slow\}\}=\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\tilde\{\\theta\}\)\. This corrective gradient anchors the update back to the actual objective and reduces the risk of following an inaccurate geometric prediction\. The final optimizer step usesgslowg\_\{\\mathrm\{slow\}\}rather than the extrapolated prediction\. For stateful optimizers such as AdamW, the two fast steps update the optimizer state, GXPO manually repositions the parameters toθ~\\tilde\{\\theta\}, and the third step is taken usinggslowg\_\{\\mathrm\{slow\}\}\. The analysis below studies the corresponding gradient\-descent surrogate\.
##### Adaptive rule in practice\.
GXPO uses geometric extrapolation only when the local training behavior appears stable\. Since gradient norms naturally change during training, a fixed absolute threshold is unreliable\. Instead, GXPO maintains a rolling buffer of recent corrective\-gradient norms and uses it as a local baseline for the current training stage\. If the corrective gradient at the repositioned point becomes unusually large relative to this recent history, GXPO treats the extrapolation as unreliable and disables it; training then continues with the same single\-pass GRPO update as the base trainer\.
Formally, before inserting the current norm into the buffer, letμt\\mu\_\{t\}andσt\\sigma\_\{t\}be the mean and standard deviation of the pastwwcorrective\-gradient norms, and define
Zt=‖gslow\(t\)‖2−μtσt\+ϵ,Z\_\{t\}=\\frac\{\\\|g\_\{\\mathrm\{slow\}\}^\{\(t\)\}\\\|\_\{2\}\-\\mu\_\{t\}\}\{\\sigma\_\{t\}\+\\epsilon\},\(10\)whereϵ\>0\\epsilon\>0\. During warm\-up, GXPO only fills the buffer\. IfZt≥τZ\_\{t\}\\geq\\tau, it setss⋆=t\+1s^\{\\star\}=t\+1and reverts to single\-backward\-pass GRPO for all subsequent steps\. Thus, extrapolation is disabled whenever the corrective\-gradient norm exhibits a sharp upward shift relative to its recent baseline\. OverTTsteps, the backward\-pass budget becomes3s⋆\+\(T−s⋆\)3s^\{\\star\}\+\(T\-s^\{\\star\}\)instead of3T3T\.
Algorithm 1GXPO: Gradient Extrapolation\-Based Policy Optimization1:Parameters
θ0∈ℝd\\theta^\{0\}\\in\\mathbb\{R\}^\{d\}; learning rate
η\\eta; virtual steps
KK; blend
α\\alpha; stability
δ\\delta; trigger threshold
τ\\tau;
2:Initialize:rolling buffer
ℬ←∅,s⋆←\+∞\\mathcal\{B\}\\leftarrow\\emptyset,\\;s^\{\\star\}\\leftarrow\+\\infty
3:foreach training step
t=0,1,2,…t=0,1,2,\\dotsdo
4:
gt,0←∇θℒ\(θt\)∈ℝdg^\{t,0\}\\leftarrow\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta^\{t\}\)\\in\\mathbb\{R\}^\{d\}⊳\\trianglerightBackward pass 1
5:if
t<s⋆t<s^\{\\star\}then⊳\\trianglerightActive phase: coordinate\-wise geometric extrapolation
6:
θt,1←OptimStep\(θt,gt,0\)\\theta^\{t,1\}\\leftarrow\\mathrm\{OptimStep\}\(\\theta^\{t\},g^\{t,0\}\)⊳\\trianglerightFirst fast optimizer step
7:
gt,1←∇θℒ\(θt,1\)∈ℝdg^\{t,1\}\\leftarrow\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta^\{t,1\}\)\\in\\mathbb\{R\}^\{d\}⊳\\trianglerightBackward pass 2
8:
θt,2←OptimStep\(θt,1,gt,1\)\\theta^\{t,2\}\\leftarrow\\mathrm\{OptimStep\}\(\\theta^\{t,1\},g^\{t,1\}\)⊳\\trianglerightSecond fast optimizer step
9:
𝒜t←\{i∈\[d\]:\|git,0\|\>δ\}\\mathcal\{A\}\_\{t\}\\leftarrow\\\{i\\in\[d\]:\|g\_\{i\}^\{t,0\}\|\>\\delta\\\}⊳\\trianglerightActive coordinates
10:
rit←git,1/git,0,∀i∈𝒜tr\_\{i\}^\{t\}\\leftarrow g\_\{i\}^\{t,1\}/g\_\{i\}^\{t,0\},\\quad\\forall i\\in\\mathcal\{A\}\_\{t\}⊳\\trianglerightRetention ratio
11:
Sit\(n\)←1−\(rit\)n1−rit,∀i∈𝒜tS\_\{i\}^\{t\}\(n\)\\leftarrow\\dfrac\{1\-\(r\_\{i\}^\{t\}\)^\{n\}\}\{1\-r\_\{i\}^\{t\}\},\\quad\\forall i\\in\\mathcal\{A\}\_\{t\}⊳\\trianglerightGeometric sum
12:
scaleit←\{Sit\(K\)Sit\(2\),i∈𝒜t,1,i∉𝒜t,\\mathrm\{scale\}\_\{i\}^\{t\}\\leftarrow\\begin\{cases\}\\dfrac\{S\_\{i\}^\{t\}\(K\)\}\{S\_\{i\}^\{t\}\(2\)\},&i\\in\\mathcal\{A\}\_\{t\},\\\\ 1,&i\\notin\\mathcal\{A\}\_\{t\},\\end\{cases\}⊳\\trianglerightExtrapolation factor
13:
θt,K←θt\+\(θt,2−θt\)⊙scalet\\theta^\{t,K\}\\leftarrow\\theta^\{t\}\+\(\\theta^\{t,2\}\-\\theta^\{t\}\)\\odot\\mathrm\{scale\}^\{t\}⊳\\trianglerightExtrapolate
14:
θ~t←θt\+α\(θt,K−θt\)\\widetilde\{\\theta\}^\{t\}\\leftarrow\\theta^\{t\}\+\\alpha\(\\theta^\{t,K\}\-\\theta^\{t\}\)⊳\\trianglerightReposition
15:
gslowt←∇θℒ\(θ~t\)∈ℝdg\_\{\\mathrm\{slow\}\}^\{t\}\\leftarrow\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\widetilde\{\\theta\}^\{t\}\)\\in\\mathbb\{R\}^\{d\}⊳\\trianglerightBackward pass 3
16:
θt\+1←OptimStep\(θ~t,gslowt\)\\theta^\{t\+1\}\\leftarrow\\mathrm\{OptimStep\}\(\\widetilde\{\\theta\}^\{t\},g\_\{\\mathrm\{slow\}\}^\{t\}\)⊳\\trianglerightSlow correction
17:if
len\(ℬ\)\>1\\mathrm\{len\}\(\\mathcal\{B\}\)\>1then
18:Compute rolling statistics
μt\\mu\_\{t\}and
σt\\sigma\_\{t\}from the current buffer
ℬ\\mathcal\{B\}
19:
Zt←‖gslowt‖2−μtσt\+ϵZ\_\{t\}\\leftarrow\\dfrac\{\\\|g\_\{\\mathrm\{slow\}\}^\{t\}\\\|\_\{2\}\-\\mu\_\{t\}\}\{\\sigma\_\{t\}\+\\epsilon\}⊳\\trianglerightAdaptive rule in §[2\.3](https://arxiv.org/html/2605.06755#S2.SS3.SSS0.Px1)
20:if
Zt≥τZ\_\{t\}\\geq\\tauthen
21:
s⋆←t\+1s^\{\\star\}\\leftarrow t\+1⊳\\trianglerightPermanently disable extrapolation
22:endif
23:endif
24:Update rolling buffer
ℬ\\mathcal\{B\}with
‖gslowt‖2\\\|g\_\{\\mathrm\{slow\}\}^\{t\}\\\|\_\{2\}
25:else
26:
θt\+1←OptimStep\(θt,gt,0\)\\theta^\{t\+1\}\\leftarrow\\mathrm\{OptimStep\}\(\\theta^\{t\},g^\{t,0\}\)⊳\\trianglerightFallback phase: single\-step GRPO
27:endif
28:endfor
### 2\.4Surrogate Analysis
The analysis below explains the parameter\-space extrapolation used by GXPO through a plain\-gradient\-descent surrogate\. In this surrogate, we can show when the geometric scaling is exact, where its errors come from, and which diagnostics should be checked in practice\. The implemented method still uses the chosen actor optimizer, AdamW in our experiments, so the surrogate should be read as a model of the extrapolation geometry rather than the full stateful optimizer dynamics of LLM training\.
We first consider the clean case where the coordinate\-wise geometric model is exact\.
###### Corollary 2\(Diagonal\-quadratic GD\-surrogate sanity check\)\.
Consider the global diagonal quadratic loss
ℒ\(θ\)=12θ⊤H0θ,H0=diag\(h1,…,hd\),hi\>0,ηhi≤1\.\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{2\}\\theta^\{\\top\}H\_\{0\}\\theta,\\qquad H\_\{0\}=\\mathrm\{diag\}\(h\_\{1\},\\dots,h\_\{d\}\),\\qquad h\_\{i\}\>0,\\qquad\\eta h\_\{i\}\\leq 1\.Assume all nonzero\-gradient coordinates are active, finite\-precision stabilizers are omitted, andα=1\\alpha=1\. Let
μ:=minihi\>0,ρ:=\(1−ημ\)2∈\[0,1\)\.\\mu:=\\min\_\{i\}h\_\{i\}\>0,\\qquad\\rho:=\(1\-\\eta\\mu\)^\{2\}\\in\[0,1\)\.Then one clean GXPO outer step with three backward passes reaches the same point asK\+1K\+1plain\-GD steps:
θnewGXPO=θK\+1GD,\\theta\_\{\\text\{new\}\}^\{\\mathrm\{GXPO\}\}=\\theta\_\{K\+1\}^\{\\mathrm\{GD\}\},and consequently, afterB∈3ℕB\\in 3\\mathbb\{N\}backward passes,
ℒ\(θB/3GXPO\)≤ρ\(K\+1\)B/3ℒ\(θ0\),\\mathcal\{L\}\\\!\\left\(\\theta\_\{B/3\}^\{\\mathrm\{GXPO\}\}\\right\)\\leq\\rho^\{\(K\+1\)B/3\}\\mathcal\{L\}\(\\theta\_\{0\}\),hence, if0<ρ<10<\\rho<1,
BGXPO=O\(3K\+1log1ε\)\.B\_\{\\mathrm\{GXPO\}\}=O\\\!\\left\(\\frac\{3\}\{K\+1\}\\log\\frac\{1\}\{\\varepsilon\}\\right\)\.This is only an algebraic sanity check: in the easiest case, the extrapolated point lands exactly where multiple GD steps would land\. Appendix[A\.6](https://arxiv.org/html/2605.06755#A1.SS6)proves Corollary[2](https://arxiv.org/html/2605.06755#Thmtheorem2)\.
Real losses are not diagonal quadratics, so the next result bounds the local error of the GD surrogate\.
###### Theorem 3\(Local displacement\-error bound for the GD surrogate\)\.
SupposeK≥2K\\geq 2,ℒ∈C3\\mathcal\{L\}\\in C^\{3\},supξ‖∇3ℒ\(ξ\)‖≤M3\\sup\_\{\\xi\}\\\|\\nabla^\{3\}\\mathcal\{L\}\(\\xi\)\\\|\\leq M\_\{3\}, and the true GD trajectory satisfies
sup0≤n<K‖g\(θntrue\)‖≤G\.\\sup\_\{0\\leq n<K\}\\\|g\(\\theta\_\{n\}^\{\\mathrm\{true\}\}\)\\\|\\leq G\.Letρ⋆≥1\\rho\_\{\\star\}\\geq 1andρ⋆≥‖I−ηH0‖\\rho\_\{\\star\}\\geq\\\|I\-\\eta H\_\{0\}\\\|\. Split coordinates into
𝒜=\{i:\|g0,i\|\>δ\},𝒮=𝒜c\.\\mathcal\{A\}=\\\{i:\|g\_\{0,i\}\|\>\\delta\\\},\\qquad\\mathcal\{S\}=\\mathcal\{A\}^\{c\}\.Consider the clean active\-set surrogate that uses empirical ratios on𝒜\\mathcal\{A\}and the observed two\-probe displacement on𝒮\\mathcal\{S\}\. If the active empirical ratios and diagonal surrogate rates are bounded byRR, then
‖θKemp−θKtrue‖≤Eoff\+Eratio\+Enonquad,\\\|\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{true\}\}\\\|\\leq E\_\{\\mathrm\{off\}\}\+E\_\{\\mathrm\{ratio\}\}\+E\_\{\\mathrm\{nonquad\}\},whereEoffE\_\{\\mathrm\{off\}\}comes from off\-diagonal Hessian coupling,EratioE\_\{\\mathrm\{ratio\}\}from empirical\-ratio error and inactive\-coordinate fallback, andEnonquadE\_\{\\mathrm\{nonquad\}\}from the Taylor remainder\. The explicit constants are given in Theorem[8](https://arxiv.org/html/2605.06755#Thmtheorem8), Lemma[9](https://arxiv.org/html/2605.06755#Thmtheorem9), and Corollary[10](https://arxiv.org/html/2605.06755#Thmtheorem10)in Appendix[A\.5](https://arxiv.org/html/2605.06755#A1.SS5); together they prove
Eoff=O\(K2η2‖H0off‖‖g0‖ρ⋆K−2\),E\_\{\\mathrm\{off\}\}=O\\\!\\left\(K^\{2\}\\eta^\{2\}\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\\|g\_\{0\}\\\|\\rho\_\{\\star\}^\{K\-2\}\\right\),Eratio=O\(η2/δ\)\+O\(η‖g0,𝒮‖1\)\+O\(η2‖\(H0g0\)𝒮‖1\),E\_\{\\mathrm\{ratio\}\}=O\(\\eta^\{2\}/\\delta\)\+O\(\\eta\\\|g\_\{0,\\mathcal\{S\}\}\\\|\_\{1\}\)\+O\\\!\\left\(\\eta^\{2\}\\\|\(H\_\{0\}g\_\{0\}\)\_\{\\mathcal\{S\}\}\\\|\_\{1\}\\right\),Enonquad=O\(K3η3M3G2ρ⋆K−1\)\.E\_\{\\mathrm\{nonquad\}\}=O\\\!\\left\(K^\{3\}\\eta^\{3\}M\_\{3\}G^\{2\}\\rho\_\{\\star\}^\{K\-1\}\\right\)\.
This bound gives simple checks for whether GXPO is operating in its intended local regime\. The extrapolated displacement should have small error, the error may grow withKKbut should remain controlled, interpolation should makeθ~\\tilde\{\\theta\}closer than the full extrapolated pointθK\\theta\_\{K\}, inactive\-coordinate fallback should be limited, and the corrective gradient should remain aligned with the initial gradient\. Proposition[4](https://arxiv.org/html/2605.06755#Thmtheorem4)in Appendix[A\.2](https://arxiv.org/html/2605.06755#A1.SS2)gives the local\-quadratic alignment check for the repositioning step\. Table[6](https://arxiv.org/html/2605.06755#A3.T6)and Table[7](https://arxiv.org/html/2605.06755#A3.T7)show this pattern: median displacement errors stay around10−1010^\{\-10\}–10−910^\{\-9\},θ~\\tilde\{\\theta\}is consistently closer thanθK\\theta\_\{K\}, andcos\(g0,gslow\)\\cos\(g\_\{0\},g\_\{\\mathrm\{slow\}\}\)stays near0\.970\.97\. These measurements support the local geometric approximation, but Theorem[3](https://arxiv.org/html/2605.06755#Thmtheorem3)remains a conservative surrogate bound rather than a numerical prediction of the exact training trajectory\.
## 3Experiments & Results
### 3\.1Experimental Settings
##### Models and data\.
We evaluate GRPO\-family reasoning RL on Qwen2\.5 and Llama3\.2 instruction models\(Qwen et al\.,[2024](https://arxiv.org/html/2605.06755#bib.bib25); Grattafiori et al\.,[2024](https://arxiv.org/html/2605.06755#bib.bib8); Meta,[2024](https://arxiv.org/html/2605.06755#bib.bib20)\)\. Training uses the Hendrycks MATH Level 3–5 splitHendrycks et al\. \([2021](https://arxiv.org/html/2605.06755#bib.bib12)\)\. Evaluation is conducted on Math\-500Hendrycks et al\. \([2021](https://arxiv.org/html/2605.06755#bib.bib12)\), AMC23AoPS \([2024](https://arxiv.org/html/2605.06755#bib.bib19)\), GSM8KCobbe et al\. \([2021](https://arxiv.org/html/2605.06755#bib.bib3)\), Minerva MathLewkowycz et al\. \([2022](https://arxiv.org/html/2605.06755#bib.bib15)\), and OlympiadBenchHe et al\. \([2024](https://arxiv.org/html/2605.06755#bib.bib10)\)\.
##### Training protocol\.
We use the same prompts, rewards, decoding setup, KL penalty, and GRPO loss for all methods, changing only the policy\-update rule\. We train Qwen2\.5\-7B with LoRA attention projections\(q,k,v,o\)\(q,k,v,o\)using lora rank128128and lora alpha256256\(Hu et al\.,[2021](https://arxiv.org/html/2605.06755#bib.bib13)\)\. All runs use bf16 precision, learning rate10−710^\{\-7\}, gradient clipping at1\.01\.0, PPO clippingϵ=0\.2\\epsilon=0\.2, and KL coefficientβ=0\.001\\beta=0\.001\. Each batch contains128128questions with55responses per question, a maximum of30723072generated tokens, and a context window of40964096\. We train for300300steps and evaluate with1616responses per prompt on Math500, GSM8K, AMC23, MinervaMath, and OlympiadBench\. For GXPO, we useα0=0\.5\\alpha\_\{0\}=0\.5,δ=10−8\\delta=10^\{\-8\},τ=0\.5\\tau=0\.5and trajectory\-aware shutoff\. For SFPO, we useα0=0\.5\\alpha\_\{0\}=0\.5,τ=2\.0\\tau=2\.0\. All experiments were run on 4 NVIDIA H100 GPUs, and wall\-clock efficiency was measured under this setup\.
##### Methods and budgets\.
We compare GRPO, SFPO, and GXPO under the same training and evaluation budget\. SFPO and GXPO are run withK∈\{3,5,10\}K\\in\\\{3,5,10\\\}, usingα0=0\.5\\alpha\_\{0\}=0\.5unless varied in ablations\. GRPO uses one backward pass per step, SFPO usesK\+1K\+1backward passes, and GXPO uses three backward passes during its active extrapolation phase before falling back to one pass after adaptive shutoff\. We report accuracy alongside step efficiency, wall\-clock efficiency, and backward\-pass efficiency\.
##### Evaluation metric\.
Following the pass@k evaluation convention\(Chen et al\.,[2021](https://arxiv.org/html/2605.06755#bib.bib1)\)and recent reasoning\-RL evaluations that report pass@1 from multiple non\-greedy samples\(DeepSeek\-AI,[2025a](https://arxiv.org/html/2605.06755#bib.bib7); Zuo et al\.,[2025](https://arxiv.org/html/2605.06755#bib.bib46)\), each benchmark is evaluated multiple times with rollout temperature being 1, and we report the average Pass@1 accuracy by default\.
### 3\.2Main Results
#### 3\.2\.1Math Reasoning Benchmarks
Table 1:Performance on math reasoning benchmarks after training on the Hendrycks MATH dataset\. SFPO and GXPO use the same reposition strength,α=0\.5\\alpha=0\.5, across all reported settings\.ModelMethod𝒌\\boldsymbol\{k\}BPMath\-500AMC23GSM8kMinervaOlympiadAvg\.Qwen2\.5\-1\.5BGRPO–124\.182\.9943\.535\.496\.1016\.46SFPO3425\.714\.4345\.295\.916\.7817\.625629\.984\.4350\.436\.597\.4519\.78101130\.005\.0852\.516\.177\.1520\.18GXPO3330\.756\.3851\.496\.537\.6420\.565331\.803\.2652\.467\.507\.6020\.5210332\.315\.0854\.457\.298\.1921\.46Qwen2\.5\-3BGRPO–156\.7430\.1378\.7316\.9114\.9139\.48SFPO3456\.4431\.0778\.9417\.5215\.2539\.845657\.0230\.9479\.0217\.2315\.4239\.93101157\.7030\.8979\.4616\.9515\.2940\.06GXPO3357\.5930\.6579\.4717\.3515\.4240\.105358\.3430\.8980\.2117\.2715\.3940\.4210359\.3632\.7380\.9817\.3915\.6441\.22Qwen2\.5\-7BGRPO–166\.5648\.7888\.4320\.9119\.4848\.83SFPO3466\.6549\.4588\.5220\.6419\.1148\.875671\.7549\.8788\.4923\.4320\.5750\.82GXPO3371\.7047\.6088\.6023\.6620\.7650\.465371\.8050\.0088\.6323\.5820\.7950\.96Llama3\.2\-3BGRPO–133\.4610\.5767\.0410\.985\.7625\.56SFPO3433\.5111\.2866\.9911\.125\.9825\.775634\.1511\.7467\.4411\.445\.9126\.14GXPO3334\.8511\.0968\.1411\.176\.1926\.295336\.0912\.9268\.6812\.056\.3127\.21
As shown in Table[1](https://arxiv.org/html/2605.06755#S3.T1), GXPO consistently outperforms GRPO and SFPO across all four model families from 1\.5B to 7B parameters\. On Qwen2\.5\-1\.5B and Qwen2\.5\-3B, GXPO withk∈3,5,10k\\in\{3,5,10\}achieves the highest average accuracy by a clear margin over both baselines\. On Qwen2\.5\-7B, the best performance is obtained atk=5k=5, and on Llama3\.2\-3B, GXPO withk=5k=5performs best across benchmarks\. All gains are achieved without increasing the active\-phase backward\-pass cost withkk\.
#### 3\.2\.2Training Dynamics
GXPO improves not only final accuracy but also the efficiency of reaching strong policies\. On Llama3\.2\-3B, GXPOk=10k\{=\}10reaches GRPO’s peak\-accuracy threshold in 60 steps and 180 backward passes, compared with 240 steps and 240 backward passes for GRPO, while also attaining the highest peak accuracy among the compared methods \(Table[2](https://arxiv.org/html/2605.06755#S3.T2)\)\. The wall\-clock and hyperparameter views show the same trend: GXPO improves steadily during its active phase acrossα\\alphaandkk\(Fig\.[2](https://arxiv.org/html/2605.06755#S3.F2)\) and reaches higher Pass@16 in less time than both GRPO and SFPO \(Fig\.[3](https://arxiv.org/html/2605.06755#S3.F3)\), while keeping the active\-phase cost fixed at three backward passes regardless ofkk\. This fixed\-cost lookahead gives GXPO a better accuracy–compute trade\-off at largerkk, and the KL/clip diagnostics indicate that repositioning does not substantially destabilize the policy update \(Table[8](https://arxiv.org/html/2605.06755#A3.T8); Appendices[B](https://arxiv.org/html/2605.06755#A2)and[C\.4](https://arxiv.org/html/2605.06755#A3.SS4)\)\.
Table 2:Convergence efficiency comparison on Hendrycks MATH\. Speedup is relative to GRPO\.ModelMethod𝒌\\boldsymbol\{k\}Peak Acc\.Steps tomatch GRPOStep↑\\uparrowHours tomatch GRPOTime↑\\uparrowBPs tomatch GRPOBP↑\\uparrowLlama3\.2\-3BGRPO–0\.34102401\.00×\\times14\.14 h1\.00×\\times2401\.00×\\timesSFPO30\.3465803\.00×\\times9\.07 h1\.56×\\times3200\.75×\\times100\.3525803\.00×\\times18\.68 h0\.76×\\times8700\.28×\\timesGXPO30\.3619653\.69×\\times6\.72 h2\.10×\\times1951\.23×\\times100\.3670604\.00×\\times6\.06 h2\.33×\\times1801\.33×\\times
Figure 2:Pass@16 accuracy versus training steps acrossα∈\{0\.1,0\.5,1\.0\}\\alpha\\in\\\{0\.1,0\.5,1\.0\\\}andk∈\{3,5,10\}k\\in\\\{3,5,10\\\}\. Solid lines denote smoothed curves\.

Figure 3:Training efficiency across GRPO, GXPO, and SFPO, with results reported up to 300 training steps\. Left: Pass@16 vs\. wall\-clock time, where GXPO \(k=10k\{=\}10\) leads throughout\. Right: peak Pass@16 vs\. time\-to\-peak, where GXPO achieves a better efficiency frontier\.##### Ablation Studies\.
On Qwen2\.5\-1\.5B, sweepingα∈\{0\.1,0\.5,1\.0\}\\alpha\\in\\\{0\.1,0\.5,1\.0\\\}andk∈\{3,5,10\}k\\in\\\{3,5,10\\\}reveals that larger values of both consistently improve Math\-500 pass@16, with the advantage persisting under backward\-pass normalization, confirming that gains reflect optimization quality rather than added compute\. Varying the stability thresholdτ\\taushows consistent improvements across all efficiency views \(Tables[3](https://arxiv.org/html/2605.06755#A3.T3)and[4](https://arxiv.org/html/2605.06755#A3.T4)\)\. On Llama3\.2\-3B, iso\-backward\-pass and iso\-wall\-clock comparisons confirm that GXPO’s gains over GRPO and SFPO hold under compute\-controlled conditions\. Full results and surrogate displacement diagnostics supporting the local geometric approximation are provided in Appendix[C](https://arxiv.org/html/2605.06755#A3)\.
## 4Conclusion
GXPO is a GRPO\-compatible policy\-update rule that approximates localKK\-step lookahead using two probe gradients and one corrective gradient, while keeping the active\-phase backward\-pass count fixed at three\. By extrapolating short\-horizon gradient changes, GXPO captures useful lookahead information without requiring the backward\-pass cost to grow withKK\. Across Qwen2\.5 and Llama3\.2 math\-reasoning experiments, GXPO improves sampled pass@1 over GRPO and matches or exceeds SFPO with fewer backward passes and faster time\-to\-target performance\. Since GXPO reuses the same rollouts, rewards, advantages, and GRPO loss, it can be added to existing RLVR pipelines with minimal changes\. Overall, GXPO offers a practical middle ground between efficient single\-step GRPO and more expensive multi\-step lookahead methods\.
## 5Limitations
Our analysis is a surrogate analysis under clean GD\-style assumptions, while the implementation uses AdamW with stateful moments and adaptive preconditioning\. Although diagnostics support the intended local regime, a full theory for stateful optimizers remains future work\. Our experiments also focus on math RLVR with verifiable rewards, so broader tasks and larger\-scale settings should be tested\.
## 6Broader Impact
GXPO aims to improve both compute and time efficiency in RLVR training by reducing the backward\-pass cost of lookahead\-style updates\. This can lower training cost and make reasoning\-RL research more accessible\. However, more efficient training may also accelerate stronger reasoning models, so models trained with GXPO should undergo standard safety, misuse, and reliability evaluations before deployment\.
## References
- Chen et al\. \(2021\)Chen, M\., Tworek, J\., Jun, H\., Yuan, Q\., Pinto, H\. P\. D\. O\., Kaplan, J\., Edwards, H\., Burda, Y\., Joseph, N\., Brockman, G\., et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.
- Chen et al\. \(2024\)Chen, L\., Li, S\., Yan, J\., Wang, H\., Gunaratna, K\., Yadav, V\., Tang, Z\., Srinivasan, V\., Zhou, T\., Huang, H\., and Jin, H\.AlpaGasus: Training a better Alpaca with fewer data\.In*ICLR*, 2024\.
- Cobbe et al\. \(2021\)Cobbe, K\., Kosaraju, V\., Bavarian, M\., Chen, M\., Jun, H\., Kaiser, L\., Plappert, M\., Tworek, J\., Hilton, J\., Nakano, R\., et al\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Cui et al\. \(2025\)Cui, G\., Zhang, Y\., Chen, J\., Yuan, L\., Wang, Z\., Zuo, Y\., Li, H\., Fan, Y\., Chen, H\., Chen, W\., Liu, Z\., Peng, H\., Bai, L\., Ouyang, W\., Cheng, Y\., Zhou, B\., and Ding, N\.The entropy mechanism of reinforcement learning for reasoning language models\.*arXiv preprint arXiv:2505\.22617*, 2025\.
- Dai et al\. \(2025\)Dai, M\., Liu, S\., and Si, Q\.Stable reinforcement learning for efficient reasoning\.*arXiv preprint arXiv:2505\.18086*, 2025\.
- Dalal et al\. \(2025\)Dalal, G\., Hallak, A\., Thoppe, G\., Mannor, S\., and Chechik, G\.Policy gradient with tree expansion\.In*ICML*, 2025\.
- DeepSeek\-AI \(2025a\)DeepSeek\-AI\.DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Grattafiori et al\. \(2024\)Grattafiori, A\., Dubey, A\., Jauhri, A\., Pandey, A\., Kadian, A\., Al\-Dahle, A\., Letman, A\., Mathur, A\., Schelten, A\., Vaughan, A\., Yang, A\., Fan, A\., et al\.The Llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Fatemi et al\. \(2025\)Fatemi, M\., Rafiee, B\., Tang, M\., and Talamadupula, K\.Concise reasoning via reinforcement learning\.*arXiv preprint arXiv:2504\.05185*, 2025\.
- He et al\. \(2024\)He, C\., Luo, R\., Bai, Y\., Hu, S\., Thai, Z\. L\., Shen, J\., Hu, J\., Han, X\., Huang, Y\., Zhang, Y\., et al\.OlympiadBench: A challenging benchmark for promoting AGI with olympiad\-level bilingual multimodal scientific problems\.*arXiv preprint arXiv:2402\.14008*, 2024\.
- He et al\. \(2025\)He, J\., Li, T\., Feng, E\., Du, D\., Liu, Q\., Liu, T\., Xia, Y\., and Chen, H\.History rhymes: Accelerating LLM reinforcement learning with RhymeRL\.*arXiv preprint arXiv:2508\.18588*, 2025\.
- Hendrycks et al\. \(2021\)Hendrycks, D\., Burns, C\., Kadavath, S\., Arora, A\., Basart, S\., Tang, E\., Song, D\., and Steinhardt, J\.Measuring mathematical problem solving with the MATH dataset\.In*NeurIPS*, 2021\.
- Hu et al\. \(2021\)Hu, E\. J\., Shen, Y\., Wallis, P\., Allen\-Zhu, Z\., Li, Y\., Wang, S\., Wang, L\., and Chen, W\.LoRA: Low\-rank adaptation of large language models\.*arXiv preprint arXiv:2106\.09685*, 2021\.
- Kwon et al\. \(2023\)Kwon, W\., Li, Z\., Zhuang, S\., Sheng, Y\., Zheng, L\., Yu, C\. H\., Gonzalez, J\. E\., Zhang, H\., and Stoica, I\.Efficient memory management for large language model serving with PagedAttention\.In*Proceedings of the 29th Symposium on Operating Systems Principles \(SOSP ’23\)*, pages 611–626, 2023\.doi:10\.1145/3600006\.3613165\.
- Lewkowycz et al\. \(2022\)Lewkowycz, A\., Andreassen, A\., Dohan, D\., Dyer, E\., Michalewski, H\., Ramasesh, V\., Slone, A\., Anil, C\., Schlag, I\., Gutman\-Solo, T\., et al\.Solving quantitative reasoning problems with language models\.In*NeurIPS*, 2022\.
- Li et al\. \(2025\)Li, X\., Zou, H\., and Liu, P\.LIMR: Less is more for RL scaling\.*arXiv preprint arXiv:2502\.11886*, 2025\.
- Liu et al\. \(2025\)Liu, Z\., Chen, C\., Li, W\., Qi, P\., Pang, T\., Du, C\., Lee, W\. S\., and Lin, M\.Understanding R1\-Zero\-like training: A critical perspective\.*arXiv preprint arXiv:2503\.20783*, 2025\.
- Loshchilov and Hutter \(2019\)Loshchilov, I\. and Hutter, F\.Decoupled weight decay regularization\.In*ICLR*, 2019\.
- AoPS \(2024\)Art of Problem Solving\.AMC Problems and Solutions\.[https://artofproblemsolving\.com/wiki/index\.php?title=AMC\_Problems\_and\_Solutions](https://artofproblemsolving.com/wiki/index.php?title=AMC_Problems_and_Solutions), 2024\.
- Meta \(2024\)Meta\.Llama 3\.2: Revolutionizing edge AI and vision with open, customizable models\.Technical blog, 2024\.[https://ai\.meta\.com/blog/llama\-3\-2\-connect\-2024\-vision\-edge\-mobile\-devices/](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)\.
- Mroueh \(2025\)Mroueh, Y\.Reinforcement learning with verifiable rewards: GRPO’s effective loss, dynamics, and success amplification\.*arXiv preprint arXiv:2503\.06639*, 2025\.
- Mroueh et al\. \(2025\)Mroueh, Y\., Dupuis, N\., Belgodere, B\., Nitsure, A\., Rigotti, M\., Greenewald, K\., Navratil, J\., Ross, J\., and Rios, J\.Revisiting group relative policy optimization: Insights into on\-policy and off\-policy training\.*arXiv preprint arXiv:2505\.22257*, 2025\.
- Muennighoff et al\. \(2025\)Muennighoff, N\., Yang, Z\., Shi, W\., Li, X\. L\., Li, F\.\-F\., Hajishirzi, H\., Zettlemoyer, L\., Liang, P\., Candès, E\., and Hashimoto, T\.s1: Simple test\-time scaling\.*arXiv preprint arXiv:2501\.19393*, 2025\.
- Protopapas and Barakat \(2024\)Protopapas, K\. and Barakat, A\.Policy mirror descent with lookahead\.In*NeurIPS*, 2024\.
- Qwen et al\. \(2024\)Qwen, Yang, A\., Yang, B\., Zhang, B\., Hui, B\., Zheng, B\., Yu, B\., Li, C\., Liu, D\., Huang, F\., Wei, H\., et al\.Qwen2\.5 technical report\.*arXiv preprint arXiv:2412\.15115*, 2024\.
- Rosenberg et al\. \(2023\)Rosenberg, A\., Hallak, A\., Mannor, S\., Chechik, G\., and Dalal, G\.Planning and learning with adaptive lookahead\.In*AAAI*, 2023\.
- Schulman et al\. \(2017\)Schulman, J\., Wolski, F\., Dhariwal, P\., Radford, A\., and Klimov, O\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Shao et al\. \(2024\)Shao, Z\., Wang, P\., Zhu, Q\., et al\.DeepSeekMath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Shen et al\. \(2025\)Shen, S\., Shen, P\., Zhao, W\., and Zhu, D\.Mitigating think\-answer mismatch in LLM reasoning through noise\-aware advantage reweighting\.*arXiv preprint arXiv:2508\.05928*, 2025\.
- Sheng et al\. \(2025\)Sheng, G\., Zhang, C\., Ye, Z\., Wu, X\., Zhang, W\., Zhang, R\., Peng, Y\., Lin, H\., and Wu, C\.HybridFlow: A flexible and efficient RLHF framework\.In*Proceedings of the Twentieth European Conference on Computer Systems \(EuroSys ’25\)*, 2025\.doi:10\.1145/3689031\.3696075\.
- Sikchi et al\. \(2021\)Sikchi, H\., Zhou, W\., and Held, D\.Learning off\-policy with online planning\.In*CoRL*, 2021\.
- Sutton et al\. \(1999\)Sutton, R\. S\., McAllester, D\., Singh, S\., and Mansour, Y\.Policy gradient methods for reinforcement learning with function approximation\.In*NeurIPS*, 1999\.
- Wang et al\. \(2026\)Wang, Z\., Wang, Z\., Fu, J\., Qu, X\., Cheng, Q\., Tang, S\., Zhang, M\., and Huo, X\.Slow\-Fast Policy Optimization: Reposition\-before\-update for LLM reasoning\.In*ICLR*, 2026\.
- Wang et al\. \(2025\)Wang, Y\., Yang, Q\., Zeng, Z\., Ren, L\., Liu, L\., Peng, B\., Cheng, H\., He, X\., Wang, K\., Gao, J\., et al\.Reinforcement learning for reasoning in large language models with one training example\.*arXiv preprint arXiv:2504\.20571*, 2025\.
- Wen et al\. \(2025\)Wen, X\., Liu, Z\., Zheng, S\., Ye, S\., Wu, Z\., Wang, Y\., Xu, Z\., Liang, X\., Li, J\., Miao, Z\., Bian, J\., and Yang, M\.Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs\.*arXiv preprint arXiv:2506\.14245*, 2025\.
- Williams \(1992\)Williams, R\. J\.Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.*Machine Learning*, 8\(3–4\):229–256, 1992\.
- Xia et al\. \(2024\)Xia, M\., Malladi, S\., Gururangan, S\., Arora, S\., and Chen, D\.LESS: Selecting influential data for targeted instruction tuning\.In*Proceedings of the 41st International Conference on Machine Learning*, PMLR 235:54104–54132, 2024\.
- Xiong et al\. \(2025\)Xiong, W\., Yao, J\., Xu, Y\., Pang, B\., Wang, L\., Sahoo, D\., Li, J\., Jiang, N\., Zhang, T\., Xiong, C\., and Dong, H\.A minimalist approach to LLM reasoning: From rejection sampling to reinforce\.*arXiv preprint arXiv:2504\.11343*, 2025\.
- Xu et al\. \(2026\)Xu, Y\. E\., Savani, Y\., Fang, F\., and Kolter, J\. Z\.Not all rollouts are useful: Down\-sampling rollouts in LLM reinforcement learning\.*Transactions on Machine Learning Research*, 2026\.
- Ye et al\. \(2025\)Ye, Y\., Huang, Z\., Xiao, Y\., Chern, E\., Xia, S\., and Liu, P\.LIMO: Less is more for reasoning\.In*COLM*, 2025\.
- Yu et al\. \(2025\)Yu, Q\., Zhang, Z\., Zhu, R\., et al\.DAPO: An open\-source LLM reinforcement learning system at scale\.In*NeurIPS*, 2025\.
- Yue et al\. \(2025\)Yue, Y\., Yuan, Y\., Yu, Q\., Zuo, X\., Zhu, R\., Xu, W\., Chen, J\., Wang, C\., Fan, T\., Du, Z\., Wei, X\., Yu, X\., Liu, G\., Liu, J\., Liu, L\., Lin, H\., Lin, Z\., Ma, B\., Zhang, C\., Zhang, M\., Zhang, W\., Zhu, H\., Zhang, R\., Liu, X\., Wang, M\., Wu, Y\., and Yan, L\.VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks\.*arXiv preprint arXiv:2504\.05118*, 2025\.
- Zhang et al\. \(2019\)Zhang, M\., Lucas, J\., Ba, J\., and Hinton, G\. E\.Lookahead optimizer:kksteps forward, 1 step back\.In*NeurIPS*, 2019\.
- Zhou et al\. \(2021\)Zhou, P\., Yan, H\., Yuan, X\., Feng, J\., and Yan, S\.Towards understanding why lookahead generalizes better than SGD and beyond\.In*NeurIPS*, 2021\.
- Zheng et al\. \(2025\)Zheng, H\., Zhou, Y\., Bartoldson, B\. R\., Kailkhura, B\., Lai, F\., Zhao, J\., and Chen, B\.Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts\.*arXiv preprint arXiv:2506\.02177*, 2025\.
- Zuo et al\. \(2025\)Zuo, Y\., Zhang, K\., Sheng, L\., Qu, S\., Cui, G\., Zhu, X\., Li, H\., Zhang, Y\., Long, X\., Hua, E\., Qi, B\., Sun, Y\., Ma, Z\., Yuan, L\., Ding, N\., and Zhou, B\.TTRL: Test\-time reinforcement learning\.*arXiv preprint arXiv:2504\.16084*, 2025\.
## Appendix AProofs
This appendix proves the claims used in §[2](https://arxiv.org/html/2605.06755#S2)\. The first group of results justifies the local geometric extrapolation used by GXPO; the second group bounds the gap between that surrogate and the true gradient\-descent trajectory; the final group gives an idealized backward\-pass budget check under a global diagonal quadratic model\. Throughout the appendix, the analysis uses plain gradient descent to isolate the extrapolation mechanism from optimizer\-state effects\.
### A\.1Proof of Theorem[1](https://arxiv.org/html/2605.06755#Thmtheorem1)
###### Proof\.
LetA=I−ηH0A=I\-\\eta H\_\{0\}\. We prove the claim by induction\.
*Step 1: base cases\.*Forn=0n=0andn=1n=1, respectively,g0=A0g0g\_\{0\}=A^\{0\}g\_\{0\}and, by \([2](https://arxiv.org/html/2605.06755#S2.E2)\),g1=\(I−ηH0\)g0=Ag0g\_\{1\}=\(I\-\\eta H\_\{0\}\)g\_\{0\}=Ag\_\{0\}\. Thus the formula holds for the first two iterates\.
*Step 2: induction assumption\.*Assume that the formula holds up to stepnn, i\.e\.,
gk=Akg0for all0≤k≤n\.g\_\{k\}=A^\{k\}g\_\{0\}\\qquad\\text\{for all \}0\\leq k\\leq n\.We show that it also holds at stepn\+1n\+1\.
*Step 3: write the next gradient using the local quadratic model\.*Under Assumption[1](https://arxiv.org/html/2605.06755#Thmassumption1), the local model and the plain\-GD displacement give
gn\+1=g0\+H0\(θn\+1−θ0\),θn\+1−θ0=−η∑k=0ngk,g\_\{n\+1\}=g\_\{0\}\+H\_\{0\}\(\\theta\_\{n\+1\}\-\\theta\_\{0\}\),\\qquad\\theta\_\{n\+1\}\-\\theta\_\{0\}=\-\\eta\\sum\_\{k=0\}^\{n\}g\_\{k\},so
gn\+1=g0−ηH0∑k=0ngk=g0−ηH0∑k=0nAkg0,g\_\{n\+1\}=g\_\{0\}\-\\eta H\_\{0\}\\sum\_\{k=0\}^\{n\}g\_\{k\}=g\_\{0\}\-\\eta H\_\{0\}\\sum\_\{k=0\}^\{n\}A^\{k\}g\_\{0\},where the last equality uses the induction assumption\.
*Step 4: simplify the geometric sum\.*SinceA=I−ηH0A=I\-\\eta H\_\{0\}, we haveI−A=ηH0I\-A=\\eta H\_\{0\}\. Therefore,
\(I−A\)∑k=0nAk=∑k=0n\(Ak−Ak\+1\)=A0−An\+1=I−An\+1,\(I\-A\)\\sum\_\{k=0\}^\{n\}A^\{k\}=\\sum\_\{k=0\}^\{n\}\(A^\{k\}\-A^\{k\+1\}\)=A^\{0\}\-A^\{n\+1\}=I\-A^\{n\+1\},where the middle terms cancel telescopically\. This identity does not requireAAto be invertible\. Multiplying byg0g\_\{0\}gives
ηH0∑k=0nAkg0=\(I−A\)∑k=0nAkg0=\(I−An\+1\)g0\.\\eta H\_\{0\}\\sum\_\{k=0\}^\{n\}A^\{k\}g\_\{0\}=\(I\-A\)\\sum\_\{k=0\}^\{n\}A^\{k\}g\_\{0\}=\(I\-A^\{n\+1\}\)g\_\{0\}\.
*Step 5: conclude the induction\.*Substituting the last identity into the expression forgn\+1g\_\{n\+1\}yields
gn\+1=g0−\(I−An\+1\)g0=An\+1g0\.g\_\{n\+1\}=g\_\{0\}\-\(I\-A^\{n\+1\}\)g\_\{0\}=A^\{n\+1\}g\_\{0\}\.Thus the formula holds at stepn\+1n\+1, completing the induction\. ∎
### A\.2Gradient alignment under the local quadratic model
###### Proposition 4\(Gradient alignment under the local quadratic model\)\.
Under Assumption[1](https://arxiv.org/html/2605.06755#Thmassumption1), letθ~\\tilde\{\\theta\}be defined by \([9](https://arxiv.org/html/2605.06755#S2.E9)\)\. If
α‖H0‖‖θK−θ0‖<‖g0‖,\\alpha\\,\\\|H\_\{0\}\\\|\\,\\\|\\theta\_\{K\}\-\\theta\_\{0\}\\\|<\\\|g\_\{0\}\\\|,then under the local quadratic model the*modelled*corrective gradient satisfies⟨g0,g0\+H0\(θ~−θ0\)⟩\>0\\langle g\_\{0\},\\,g\_\{0\}\+H\_\{0\}\(\\tilde\{\\theta\}\-\\theta\_\{0\}\)\\rangle\>0\. This only concerns the gradient predicted by the local quadratic model\. Algorithm[1](https://arxiv.org/html/2605.06755#alg1)uses the true gradientgslow=∇ℒ\(θ~\)g\_\{\\mathrm\{slow\}\}=\\nabla\\mathcal\{L\}\(\\tilde\{\\theta\}\), so the proposition should be read as a local geometric sanity check for the repositioning step rather than as a descent or convergence guarantee\.
###### Proof\.
By definition,
θ~−θ0=α\(θK−θ0\),⟨g0,g0\+H0\(θ~−θ0\)⟩=‖g0‖2\+α⟨g0,H0\(θK−θ0\)⟩\.\\tilde\{\\theta\}\-\\theta\_\{0\}=\\alpha\(\\theta\_\{K\}\-\\theta\_\{0\}\),\\qquad\\langle g\_\{0\},\\,g\_\{0\}\+H\_\{0\}\(\\tilde\{\\theta\}\-\\theta\_\{0\}\)\\rangle=\\\|g\_\{0\}\\\|^\{2\}\+\\alpha\\langle g\_\{0\},\\,H\_\{0\}\(\\theta\_\{K\}\-\\theta\_\{0\}\)\\rangle\.By Cauchy–Schwarz and submultiplicativity of the operator norm,
\|⟨g0,H0\(θK−θ0\)⟩\|≤‖g0‖⋅‖H0\(θK−θ0\)‖≤‖g0‖⋅‖H0‖⋅‖θK−θ0‖\.\|\\langle g\_\{0\},H\_\{0\}\(\\theta\_\{K\}\-\\theta\_\{0\}\)\\rangle\|\\leq\\\|g\_\{0\}\\\|\\cdot\\\|H\_\{0\}\(\\theta\_\{K\}\-\\theta\_\{0\}\)\\\|\\leq\\\|g\_\{0\}\\\|\\cdot\\\|H\_\{0\}\\\|\\cdot\\\|\\theta\_\{K\}\-\\theta\_\{0\}\\\|\.Hence
⟨g0,g0\+H0\(θ~−θ0\)⟩≥‖g0‖\(‖g0‖−α‖H0‖‖θK−θ0‖\),\\langle g\_\{0\},\\,g\_\{0\}\+H\_\{0\}\(\\tilde\{\\theta\}\-\\theta\_\{0\}\)\\rangle\\geq\\\|g\_\{0\}\\\|\\bigl\(\\\|g\_\{0\}\\\|\-\\alpha\\,\\\|H\_\{0\}\\\|\\,\\\|\\theta\_\{K\}\-\\theta\_\{0\}\\\|\\bigr\),which is strictly positive under the stated condition\. ∎
### A\.3Derivation of the Geometric Decay Approximation
The analysis throughout this appendix is conducted under plain gradient descent,
θn\+1=θn−ηgn,\\theta\_\{n\+1\}=\\theta\_\{n\}\-\\eta\\,g\_\{n\},as a surrogate for the practical first\-order\-optimizer implementation in Algorithm[1](https://arxiv.org/html/2605.06755#alg1)\. This surrogate isolates the extrapolation mechanism from optimizer\-specific state, such as momentum and adaptive preconditioning\. The results below therefore describe the GD surrogate only; they should be read as an explanation of the extrapolation geometry rather than as a direct model of the full optimizer dynamics used in implementation\.
By Theorem[1](https://arxiv.org/html/2605.06755#Thmtheorem1), the exact gradient evolution under Assumption[1](https://arxiv.org/html/2605.06755#Thmassumption1)isgn=\(I−ηH0\)ng0g\_\{n\}=\(I\-\\eta H\_\{0\}\)^\{n\}g\_\{0\}\. Direct evaluation of this expression requiresO\(d2\)O\(d^\{2\}\)matrix arithmetic, which is infeasible at large model scales\. The remainder of this section derives the coordinate\-wise geometric approximation that renders the computation tractable\.
##### Diagonal Hessian case\.
Suppose first thatH0H\_\{0\}is diagonal:
H0=diag\(h11,h22,…,hdd\)\.H\_\{0\}=\\mathrm\{diag\}\(h\_\{11\},h\_\{22\},\\dots,h\_\{dd\}\)\.Then
I−ηH0=diag\(1−ηh11,1−ηh22,…,1−ηhdd\),I\-\\eta H\_\{0\}=\\mathrm\{diag\}\(1\-\\eta h\_\{11\},\\,1\-\\eta h\_\{22\},\\,\\dots,\\,1\-\\eta h\_\{dd\}\),and therefore
\(I−ηH0\)n=diag\(\(1−ηh11\)n,…,\(1−ηhdd\)n\)\.\(I\-\\eta H\_\{0\}\)^\{n\}=\\mathrm\{diag\}\\\!\\big\(\(1\-\\eta h\_\{11\}\)^\{n\},\\dots,\(1\-\\eta h\_\{dd\}\)^\{n\}\\big\)\.Multiplying byg0g\_\{0\}gives, for every coordinateii,
gn,i=\(1−ηhii\)ng0,i\.g\_\{n,i\}=\(1\-\\eta h\_\{ii\}\)^\{n\}g\_\{0,i\}\.\(11\)
##### Diagonal surrogate rate\.
Equation \([11](https://arxiv.org/html/2605.06755#A1.E11)\) motivates the definition of the*diagonal surrogate rate*
r¯i≡1−ηH0,ii\.\\bar\{r\}\_\{i\}\\;\\equiv\\;1\-\\eta H\_\{0,ii\}\.\(12\)WhenH0H\_\{0\}is diagonal, \([11](https://arxiv.org/html/2605.06755#A1.E11)\) takes the form
gn,i=r¯ing0,i,g\_\{n,i\}=\\bar\{r\}\_\{i\}^\{n\}g\_\{0,i\},and, in the GD surrogate, the geometric decay model is exact\.
###### Lemma 5\(Exactness of geometric decay for diagonalH0H\_\{0\}in the GD surrogate\)\.
IfH0H\_\{0\}is diagonal, then for everyn≥0n\\geq 0and every coordinateii,
gn,i=r¯ing0,i,r¯i=1−ηH0,ii\.g\_\{n,i\}=\\bar\{r\}\_\{i\}^\{n\}g\_\{0,i\},\\qquad\\bar\{r\}\_\{i\}=1\-\\eta H\_\{0,ii\}\.
###### Proof\.
By Theorem[1](https://arxiv.org/html/2605.06755#Thmtheorem1),gn=\(I−ηH0\)ng0g\_\{n\}=\(I\-\\eta H\_\{0\}\)^\{n\}g\_\{0\}\. WhenH0=diag\(h11,…,hdd\)H\_\{0\}=\\mathrm\{diag\}\(h\_\{11\},\\dots,h\_\{dd\}\), the matrix\(I−ηH0\)\(I\-\\eta H\_\{0\}\)is also diagonal withii\-th entry1−ηhii1\-\\eta h\_\{ii\}, so\(I−ηH0\)n\(I\-\\eta H\_\{0\}\)^\{n\}is diagonal withii\-th entry\(1−ηhii\)n\(1\-\\eta h\_\{ii\}\)^\{n\}\. Extracting theii\-th coordinate gives
gn,i=\(1−ηhii\)ng0,i=r¯ing0,i,g\_\{n,i\}=\(1\-\\eta h\_\{ii\}\)^\{n\}g\_\{0,i\}=\\bar\{r\}\_\{i\}^\{n\}g\_\{0,i\},where the last equality usesr¯i=1−ηH0,ii\\bar\{r\}\_\{i\}=1\-\\eta H\_\{0,ii\}andhii=H0,iih\_\{ii\}=H\_\{0,ii\}\. ∎
##### Empirical retention ratio\.
The diagonal surrogate rater¯i\\bar\{r\}\_\{i\}is not accessible directly\. GXPO therefore uses the empirical retention ratio measured from the first two computed gradients:
ri≡g1,ig0,i,for coordinates withg0,i≠0\.r\_\{i\}\\;\\equiv\\;\\frac\{g\_\{1,i\}\}\{g\_\{0,i\}\},\\qquad\\text\{for coordinates with \}g\_\{0,i\}\\neq 0\.\(13\)In the plain\-GD surrogate, this quantity coincides withr¯i\\bar\{r\}\_\{i\}whenH0H\_\{0\}is diagonal, but differs in general\. In the implemented method,rir\_\{i\}is treated as an empirical coordinate\-wise retention ratio measured along the realized fast optimizer trajectory\.
###### Lemma 6\(Bias of the empirical retention ratio in the GD surrogate\)\.
Under the plain\-GD surrogate and the local quadratic model, for every coordinateiiwithg0,i≠0g\_\{0,i\}\\neq 0,
ri=r¯i−η∑j≠iH0,ijg0,jg0,i\.r\_\{i\}=\\bar\{r\}\_\{i\}\-\\eta\\sum\_\{j\\neq i\}H\_\{0,ij\}\\,\\frac\{g\_\{0,j\}\}\{g\_\{0,i\}\}\.\(14\)Equivalently,
ri−r¯i=−η∑j≠iH0,ijg0,jg0,i\.r\_\{i\}\-\\bar\{r\}\_\{i\}=\-\\eta\\sum\_\{j\\neq i\}H\_\{0,ij\}\\,\\frac\{g\_\{0,j\}\}\{g\_\{0,i\}\}\.\(15\)
###### Proof\.
Under the plain\-GD surrogate, \([2](https://arxiv.org/html/2605.06755#S2.E2)\) gives
g1=\(I−ηH0\)g0,g\_\{1\}=\(I\-\\eta H\_\{0\}\)g\_\{0\},so theii\-th coordinate is
g1,i=g0,i−η\[H0g0\]i\.g\_\{1,i\}=g\_\{0,i\}\-\\eta\[H\_\{0\}g\_\{0\}\]\_\{i\}\.Expand the matrix\-vector product:
\[H0g0\]i=∑j=1dH0,ijg0,j=H0,iig0,i\+∑j≠iH0,ijg0,j\.\[H\_\{0\}g\_\{0\}\]\_\{i\}=\\sum\_\{j=1\}^\{d\}H\_\{0,ij\}g\_\{0,j\}=H\_\{0,ii\}g\_\{0,i\}\+\\sum\_\{j\\neq i\}H\_\{0,ij\}g\_\{0,j\}\.Substituting into the expression forg1,ig\_\{1,i\}gives
g1,i=g0,i−ηH0,iig0,i−η∑j≠iH0,ijg0,j\.g\_\{1,i\}=g\_\{0,i\}\-\\eta H\_\{0,ii\}g\_\{0,i\}\-\\eta\\sum\_\{j\\neq i\}H\_\{0,ij\}g\_\{0,j\}\.Divide both sides byg0,ig\_\{0,i\}:
g1,ig0,i=1−ηH0,ii−η∑j≠iH0,ijg0,jg0,i\.\\frac\{g\_\{1,i\}\}\{g\_\{0,i\}\}=1\-\\eta H\_\{0,ii\}\-\\eta\\sum\_\{j\\neq i\}H\_\{0,ij\}\\frac\{g\_\{0,j\}\}\{g\_\{0,i\}\}\.Using \([12](https://arxiv.org/html/2605.06755#A1.E12)\) and \([13](https://arxiv.org/html/2605.06755#A1.E13)\),
ri=r¯i−η∑j≠iH0,ijg0,jg0,i,r\_\{i\}=\\bar\{r\}\_\{i\}\-\\eta\\sum\_\{j\\neq i\}H\_\{0,ij\}\\frac\{g\_\{0,j\}\}\{g\_\{0,i\}\},which is \([14](https://arxiv.org/html/2605.06755#A1.E14)\)\. Rearranging gives \([15](https://arxiv.org/html/2605.06755#A1.E15)\)\. ∎
##### Sources of approximation error\.
Within the GD surrogate, Lemma[6](https://arxiv.org/html/2605.06755#Thmtheorem6)identifies the off\-diagonal Hessian coupling as the discrepancy betweenrir\_\{i\}andr¯i\\bar\{r\}\_\{i\}\. Two distinct approximation layers therefore arise:
1. 1\.replacing the full quadratic dynamics\(I−ηH0\)ng0\(I\-\\eta H\_\{0\}\)^\{n\}g\_\{0\}by the diagonal surrogate\(I−ηdiag\(H0\)\)ng0\(I\-\\eta\\,\\mathrm\{diag\}\(H\_\{0\}\)\)^\{n\}g\_\{0\};
2. 2\.replacing the diagonal surrogate rater¯i\\bar\{r\}\_\{i\}by the empirical estimatorri=g1,i/g0,ir\_\{i\}=g\_\{1,i\}/g\_\{0,i\}\.
The same empirical quantityrir\_\{i\}is employed in Algorithm[1](https://arxiv.org/html/2605.06755#alg1)because it is computable from two backward passes\. Concretely, the per\-coordinate bias from Lemma[6](https://arxiv.org/html/2605.06755#Thmtheorem6)is
\|ri−r¯i\|≤η∑j≠i\|H0,ij\|⋅\|g0,j\|\|g0,i\|,\|r\_\{i\}\-\\bar\{r\}\_\{i\}\|\\leq\\eta\\sum\_\{j\\neq i\}\|H\_\{0,ij\}\|\\cdot\\frac\{\|g\_\{0,j\}\|\}\{\|g\_\{0,i\}\|\},which isO\(η\)O\(\\eta\)when the off\-diagonal Hessian entries and gradient\-component ratios are bounded\. The error analysis in Appendix[A\.5](https://arxiv.org/html/2605.06755#A1.SS5)quantifies the displacement consequences of both approximation layers: Theorem[8](https://arxiv.org/html/2605.06755#Thmtheorem8)bounds the diagonalization and non\-quadratic errors, and Lemma[9](https://arxiv.org/html/2605.06755#Thmtheorem9)bounds the additional cost of using empirical ratios\.
### A\.4GD\-Surrogate Displacement Identity
Letρ=\(ρ1,…,ρd\)∈ℝd\\rho=\(\\rho\_\{1\},\\dots,\\rho\_\{d\}\)\\in\\mathbb\{R\}^\{d\}be an arbitrary coordinate\-wise retention\-rate vector\. Hereρ\\rhois a generic rate vector: later we instantiate this identity withρ=r¯\\rho=\\bar\{r\}for the diagonal surrogate and withρ=r\\rho=rfor the empirical\-ratio surrogate\. The associated coordinate\-wise geometric surrogate gradient is defined by
g^n,i\(ρ\)≡ρing0,i,n≥0\.\\hat\{g\}\_\{n,i\}^\{\(\\rho\)\}\\;\\equiv\\;\\rho\_\{i\}^\{n\}\\,g\_\{0,i\},\\qquad n\\geq 0\.\(16\)Letθ^n\(ρ\)\\hat\{\\theta\}\_\{n\}^\{\(\\rho\)\}be the plain\-GD surrogate trajectory generated by these gradients:
θ^n\+1\(ρ\)=θ^n\(ρ\)−ηg^n\(ρ\),θ^0\(ρ\)=θ0\.\\hat\{\\theta\}\_\{n\+1\}^\{\(\\rho\)\}=\\hat\{\\theta\}\_\{n\}^\{\(\\rho\)\}\-\\eta\\,\\hat\{g\}\_\{n\}^\{\(\\rho\)\},\\qquad\\hat\{\\theta\}\_\{0\}^\{\(\\rho\)\}=\\theta\_\{0\}\.\(17\)
###### Proposition 7\(Total displacement under geometric decay\)\.
For every coordinateii, every rate vectorρ\\rho, and everyK≥1K\\geq 1,
\[θ^K\(ρ\)−θ0\]i=−ηg0,iSK\(ρi\),\[\\hat\{\\theta\}\_\{K\}^\{\(\\rho\)\}\-\\theta\_\{0\}\]\_\{i\}=\-\\eta\\,g\_\{0,i\}\\,S\_\{K\}\(\\rho\_\{i\}\),\(18\)where
SK\(x\)≡∑n=0K−1xn=\{1−xK1−x,x≠1,K,x=1\.S\_\{K\}\(x\)\\;\\equiv\\;\\sum\_\{n=0\}^\{K\-1\}x^\{n\}=\\begin\{cases\}\\dfrac\{1\-x^\{K\}\}\{1\-x\},&x\\neq 1,\\\\\[5\.16663pt\] K,&x=1\.\\end\{cases\}\(19\)
###### Proof\.
Summing the plain\-GD surrogate updates gives
θ^K\(ρ\)−θ0=−η∑n=0K−1g^n\(ρ\)\.\\hat\{\\theta\}\_\{K\}^\{\(\\rho\)\}\-\\theta\_\{0\}=\-\\eta\\sum\_\{n=0\}^\{K\-1\}\\hat\{g\}\_\{n\}^\{\(\\rho\)\}\.Taking coordinateiiand substitutingg^n,i\(ρ\)=ρing0,i\\hat\{g\}\_\{n,i\}^\{\(\\rho\)\}=\\rho\_\{i\}^\{n\}g\_\{0,i\}yields
\[θ^K\(ρ\)−θ0\]i=−ηg0,i∑n=0K−1ρin=−ηg0,iSK\(ρi\)\.\[\\hat\{\\theta\}\_\{K\}^\{\(\\rho\)\}\-\\theta\_\{0\}\]\_\{i\}=\-\\eta\\,g\_\{0,i\}\\sum\_\{n=0\}^\{K\-1\}\\rho\_\{i\}^\{n\}=\-\\eta\\,g\_\{0,i\}\\,S\_\{K\}\(\\rho\_\{i\}\)\.∎
### A\.5Displacement Error Bound for the GD Surrogate
Three sources of approximation error arise in the derivation:
1. 1\.Diagonalization error:replacing the full quadratic dynamics by the diagonal surrogate with ratesr¯i=1−ηH0,ii\\bar\{r\}\_\{i\}=1\-\\eta H\_\{0,ii\}\.
2. 2\.Ratio\-estimation error:replacing the diagonal surrogate ratesr¯i\\bar\{r\}\_\{i\}by the empirical ratiosri=g1,i/g0,ir\_\{i\}=g\_\{1,i\}/g\_\{0,i\}\.
3. 3\.Non\-quadratic error:replacing the true loss by its local quadratic model aroundθ0\\theta\_\{0\}\.
*Norm conventions\.*Throughout this subsection,∥⋅∥\\\|\\cdot\\\|denotes the spectral \(operator\) norm for matrices and the Euclidean norm for vectors, except where explicitly subscripted \(e\.g\.∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}for the row\-sum norm in Lemma[9](https://arxiv.org/html/2605.06755#Thmtheorem9)\)\. Let
H0off=H0−diag\(H0\)H\_\{0\}^\{\\mathrm\{off\}\}=H\_\{0\}\-\\mathrm\{diag\}\(H\_\{0\}\)denote the off\-diagonal part of the Hessian\.
Define the exact quadratic\-model GD trajectoryθnquad\\theta\_\{n\}^\{\\mathrm\{quad\}\}by
θn\+1quad=θnquad−ηgnquad,gnquad=\(I−ηH0\)ng0,θ0quad=θ0,\\theta\_\{n\+1\}^\{\\mathrm\{quad\}\}=\\theta\_\{n\}^\{\\mathrm\{quad\}\}\-\\eta g\_\{n\}^\{\\mathrm\{quad\}\},\\qquad g\_\{n\}^\{\\mathrm\{quad\}\}=\(I\-\\eta H\_\{0\}\)^\{n\}g\_\{0\},\\qquad\\theta\_\{0\}^\{\\mathrm\{quad\}\}=\\theta\_\{0\},wheregnquadg\_\{n\}^\{\\mathrm\{quad\}\}is the gradient atθnquad\\theta\_\{n\}^\{\\mathrm\{quad\}\}under the quadratic model \(equivalently, the closed form from Theorem[1](https://arxiv.org/html/2605.06755#Thmtheorem1)\)\. Define the true GD trajectory by
θn\+1true=θntrue−ηg\(θntrue\),θ0true=θ0\.\\theta\_\{n\+1\}^\{\\mathrm\{true\}\}=\\theta\_\{n\}^\{\\mathrm\{true\}\}\-\\eta g\(\\theta\_\{n\}^\{\\mathrm\{true\}\}\),\\qquad\\theta\_\{0\}^\{\\mathrm\{true\}\}=\\theta\_\{0\}\.Define the diagonal\-surrogate point by
θKdiag≡θ^K\(r¯\),r¯i=1−ηH0,ii\.\\theta\_\{K\}^\{\\mathrm\{diag\}\}\\equiv\\hat\{\\theta\}\_\{K\}^\{\(\\bar\{r\}\)\},\\qquad\\bar\{r\}\_\{i\}=1\-\\eta H\_\{0,ii\}\.The active\-set empirical\-ratio surrogateθKemp\\theta\_\{K\}^\{\\mathrm\{emp\}\}is defined in Lemma[9](https://arxiv.org/html/2605.06755#Thmtheorem9): it uses empirical ratios on active coordinates and the observed two\-probe displacement on inactive coordinates\.
###### Theorem 8\(Error bound for the diagonal surrogate\)\.
AssumeK≥2K\\geq 2,ℒ∈C3\\mathcal\{L\}\\in C^\{3\}, and that there exists a neighborhood𝒰\\mathcal\{U\}, star\-shaped with respect toθ0\\theta\_\{0\}, containing the trajectories
\{θntrue\}n=0K∪\{θnquad\}n=0K\\\{\\theta\_\{n\}^\{\\mathrm\{true\}\}\\\}\_\{n=0\}^\{K\}\\cup\\\{\\theta\_\{n\}^\{\\mathrm\{quad\}\}\\\}\_\{n=0\}^\{K\}such that
supξ∈𝒰‖∇3ℒ\(ξ\)‖≤M3\.\\sup\_\{\\xi\\in\\mathcal\{U\}\}\\\|\\nabla^\{3\}\\mathcal\{L\}\(\\xi\)\\\|\\leq M\_\{3\}\.Letγ:=‖I−ηH0‖\\gamma:=\\\|I\-\\eta H\_\{0\}\\\|andρmax:=max\(1,γ\)\\rho\_\{\\max\}:=\\max\(1,\\gamma\)\. Assume the true GD trajectory satisfies the uniform gradient bound
sup0≤n<K‖g\(θntrue\)‖≤G\.\\sup\_\{0\\leq n<K\}\\\|g\(\\theta\_\{n\}^\{\\mathrm\{true\}\}\)\\\|\\leq G\.Then the diagonal\-surrogate displacement error satisfies
‖θKdiag−θKtrue‖≤K\(K−1\)2η2‖H0off‖‖g0‖ρmaxK−2⏟ϵdiag\+K\(K−1\)\(2K−1\)12η3M3G2ρmaxK−1⏟ϵnonquad\.\\\|\\theta\_\{K\}^\{\\mathrm\{diag\}\}\-\\theta\_\{K\}^\{\\mathrm\{true\}\}\\\|\\leq\\underbrace\{\\frac\{K\(K\-1\)\}\{2\}\\,\\eta^\{2\}\\,\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\,\\\|g\_\{0\}\\\|\\,\\rho\_\{\\max\}^\{K\-2\}\}\_\{\\epsilon\_\{\\mathrm\{diag\}\}\}\+\\underbrace\{\\frac\{K\(K\-1\)\(2K\-1\)\}\{12\}\\,\\eta^\{3\}\\,M\_\{3\}\\,G^\{2\}\\,\\rho\_\{\\max\}^\{K\-1\}\}\_\{\\epsilon\_\{\\mathrm\{nonquad\}\}\}\.\(20\)
###### Proof\.
Introduce the exact quadratic\-model pointθKquad\\theta\_\{K\}^\{\\mathrm\{quad\}\}and write
‖θKdiag−θKtrue‖≤‖θKdiag−θKquad‖⏟ϵdiag\+‖θKquad−θKtrue‖⏟ϵnonquad\.\\\|\\theta\_\{K\}^\{\\mathrm\{diag\}\}\-\\theta\_\{K\}^\{\\mathrm\{true\}\}\\\|\\leq\\underbrace\{\\\|\\theta\_\{K\}^\{\\mathrm\{diag\}\}\-\\theta\_\{K\}^\{\\mathrm\{quad\}\}\\\|\}\_\{\\epsilon\_\{\\mathrm\{diag\}\}\}\+\\underbrace\{\\\|\\theta\_\{K\}^\{\\mathrm\{quad\}\}\-\\theta\_\{K\}^\{\\mathrm\{true\}\}\\\|\}\_\{\\epsilon\_\{\\mathrm\{nonquad\}\}\}\.\(21\)
##### Diagonalization error\.
Let
A=I−ηH0,D=I−ηdiag\(H0\),E=A−D=−ηH0off\.A=I\-\\eta H\_\{0\},\\qquad D=I\-\\eta\\,\\mathrm\{diag\}\(H\_\{0\}\),\\qquad E=A\-D=\-\\eta H\_\{0\}^\{\\mathrm\{off\}\}\.Then
gnquad=Ang0,gndiag=Dng0\.g\_\{n\}^\{\\mathrm\{quad\}\}=A^\{n\}g\_\{0\},\\qquad g\_\{n\}^\{\\mathrm\{diag\}\}=D^\{n\}g\_\{0\}\.We first prove the telescoping identity
An−Dn=∑k=0n−1An−1−kEDk\.A^\{n\}\-D^\{n\}=\\sum\_\{k=0\}^\{n\-1\}A^\{n\-1\-k\}ED^\{k\}\.\(22\)Indeed,
∑k=0n−1An−1−kEDk\\displaystyle\\sum\_\{k=0\}^\{n\-1\}A^\{n\-1\-k\}ED^\{k\}=∑k=0n−1An−1−k\(A−D\)Dk\\displaystyle=\\sum\_\{k=0\}^\{n\-1\}A^\{n\-1\-k\}\(A\-D\)D^\{k\}=∑k=0n−1\(An−kDk−An−1−kDk\+1\)\.\\displaystyle=\\sum\_\{k=0\}^\{n\-1\}\\big\(A^\{n\-k\}D^\{k\}\-A^\{n\-1\-k\}D^\{k\+1\}\\big\)\.\(23\)Writing out the first few and last few terms,
k=0\\displaystyle k=0:AnD0−An−1D1,\\displaystyle:\\quad A^\{n\}D^\{0\}\-A^\{n\-1\}D^\{1\},k=1\\displaystyle k=1:An−1D1−An−2D2,\\displaystyle:\\quad A^\{n\-1\}D^\{1\}\-A^\{n\-2\}D^\{2\},⋮\\displaystyle\\vdotsk=n−1\\displaystyle k=n\-1:A1Dn−1−A0Dn,\\displaystyle:\\quad A^\{1\}D^\{n\-1\}\-A^\{0\}D^\{n\},all intermediate terms cancel, leaving
AnD0−A0Dn=An−Dn\.A^\{n\}D^\{0\}\-A^\{0\}D^\{n\}=A^\{n\}\-D^\{n\}\.This proves \([22](https://arxiv.org/html/2605.06755#A1.E22)\)\.
Taking norms in \([22](https://arxiv.org/html/2605.06755#A1.E22)\) gives
‖An−Dn‖\\displaystyle\\\|A^\{n\}\-D^\{n\}\\\|≤∑k=0n−1‖A‖n−1−k‖E‖‖D‖k\.\\displaystyle\\leq\\sum\_\{k=0\}^\{n\-1\}\\\|A\\\|^\{n\-1\-k\}\\,\\\|E\\\|\\,\\\|D\\\|^\{k\}\.\(24\)Letγ:=‖A‖\\gamma:=\\\|A\\\|andρmax:=max\(1,γ\)\\rho\_\{\\max\}:=\\max\(1,\\gamma\)\. SinceDDis diagonal and shares the diagonal entries ofAA,‖D‖=maxi\|Aii\|≤‖A‖=γ≤ρmax\\\|D\\\|=\\max\_\{i\}\|A\_\{ii\}\|\\leq\\\|A\\\|=\\gamma\\leq\\rho\_\{\\max\}\. Therefore
‖An−Dn‖≤∑k=0n−1ρmaxn−1‖E‖=nρmaxn−1‖E‖=nη‖H0off‖ρmaxn−1\.\\\|A^\{n\}\-D^\{n\}\\\|\\leq\\sum\_\{k=0\}^\{n\-1\}\\rho\_\{\\max\}^\{\\,n\-1\}\\\|E\\\|=n\\,\\rho\_\{\\max\}^\{\\,n\-1\}\\\|E\\\|=n\\,\\eta\\,\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\,\\rho\_\{\\max\}^\{\\,n\-1\}\.Multiplying by‖g0‖\\\|g\_\{0\}\\\|gives the per\-step gradient error:
‖Ang0−Dng0‖≤nη‖H0off‖‖g0‖ρmaxn−1\.\\\|A^\{n\}g\_\{0\}\-D^\{n\}g\_\{0\}\\\|\\leq n\\,\\eta\\,\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\,\\\|g\_\{0\}\\\|\\,\\rho\_\{\\max\}^\{\\,n\-1\}\.\(25\)
Accumulating the displacement difference over all steps,
ϵdiag\\displaystyle\\epsilon\_\{\\mathrm\{diag\}\}=‖η∑n=0K−1\(Ang0−Dng0\)‖\\displaystyle=\\left\\\|\\eta\\sum\_\{n=0\}^\{K\-1\}\(A^\{n\}g\_\{0\}\-D^\{n\}g\_\{0\}\)\\right\\\|≤η∑n=0K−1‖Ang0−Dng0‖\\displaystyle\\leq\\eta\\sum\_\{n=0\}^\{K\-1\}\\\|A^\{n\}g\_\{0\}\-D^\{n\}g\_\{0\}\\\|≤η∑n=1K−1nη‖H0off‖‖g0‖ρmaxn−1\\displaystyle\\leq\\eta\\sum\_\{n=1\}^\{K\-1\}n\\,\\eta\\,\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\,\\\|g\_\{0\}\\\|\\,\\rho\_\{\\max\}^\{\\,n\-1\}≤η2‖H0off‖‖g0‖ρmaxK−2∑n=0K−1n≤K\(K−1\)2η2‖H0off‖‖g0‖ρmaxK−2\.\\displaystyle\\leq\\eta^\{2\}\\,\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\,\\\|g\_\{0\}\\\|\\,\\rho\_\{\\max\}^\{\\,K\-2\}\\sum\_\{n=0\}^\{K\-1\}n\\leq\\frac\{K\(K\-1\)\}\{2\}\\,\\eta^\{2\}\\,\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\,\\\|g\_\{0\}\\\|\\,\\rho\_\{\\max\}^\{\\,K\-2\}\.\(26\)
##### Non\-quadratic error\.
Let
en≡g\(θntrue\)−\(g0\+H0\(θntrue−θ0\)\)\.e\_\{n\}\\equiv g\(\\theta\_\{n\}^\{\\mathrm\{true\}\}\)\-\\big\(g\_\{0\}\+H\_\{0\}\(\\theta\_\{n\}^\{\\mathrm\{true\}\}\-\\theta\_\{0\}\)\\big\)\.By the Taylor remainder bound \([1](https://arxiv.org/html/2605.06755#S2.E1)\),
‖en‖≤M32‖θntrue−θ0‖2\.\\\|e\_\{n\}\\\|\\leq\\frac\{M\_\{3\}\}\{2\}\\,\\\|\\theta\_\{n\}^\{\\mathrm\{true\}\}\-\\theta\_\{0\}\\\|^\{2\}\.\(27\)
The displacement of the true trajectory fromθ0\\theta\_\{0\}is bounded as follows\. Since
θm\+1true−θmtrue=−ηg\(θmtrue\),\\theta\_\{m\+1\}^\{\\mathrm\{true\}\}\-\\theta\_\{m\}^\{\\mathrm\{true\}\}=\-\\eta\\,g\(\\theta\_\{m\}^\{\\mathrm\{true\}\}\),the triangle inequality yields
‖θntrue−θ0‖\\displaystyle\\\|\\theta\_\{n\}^\{\\mathrm\{true\}\}\-\\theta\_\{0\}\\\|≤∑m=0n−1‖θm\+1true−θmtrue‖\\displaystyle\\leq\\sum\_\{m=0\}^\{n\-1\}\\\|\\theta\_\{m\+1\}^\{\\mathrm\{true\}\}\-\\theta\_\{m\}^\{\\mathrm\{true\}\}\\\|=η∑m=0n−1‖g\(θmtrue\)‖\\displaystyle=\\eta\\sum\_\{m=0\}^\{n\-1\}\\\|g\(\\theta\_\{m\}^\{\\mathrm\{true\}\}\)\\\|≤η∑m=0n−1G\\displaystyle\\leq\\eta\\sum\_\{m=0\}^\{n\-1\}G=nηG\.\\displaystyle=n\\,\\eta\\,G\.\(28\)Substituting \([28](https://arxiv.org/html/2605.06755#A1.E28)\) into \([27](https://arxiv.org/html/2605.06755#A1.E27)\) gives
‖en‖≤M32n2η2G2\.\\\|e\_\{n\}\\\|\\leq\\frac\{M\_\{3\}\}\{2\}\\,n^\{2\}\\,\\eta^\{2\}\\,G^\{2\}\.\(29\)
Define the trajectory difference
ξn≡θntrue−θnquad\.\\xi\_\{n\}\\equiv\\theta\_\{n\}^\{\\mathrm\{true\}\}\-\\theta\_\{n\}^\{\\mathrm\{quad\}\}\.Because both trajectories start atθ0\\theta\_\{0\}, we haveξ0=0\\xi\_\{0\}=0\. Their update difference satisfies
ξn\+1\\displaystyle\\xi\_\{n\+1\}=ξn−η\(g\(θntrue\)−\(g0\+H0\(θnquad−θ0\)\)\)\\displaystyle=\\xi\_\{n\}\-\\eta\\Big\(g\(\\theta\_\{n\}^\{\\mathrm\{true\}\}\)\-\\big\(g\_\{0\}\+H\_\{0\}\(\\theta\_\{n\}^\{\\mathrm\{quad\}\}\-\\theta\_\{0\}\)\\big\)\\Big\)=ξn−η\(g0\+H0\(θntrue−θ0\)\+en−g0−H0\(θnquad−θ0\)\)\\displaystyle=\\xi\_\{n\}\-\\eta\\Big\(g\_\{0\}\+H\_\{0\}\(\\theta\_\{n\}^\{\\mathrm\{true\}\}\-\\theta\_\{0\}\)\+e\_\{n\}\-g\_\{0\}\-H\_\{0\}\(\\theta\_\{n\}^\{\\mathrm\{quad\}\}\-\\theta\_\{0\}\)\\Big\)=ξn−ηH0ξn−ηen\\displaystyle=\\xi\_\{n\}\-\\eta H\_\{0\}\\xi\_\{n\}\-\\eta e\_\{n\}=\(I−ηH0\)ξn−ηen\\displaystyle=\(I\-\\eta H\_\{0\}\)\\xi\_\{n\}\-\\eta e\_\{n\}=Aξn−ηen\.\\displaystyle=A\\xi\_\{n\}\-\\eta e\_\{n\}\.\(30\)Unrolling the recurrence fromξ0=0\\xi\_\{0\}=0gives
ξK=−η∑n=0K−1AK−1−nen\.\\xi\_\{K\}=\-\\eta\\sum\_\{n=0\}^\{K\-1\}A^\{K\-1\-n\}e\_\{n\}\.\(31\)
Take norms and use‖AK−1−n‖≤ρmaxK−1−n\\\|A^\{K\-1\-n\}\\\|\\leq\\rho\_\{\\max\}^\{\\,K\-1\-n\}:
ϵnonquad=‖ξK‖\\displaystyle\\epsilon\_\{\\mathrm\{nonquad\}\}=\\\|\\xi\_\{K\}\\\|≤η∑n=0K−1‖AK−1−n‖‖en‖≤η∑n=0K−1ρmaxK−1−n‖en‖≤ηρmaxK−1∑n=0K−1‖en‖\.\\displaystyle\\leq\\eta\\sum\_\{n=0\}^\{K\-1\}\\\|A^\{K\-1\-n\}\\\|\\,\\\|e\_\{n\}\\\|\\leq\\eta\\sum\_\{n=0\}^\{K\-1\}\\rho\_\{\\max\}^\{\\,K\-1\-n\}\\\|e\_\{n\}\\\|\\leq\\eta\\,\\rho\_\{\\max\}^\{\\,K\-1\}\\sum\_\{n=0\}^\{K\-1\}\\\|e\_\{n\}\\\|\.\(32\)Substituting \([29](https://arxiv.org/html/2605.06755#A1.E29)\),
ϵnonquad\\displaystyle\\epsilon\_\{\\mathrm\{nonquad\}\}≤ηρmaxK−1∑n=0K−1M32n2η2G2≤M3η3G22ρmaxK−1∑n=0K−1n2\\displaystyle\\leq\\eta\\rho\_\{\\max\}^\{K\-1\}\\sum\_\{n=0\}^\{K\-1\}\\frac\{M\_\{3\}\}\{2\}n^\{2\}\\eta^\{2\}G^\{2\}\\leq\\frac\{M\_\{3\}\\eta^\{3\}G^\{2\}\}\{2\}\\rho\_\{\\max\}^\{K\-1\}\\sum\_\{n=0\}^\{K\-1\}n^\{2\}≤M3η3G22ρmaxK−1⋅K\(K−1\)\(2K−1\)6≤K\(K−1\)\(2K−1\)12η3M3G2ρmaxK−1\.\\displaystyle\\leq\\frac\{M\_\{3\}\\eta^\{3\}G^\{2\}\}\{2\}\\rho\_\{\\max\}^\{K\-1\}\\cdot\\frac\{K\(K\-1\)\(2K\-1\)\}\{6\}\\leq\\frac\{K\(K\-1\)\(2K\-1\)\}\{12\}\\eta^\{3\}M\_\{3\}G^\{2\}\\rho\_\{\\max\}^\{K\-1\}\.\(33\)
Combining \([21](https://arxiv.org/html/2605.06755#A1.E21)\), \([26](https://arxiv.org/html/2605.06755#A1.E26)\), and \([33](https://arxiv.org/html/2605.06755#A1.E33)\) yields \([20](https://arxiv.org/html/2605.06755#A1.E20)\)\. ∎
###### Lemma 9\(Additional error from active\-set empirical ratios in the GD surrogate\)\.
Under the plain\-GD surrogate and the local quadratic model, consider the clean active\-set surrogate corresponding to GXPO\. Let‖H0off‖∞≡maxi∑j≠i\|H0,ij\|\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\_\{\\infty\}\\equiv\\max\_\{i\}\\sum\_\{j\\neq i\}\|H\_\{0,ij\}\|\. Letδ\>0\\delta\>0be the active\-set threshold in Algorithm[1](https://arxiv.org/html/2605.06755#alg1)\. Partition coordinates into𝒜=\{i:\|g0,i\|\>δ\}\\mathcal\{A\}=\\\{i:\|g\_\{0,i\}\|\>\\delta\\\}and𝒮=𝒜c\\mathcal\{S\}=\\mathcal\{A\}^\{c\}\. Fori∈𝒜i\\in\\mathcal\{A\}, letri=g1,i/g0,ir\_\{i\}=g\_\{1,i\}/g\_\{0,i\}be the empirical active\-set ratio and assume\|ri\|≤R\|r\_\{i\}\|\\leq Ron𝒜\\mathcal\{A\}and\|r¯i\|≤R\|\\bar\{r\}\_\{i\}\|\\leq Ron all coordinates\. Define
CK,R=∑n=1K−1nRn−1,DK,R=2\+∑n=0K−1Rn\.C\_\{K,R\}=\\sum\_\{n=1\}^\{K\-1\}nR^\{n\-1\},\\qquad D\_\{K,R\}=2\+\\sum\_\{n=0\}^\{K\-1\}R^\{n\}\.LetθKemp\\theta\_\{K\}^\{\\mathrm\{emp\}\}be the active\-set surrogate that applies−ηg0,iSK\(ri\)\-\\eta g\_\{0,i\}S\_\{K\}\(r\_\{i\}\)on𝒜\\mathcal\{A\}and keeps the observed two\-probe quadratic displacement−η\(g0,i\+g1,i\)\-\\eta\(g\_\{0,i\}\+g\_\{1,i\}\)on𝒮\\mathcal\{S\}\. Then
‖θKemp−θKdiag‖\\displaystyle\\\|\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{diag\}\}\\\|≤η2CK,R‖H0off‖∞δ\\displaystyle\\leq\\eta^\{2\}C\_\{K,R\}\\frac\{\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\_\{\\infty\}\}\{\\delta\}\(34\)×‖g0‖∞‖g0,𝒜‖\\displaystyle\\qquad\{\}\\times\\\|g\_\{0\}\\\|\_\{\\infty\}\\\|g\_\{0,\\mathcal\{A\}\}\\\|\+ηDK,R‖g0,𝒮‖1\+η2‖\(H0g0\)𝒮‖1,\\displaystyle\\quad\+\\eta D\_\{K,R\}\\\|g\_\{0,\\mathcal\{S\}\}\\\|\_\{1\}\+\\eta^\{2\}\\\|\(H\_\{0\}g\_\{0\}\)\_\{\\mathcal\{S\}\}\\\|\_\{1\},where‖g0,𝒜‖\\\|g\_\{0,\\mathcal\{A\}\}\\\|is the Euclidean norm on𝒜\\mathcal\{A\}and‖g0,𝒮‖1=∑i∈𝒮\|g0,i\|\\\|g\_\{0,\\mathcal\{S\}\}\\\|\_\{1\}=\\sum\_\{i\\in\\mathcal\{S\}\}\|g\_\{0,i\}\|\.
###### Proof\.
For any scalarxx, writeSK\(x\)=∑n=0K−1xnS\_\{K\}\(x\)=\\sum\_\{n=0\}^\{K\-1\}x^\{n\}\. On the active set, the empirical and diagonal surrogate displacements differ coordinate\-wise by
\[θKemp−θKdiag\]i=−ηg0,i\(SK\(ri\)−SK\(r¯i\)\)\.\[\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{diag\}\}\]\_\{i\}=\-\\eta g\_\{0,i\}\\bigl\(S\_\{K\}\(r\_\{i\}\)\-S\_\{K\}\(\\bar\{r\}\_\{i\}\)\\bigr\)\.The polynomialSKS\_\{K\}is Lipschitz on\[−R,R\]\[\-R,R\]with constant
sup\|x\|≤R\|SK′\(x\)\|=sup\|x\|≤R\|∑n=1K−1nxn−1\|≤CK,R\.\\sup\_\{\|x\|\\leq R\}\|S\_\{K\}^\{\\prime\}\(x\)\|=\\sup\_\{\|x\|\\leq R\}\\left\|\\sum\_\{n=1\}^\{K\-1\}nx^\{n\-1\}\\right\|\\leq C\_\{K,R\}\.Lemma[6](https://arxiv.org/html/2605.06755#Thmtheorem6)gives, fori∈𝒜i\\in\\mathcal\{A\},
\|ri−r¯i\|≤η∑j≠i\|H0,ij\|\|g0,j\|\|g0,i\|≤η‖H0off‖∞‖g0‖∞δ\.\|r\_\{i\}\-\\bar\{r\}\_\{i\}\|\\leq\\eta\\sum\_\{j\\neq i\}\|H\_\{0,ij\}\|\\frac\{\|g\_\{0,j\}\|\}\{\|g\_\{0,i\}\|\}\\leq\\eta\\frac\{\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\_\{\\infty\}\\\|g\_\{0\}\\\|\_\{\\infty\}\}\{\\delta\}\.Therefore
‖\(θKemp−θKdiag\)𝒜‖≤ηCK,Rη‖H0off‖∞‖g0‖∞δ‖g0,𝒜‖\.\\\|\(\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{diag\}\}\)\_\{\\mathcal\{A\}\}\\\|\\leq\\eta C\_\{K,R\}\\eta\\frac\{\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\_\{\\infty\}\\\|g\_\{0\}\\\|\_\{\\infty\}\}\{\\delta\}\\\|g\_\{0,\\mathcal\{A\}\}\\\|\.Fori∈𝒮i\\in\\mathcal\{S\}, the clean active\-set surrogate corresponding to GXPO forms no ratio and keeps the observed two\-probe displacement\. Under the quadratic model,g1=g0−ηH0g0g\_\{1\}=g\_\{0\}\-\\eta H\_\{0\}g\_\{0\}, so
\[θKemp−θKdiag\]i=−η\(g0,i\+g1,i\)\+ηg0,iSK\(r¯i\)=ηg0,i\(SK\(r¯i\)−2\)\+η2\[H0g0\]i\.\[\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{diag\}\}\]\_\{i\}=\-\\eta\(g\_\{0,i\}\+g\_\{1,i\}\)\+\\eta g\_\{0,i\}S\_\{K\}\(\\bar\{r\}\_\{i\}\)=\\eta g\_\{0,i\}\\bigl\(S\_\{K\}\(\\bar\{r\}\_\{i\}\)\-2\\bigr\)\+\\eta^\{2\}\[H\_\{0\}g\_\{0\}\]\_\{i\}\.Since\|r¯i\|≤R\|\\bar\{r\}\_\{i\}\|\\leq R,\|SK\(r¯i\)−2\|≤2\+∑n=0K−1Rn=DK,R\|S\_\{K\}\(\\bar\{r\}\_\{i\}\)\-2\|\\leq 2\+\\sum\_\{n=0\}^\{K\-1\}R^\{n\}=D\_\{K,R\}\. Thus
‖\(θKemp−θKdiag\)𝒮‖≤ηDK,R‖g0,𝒮‖1\+η2‖\(H0g0\)𝒮‖1\.\\\|\(\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{diag\}\}\)\_\{\\mathcal\{S\}\}\\\|\\leq\\eta D\_\{K,R\}\\\|g\_\{0,\\mathcal\{S\}\}\\\|\_\{1\}\+\\eta^\{2\}\\\|\(H\_\{0\}g\_\{0\}\)\_\{\\mathcal\{S\}\}\\\|\_\{1\}\.Adding the two coordinate groups proves the claim\. ∎
###### Corollary 10\(Combined bound for the empirical\-ratio surrogate\)\.
Under Theorem[8](https://arxiv.org/html/2605.06755#Thmtheorem8)and Lemma[9](https://arxiv.org/html/2605.06755#Thmtheorem9),
‖θKemp−θKtrue‖\\displaystyle\\\|\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{true\}\}\\\|≤K\(K−1\)2η2‖H0off‖‖g0‖ρmaxK−2\+η2CK,R‖H0off‖∞δ‖g0‖∞‖g0,𝒜‖\\displaystyle\\leq\\frac\{K\(K\-1\)\}\{2\}\\eta^\{2\}\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\\,\\\|g\_\{0\}\\\|\\rho\_\{\\max\}^\{K\-2\}\+\\eta^\{2\}C\_\{K,R\}\\frac\{\\\|H\_\{0\}^\{\\mathrm\{off\}\}\\\|\_\{\\infty\}\}\{\\delta\}\\\|g\_\{0\}\\\|\_\{\\infty\}\\\|g\_\{0,\\mathcal\{A\}\}\\\|\(35\)\+ηDK,R‖g0,𝒮‖1\+η2‖\(H0g0\)𝒮‖1\+K\(K−1\)\(2K−1\)12η3M3G2ρmaxK−1\.\\displaystyle\\quad\+\\eta D\_\{K,R\}\\\|g\_\{0,\\mathcal\{S\}\}\\\|\_\{1\}\+\\eta^\{2\}\\\|\(H\_\{0\}g\_\{0\}\)\_\{\\mathcal\{S\}\}\\\|\_\{1\}\+\\frac\{K\(K\-1\)\(2K\-1\)\}\{12\}\\eta^\{3\}M\_\{3\}G^\{2\}\\rho\_\{\\max\}^\{K\-1\}\.
###### Proof\.
By the triangle inequality,
‖θKemp−θKtrue‖≤‖θKemp−θKdiag‖\+‖θKdiag−θKtrue‖\.\\\|\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{true\}\}\\\|\\leq\\\|\\theta\_\{K\}^\{\\mathrm\{emp\}\}\-\\theta\_\{K\}^\{\\mathrm\{diag\}\}\\\|\+\\\|\\theta\_\{K\}^\{\\mathrm\{diag\}\}\-\\theta\_\{K\}^\{\\mathrm\{true\}\}\\\|\.Applying Lemma[9](https://arxiv.org/html/2605.06755#Thmtheorem9)to the first term and Theorem[8](https://arxiv.org/html/2605.06755#Thmtheorem8)to the second term gives \([35](https://arxiv.org/html/2605.06755#A1.E35)\)\. ∎
### A\.6Idealized Diagonal\-Quadratic GD\-Surrogate Budget Check
This subsection gives an algebraic sanity check for the extrapolation rule in the cleanest possible setting\. It does not model the full stateful optimizer dynamics used in the implemented method\. Instead, it studies a deterministic plain\-GD surrogate under a global diagonal quadratic loss, where the coordinate\-wise geometric model is exact\.
Consider
ℒ\(θ\)=12θ⊤H0θ,H0=diag\(h1,…,hd\),hi\>0,\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{2\}\\theta^\{\\top\}H\_\{0\}\\theta,\\qquad H\_\{0\}=\\mathrm\{diag\}\(h\_\{1\},\\ldots,h\_\{d\}\),\\qquad h\_\{i\}\>0,and assume the GD step size satisfiesηhi≤1\\eta h\_\{i\}\\leq 1for allii\. Let
ri=1−ηhi\.r\_\{i\}=1\-\\eta h\_\{i\}\.Under plain GD,
θn\+1=θn−ηgn,gn=∇ℒ\(θn\),\\theta\_\{n\+1\}=\\theta\_\{n\}\-\\eta g\_\{n\},\\qquad g\_\{n\}=\\nabla\\mathcal\{L\}\(\\theta\_\{n\}\),each coordinate evolves independently:
gn,i=ring0,i\.g\_\{n,i\}=r\_\{i\}^\{n\}g\_\{0,i\}\.By Proposition[7](https://arxiv.org/html/2605.06755#Thmtheorem7), theKK\-step GD displacement is
\[θK−θ0\]i=−ηg0,iSK\(ri\),SK\(ri\)=∑n=0K−1rin\.\[\\theta\_\{K\}\-\\theta\_\{0\}\]\_\{i\}=\-\\eta g\_\{0,i\}S\_\{K\}\(r\_\{i\}\),\\qquad S\_\{K\}\(r\_\{i\}\)=\\sum\_\{n=0\}^\{K\-1\}r\_\{i\}^\{n\}\.
In the clean GXPO surrogate, the first two GD probe steps produce
\[θ2−θ0\]i=−ηg0,iS2\(ri\)\.\[\\theta\_\{2\}\-\\theta\_\{0\}\]\_\{i\}=\-\\eta g\_\{0,i\}S\_\{2\}\(r\_\{i\}\)\.If finite\-precision stabilizers and inactive\-coordinate fallback are omitted, GXPO scales this observed two\-step displacement by
SK\(ri\)S2\(ri\)\.\\frac\{S\_\{K\}\(r\_\{i\}\)\}\{S\_\{2\}\(r\_\{i\}\)\}\.Hence the extrapolated point satisfies
\[θKGXPO−θ0\]i=\[θ2−θ0\]iSK\(ri\)S2\(ri\)=−ηg0,iSK\(ri\),\[\\theta\_\{K\}^\{\\mathrm\{GXPO\}\}\-\\theta\_\{0\}\]\_\{i\}=\[\\theta\_\{2\}\-\\theta\_\{0\}\]\_\{i\}\\frac\{S\_\{K\}\(r\_\{i\}\)\}\{S\_\{2\}\(r\_\{i\}\)\}=\-\\eta g\_\{0,i\}S\_\{K\}\(r\_\{i\}\),which is exactly theKK\-th plain\-GD iterate\.
With full extrapolationα=1\\alpha=1, the corrective gradient is then evaluated at this exactKK\-step point, and the final correction gives
θnewGXPO=θK−ηgK=θK\+1GD\.\\theta\_\{\\mathrm\{new\}\}^\{\\mathrm\{GXPO\}\}=\\theta\_\{K\}\-\\eta g\_\{K\}=\\theta\_\{K\+1\}^\{\\mathrm\{GD\}\}\.Thus, in this idealized diagonal\-quadratic GD surrogate, one active GXPO outer step using three backward passes lands at the same point asK\+1K\+1plain\-GD steps\.
Let
μ=minihi\>0,ρ=\(1−ημ\)2∈\[0,1\)\.\\mu=\\min\_\{i\}h\_\{i\}\>0,\\qquad\\rho=\(1\-\\eta\\mu\)^\{2\}\\in\[0,1\)\.Since0≤ri≤1−ημ0\\leq r\_\{i\}\\leq 1\-\\eta\\mu, plain GD satisfies
ℒ\(θnGD\)=12∑ihiri2nθ0,i2≤ρnℒ\(θ0\)\.\\mathcal\{L\}\(\\theta\_\{n\}^\{\\mathrm\{GD\}\}\)=\\frac\{1\}\{2\}\\sum\_\{i\}h\_\{i\}r\_\{i\}^\{2n\}\\theta\_\{0,i\}^\{2\}\\leq\\rho^\{n\}\\mathcal\{L\}\(\\theta\_\{0\}\)\.Aftermmactive GXPO outer steps, the surrogate reaches the same point as\(K\+1\)m\(K\+1\)mGD steps, so
ℒ\(θmGXPO\)≤ρ\(K\+1\)mℒ\(θ0\)\.\\mathcal\{L\}\(\\theta\_\{m\}^\{\\mathrm\{GXPO\}\}\)\\leq\\rho^\{\(K\+1\)m\}\\mathcal\{L\}\(\\theta\_\{0\}\)\.Since each active GXPO step uses three backward passes, forB=3mB=3mbackward passes,
ℒ\(θB/3GXPO\)≤ρ\(K\+1\)B/3ℒ\(θ0\)\.\\mathcal\{L\}\(\\theta\_\{B/3\}^\{\\mathrm\{GXPO\}\}\)\\leq\\rho^\{\(K\+1\)B/3\}\\mathcal\{L\}\(\\theta\_\{0\}\)\.If0<ρ<10<\\rho<1, then to reachℒ\(θ\)≤ε\\mathcal\{L\}\(\\theta\)\\leq\\varepsilon, this idealized surrogate requires
B≥3K\+1log\(ℒ\(θ0\)/ε\)log\(1/ρ\)\.B\\geq\\frac\{3\}\{K\+1\}\\frac\{\\log\(\\mathcal\{L\}\(\\theta\_\{0\}\)/\\varepsilon\)\}\{\\log\(1/\\rho\)\}\.This proves the diagonal\-quadratic GD\-surrogate sanity check stated in Corollary[2](https://arxiv.org/html/2605.06755#Thmtheorem2)\.
This result should be read only as a sanity check for the extrapolation formula\. The exact identity relies on a global diagonal quadratic loss, deterministic plain GD, full extrapolationα=1\\alpha=1, exact geometric sums, and no active\-set fallback or finite\-precision stabilization\. The implemented GXPO update uses the chosen actor optimizer and may use partial extrapolation, so the practical experiments should not be interpreted as being governed by this idealized rate\. The purpose of the calculation is only to show that, when the coordinate\-wise geometric model is exact, the GXPO extrapolation has the intended multi\-step GD interpretation\.
## Appendix BGeometric Extrapolation Diagnostics
We analyze GXPO behavior across training dynamics, compute normalization, and hyperparameter settings\. Largerkkandα\\alphayield improved optimization efficiency, as confirmed when measured against backward passes \(Fig\.[5](https://arxiv.org/html/2605.06755#A2.F5)\), rather than reflecting purely increased compute\. Peak performance exhibits a trade\-off with time\-to\-peak and a stable high\-performing region in theα\\alpha–kklandscape \(Fig\.[4](https://arxiv.org/html/2605.06755#A2.F4)\), while varyingτ\\tau\(fork=5k=5\) shows consistent gains across training steps, wall\-clock time, and backward passes \(Fig\.[6](https://arxiv.org/html/2605.06755#A2.F6)\)\. Auxiliary metrics further reveal that largerkkandα\\alphalead to longer and more variable responses in tokens \(Fig\.[7](https://arxiv.org/html/2605.06755#A2.F7)\), and GXPO diagnostics peak early and collapse after the GXPO\-to\-GRPO transition \(Fig\.[8](https://arxiv.org/html/2605.06755#A2.F8)\), indicating that GXPO primarily affects early training dynamics\. Retention ratios follow a similar pattern, increasing withkkandα\\alphaduring the active phase and dropping sharply at shutoff \(Fig\.[9](https://arxiv.org/html/2605.06755#A2.F9)\)\.


Figure 4:GXPO ablations on Math\-500 with Qwen2\.5\-1\.5B\. Left: peak Pass@16 versus time\-to\-peak across\(k,α\)\(k,\\alpha\)\. Right: peak Pass@16 over theα\\alpha–kkgrid\.Figure 5:Pass@16 \(EMA\) versus backward passes acrossα∈\{0\.1,0\.5,1\}\\alpha\\in\\\{0\.1,0\.5,1\\\}andk∈\{3,5,10\}k\\in\\\{3,5,10\\\}\. Largerkkmaintains a consistent advantage under compute normalization\.Figure 6:Pass@16 \(EMA\) fork=5k=5underτ∈\{0\.7,1,1\.5,2\}\\tau\\in\\\{0\.7,1,1\.5,2\\\}versus training steps \(left\), wall\-clock time \(center\), and backward passes \(right\)\. Largerτ\\tauachieves higher accuracy across all views\.Figure 7:Mean response length \(in tokens\) versus training steps forα∈\{0\.1,0\.5,1\}\\alpha\\in\\\{0\.1,0\.5,1\\\}andk∈\{3,5,10\}k\\in\\\{3,5,10\\\}\. Larger values ofkkandα\\alphalead to longer responses and increased variability\.Figure 8:GXPO diagnostic metrics versus training steps fork∈\{3,5,10\}k\\in\\\{3,5,10\\\}\. Metrics peak during the GXPO\-active phase and collapse after the GXPO\-to\-GRPO transition\.Figure 9:Retention ratio versus training steps acrossα∈\{0\.1,0\.5,1\}\\alpha\\in\\\{0\.1,0\.5,1\\\}andk∈\{3,5,10\}k\\in\\\{3,5,10\\\}\. Retention increases during GXPO and drops sharply at shutoff\.
## Appendix CAblation Tables
This appendix provides compact ablation summaries for two model families\. The first two tables use Llama3\.2\-3B and compare sampled pass@1 accuracy under matched backward\-pass and wall\-clock budgets\. The next tables use Qwen2\.5\-1\.5B on Math\-500 and report theα\\alphasweep, surrogate\-displacement diagnostics, and active\-phase GXPO mechanism metrics\. The final table reports KL and clipping diagnostics for the Llama3\.2\-3B runs\.
### C\.1Iso\-Backward\-Pass Benchmark Comparison
Table[3](https://arxiv.org/html/2605.06755#A3.T3)compares Llama3\.2\-3B methods at matched total backward\-pass budgets\. The selective view reports sampled pass@1 accuracy on Math\-500, GSM8K, and Minerva, where higher values indicate better budget\-normalized reasoning performance\.
Table 3:Sampled pass@1 accuracy at matched total backward\-pass budgets for Llama3\.2\-3B\. Selective view over Math\-500, GSM8K, and Minerva\.Method𝒌\\boldsymbol\{k\}BP=108BP=204BP=300Avg\.M\-500GSM8KMinervaM\-500GSM8KMinervaM\-500GSM8KMinervaGRPO–32\.9166\.8412\.5033\.9167\.068\.1833\.8566\.849\.7736\.87SFPO332\.8267\.139\.6632\.9467\.2111\.1432\.6166\.9910\.3436\.76532\.2466\.6811\.2533\.0566\.4711\.4832\.9067\.0210\.9136\.89GXPO332\.4867\.4412\.0533\.8567\.3310\.3434\.7667\.3412\.7337\.59533\.3867\.2210\.8034\.2367\.3511\.2535\.5467\.2411\.0237\.56
### C\.2Iso\-Wall\-Clock Benchmark Comparison
Table[4](https://arxiv.org/html/2605.06755#A3.T4)repeats the Llama3\.2\-3B comparison at matched wall\-clock checkpoints\. This complements the backward\-pass view by accounting for end\-to\-end runtime differences between GRPO, SFPO, and GXPO\.
Table 4:Sampled pass@1 accuracy at matched wall\-clock times for Llama3\.2\-3B\. Selective view; GRPO ran for 14\.5 hours and therefore has no 16\-hour evaluation\.Method𝒌\\boldsymbol\{k\}4h8h12hAvg\.M\-500GSM8KMinervaM\-500GSM8KMinervaM\-500GSM8KMinervaGRPO–33\.2466\.4610\.9132\.7367\.0312\.1633\.9467\.4911\.0236\.00SFPO332\.9967\.1310\.8032\.6166\.9910\.3434\.1967\.2011\.9336\.58532\.1567\.0911\.7033\.3867\.1213\.3033\.4167\.0010\.8036\.77GXPO333\.6967\.2212\.7334\.6767\.5012\.3935\.3067\.4811\.5938\.06533\.7167\.3010\.4535\.0067\.7411\.8235\.2568\.0311\.1437\.72
### C\.3Qwen2\.5\-1\.5B on Math\-500 Ablations
Tables[5](https://arxiv.org/html/2605.06755#A3.T5)–[7](https://arxiv.org/html/2605.06755#A3.T7)summarize Qwen2\.5\-1\.5B ablations on Math\-500 through step 300\. Theα\\alphatable reports the best pass@16 checkpoint for each configuration\. The diagnostic tables connect the method to Theorem[3](https://arxiv.org/html/2605.06755#Thmtheorem3): Table[6](https://arxiv.org/html/2605.06755#A3.T6)measures the displacement error of the geometric surrogate, and Table[7](https://arxiv.org/html/2605.06755#A3.T7)checks the active\-phase conditions under which the approximation should hold\. The results match the theory’s expected regime: largerkkincreases extrapolation and measured error, but the errors remain small; interpolation lowers the realized error atθ~\\tilde\{\\theta\}; inactive fallback stays limited; andcos\(g0,gslow\)\\cos\(g\_\{0\},g\_\{\\mathrm\{slow\}\}\)remains close to0\.970\.97\.
Table 5:Math\-500 pass@16 sensitivity of GXPO toα\\alphaon Qwen2\.5\-1\.5B, computed from checkpoints up to step 300\. Total BP and total hours are measured at step 300; best\-pass columns record the earliest lowest\-BP maximizer of pass@16\.𝒌\\boldsymbol\{k\}𝜶\\boldsymbol\{\\alpha\}Shutoff stepTotal BPTotal hoursStep to bestBP to bestHours to bestMath\-500 pass@1630\.110050011\.8928048011\.1968\.400\.512354612\.432104569\.3969\.801\.010651212\.242004128\.8071\.4050\.110050012\.011603607\.2668\.400\.59248411\.8930048411\.8970\.001\.012354612\.711503967\.4373\.80100\.110050011\.961903908\.2169\.800\.512354612\.7726050611\.3273\.001\.010651213\.5024045211\.1178\.20
Table 6:Absolute surrogate\-displacement diagnostics for GXPO on Qwen2\.5\-1\.5B over checkpoints up to step 300\. The first two columns measure the main quantities in Theorem[3](https://arxiv.org/html/2605.06755#Thmtheorem3): the error of the extrapolated pointθK\\theta\_\{K\}and the error after interpolation toθ~\\tilde\{\\theta\}\. Errors increase withkkbut remain small, and interpolation consistently reduces the realized error\. The displacement\-cosine error is a directional diagnostic, not a term in the bound\.𝒌\\boldsymbol\{k\}Med\.θK\\theta\_\{K\}abs\. errorMed\.θ~\\tilde\{\\theta\}abs\. errorMed\. disp\. cosine err\.Med\. activecos\(g0,gslow\)\\cos\(g\_\{0\},g\_\{\\mathrm\{slow\}\}\)38\.386e\-105\.084e\-106\.245e\-060\.97052\.124e\-091\.189e\-091\.360e\-050\.970105\.789e\-092\.933e\-095\.610e\-050\.969
Table 7:Active\-phase GXPO diagnostics from the Qwen2\.5\-1\.5Bα\\alphasweep\. Medians are computed only over checkpoints where GXPO extrapolation is enabled\. These statistics check whether the approximation assumptions are present in practice: gradient norms stay stable, active\-set retention ratios remain bounded, scale grows withkk, inactive\-coordinate fallback is small, and the backward\-pass cost remains fixed at three\.𝒌\\boldsymbol\{k\}PolicypassesMed\. active‖𝒈𝟎‖\\boldsymbol\{\\\|g\_\{0\}\\\|\}Med\. active‖𝒈𝟏‖\\boldsymbol\{\\\|g\_\{1\}\\\|\}Med\. active‖𝒈𝐬𝐥𝐨𝐰‖\\boldsymbol\{\\\|g\_\{\\mathrm\{slow\}\}\\\|\}Med\.𝐜𝐨𝐬\(𝒈𝟎,𝒈𝐬𝐥𝐨𝐰\)\\boldsymbol\{\\cos\(g\_\{0\},\}\\boldsymbol\{g\_\{\\mathrm\{slow\}\}\)\}Retentionratio‖𝚫𝑲‖/‖𝚫𝟐‖\\boldsymbol\{\\\|\\Delta\_\{K\}\\\|\}\\boldsymbol\{/\\\|\\Delta\_\{2\}\\\|\}ScalemeanInactivefrac\.331\.770e\-021\.772e\-021\.761e\-020\.9710\.519±\\pm0\.6871\.5281\.2710\.028531\.880e\-021\.886e\-021\.881e\-020\.9700\.535±\\pm0\.6822\.4771\.6960\.0281031\.702e\-021\.693e\-021\.685e\-020\.9710\.627±\\pm0\.6754\.5402\.8840\.028
### C\.4KL and Clip Diagnostics
Table[8](https://arxiv.org/html/2605.06755#A3.T8)reports PPO/GRPO clipping and KL diagnostics over 300 Llama3\.2\-3B training steps\. These diagnostics check whether GXPO repositioning causes unusually frequent clipping or large KL activation\.
Table 8:KL and clipping diagnostics over 300 Llama3\.2\-3B training steps\. Clip fractions remain close across GRPO, SFPO, and GXPO, indicating that GXPO repositioning does not substantially increase PPO/GRPO clipping\. KL penalties are higher for GXPO at largerkkbut remain small in absolute value\.Method𝒌\\boldsymbol\{k\}Mean clip fraction↓\\downarrowMax clip fraction↓\\downarrowMean KL penalty↓\\downarrowMax KL penalty↓\\downarrowGRPO–7\.02e\-49\.15e\-42\.47e\-74\.63e\-7SFPO36\.93e\-48\.43e\-41\.88e\-76\.10e\-756\.89e\-48\.67e\-42\.74e\-79\.03e\-7GXPO36\.85e\-48\.73e\-45\.41e\-72\.53e\-656\.91e\-48\.74e\-41\.05e\-64\.81e\-6Similar Articles
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
Evolved Policy Gradients
OpenAI introduces Evolved Policy Gradients (EPG), a meta-learning approach that learns loss functions through evolution rather than learning policies directly, enabling RL agents to generalize better across tasks by leveraging prior experience similar to how humans transfer skills.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.
Proximal Policy Optimization
OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.
Near-Future Policy Optimization
Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.