Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

arXiv cs.LG 07/03/26, 04:00 AM Papers
Summary
This paper introduces FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage function that dynamically schedules gradient weights during RL post-training of LLMs, achieving faster convergence and better accuracy-diversity trade-offs compared to static baselines.
arXiv:2607.01490v1 Announce Type: new Abstract: Reinforcement learning post-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement. Yet a proliferation of methods makes it unclear which advantage to use and when. We cut through the confusion with a unifying framework that decomposes any advantage into its positive and negative gradient mass along two orthogonal axes. On the sign axis, imbalanced updates collapse either entropy or weight geometry. On the difficulty axis, hard-problem focus sharpens signal but costs sample size. Both trade-offs shift during training: exploration favors balance and hard focus; exploitation favors suppression and medium focus. This motivates FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage that reads training dynamics to schedule the gradient weight automatically. FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale and 2k steps earlier at the 32B , while achieving the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:41 AM
# Don’t Let Gains FADE: Breaking Down Policy Gradient Weights in RL
Source: [https://arxiv.org/html/2607.01490](https://arxiv.org/html/2607.01490)
1\]FAIR at Meta 2\]Inria, Ecole Normale Supérieure\\contribution\[\*\]Work done at Meta, now working at the University of California, San Diego

Sean O’BrienFrancis BachGabriel SynnaeveTaco Cohen\[\[[jdecugis@meta\.com](https://arxiv.org/html/2607.01490v1/mailto:[email protected])

\(July 1, 2026\)

###### Abstract

Reinforcement learning post\-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse\. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement\. Yet a proliferation of methods makes it unclear which advantage to use and when\. We cut through the confusion with a unifying framework that decomposes any advantage into its positive and negative gradient mass \(mSm\_\{S\},mFm\_\{F\}\) along two orthogonal axes\. On the sign axis, imbalanced updates collapse either entropy or weight geometry\. On the difficulty axis, hard\-problem focus sharpens signal but costs sample size\. Both trade\-offs shift during training: exploration favors balance and hard focus; exploitation favors suppression and medium focus\. This motivates FADE \(Focal Advantage with Dynamic Entropy\), a self\-adapting advantage that reads training dynamics to schedule the gradient weight automatically\. FADE reaches peak pass@112020k steps earlier than the best static baseline at the 7B scale and22k steps earlier at the 32B , while achieving the best accuracy\-diversity trade\-off across all pass@kkon LiveCodeBench and AIME\.

\\correspondence

Juliette Decugis at

![Refer to caption](https://arxiv.org/html/2607.01490v1/x1.png)Figure 1:FADE learns faster and better for all pass@k on LiveCodeBench v6when compared to GRPO and the best static advantages powerα\\alphaand Asymmetric GRPO with the bestδ\\deltaper model\.## 1Introduction

Recent developments in reinforcement learning with verifiable rewards \(RLVR\) have unlocked rapid gains in language models’ capabilities, especially in easily\-verifiable domains like code generation and mathematics\(OpenAI,[2024](https://arxiv.org/html/2607.01490#bib.bib49); Guo et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib22); Shao et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib57); Liu et al\.,[2025a](https://arxiv.org/html/2607.01490#bib.bib41)\)\. Although sparse rewards over long sequences make credit assignment challenging\(Minsky,[1961](https://arxiv.org/html/2607.01490#bib.bib44); Sutton,[1988](https://arxiv.org/html/2607.01490#bib.bib60); Zhang,[2026](https://arxiv.org/html/2607.01490#bib.bib75)\), pretrained LLMs provide strong behavioral priors\(Gan and Isola,[2026](https://arxiv.org/html/2607.01490#bib.bib18); Yan et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib71)\)and fully resettable environments enable parallel rollout collection\. These methods therefore follow a common recipe: sample multiple rollouts per problem, score them with a binary verifier, and update the policy via a weighted policy gradient\(Williams,[1992](https://arxiv.org/html/2607.01490#bib.bib70); Schulman et al\.,[2015](https://arxiv.org/html/2607.01490#bib.bib55)\)\. The weights, commonly called “advantage functions,” rarely correspond to the classical advantage \(value of an action minus the value of the average action in a state\); they simply determine how much each rollout contributes to the gradient\. To avoid this overloading, we use the term*policy weights*throughout this paper\.

Since GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib57)\), which uses the mean reward as a baseline for updates, a host of alternative policy weights have appeared: DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib74)\), DR\-GRPO\(Liu et al\.,[2025a](https://arxiv.org/html/2607.01490#bib.bib41)\), pass@kk\-based objectives\(Tang et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib64); Chen et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib11)\), log\-mean\-exp weighting\(Jiang et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib27)\), and more\. Each claims improvements, yet comparing them is difficult because they differ along multiple axes simultaneously\. Consider for example the pass@88normalization\(Tang et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib64)\), which only upweights a correct rollout when it is the sole success in a batch of eight\. This simultaneously shifts gradient mass toward hard problems, drops all negative gradient signal since incorrect rollouts receive zero weight, and reduces the overall gradient magnitude because most batches contain either zero or more than one success\. Other methods such as Skew\-R\(Thrampoulidis et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib66)\)keep the sign balance of GRPO𝔼\[A\]=0\\mathbb\{E\}\[A\]=0but emphasize high variance samples\. When these methods under or outperform GRPO, it is unclear which change is responsible\.

We argue that the confusion stems from conflating three orthogonal design axes\. Similar toThrampoulidis et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib66)\), we decompose policy weights into positivemSm\_\{S\}and negativemFm\_\{F\}mass on the gradient \(Section[2](https://arxiv.org/html/2607.01490#S2)\) which are dependent on a prompt’s solve ratepp\. We show policy weights can differe along:

1. 1\.Difficuly Axis: whether gradient mass peaks on easy prompts \(highpp\) or hard ones \(lowpp\);
2. 2\.Sign Axis: whether the positive and negative masses are equal or not;
3. 3\.Scale Axis: the overall magnitude of the gradient, which implicitely rescales the learning rate\.

We identify three trade\-offs driven by the representational asymmetry between correct and incorrect trajectories:

- •Reinforcing successes collapses entropy\.Because correct solutions cluster tightly, amplifying them concentrates the policy onto a narrow mode, with the drift rate predictable from the sign ratio alone \(Section[4\.1](https://arxiv.org/html/2607.01490#S4.SS1)\)\.
- •Suppressing failures induces rank\-1 update collapse\.Because failures are diverse and decorrelated, amplifying them drives the weight update toward a single suppression direction, progressively blocking multi\-dimensional learning \(Section[4\.2](https://arxiv.org/html/2607.01490#S4.SS2)\)\.
- •Harder problems trade information for variance\.Focusing gradient mass on low\-solve\-rate prompts yields more informative updates, but at the cost of more variance \(Section[4\.3](https://arxiv.org/html/2607.01490#S4.SS3)\)\.

Since a fixed advantage cannot adapt to all three trade\-offs during training, we proposeFADE\(Focal Advantage with Dynamic Entropy\) which shapes its gradient weight based on the policy’s past entropies and solve rates\. It achieves fast early learning with sustained diversity and accuracy across model scales \(7B, 32B\) \(Section[5](https://arxiv.org/html/2607.01490#S5)\)\.

## 2Framework for Policy Weight Analysis

We view the LLM as a policyπθ\\pi\_\{\\theta\}that generates a trajectory of tokensτ:=\(a1,…,aT\)\\tau:=\(a\_\{1\},\\ldots,a\_\{T\}\)given a promptqqwithlog⁡πθ\(τ\)=∑t=0Tlog⁡πθ\(at\|q,a<t\)\\log\\pi\_\{\\theta\}\(\\tau\)=\\sum\_\{t=0\}^\{T\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|q,a\_\{<t\}\)\. In LLM post\-training, the rewardr\(τ\)r\(\\tau\)is typically a single scalar assigned at the end of the trajectory \(e\.g\., correctness of the final answer\) which we maximize using the policy gradient estimator\(Williams,[1992](https://arxiv.org/html/2607.01490#bib.bib70)\):𝔼τ∼πθ\[r\(τ\)\]=𝔼τ∼πθ\[r\(τ\)∇θlog⁡πθ\(τ\)\]\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\[r\(\\tau\)\]=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\[r\(\\tau\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)\]\.

Compared to supervised learning, the model learns from its own generations reweighted by a verifier which can introduce noise\. To reduce variance\(Schulman et al\.,[2015](https://arxiv.org/html/2607.01490#bib.bib55)\), the reward is replaced by a weight functionW\(τ\)W\(\\tau\), originally the advantageA\(st,at\)=Q\(st,at\)−V\(st\)A\(s\_\{t\},a\_\{t\}\)=Q\(s\_\{t\},a\_\{t\}\)\-V\(s\_\{t\}\)\(Baird,[1993](https://arxiv.org/html/2607.01490#bib.bib6)\), and more recently a variety of alternatives based on the average solve rate per problem \(Table[1](https://arxiv.org/html/2607.01490#S2.T1)\)\. We will show that balancing between positive and negative gradient mass is the key to stable, efficient and diverse RL at scale\.

### 2\.1Positive and Negative Gradient Mass

We will assume that rewards are binary, withr\(τ\)∈\{0,1\}r\(\\tau\)\\in\\\{0,1\\\}, and that the policy weight is a function of success/failure only\. In this setting, we can consider the set of successful and unsuccessful trajectoriesS=r−1\(1\)S=r^\{\-1\}\(1\)andF=r−1\(0\)F=r^\{\-1\}\(0\)\. Sincerris deterministic and binary\-valued, the joint distributionℙ\(r=1,τ\)=ℙ\(r=1\|τ\)πθ\(τ\)=𝕀\[τ∈S\]πθ\(τ\)\\mathbb\{P\}\(r=1,\\tau\)=\\mathbb\{P\}\(r=1\|\\tau\)\\pi\_\{\\theta\}\(\\tau\)=\\mathbb\{I\}\[\\tau\\in S\]\\pi\_\{\\theta\}\(\\tau\), and similarly forr=0r=0\. Hence the probability of success is given byp=∑τ∈Sπθ\(τ\)p=\\sum\_\{\\tau\\in S\}\\pi\_\{\\theta\}\(\\tau\), and failureq=1−p=∑τ∈Fπθ\(τ\)q=1\-p=\\sum\_\{\\tau\\in F\}\\pi\_\{\\theta\}\(\\tau\)and we write the policy weight as:

W\(τ\)=wS⋅𝕀\[τ∈S\]−wF⋅𝕀\[τ∈F\]\.W\(\\tau\)=w\_\{S\}\\cdot\\mathbb\{I\}\[\\tau\\in S\]\-w\_\{F\}\\cdot\\mathbb\{I\}\[\\tau\\in F\]\.\(1\)
Adapting the notation from previous work\(Thrampoulidis et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib66)\), we can write the policy gradient as

∇θJ=\\displaystyle\\nabla\_\{\\theta\}J=\\;wS⋅p⋅𝔼\[∇θlogπθ\(τ\)\|τ∈S\]−wF⋅q⋅𝔼\[∇θlogπθ\(τ\)\|τ∈F\]\.\\displaystyle w\_\{S\}\\cdot p\\cdot\\mathbb\{E\}\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)\\;\\middle\|\\;\\tau\\in S\\right\]\-w\_\{F\}\\cdot q\\cdot\\mathbb\{E\}\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)\\;\\middle\|\\;\\tau\\in F\\right\]\.\(2\)
In practice, when doing online reinforcement learning we don’t have access to the true solve rateppbut estimate viaGGMonte Carlo rollouts from our policyπθ\\pi\_\{\\theta\}:

∇θJ≈\\displaystyle\\nabla\_\{\\theta\}J\\approx\\;wS⋅p¯⏟m¯S1\|S¯\|∑τ∈S¯∇θlog⁡πθ\(τ\)⏟∇¯S−wF⋅q¯⏟m¯F1\|F¯\|∑τ∈F¯∇θlog⁡πθ\(τ\)⏟∇¯F,\\displaystyle\\underbrace\{w\_\{S\}\\cdot\\bar\{p\}\}\_\{\\bar\{m\}\_\{S\}\}\\;\\underbrace\{\\frac\{1\}\{\|\\bar\{S\}\|\}\\sum\_\{\\tau\\in\\bar\{S\}\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)\}\_\{\\bar\{\\nabla\}\_\{S\}\}\-\\underbrace\{w\_\{F\}\\cdot\\bar\{q\}\}\_\{\\bar\{m\}\_\{F\}\}\\;\\underbrace\{\\frac\{1\}\{\|\\bar\{F\}\|\}\\sum\_\{\\tau\\in\\bar\{F\}\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)\}\_\{\\bar\{\\nabla\}\_\{F\}\},\(3\)whereS¯\\bar\{S\}andF¯\\bar\{F\}are the positive and negative trajectories inGG,p¯=\|S¯\|/\|G\|\\bar\{p\}=\|\\bar\{S\}\|/\|G\|andq¯=\|F¯\|/\|G\|\\bar\{q\}=\|\\bar\{F\}\|/\|G\|are the empirical estimates of the success and failure probability, respectively\. Here∇¯S\\bar\{\\nabla\}\_\{S\},∇¯F\\bar\{\\nabla\}\_\{F\}are the average log\-probability gradients over successful and failed trajectories in the batch\. For notational convenience, we will often leave out the bar when discussing estimates when it can be inferred from context\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/x2.png)Figure 2:Effect of Policy Weights on Batch Level Gradientsbased on the estimated solve ratepp\.\(Left\)Existing methods: mixing pass@kkChen et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib11)\)and W\-REINFORCEZhu et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib77)\)modify the sign, difficulty focus and scale of advantages\. We design weights that isolate\(Middle\)the difficulty axis with the Powerα\\alphaseriesp\(1−p\)αp\(1\-p\)^\{\\alpha\}and\(Right\)sign axis with the AsymGRPO series wherep\(1−p\)δ\\frac\{p\(1\-p\)\}\{\\delta\}for failed trajectories\.
### 2\.2GRPO as an Example

Consider a group ofGGrollouts per prompt, we take as policy weight the reward minus the mean \(mean only GRPO by GRPO\(Liu et al\.,[2025a](https://arxiv.org/html/2607.01490#bib.bib41)\)\)\. Withr∈\{0,1\}r\\in\\\{0,1\\\}, we have𝔼\[A\]=p¯\\mathbb\{E\}\[A\]=\\bar\{p\}and successful trajectories receiveAs=1−p¯A\_\{s\}=1\{\-\}\\bar\{p\}, failed onesAf=0−p¯A\_\{f\}=0\-\\bar\{p\}, yielding

∇GRPO\\displaystyle\\nabla\_\{\\mathrm\{GRPO\}\}=∑τ∈S¯As∇log⁡πθ\(τ\)\+∑τ∈F¯Af∇log⁡πθ\(τ\)\\displaystyle=\\sum\_\{\\tau\\in\\bar\{S\}\}A\_\{s\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\)\+\\sum\_\{\\tau\\in\\bar\{F\}\}A\_\{f\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\)=\(1−p\)∑τ∈S¯∇log⁡πθ\(τ\)−p∑τ∈F¯∇log⁡πθ\(τ\)\\displaystyle=\(1\{\-\}p\)\\\!\\sum\_\{\\tau\\in\\bar\{S\}\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\)\-p\\\!\\sum\_\{\\tau\\in\\bar\{F\}\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\)\(4\)=\|S¯\|\(1−p\)1\|S¯\|∑τ∈S¯∇log⁡πθ\(τ\)⏟∇¯S−\|F¯\|p1\|F¯\|∑τ∈F¯∇log⁡πθ\(τ\)⏟∇¯F\\displaystyle=\|\\bar\{S\}\|\(1\{\-\}p\)\\,\\underbrace\{\\tfrac\{1\}\{\|\\bar\{S\}\|\}\\\!\\sum\_\{\\tau\\in\\bar\{S\}\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\)\}\_\{\\bar\{\\nabla\}\_\{S\}\}\-\|\\bar\{F\}\|\\,p\\,\\underbrace\{\\tfrac\{1\}\{\|\\bar\{F\}\|\}\\\!\\sum\_\{\\tau\\in\\bar\{F\}\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\)\}\_\{\\bar\{\\nabla\}\_\{F\}\}\(5\)≈Gp\(1−p\)∇¯S−G\(1−p\)p∇¯F\\displaystyle\\approx Gp\(1\{\-\}p\)\\,\\bar\{\\nabla\}\_\{S\}\-G\(1\{\-\}p\)\\,p\\,\\bar\{\\nabla\}\_\{F\}=Gp\(1−p\)\[∇¯S−∇¯F\]\.\\displaystyle=Gp\(1\{\-\}p\)\\bigl\[\\bar\{\\nabla\}\_\{S\}\-\\bar\{\\nabla\}\_\{F\}\\bigr\]\.\(6\)
Mean based GRPO reweights gradients by the estimated variance of rewards\(Suk and Duan,[2025](https://arxiv.org/html/2607.01490#bib.bib59)\)\. It focuses on medium difficulty problems sincearg⁡maxp⁡p\(1−p\)=12\\arg\\max\_\{p\}p\(1\-p\)=\\frac\{1\}\{2\}\(Figure[2](https://arxiv.org/html/2607.01490#S2.F2)\)\. Similarly many recent policy weights depend only on the average solve rate per batch so we can define their positive and negative mass as functions ofp¯\\bar\{p\}\(see examples in Table[1](https://arxiv.org/html/2607.01490#S2.T1), complete derivations in Appendix[8](https://arxiv.org/html/2607.01490#S8)\)\. We distinguish sign\-balanced advantages \(mS=mFm\_\{S\}=m\_\{F\}\) and sign\-biased methods \(mS≠mFm\_\{S\}\\neq m\_\{F\}\)\.

Table 1:Weight functions as positivemSm\_\{S\}vs\. negative massmFm\_\{F\}on our policy gradients where our final update consists ofmS⋅∇¯S−mF⋅∇¯Fm\_\{S\}\\cdot\\bar\{\\nabla\}\_\{S\}\-m\_\{F\}\\cdot\\bar\{\\nabla\}\_\{F\}\. We assume binary rewards for successful vs\. failed trajectories and useri∈\{0,1\}r\_\{i\}\\in\\\{0,1\\\}for simplicity \(one can generalize to other ranges with a multiplier\) so𝔼\[r\]=p^\\mathbb\{E\}\[r\]=\\hat\{p\}\. Notation: expected number of correctppand incorrectq:=1−pq:=1\-psamples per batch\.

## 3Experimental Setup

The framework above shows that each advantage function induces a different balance of positive and negative gradient mass, but it does not predict which balance leads to the best policies\. To answer this, we compare advantage functions along different weight axes within the PPO clipping framework\(Schulman et al\.,[2017](https://arxiv.org/html/2607.01490#bib.bib56)\)using two model sizes: the Qwen 2\.5 7B\(Hui et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib26)\), and the Code World Model 32B SFT checkpoint\(FAIR CodeGen team et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib17)\)\(CWM 32B\)\. All evaluations use temperature1\.01\.0and top\-p1\.01\.0to promote sampling diversity, at reasoning budgets from 8k to 30k tokens\.

Reasoning SFT\.The Qwen 2\.5 7B model lacks chain\-of\-thought capability, so we first fine\-tune it on a mix of reasoning chains including OpenCodeReasoning\-2\(Ahmad et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib1)\)and OpenMathReasoning\(Moshkov et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib47)\)generated by DeepSeek\-R1\(Guo et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib22)\)using the same<think\>tags formatting\. CWM 32B is already trained to produce long chain\-of\-thought responses so we skip the supervised fine\-tuning \(SFT\)\.

RL training\.We train with binary rewards \(format and answer correctness\) on 25,000 competitive programming problems including the CodeContest\(Li et al\.,[2022](https://arxiv.org/html/2607.01490#bib.bib37)\)and TACO\(Li et al\.,[2023](https://arxiv.org/html/2607.01490#bib.bib36)\)training sets\. The dataset is fixed throughout training and epoched over: for Qwen 2\.5 7B we use the full problem mix \(initial solve rate≈0\.3\\approx 0\.3\), while for CWM 32B we filter out easy problems to start at a similar difficulty frontier \(≈0\.5\\approx 0\.5solve rate\)\. In this work we do not modify the training data distribution and instead focus on maximizing gradient learning through the policy weight given a fixed dataset\. See Appendix[11\.1](https://arxiv.org/html/2607.01490#S11.SS1)for infrastructure details\.

Starting from the same SFT checkpoint, we train models with different advantage functions and analyze policies along four complementary axes: accuracy \(pass@11\(Chen et al\.,[2021](https://arxiv.org/html/2607.01490#bib.bib10)\)\), diversity \(pass@100100\), reasoning generalization on AIME 2024/2025 math competitions\(OpenAI,[2024](https://arxiv.org/html/2607.01490#bib.bib49)\)a task unseen during RL training, and learning speed\. Full results across all methods, models, and benchmarks are in Tables[4](https://arxiv.org/html/2607.01490#S11.T4)and[5](https://arxiv.org/html/2607.01490#S11.T5)in Appendix[11\.2](https://arxiv.org/html/2607.01490#S11.SS2)\.

## 4Where to learn from? Balancing gradient signs and problem difficulty

Should we focus on reinforcing successes or suppressing failures within a batch? And on easy, medium, or hard problems across batches? During online RL, we are simultaneously doing gradient descent to downweight failed trajectories and gradient ascent to upweight successful trajectories\. We analyze how to balance reinforcing successes \(Section[4\.1](https://arxiv.org/html/2607.01490#S4.SS1)\), suppressing failures \(Section[4\.2](https://arxiv.org/html/2607.01490#S4.SS2)\), and adjusting the focus per difficulty \(Section[4\.3](https://arxiv.org/html/2607.01490#S4.SS3)\) to reach\+14%\+14\\%pass@11in2×2\\timesless training steps over the default reward weight \(REINFORCESutton \([1988](https://arxiv.org/html/2607.01490#bib.bib60)\)\)\.

### 4\.1Reinforcing Successes Collapses Entropy

TakeawayEntropy collapse is proportional to the sign ratio and success rates⇒\\RightarrowBias toward successes only at low solve rates\.

We introduceAsymGRPO, a single\-parameter variant of GRPO that keeps the same positive massmS=p\(1−p\)m\_\{S\}=p\(1\-p\)and rescales the negative mass byδ\\delta,mF=p\(1−p\)δm\_\{F\}=\\frac\{p\(1\-p\)\}\{\\delta\}\. With thisδ\\deltaknob we can either amplify or downweight failures; withδ=1\\delta=1we recover standard mean based GRPO\. Because correct solutions are few and similar, amplifying them \(δ\>1\\delta\>1\) rapidly concentrates the policy onto a narrow set of actions\. Adapting the analysis ofCui et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib14)\)\(Appendix[13](https://arxiv.org/html/2607.01490#S13)\), a first\-order Taylor expansion of the entropy after a gradient step with learning rateη\\etais:

Δℋ≈η\[\(mS−mF\)ℋ⏟entropy drift−Cov⁡\(A,log⁡πθ\)\]\+O\(η2\)\.\\Delta\\mathcal\{H\}\\approx\\eta\\left\[\\underbrace\{\(m\_\{S\}\-m\_\{F\}\)\\,\\mathcal\{H\}\}\_\{\\text\{entropy drift\}\}\-\\operatorname\{Cov\}\(A,\\log\\pi\_\{\\theta\}\)\\right\]\+O\(\\eta^\{2\}\)\.\(7\)The covariance term drives entropy loss for all methods it is the standard mechanism studied in prior work\(Cui et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib14)\)\. The drift term, however, is unique to sign\-imbalanced advantages wheremS≠mFm\_\{S\}\\neq m\_\{F\}and introduces entropy\-proportional feedback\. Under AsymGRPO this feedback is controlled entirely byδ\\delta:

- •δ=1\\delta=1: drift is zero, entropy collapses through the covariance term alone, with no mechanism to recover\.
- •δ\>1\\delta\>1: drift is positive \(mS\>mFm\_\{S\}\>m\_\{F\}\), accelerating collapse beyond what the covariance predicts\. Entropy loss feeds into itself since the drift is proportional toℋ\\mathcal\{H\}\.
- •δ<1\\delta<1: drift is negative \(mS<mFm\_\{S\}<m\_\{F\}\), providing a restoring force that slows entropy loss\.

Empirically we find entropy correlates tightly with the advantage ratioAsAf:=mS×pmF×q\\frac\{A\_\{s\}\}\{A\_\{f\}\}:=\\frac\{m\_\{S\}\\times p\}\{m\_\{F\}\\times q\}\(Figure[4](https://arxiv.org/html/2607.01490#S4.F4)\) so it can be steered by changing the training data \(adjusting average solve ratespp\) or by rebalancing gradient mass between reinforcing and suppressing \(mSm\_\{S\}vs\.mFm\_\{F\}\)\. Since LLMs are prone to entropy collapse in online RL\(Park et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib50); Khatri et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib30)\),δ<1\\delta<1delays overfitting and accelerates exploration at the start of training \(Figure[3](https://arxiv.org/html/2607.01490#S4.F3)\)\. However, as we show next, pushingδ\\deltatoo far below11trades one pathology for another\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/x3.png)Figure 3:Scaling negative gradients of mean GRPO \(δ<1\\delta<1: up,δ\>1\\delta\>1: down\) on Qwen 2\.5 7B\.\(Left\)For AsymGRPO withδ=0\.5\\delta=0\.5, correct samples are far more correlated than failures, whose pairwise residual correlationρ⟂\\rho\_\{\\perp\}is2×2\\timesbelow the value predicted by the higher\-rank residualR=𝔼\[Aivi⊗hi⟂\]R=\\mathbb\{E\}\[A\_\{i\}v\_\{i\}\\otimes h\_\{i\}^\{\\perp\}\]\(Appendix[15](https://arxiv.org/html/2607.01490#S15)\)\.\(Middle\)Over\-reinforcing \(δ\>1\\delta\>1\) causes entropy collapse;\(Right\)over\-suppressing \(δ<1\\delta<1\) causes rank\-1 update collapse\.![Refer to caption](https://arxiv.org/html/2607.01490v1/x4.png)Figure 4:The policy’s entropy is not correlated with pass@100100or learning speed; instead, it is proportional to the advantage sign:mS⋅p/mF⋅\(1−p\)m\_\{S\}\\cdot p\\,/\\,m\_\{F\}\\cdot\(1\-p\)withmSm\_\{S\},mFm\_\{F\}the mass of successful and failed trajectories respectively,ppour solve rate\.
### 4\.2Suppressing failures induces rank\-1 update collapse

TakeawayFailure bias learns fast but collapses the update to rank\-1\.⇒\\RightarrowReserve failure bias for late\-stage exploitation\.

Failure\-biased methods \(our own AsymGRPO withδ<1\\delta<1, Asymmetric Powerα\\alphaand existing methods such as AsymNormArnal et al\. \([2026](https://arxiv.org/html/2607.01490#bib.bib5)\)\) maintain high entropy and show fast early reward and pass@1\. Yet the gains are fragile: reward eventually degrades, answer diversity \(pass@100100\) drops, and GRPO catches up \(Tables[4](https://arxiv.org/html/2607.01490#S11.T4)and[5](https://arxiv.org/html/2607.01490#S11.T5)\)\. What makes learning from failures unreliable?

Analyzing the weight changeWΔ=Wrl−WsftW\_\{\\Delta\}=W\_\{\\mathrm\{rl\}\}\-W\_\{\\mathrm\{sft\}\}throughout training \(Appendix[15](https://arxiv.org/html/2607.01490#S15)\), we find that all methods start rank\-1 dominant in the output weight change \(using SVD analysis onWΔW\_\{\\Delta\}\)\. Sign\-balanced and success\-biased methods gradually escape \(Figure[5](https://arxiv.org/html/2607.01490#S4.F5)\), while failure\-biased methods \(δ<1\\delta<1\) remain locked in rank\-1, with RL change concentrating almost entirely in the output head \(up to 90% of‖WΔ‖2\\\|W\_\{\\Delta\}\\\|\_\{2\}at 7B, Table[9](https://arxiv.org/html/2607.01490#S15.T9), Figure[3](https://arxiv.org/html/2607.01490#S4.F3)\)\. We call this the*rank\-1 funnel*: it enables fast early exploitation but progressively blocks further learning as the model can only update along one axis, ultimately reducing diversity \(pass@100100\) and out\-of\-distribution generalization on AIME 2024/2025 \(Tables[4](https://arxiv.org/html/2607.01490#S11.T4)and[5](https://arxiv.org/html/2607.01490#S11.T5)\)\.

What causes this rank\-1 funnel? We formalize this by decomposing the output\-head of the RL changeWΔW\_\{\\Delta\}gradient into a rank\-1 signal and a higher\-rank residual \(see details in Appendix[15\.1](https://arxiv.org/html/2607.01490#S15.SS1)\):

WΔ=∑i=1NAivi⊗hi=\(∑i=1NAiαivi\)⊗u1⏟M1\(rank 1\)\+∑i=1NAivi⊗hi⟂⏟M2\(higher rank\)\.W\_\{\\Delta\}\\;=\\;\\sum\_\{i=1\}^\{N\}A\_\{i\}\\,v\_\{i\}\\otimes h\_\{i\}\\;=\\;\\underbrace\{\\Bigl\(\\sum\_\{i=1\}^\{N\}A\_\{i\}\\alpha\_\{i\}\\,v\_\{i\}\\Bigr\)\\otimes u\_\{1\}\}\_\{M\_\{1\}\\;\\text\{\(rank 1\)\}\}\\;\+\\;\\underbrace\{\\sum\_\{i=1\}^\{N\}A\_\{i\}\\,v\_\{i\}\\otimes h\_\{i\}^\{\\perp\}\}\_\{M\_\{2\}\\;\\text\{\(higher rank\)\}\}\.\(8\)We measure collapse viar1=σ12\(WΔ\)‖WΔ‖F2r\_\{1\}=\\frac\{\\sigma\_\{1\}^\{2\}\(W\_\{\\Delta\}\)\}\{\\\|W\_\{\\Delta\}\\\|\_\{F\}^\{2\}\}: the fraction of the update’s energy in its leading singular direction, which arises from two conditions:

1. 1\.Per\-step:We project each hidden statehih\_\{i\}onto the leading shared directionu1u\_\{1\}and measure how correlated the residualshi⟂=hi−\(hi⊤u1\)u1h\_\{i\}^\{\\perp\}=h\_\{i\}\-\(h\_\{i\}^\{\\top\}u\_\{1\}\)u\_\{1\}are across samples\. The average pairwise correlation of these residuals,ρ⟂\\rho\_\{\\perp\}, controls the higher\-rank termM2M\_\{2\}: whenρ⟂→0\\rho\_\{\\perp\}\\to 0the residuals are mutually uncorrelated and their weighted sum cancels out, soM2M\_\{2\}vanishes andr1→1r\_\{1\}\\to 1\(Appendix[15\.1](https://arxiv.org/html/2607.01490#S15.SS1)\)\. Empirically, failure hidden states are far more diverse than correct ones \(ρ⟂,fail≪ρ⟂,correct\\rho\_\{\\perp,\\text\{fail\}\}\\ll\\rho\_\{\\perp,\\text\{correct\}\}, Appendix[15](https://arxiv.org/html/2607.01490#S15)\): summing over many uncorrelated failure residuals leaves only the shared “suppress non\-code tokens” directionu1u\_\{1\}\. Settingδ<1\\delta<1amplifies this noisy failure mass while downweighting the correlated successes, pushingM2M\_\{2\}toward zero andr1r\_\{1\}toward11\(Figure[3](https://arxiv.org/html/2607.01490#S4.F3)\)\.
2. 2\.Across steps:if each per\-step gradient aligns in the same direction, the accumulatedWΔW\_\{\\Delta\}is itself rank\-1\. Empirically, cosine similarity between successive batch gradients converges to≈1\\approx 1from step600600on for negatively biased methods, turning per\-step dominance into cumulative collapse\. In contrast, sign\-balanced methods have lower alignment or switch rank\-1 direction \(Appendix[15\.3](https://arxiv.org/html/2607.01490#S15.SS3)\)\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/x5.png)

![Refer to caption](https://arxiv.org/html/2607.01490v1/x6.png)

Figure 5:\(Left\)Failure\-biased methods converge to a lower\-rank weight update than sign\-balanced methods as shown by the singular\-value ratios1/s2s\_\{1\}/s\_\{2\}of the ouput head ofWΔ=Wsft−WrlW\_\{\\Delta\}=W\_\{sft\}\-W\_\{rl\}, and itssrankδ=min⁡\{k:∑i=1ksi/∑i=1dsi≥1−δ\}\\mathrm\{srank\}\_\{\\delta\}=\\min\\\{k:\\sum\_\{i=1\}^\{k\}s\_\{i\}\\,/\\,\\sum\_\{i=1\}^\{d\}s\_\{i\}\\geq 1\-\\delta\\\}\(Kumar et al\.,[2020](https://arxiv.org/html/2607.01490#bib.bib34)\)\.\(Right\)Pass@100 on LCB v6 as a function of the difficulty focus in policy weights\.
### 4\.3Harder problems trade information for variance

TakeawayHard problems yield more informative gradients but fewer usable batches⇒\\RightarrowFocus on hard problems early, relax toward medium difficulty as solve rates rise\.

We showed biasing batch updates towards reinforcing or supressing trajectories can lead to overexploitation either in the policy or weight space\. For now, we only reweighted successes and failures based on the weight sign, but in practice some batches deserve more gradient attention than others\. For example, our policy should explore new ways of reasoning to solve harder problems rather than overfit to the easy ones\. Approximating thekkvs\. pass@kkcurve aspass@k=exp⁡\(−a\(k\+c\)−b\)\\text\{pass@\}k=\\exp\(\-a\(k\+c\)^\{\-b\}\)\(Brown et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib8); Schaeffer et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib54), see also Appendix[12](https://arxiv.org/html/2607.01490#S12)\), we also notice the difficulty axis is relevant to preserve diversity\. All methods lose scaling slopebbduring training except pass@kk\-driven weights \(evaluations done on LCB v6, Figure[6](https://arxiv.org/html/2607.01490#S4.F6), Figure[13](https://arxiv.org/html/2607.01490#S12.F13)\), which also achieve the highest final pass@100100for competitive programming and math at both model scales \(Table[3](https://arxiv.org/html/2607.01490#S11.T3),[5](https://arxiv.org/html/2607.01490#S11.T5)\)\. To understand how this difficulty axis shapes learning, we introducePowerα\\alpha:mS=mF=C⋅pqαm\_\{S\}=m\_\{F\}=C\\cdot pq^\{\\alpha\}\(CCnormalizes the peak to match the GRPO scale\), a smooth approximation of pass@kkadvantages\(Chen et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib11)\)\(Figure[2](https://arxiv.org/html/2607.01490#S2.F2)\) inspired by the focal loss ofLin et al\. \([2017](https://arxiv.org/html/2607.01490#bib.bib39)\)\. Higherα\\alphabiases the gradient towards harder problems\. Its easy counterpart usesC⋅pαqC\\cdot p^\{\\alpha\}q\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/x7.png)Figure 6:Accuracy vs\. Diversity per Policy WeightWe estimate thekkvs\. pass@kkcurve by a shifted power lawG\(k\)=exp⁡\(−a\(k\+c\)−b\)G\(k\)=\\exp\(\-a\(k\+c\)^\{\-b\}\)whereaacontrols the uniform level andbbcontrols the steepness\.\(Left\)Evolution during training of these coefficients where diverse RL should increasebbor at least maintainbband loweraa\.\(Right\)pass@100 vs\. pass@1 where diverse accurate advantages maximize both\.![Refer to caption](https://arxiv.org/html/2607.01490v1/x8.png)Figure 7:Variance of the gradient weightp\(1−p\)αp\(1\-p\)^\{\\alpha\}at different solve rates for1616MC rollouts\.Sweepingα\\alphavalues, we find focusing on solve ratesp^∈\[0\.3,0\.5\]\\hat\{p\}\\in\[0\.3,0\.5\]\(Figure[5](https://arxiv.org/html/2607.01490#S4.F5)\) maximizes both pass@11and pass@100100\. We analyze why this optimal difficulty exists\. Across batches, the most valuable gradients at a solve rateppare the ones with high signals\(p\)s\(p\), low varianceVar\(w\(p^\)g^\)\\mathrm\{Var\}\(w\(\\hat\{p\}\)\\hat\{g\}\), and high frequency\. Since we estimate the empirical success ratep¯\\bar\{p\}viaGGMonte Carlo rollouts, the variance of the weightw¯=w\(p^\)\\bar\{w\}=w\(\\hat\{p\}\)is a random variable\. Using the Delta method, we can estimate the total per\-update noise as:

Var\(w\(p^\)g^\)≈w\(p\)2v0\(p\)\+s\(p\)2Var\(w\(p^\)\)\\mathrm\{Var\}\(w\(\\hat\{p\}\)\\hat\{g\}\)\\approx w\(p\)^\{2\}v\_\{0\}\(p\)\+s\(p\)^\{2\}\\mathrm\{Var\}\(w\(\\hat\{p\}\)\)\(9\)wherev0\(p\)∝1Gp\(1−p\)v\_\{0\}\(p\)\\propto\\frac\{1\}\{G\\,p\(1\{\-\}p\)\}is the raw gradient variance \(which blows up when successes or failures are rare see Figure[7](https://arxiv.org/html/2607.01490#S4.F7)\), andVar\(w\(p^\)\)≈\[w′\(p\)\]2p\(1−p\)G\\mathrm\{Var\}\(w\(\\hat\{p\}\)\)\\approx\[w^\{\\prime\}\(p\)\]^\{2\}\\frac\{p\(1\-p\)\}\{G\}is the variance introduced by weight estimation \(Appendix[14](https://arxiv.org/html/2607.01490#S14)\)\. Treating updates as independent, we define the per\-prompt learning qualityq\(p,w\)=s\(p\)2/Var\(w\(p^\)\)q\(p,w\)=s\(p\)^\{2\}/\{Var\}\(w\(\\hat\{p\}\)\), the signal\-to\-noise ratio of the weighted gradient at solve ratepp\. The total update quality over the difficulty distributionf\(p\)f\(p\)is:

𝒬\(w\)=Neff⏟effective count×𝔼p∼f\[q\(p,w\)\]⏟per\-sample quality=Neff×𝔼p∼f\[s\(p\)Var\(w\(p^\)g^\)\]\\mathcal\{Q\}\(w\)=\\underbrace\{N\_\{\\mathrm\{eff\}\}\}\_\{\\text\{effective count\}\}\\times\\underbrace\{\\mathbb\{E\}\_\{p\\sim f\}\\\!\\left\[q\(p,w\)\\right\]\}\_\{\\text\{per\-sample quality\}\}=N\_\{\\mathrm\{eff\}\}\\times\\mathbb\{E\}\_\{p\\sim f\}\\\!\\left\[\\frac\{s\(p\)\}\{\\mathrm\{Var\}\(w\(\\hat\{p\}\)\\hat\{g\}\)\}\\right\]\(10\)whereNeff=M\(𝔼\[w\]\)2/𝔼\[w2\]≤MN\_\{\\mathrm\{eff\}\}=M\\,\(\\mathbb\{E\}\[w\]\)^\{2\}/\\mathbb\{E\}\[w^\{2\}\]\\leq Mis Kish’s effective sample size\. Increasingα\\alphaselectively filters out easy prompts, which can raiseq\(p,w\)q\(p,w\)at the mode but reducesNeffN\_\{\\mathrm\{eff\}\}\.

Sincewwmust serve all difficulties simultaneously, we seekw⋆=arg⁡maxw⁡𝒬\(w\)w^\{\\star\}=\\arg\\max\_\{w\}\\mathcal\{Q\}\(w\)over the current distributionf\(p\)f\(p\), which may shift over the course of training\. In practice, our gradient signals\(p\)s\(p\)is unknown, so we consider two hypotheses\. If gradient updates carry the same signal regardless of difficulty \(s\(p\)≡ss\(p\)\\equiv s\), the only lever is variance reduction: the optimal strategy assigns weight proportional to the inverse noise,w⋆\(p\)∝1/v0\(p\)∝p\(1−p\)w^\{\\star\}\(p\)\\propto 1/v\_\{0\}\(p\)\\propto p\(1\{\-\}p\), concentrating on medium\-difficulty prompts where both successes and failures are frequent enough to yield low\-variance gradient estimates\. This corresponds toα=1\\alpha=1\(mean GRPO\) and is the case for larger models \(CWM 32B\) \(Figure[1](https://arxiv.org/html/2607.01490#S0.F1)\)\.

If instead harder problems carry a stronger learning signal \(s′\(p\)<0s^\{\\prime\}\(p\)<0\), the optimal filter tilts toward lowpp\(α\>1\\alpha\>1\), placing the weight mode at11\+α\\frac\{1\}\{1\{\+\}\\alpha\}\(see Figure[7](https://arxiv.org/html/2607.01490#S4.F7)\)\. At a solve rate of0\.50\.5,α=3\\alpha=3is justified only if hard problems carry at least2\.25×2\.25\\timesmore signal than the average prompt \(Table[8](https://arxiv.org/html/2607.01490#S14.T8)\)\. This matches the 7B regime, where Powerα=2\\alpha\{=\}2gives\+5%\+5\\%pass@kkover GRPO \(Tables[3](https://arxiv.org/html/2607.01490#S11.T3),[4](https://arxiv.org/html/2607.01490#S11.T4)\), suggesting smaller models have more to learn from hard problems\. However as the policy improves, problems that once provided signal become easy, and the pool of hard problems shrinks, reducingNeffN\_\{\\mathrm\{eff\}\}and degrading update quality for largeα\\alpha\. Methods like pass@kkadvantages that target tail\-end difficulties are especially vulnerable, waiting for signal on unsolvable problems while ignoring steady progress on medium\-difficulty prompts \(Figure[5](https://arxiv.org/html/2607.01490#S4.F5)\)\.

## 5The FADE Scheduler

The previous sections revealed two trade\-offs that a fixed policy weight cannot satisfy simultaneously:

- •Sign axis\(§[4\.1](https://arxiv.org/html/2607.01490#S4.SS1), §[4\.2](https://arxiv.org/html/2607.01490#S4.SS2)\): failure\-biased methods \(δ<1\\delta<1\) learn faster but induce rank\-1 update collapse; sign\-balanced methods \(δ=1\\delta=1\) preserve multi\-dimensional learning but converge slowly\.
- •Difficulty axis\(§[4\.3](https://arxiv.org/html/2607.01490#S4.SS3)\): focusing on hard problems maximizes gradient informativeness but at the cost of higher variance and less relevant batches\.

To maximize gradient efficiency, we propose to adapt the policy weight to online learning dynamics\. Combining Asym GRPO and Powerα\\alpha, we proposeFocal Advantage with Dynamic Entropy\(FADE\) which adapts the policy weight to the moving average of solve ratep^\\hat\{p\}and entropyH^\\hat\{H\}:

Ai=\(1−r¯\)α−1⋅\{ri−r¯ifri≥r¯,ri−r¯δifri<r¯,α=clip⁡\(3\(1−p^\)2p^,1,αmax\),δ=clip⁡\(1\+H^−H∗,0\.3,1\)\.A\_\{i\}=\(1\-\\bar\{r\}\)^\{\\alpha\-1\}\\cdot\\begin\{cases\}r\_\{i\}\-\\bar\{r\}&\\text\{if \}r\_\{i\}\\geq\\bar\{r\},\\\\\[4\.0pt\] \\dfrac\{r\_\{i\}\-\\bar\{r\}\}\{\\delta\}&\\text\{if \}r\_\{i\}<\\bar\{r\},\\end\{cases\}\\quad\\alpha=\\operatorname\{clip\}\\Bigl\(\\frac\{3\(1\-\\hat\{p\}\)\}\{2\\hat\{p\}\},\\,1,\\,\\alpha\_\{\\max\}\\Bigr\),\\quad\\delta=\\operatorname\{clip\}\\bigl\(1\+\\hat\{H\}\-H^\{\*\},\\,0\.3,\\,1\\bigr\)\.\(11\)
Algorithm[2](https://arxiv.org/html/2607.01490#alg2)shows how FADE can be used in practice\. We foundαmax=3\\alpha\_\{\\max\}=3and a target entropyH∗H^\{\*\}of half the initial entropy worked best \(Appendix[10](https://arxiv.org/html/2607.01490#S10)\)\. FADE can be seen as delayed exploitation\. In a typical training run \(Figure[8](https://arxiv.org/html/2607.01490#S5.F8)\),α\\alphastarts high making the policy focus on frontier problems and develop diverse reasoning strategies\. When the entropy drops belowH∗H^\{\*\}we force exploitation by making the gradients failure\-biased withδ<1\\delta<1\. The initial sign\-balanced phase \(δ=1\\delta=1\) prevents premature rank\-1 collapse in the update weights\. On the other hand, the decayingα\\alphapower avoids the gradient allocation trap where we overfit to hard problems\. Figure[1](https://arxiv.org/html/2607.01490#S0.F1)shows that FADE achieves the best performance across all pass@kkmetrics while reaching peak pass@1120k steps earlier than the best static baseline \(Powerα=2\\alpha\{=\}2\) at the 7B scale and 2k steps earlier at the 32B scale\.

We ablate each component independently \(Figure[9](https://arxiv.org/html/2607.01490#S10.F9)\): removingδ\\deltacauses entropy collapse while removingα\\alphaloses diversity\. We also test two simplifications: \(1\) replacing both signals with a deterministic logarithmic schedulef\(t\)∝log⁡\(1\+t/τ\)f\(t\)\\propto\\log\(1\+t/\\tau\), and \(2\) driving bothα\\alphaandδ\\deltafrom entropy alone\. Both degrade diversity \(pass@10, pass@100\), confirming thatα\\alphaandδ\\deltamust respond to distinct signals to balance exploration and exploitation effectively\.

FADE’s entropy target provides a single interpretable knob controlling the exploration\-exploitation in the learned update weightsWΔW\_\{\\Delta\}\(see Figure[6](https://arxiv.org/html/2607.01490#S4.F6)\)\. We track the SVD of the output\-head weight changeWΔW\_\{\\Delta\}\(singular\-value ratios1/s2s\_\{1\}/s\_\{2\}and rank\-1 fraction\) alongside itsL2L\_\{2\}norm across three FADE runs withH∗∈\{0\.5,1\.0,1\.3\}H^\{\*\}\\in\\\{0\.5,1\.0,1\.3\\\}on Qwen 2\.5 7B \(Table[8](https://arxiv.org/html/2607.01490#S5.F8)\)\. The result is a clean, continuous transition:H∗=0\.5H^\{\*\}\{=\}0\.5keeps‖WΔ‖\\\|W\_\{\\Delta\}\\\|small and the update full\-rank;H∗=1\.0H^\{\*\}\{=\}1\.0grows‖WΔ‖\\\|W\_\{\\Delta\}\\\|to 67% with moderate rank concentration;H∗=1\.3H^\{\*\}\{=\}1\.3reproduces the rank\-1 collapse seen in static failure\-biased methods \(s1/s2\>5s\_\{1\}/s\_\{2\}\>5, rank\-1 fraction 91%\)\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/x9.png)

H∗H^\{\*\}StepOHL2L\_\{2\}s1/s2s\_\{1\}/s\_\{2\}rank\-1%Inner0\.510k6\.1%4\.1038\.7%3\.32%0\.533k6\.3%3\.8454\.4%3\.32%1\.010k13\.2%3\.5155\.9%3\.07%1\.040k66\.9%5\.3488\.2%1\.17%1\.310k23\.9%3\.2568\.3%2\.69%1\.330k72\.7%5\.7791\.2%0\.96%*Asym\.*∼\{\\sim\}90%6\.96–8\.5178–96%∼\{\\sim\}0\.3%*Sym\.*<<10%3\.61–3\.9861–62%∼\{\\sim\}3\.2%

Figure 8:\(Left\)FADE controller dynamics:α\\alphaandδ\\deltaadapt the policy gradient weightp\(1−p\)α/δp\(1\{\-\}p\)^\{\\alpha\}/\\delta;δ\\deltadecays once entropy hitsH∗H^\{\*\}andα\\alphadecays as solve rates rise\.\(Right\)Output\-head SVD metrics at different entropy targets confirm thatH∗H^\{\*\}continuously controls the sign\-balanced\-to\-failure\-biased transition \(Table[8](https://arxiv.org/html/2607.01490#S5.F8)\)\.
## 6Related Works

Test\-time scaling and diversity collapse\.Test\-time compute scaling, samplingkkcandidate solutions and selecting the best, has become a standard strategy for improving LLM reasoning, through self\-consistency\(Wang et al\.,[2023](https://arxiv.org/html/2607.01490#bib.bib68)\), multi\-turn verification\(Lightman et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib38)\), or simply majority voting\.Brown et al\. \([2024](https://arxiv.org/html/2607.01490#bib.bib8)\)andSchaeffer et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib54)\)characterized the resulting pass@kkcurves as a power law governed by the distribution of per\-problem solve rates\. Training\-time scaling laws for RL have been studied along complementary axes: environment interactions\(Hilton et al\.,[2023](https://arxiv.org/html/2607.01490#bib.bib24)\), entropy budgets\(Cui et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib14)\), and algorithmic choices\(Khatri et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib30); Kaplan et al\.,[2020](https://arxiv.org/html/2607.01490#bib.bib28); Tan et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib63)\)\. At the intersection of training and inference, several works observed that RL fine\-tuning “sharpens” the policy: pass@11improves while pass@kkdegrades as the model collapses onto a narrow set of solutions\(Huang et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib25); Karan and Du,[2025](https://arxiv.org/html/2607.01490#bib.bib29); Levi,[2026](https://arxiv.org/html/2607.01490#bib.bib35)\)\.Barakat et al\. \([2026](https://arxiv.org/html/2607.01490#bib.bib7)\)formalized this as a gradient\-level trade\-off, showing that pass@11and pass@kkoptimization apply opposing forces on the policy\. Focusing on GRPO,Cheng et al\. \([2026](https://arxiv.org/html/2607.01490#bib.bib12)\)showed failed trajectories serve as credit assignment for positive tokens\.

Policy weights and advantage functions\.The advantage function\(Baird,[1993](https://arxiv.org/html/2607.01490#bib.bib6)\)separates the effect of an action from the quality of a state and can be estimated via batch\-average baselines\(Marbach and Tsitsiklis,[2001](https://arxiv.org/html/2607.01490#bib.bib43); Kool et al\.,[2019](https://arxiv.org/html/2607.01490#bib.bib33); Mnih and Rezende,[2016](https://arxiv.org/html/2607.01490#bib.bib45)\), learned value networks\(Wang et al\.,[2016](https://arxiv.org/html/2607.01490#bib.bib69)\), or GAE\(Schulman et al\.,[2015](https://arxiv.org/html/2607.01490#bib.bib55)\)\. Using Monte Carlo rollouts to estimate the value function was popularized by GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib57); Guo et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib22)\)\. To preserve test\-time diversity,\(Amini et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib3); Chow et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib13); Walder and Karkhanis,[2025](https://arxiv.org/html/2607.01490#bib.bib67); Tang et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib64)\)propose pass@kkreweighting\. Other works design policy weights based on likelihood\(Tajwar et al\.,[2026](https://arxiv.org/html/2607.01490#bib.bib62)\), sign reweighting\(Zhu et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib77); He et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib23)\), or more complex proxies such as the top\-kkredistribution\(Peng et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib51)\), or in\-context guidance\(Qu et al\.,[2026](https://arxiv.org/html/2607.01490#bib.bib53)\)\. Beyond fixed policy weights,The Microsoft AI Team \([2026](https://arxiv.org/html/2607.01490#bib.bib65)\)adjust the upper clip bound to maintain a target policy entropy similar ourδ\\deltascheduler in FADE but at the PPO clipping level\.Zhao et al\. \([2026](https://arxiv.org/html/2607.01490#bib.bib76)\)directly rescale the GRPO weight based on the average entropy\. Our framework is inspired byThrampoulidis et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib66)\),Chen et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib11)\)andDavis and Recht \([2025](https://arxiv.org/html/2607.01490#bib.bib16)\), who analyze the reward shaping of advantage functions and their gradient weights\. We extend this line of work to a broader range of advantages identifying general structural properties validated via large scale RL experiments\.

Weight\-space analysis of RL training\.Kumar et al\. \([2020](https://arxiv.org/html/2607.01490#bib.bib34)\)studied rank collapse in Q\-learning with gradient descent, attributing it to bootstrapping\. In contrast, we’re interested in the rank of the update weight since in RL with LLMs we have relatively few changes in the model weights\.Cai et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib9)\)observed rank\-1 dominance of the parameter update matrix, more pronounced in RL than in SFT or distillation\.Ye et al\. \([2026](https://arxiv.org/html/2607.01490#bib.bib73)\)showed that non\-rank\-1 directions encode out\-of\-domain abilities beyond reasoning, framing rank\-1 collapse as a form of overfitting\.Moalla et al\. \([2024](https://arxiv.org/html/2607.01490#bib.bib46)\)showed update weights become lower rank in high\-entropy regimes correlating with our observations forδ\>1\\delta\>1\.

## 7Conclusion

By decomposing policy weights into their positive and negative gradient masses \(mSm\_\{S\},mFm\_\{F\}\), we revealed two orthogonal axes that jointly govern training dynamics\. Balancing the sign balance ofmSm\_\{S\}andmFm\_\{F\}can push the policy to entropy collapse ifmS≫mFm\_\{S\}\\gg m\_\{F\}or rank\-1 update collapse ifmF≫mSm\_\{F\}\\gg m\_\{S\}, accelerating training at the cost of less diverse features\. On the other hand, looking atmS\(p\),mF\(p\)m\_\{S\}\(p\),m\_\{F\}\(p\)so the policy gradient mass as functions of the batch’s solve ratepp, we showed another trade\-off between the signal to noise per batch\. Combining these two directions, we notice two paradigms: the explorative mode where we should have a balanced sign ratio and focus on hard problems; and the exploitative mode where we want fast learning and no variance\.

Building on this analysis, we introduced FADE, which adapts both axes to online training dynamics\. It reaches peak pass@1120k steps earlier than the best static baseline at 7B and 2k steps earlier at 32B, while maintaining the best diversity\-accuracy trade\-off across models and benchmarks \(LiveCodeBench, AIME\)\.

Several directions remain open\. Our framework assumes binary, terminal rewards; extending themS/mFm\_\{S\}/m\_\{F\}decomposition to process reward models with per\-step credit assignment would clarify whether the sign and difficulty trade\-offs persist at finer granularity\. Similarly, when only a few Monte Carlo rollouts per prompt are available, the estimatep¯\\bar\{p\}becomes noisy\. Understanding which policy weights are robust to this estimation error, and how the optimal choice shifts in the low\-rollout regime, deserves further study\. Finally, all experiments use single\-turn code and math generation; multi\-turn and agentic settings, where intermediate feedback is available, may shift the exploration\-exploitation balance in ways our trajectory\-level analysis does not capture\.

## References

- Ahmad et al\. \(2025\)Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg\.Opencodereasoning\-ii: A simple test time scaling approach via self\-critique\.*arXiv preprint arXiv:2507\.09075*, 2025\.
- Ahmadian et al\. \(2024\)Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker\.Back to basics: Revisiting reinforce\-style optimization for learning from human feedback in llms\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 12248–12267, 2024\.
- Amini et al\. \(2024\)Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell\.Variational best\-of\-n alignment\.*arXiv preprint arXiv:2407\.06057*, 2024\.
- Andrychowicz et al\. \(2021\)Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al\.What matters in on\-policy reinforcement learning? a large\-scale empirical study\.In*ICLR 2021\-Ninth International Conference on Learning Representations*, 2021\.
- Arnal et al\. \(2026\)Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, and Remi Munos\.Asymmetric reinforce for off\-policy reinforcement learning: Balancing positive and negative rewards\.*Advances in Neural Information Processing Systems*, 38:9640–9664, 2026\.
- Baird \(1993\)Leemon C Baird\.Advantage updating\.Technical report, Wright Laboratory, 1993\.
- Barakat et al\. \(2026\)Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, and Amrit Singh Bedi\.Why pass@ k optimization can degrade pass@ 1: Prompt interference in llm post\-training\.*arXiv preprint arXiv:2602\.21189*, 2026\.
- Brown et al\. \(2024\)Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini\.Large language monkeys: Scaling inference compute with repeated sampling\.*arXiv preprint arXiv:2407\.21787*, 2024\.
- Cai et al\. \(2025\)Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang\.On predictability of reinforcement learning dynamics for large language models\.*arXiv preprint arXiv:2510\.00553*, 2025\.
- Chen et al\. \(2021\)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.
- Chen et al\. \(2025\)Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi\.Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models\.*arXiv preprint arXiv:2508\.10751*, 2025\.
- Cheng et al\. \(2026\)Tianhao Cheng, Zeyu Huang, Zihan Qiu, Yu Cheng, Edoardo Ponti, Yinghui Xu, Ivan Titov, and Zenglin Xu\.The cancellation hypothesis in critic\-free rl: From outcome rewards to token credits\.*arXiv preprint arXiv:2605\.08666*, 2026\.
- Chow et al\. \(2024\)Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust\.Inference\-aware fine\-tuning for best\-of\-n sampling in large language models\.*arXiv preprint arXiv:2412\.15287*, 2024\.
- Cui et al\. \(2025\)Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al\.The entropy mechanism of reinforcement learning for reasoning language models\.*arXiv preprint arXiv:2505\.22617*, 2025\.
- Dabney et al\. \(2018\)Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos\.Distributional reinforcement learning with quantile regression\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 32, 2018\.
- Davis and Recht \(2025\)Damek Davis and Benjamin Recht\.What is the objective of reasoning with reinforcement learning?*arXiv preprint arXiv:2510\.13651*, 2025\.
- FAIR CodeGen team et al\. \(2025\)FAIR CodeGen team, :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol\-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V\. Fisches, François Fleuret, Fabian Gloeckle, Alex Gu, Michael Hassid, Daniel Haziza, Badr Youbi Idrissi, Christian Keller, Rahul Kindi, Hugh Leather, Gallil Maimon, Aram Markosyan, Francisco Massa, Pierre\-Emmanuel Mazaré, Vegard Mella, Naila Murray, Keyur Muzumdar, Peter O’Hearn, Matteo Pagliardini, Dmitrii Pedchenko, Tal Remez, Volker Seeker, Marco Selvi, Oren Sultan, Sida Wang, Luca Wehrstedt, Ori Yoran, Lingming Zhang, Taco Cohen, Yossi Adi, and Gabriel Synnaeve\.Cwm: An open\-weights llm for research on code generation with world models\.*arXiv preprint arXiv:2510\.02387*, 2025\.
- Gan and Isola \(2026\)Yulu Gan and Phillip Isola\.Neural thickets: Diverse task experts are dense around pretrained weights\.*arXiv preprint arXiv:2603\.12228*, 2026\.
- Garg and Venkatesh \(2025\)Anisha Garg and Ganesh Venkatesh\.The peril of preference: Why grpo fails on ordinal rewards\.*arXiv preprint arXiv:2511\.04439*, 2025\.
- Gehring et al\. \(2025\)Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve\.Rlef: Grounding code llms in execution feedback with reinforcement learning\.In*International Conference on Machine Learning*, pages 19034–19055\. PMLR, 2025\.
- Greensmith et al\. \(2004\)Evan Greensmith, Peter L Bartlett, and Jonathan Baxter\.Variance reduction techniques for gradient estimates in reinforcement learning\.*Journal of Machine Learning Research*, 5\(Nov\):1471–1530, 2004\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al\.Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning\.*Nature*, 645\(8081\):633–638, 2025\.
- He et al\. \(2025\)Andre Wang He, Daniel Fried, and Sean Welleck\.Rewarding the unlikely: Lifting grpo beyond distribution sharpening\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 25559–25571, 2025\.
- Hilton et al\. \(2023\)Jacob Hilton, Jie Tang, and John Schulman\.Scaling laws for single\-agent reinforcement learning\.*arXiv preprint arXiv:2301\.13442*, 2023\.
- Huang et al\. \(2025\)Audrey Huang, Adam Block, Dylan Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan Ash, and Akshay Krishnamurthy\.Self\-improvement in language models: The sharpening mechanism\.In*International Conference on Learning Representations*, volume 2025, pages 76687–76739, 2025\.
- Hui et al\. \(2024\)Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al\.Qwen2\. 5\-coder technical report\.*arXiv preprint arXiv:2409\.12186*, 2024\.
- Jiang et al\. \(2025\)Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan\.Risk\-sensitive rl for alleviating exploration dilemmas in large language models\.*arXiv preprint arXiv:2509\.24261*, 2025\.
- Kaplan et al\. \(2020\)Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\.Scaling laws for neural language models\.*arXiv preprint arXiv:2001\.08361*, 2020\.
- Karan and Du \(2025\)Aayush Karan and Yilun Du\.Reasoning with sampling: Your base model is smarter than you think\.*arXiv preprint arXiv:2510\.14901*, 2025\.
- Khatri et al\. \(2025\)Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal\.The art of scaling reinforcement learning compute for llms\.*arXiv preprint arXiv:2510\.13786*, 2025\.
- Kim \(2026\)Youngeun Kim\.Mc\-grpo: Median\-centered group relative policy optimization for small\-rollout reinforcement learning\.*arXiv preprint arXiv:2601\.22582*, 2026\.
- Kimi Team et al\. \(2025\)Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al\.Kimi k1\. 5: Scaling reinforcement learning with llms\.*arXiv preprint arXiv:2501\.12599*, 2025\.
- Kool et al\. \(2019\)Wouter Kool, Herke van Hoof, and Max Welling\.Buy 4 reinforce samples, get a baseline for free\!*arXiv preprint arXiv:1901\.10280*, 2019\.
- Kumar et al\. \(2020\)Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine\.Implicit under\-parameterization inhibits data\-efficient deep reinforcement learning\.*arXiv preprint arXiv:2010\.14498*, 2020\.
- Levi \(2026\)Noam Levi\.Learning shrinks the hard tail: Training\-dependent inference scaling in a solvable linear model\.*arXiv preprint*, 2026\.
- Li et al\. \(2023\)Rongao Li, Jie Fu, Bo\-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li\.Taco: Topics in algorithmic code generation dataset\.*arXiv preprint arXiv:2312\.14852*, 2023\.
- Li et al\. \(2022\)Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al\.Competition\-level code generation with alphacode\.*Science*, 378\(6624\):1092–1097, 2022\.
- Lightman et al\. \(2024\)Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Lin et al\. \(2017\)Tsung\-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár\.Focal loss for dense object detection\.In*Proceedings of the IEEE international conference on computer vision*, pages 2980–2988, 2017\.
- Lin et al\. \(2026\)Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, and Gao Huang\.Thickening\-to\-thinning: Reward shaping via human\-inspired learning dynamics for llm reasoning\.*arXiv preprint arXiv:2602\.04265*, 2026\.
- Liu et al\. \(2025a\)Zi\-Yan Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin\.Understanding r1\-zero\-like training: A critical perspective\.*ArXiv*, abs/2503\.20783, 2025a\.
- Liu et al\. \(2025b\)Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin\.Understanding r1\-zero\-like training: A critical perspective\.*arXiv preprint arXiv:2503\.20783*, 2025b\.
- Marbach and Tsitsiklis \(2001\)Peter Marbach and John N Tsitsiklis\.Simulation\-based optimization of markov reward processes\.*IEEE Transactions on Automatic Control*, 46\(2\):191–209, 2001\.
- Minsky \(1961\)Marvin Minsky\.Steps toward artificial intelligence\.*Proceedings of the IRE*, 49\(1\):8–30, 1961\.
- Mnih and Rezende \(2016\)Andriy Mnih and Danilo Rezende\.Variational inference for monte carlo objectives\.In*International Conference on Machine Learning*, pages 2188–2196\. PMLR, 2016\.
- Moalla et al\. \(2024\)Skander Moalla, Andrea Miele, Daniil Pyatko, Razvan Pascanu, and Caglar Gulcehre\.No representation, no trust: Connecting representation, collapse, and trust issues in ppo\.*Advances in Neural Information Processing Systems*, 37:69652–69699, 2024\.
- Moshkov et al\. \(2025\)Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman\.Aimo\-2 winning solution: Building state\-of\-the\-art mathematical reasoning models with openmathreasoning dataset\.*arXiv preprint arXiv:2504\.16891*, 2025\.
- Noukhovitch et al\. \(2025\)Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville\.Asynchronous rlhf: Faster and more efficient off\-policy rl for language models\.In*International Conference on Learning Representations*, volume 2025, pages 4003–4029, 2025\.
- OpenAI \(2024\)OpenAI\.Learning to reason with LLMs\.[https://openai\.com/index/learning\-to\-reason\-with\-llms/](https://openai.com/index/learning-to-reason-with-llms/), September 2024\.
- Park et al\. \(2025\)Jaesung R Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, and Ernest K Ryu\.Clip\-low increases entropy and clip\-high decreases entropy in reinforcement learning of large language models\.*arXiv preprint arXiv:2509\.26114*, 2025\.
- Peng et al\. \(2025\)Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen\.Simko: Simple pass@ k policy optimization\.*arXiv preprint arXiv:2510\.14807*, 2025\.
- Plyusov et al\. \(2026\)Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov\.F\-grpo: Don’t let your policy learn the obvious and forget the rare\.*arXiv preprint arXiv:2602\.06717*, 2026\.
- Qu et al\. \(2026\)Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar\.Pope: Learning to reason on hard problems via privileged on\-policy exploration\.*arXiv preprint arXiv:2601\.18779*, 2026\.
- Schaeffer et al\. \(2025\)Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo\.How do large language monkeys get their power \(laws\)?In*International Conference on Machine Learning*, pages 53132–53176\. PMLR, 2025\.
- Schulman et al\. \(2015\)John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel\.High\-dimensional continuous control using generalized advantage estimation\.*arXiv preprint arXiv:1506\.02438*, 2015\.
- Schulman et al\. \(2017\)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Srinivasan et al\. \(2018\)Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling\.Actor\-critic policy optimization in partially observable multiagent environments\.*Advances in Neural Information Processing Systems*, 31, 2018\.
- Suk and Duan \(2025\)Joe Suk and Yaqi Duan\.On the optimization dynamics of rlvr: Gradient gap and step size thresholds\.*arXiv preprint arXiv:2510\.08539*, 2025\.
- Sutton \(1988\)Richard S Sutton\.Learning to predict by the methods of temporal differences\.*Machine learning*, 3\(1\):9–44, 1988\.
- Sutton et al\. \(1999\)Richard S\. Sutton, David McAllester, Satinder Singh, and Yishay Mansour\.Policy gradient methods for reinforcement learning with function approximation\.In*Proceedings of the 13th International Conference on Neural Information Processing Systems \(NIPS\)*, pages 1057–1063, Cambridge, MA, USA, 1999\. MIT Press\.
- Tajwar et al\. \(2026\)Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette\.Maximum likelihood reinforcement learning\.*arXiv preprint arXiv:2602\.02710*, 2026\.
- Tan et al\. \(2025\)Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al\.Scaling behaviors of llm reinforcement learning post\-training: An empirical study in mathematical reasoning\.*arXiv preprint arXiv:2509\.25300*, 2025\.
- Tang et al\. \(2025\)Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Rémi Munos\.Optimizing language models for inference time objectives using reinforcement learning\.*arXiv preprint arXiv:2503\.19595*, 2025\.
- The Microsoft AI Team \(2026\)The Microsoft AI Team\.Mai\-thinking\-1: Building a hill\-climbing machine\.Technical report, Microsoft AI, 2026\.[https://microsoft\.ai/pdf/mai\-thinking\-1\.pdf](https://microsoft.ai/pdf/mai-thinking-1.pdf)\.
- Thrampoulidis et al\. \(2025\)Christos Thrampoulidis, Sadegh Mahdavi, and Wenlong Deng\.Advantage shaping as surrogate reward maximization: Unifying pass@ k policy gradients\.*arXiv preprint arXiv:2510\.23049*, 2025\.
- Walder and Karkhanis \(2025\)Christian Walder and Deep Karkhanis\.Pass@ k policy optimization: Solving harder reinforcement learning problems\.*arXiv preprint arXiv:2505\.15201*, 2025\.
- Wang et al\. \(2023\)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.In*The Eleventh International Conference on Learning Representations*, 2023\.
- Wang et al\. \(2016\)Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas\.Dueling network architectures for deep reinforcement learning\.In*International conference on machine learning*, pages 1995–2003\. PMLR, 2016\.
- Williams \(1992\)Ronald J\. Williams\.Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.*Machine Learning*, 8\(3–4\):229–256, 1992\.[10\.1007/BF00992696](https://arxiv.org/doi.org/10.1007/BF00992696)\.
- Yan et al\. \(2025\)Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, and Jun Wang\.Efficient reinforcement learning with large language model priors\.In*International Conference on Learning Representations*, volume 2025, pages 48691–48715, 2025\.
- Yang et al\. \(2026\)Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, et al\.Your group\-relative advantage is biased\.*arXiv preprint arXiv:2601\.08521*, 2026\.
- Ye et al\. \(2026\)Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, and Tat\-Seng Chua\.On the implicit reward overfitting and the low\-rank dynamics in rlvr\.*arXiv preprint arXiv:2605\.06523*, 2026\.
- Yu et al\. \(2025\)Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al\.Dapo: An open\-source llm reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.
- Zhang \(2026\)Chenchen Zhang\.From reasoning to agentic: Credit assignment in reinforcement learning for large language models\.*arXiv preprint arXiv:2604\.09459*, 2026\.
- Zhao et al\. \(2026\)Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S\-T Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, et al\.Aem: Adaptive entropy modulation for multi\-turn agentic reinforcement learning\.*arXiv preprint arXiv:2605\.00425*, 2026\.
- Zhu et al\. \(2025\)Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei\-Lin Chen, Danqi Chen, and Yu Meng\.The surprising effectiveness of negative reinforcement in llm reasoning\.*arXiv preprint arXiv:2506\.01347*, 2025\.

\\beginappendix

## 8Deriving Positive and Negative Weights

Throughout,τS\\tau\_\{S\}andτF\\tau\_\{F\}denote successful and failed trajectories \(positive and negative weights\),r¯:=1B∑iri\\bar\{r\}:=\\frac\{1\}\{B\}\\sum\_\{i\}r\_\{i\}the batch reward mean,σr=1B∑i\(ri−r¯\)2\\sigma\_\{r\}=\\sqrt\{\\frac\{1\}\{B\}\\sum\_\{i\}\(r\_\{i\}\-\\bar\{r\}\)^\{2\}\}the reward standard deviation, and all gradients are estimated overBBindependent rollouts fromπθ\\pi\_\{\\theta\}\.

### 8\.1GRPO

The GRPO advantage isAi=\(ri−r¯\)/σrA\_\{i\}=\(r\_\{i\}\-\\bar\{r\}\)/\\sigma\_\{r\}\. For binary rewardsri∈\{0,1\}r\_\{i\}\\in\\\{0,1\\\}withr¯=p^\\bar\{r\}=\\hat\{p\}, the empirical standard deviation reduces to the Bernoulli formσr=p^\(1−p^\)=pq\\sigma\_\{r\}=\\sqrt\{\\hat\{p\}\(1\-\\hat\{p\}\)\}=\\sqrt\{pq\}, since:

σr2=1B∑i=1B\(ri−p^\)2=1B\[Bp\(1−p\)2\+Bqp2\]=pq\(q\+p\)=pq\.\\sigma\_\{r\}^\{2\}=\\tfrac\{1\}\{B\}\\textstyle\\sum\_\{i=1\}^\{B\}\(r\_\{i\}\-\\hat\{p\}\)^\{2\}=\\tfrac\{1\}\{B\}\\bigl\[Bp\\,\(1\{\-\}p\)^\{2\}\+Bq\\,p^\{2\}\\bigr\]=pq\(q\+p\)=pq\.The per\-trajectory advantages are:

As=qpq=qp,Af=−ppq=−pq\.A\_\{s\}=\\frac\{q\}\{\\sqrt\{pq\}\}=\\sqrt\{\\frac\{q\}\{p\}\},\\qquad A\_\{f\}=\\frac\{\-p\}\{\\sqrt\{pq\}\}=\-\\sqrt\{\\frac\{p\}\{q\}\}\.With\|S\|=Bp\|S\|=Bpand\|F\|=Bq\|F\|=Bqin expectation:

∇GRPO\\displaystyle\\nabla\_\{\\mathrm\{GRPO\}\}=qpq⋅Bp⋅𝐦S−ppq⋅Bq⋅𝐦F=Bpqpq\[𝐦S−𝐦F\]=Bpq\[𝐦S−𝐦F\]\.\\displaystyle=\\tfrac\{q\}\{\\sqrt\{pq\}\}\\cdot Bp\\cdot\\mathbf\{m\}\_\{S\}\-\\tfrac\{p\}\{\\sqrt\{pq\}\}\\cdot Bq\\cdot\\mathbf\{m\}\_\{F\}=\\tfrac\{Bpq\}\{\\sqrt\{pq\}\}\\left\[\\mathbf\{m\}\_\{S\}\-\\mathbf\{m\}\_\{F\}\\right\]=B\\sqrt\{pq\}\\left\[\\mathbf\{m\}\_\{S\}\-\\mathbf\{m\}\_\{F\}\\right\]\.HencemS=mF=Bpq/σr=Bpqm\_\{S\}=m\_\{F\}=Bpq/\\sigma\_\{r\}=B\\sqrt\{pq\}, confirming sign balance\. Skew\-R introduced byThrampoulidis et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib66)\)combines regular and mean based GRPO by taking the product of both givingmS=mF=pq×pqm\_\{S\}=m\_\{F\}=pq\\times\\sqrt\{pq\}\.

### 8\.2Multiplier Rescaled GRPO

The1σr\\frac\{1\}\{\\sigma\_\{r\}\}multiplier in GRPO is unstable\(Liu et al\.,[2025b](https://arxiv.org/html/2607.01490#bib.bib42)\)\(Yu et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib74)\), as it can lead to exploding or noisy gradient updates\. Instead, we can keep the mean normalization and introduce other multipliers:

∇rescale\\displaystyle\\nabla\_\{\\text\{rescale\}\}=∑i∈SCs\(𝐫,i\)⋅\(ri−r¯\)∇log⁡πθ\(τi\)−∑i∈FCf\(𝐫,i\)⋅\(r¯−ri\)∇log⁡πθ\(τi\),\\displaystyle=\\sum\_\{i\\in S\}C\_\{s\}\(\\mathbf\{r\},i\)\\cdot\(r\_\{i\}\-\\bar\{r\}\)\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)\-\\sum\_\{i\\in F\}C\_\{f\}\(\\mathbf\{r\},i\)\\cdot\(\\bar\{r\}\-r\_\{i\}\)\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\),whereCs\(𝐫,i\)C\_\{s\}\(\\mathbf\{r\},i\)andCf\(𝐫,i\)C\_\{f\}\(\\mathbf\{r\},i\)are arbitrary non\-negative functions that may depend on the full reward vector𝐫=\(r1,…,rB\)\\mathbf\{r\}=\(r\_\{1\},\\dots,r\_\{B\}\)and the sample indexii\. Mirroring the logic behind the focal loss\(Lin et al\.,[2017](https://arxiv.org/html/2607.01490#bib.bib39)\), we introduce the powerα\\alphaseries \(PA\) where the functions are powers of\|ri−r¯\|\|r\_\{i\}\-\\bar\{r\}\|\.

Our proposed methods:

1. 1\.Powerα\\alpha\(sign\-balanced\):The intuition is to reduce the gradient magnitude when the model already performs well \(r¯\\bar\{r\}high\): samples that are easy to solve should contribute less to the update\. We setCs\(𝐫,i\)=Cf\(𝐫,i\)=\(1−r¯\)α−1C\_\{s\}\(\\mathbf\{r\},i\)=C\_\{f\}\(\\mathbf\{r\},i\)=\(1\-\\bar\{r\}\)^\{\\alpha\-1\}for allii, so the advantage becomes: A^i=\(ri−r¯\)\(1−r¯\)α−1\.\\hat\{A\}\_\{i\}=\(r\_\{i\}\-\\bar\{r\}\)\(1\-\\bar\{r\}\)^\{\\alpha\-1\}\.For binary rewards wherer¯=p\\bar\{r\}=p, this simplifies toCs=Cf=qα−1C\_\{s\}=C\_\{f\}=q^\{\\alpha\-1\}, giving: mS=mF=Bpq⋅qα−1=Bpqα\.m\_\{S\}=m\_\{F\}=Bpq\\cdot q^\{\\alpha\-1\}=Bpq^\{\\alpha\}\.Asα\\alphaincreases, the masses decay faster withqq: gradients are suppressed on prompts the model has mostly solved, focusing optimization on harder problems\.
2. 2\.Asymmetric Powerα\\alpha:We decouple the exponents:Cs\(𝐫,i\)=\(1−r¯\)αs−1C\_\{s\}\(\\mathbf\{r\},i\)=\(1\-\\bar\{r\}\)^\{\\alpha\_\{s\}\-1\}fori∈Si\\in SandCf\(𝐫,i\)=r¯αf−1C\_\{f\}\(\\mathbf\{r\},i\)=\\bar\{r\}^\{\\alpha\_\{f\}\-1\}fori∈Fi\\in F\. The positive mass derivation is identical to the sign\-balanced case: mS=Bpqαs\.m\_\{S\}=Bpq^\{\\alpha\_\{s\}\}\.For the negative side, with binary rewardsr¯=p\\bar\{r\}=p: mF=Bpαf−1⋅p⋅q=Bpαfq\.m\_\{F\}=Bp^\{\\alpha\_\{f\}\-1\}\\cdot p\\cdot q=Bp^\{\\alpha\_\{f\}\}q\.Ifαs\>αf\\alpha\_\{s\}\>\\alpha\_\{f\}we suppress the positive gradient more than the negative one \(or vice versa\), giving explicit control over the diversity–performance tradeoff\. Note thatαs=αf=1\\alpha\_\{s\}=\\alpha\_\{f\}=1recovers GRPO \(mS=mF=pqm\_\{S\}=m\_\{F\}=pq\)\.
3. 3\.Asymmetric GRPO:keeps the GRPO advantage intact on the positive side and rescales only the negative side by a fixed scalar1δ\\frac\{1\}\{\\delta\}:Cs=1C\_\{s\}=1andCf=1δC\_\{f\}=\\frac\{1\}\{\\delta\}\. This gives: mS=Bpq,mF=Bpqδ\.m\_\{S\}=Bpq,\\quad m\_\{F\}=\\frac\{Bpq\}\{\\delta\}\.Whenδ<1\\delta<1, the negative gradient is amplified relative to the positive one \(mF\>mSm\_\{F\}\>m\_\{S\}\), pushing the model more aggressively away from failed trajectories\. Whenδ\>1\\delta\>1, the negative gradient is suppressed \(mF<mSm\_\{F\}<m\_\{S\}\), preserving diversity by reducing the penalty on incorrect solutions\. Atδ=1\\delta=1we recover standard GRPO\. Unlike Powerα\\alpha, this method does not adapt to the difficulty of the prompt \(pp\)\. It modifies only the strength of negative advantages\.

Thickening to Thinning \(T2T\)propose another type ofCsC\_\{s\}andCfC\_\{f\}multiplier for GRPO based on the average normalized lengths of successful and failed responses,L¯S\\bar\{L\}\_\{S\}andL¯F\\bar\{L\}\_\{F\}\. They define the reward asRi=\{1−αpLiif𝒱i=1α\(1−p\)Liif𝒱i=0R\_\{i\}=\\begin\{cases\}1\-\\alpha pL\_\{i\}&\\text\{if \}\\mathcal\{V\}\_\{i\}=1\\\\ \\alpha\(1\-p\)L\_\{i\}&\\text\{if \}\\mathcal\{V\}\_\{i\}=0\\end\{cases\}within a GRPO normalization\. The batch mean is:

μ\\displaystyle\\mu=1B\(∑i∈S\(1−αpLi\)\+∑j∈Fα\(1−p\)Lj\)=p−αp2L¯S\+α\(1−p\)2L¯F\.\\displaystyle=\\tfrac\{1\}\{B\}\\bigl\(\\textstyle\\sum\_\{i\\in S\}\(1\-\\alpha pL\_\{i\}\)\+\\textstyle\\sum\_\{j\\in F\}\\alpha\(1\{\-\}p\)L\_\{j\}\\bigr\)=p\-\\alpha p^\{2\}\\bar\{L\}\_\{S\}\+\\alpha\(1\{\-\}p\)^\{2\}\\bar\{L\}\_\{F\}\.\(12\)DefiningC:=1−αpL¯S−α\(1−p\)L¯FC:=1\-\\alpha p\\bar\{L\}\_\{S\}\-\\alpha\(1\-p\)\\bar\{L\}\_\{F\}, the positive and negative advantages are:

A¯S\\displaystyle\\bar\{A\}\_\{S\}=1σR\(\(1−αpL¯S\)−μ\)=\(1−p\)σR\[1−αpL¯S−α\(1−p\)L¯F\]=\(1−p\)CσR,\\displaystyle=\\tfrac\{1\}\{\\sigma\_\{R\}\}\\bigl\(\(1\-\\alpha p\\bar\{L\}\_\{S\}\)\-\\mu\\bigr\)=\\tfrac\{\(1\-p\)\}\{\\sigma\_\{R\}\}\\bigl\[1\-\\alpha p\\bar\{L\}\_\{S\}\-\\alpha\(1\{\-\}p\)\\bar\{L\}\_\{F\}\\bigr\]=\\tfrac\{\(1\-p\)\\,C\}\{\\sigma\_\{R\}\},\(13\)A¯F\\displaystyle\\bar\{A\}\_\{F\}=1σR\(α\(1−p\)L¯F−μ\)=−pσR\[1−αpL¯S−α\(1−p\)L¯F\]=−pCσR\.\\displaystyle=\\tfrac\{1\}\{\\sigma\_\{R\}\}\\bigl\(\\alpha\(1\{\-\}p\)\\bar\{L\}\_\{F\}\-\\mu\\bigr\)=\\tfrac\{\-p\}\{\\sigma\_\{R\}\}\\bigl\[1\-\\alpha p\\bar\{L\}\_\{S\}\-\\alpha\(1\{\-\}p\)\\bar\{L\}\_\{F\}\\bigr\]=\\tfrac\{\-p\\,C\}\{\\sigma\_\{R\}\}\.\(14\)The gradient takes the form:

∇GRPO\-T2T≈Bp\(1−p\)CσR\[∇¯S−∇¯F\]\\nabla\_\{\\text\{GRPO\-T2T\}\}\\approx\\frac\{Bp\(1\-p\)C\}\{\\sigma\_\{R\}\}\\left\[\\overline\{\\nabla\}\_\{S\}\-\\overline\{\\nabla\}\_\{F\}\\right\]

### 8\.3Maximum Likelihood RL

As introduced inTajwar et al\. \([2026](https://arxiv.org/html/2607.01490#bib.bib62)\), the MaxRL advantage isA=ri−r¯r¯A=\\frac\{r\_\{i\}\-\\bar\{r\}\}\{\\bar\{r\}\}\. Withr∈\{0,1\}r\\in\\\{0,1\\\}, we haveAs=1−ppA\_\{s\}=\\frac\{1\-p\}\{p\}andAf=0−pp=−1A\_\{f\}=\\frac\{0\-p\}\{p\}=\-1\. Similar to the GRPO gradient we can write:

∇MaxRL\\displaystyle\\nabla\_\{\\mathrm\{MaxRL\}\}=∑τs∈SAs∇log⁡πθ\(τs\)\+∑τf∈FAf∇θlog⁡πθ\(τf\)\\displaystyle=\\sum\_\{\\tau\_\{s\}\\in S\}A\_\{s\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{s\}\)\+\\sum\_\{\\tau\_\{f\}\\in F\}A\_\{f\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\_\{f\}\)=1−pp∑τs∈S∇log⁡πθ\(τS\)−∑τf∈F∇log⁡πθ\(τF\)\\displaystyle=\\frac\{1\-p\}\{p\}\\sum\_\{\\tau\_\{s\}\\in S\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{S\}\)\-\\sum\_\{\\tau\_\{f\}\\in F\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{F\}\)=\(1−p\)p⋅Bp⋅𝐦S−B\(1−p\)⋅𝐦F=B\(1−p\)\[𝐦S−𝐦F\]\.\\displaystyle=\\frac\{\(1\-p\)\}\{p\}\\cdot Bp\\cdot\\mathbf\{m\}\_\{S\}\-B\(1\-p\)\\cdot\\mathbf\{m\}\_\{F\}=B\(1\-p\)\\left\[\\mathbf\{m\}\_\{S\}\-\\mathbf\{m\}\_\{F\}\\right\]\.

### 8\.4Pass@kkBased Methods

#### 8\.4\.1Pass@kk\(Tang et al\.\)

The leave\-one\-out advantage ofTang et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib64)\)isAi=maxk⁡\(𝐫\)−maxk⁡\(𝐫−i\)A\_\{i\}=\\max\_\{k\}\(\\mathbf\{r\}\)\-\\max\_\{k\}\(\\mathbf\{r\}\_\{\-i\}\), wheremaxk\(𝐫\)=𝟏\[∃j:rj=1\]\\max\_\{k\}\(\\mathbf\{r\}\)=\\mathbf\{1\}\[\\exists\\,j:r\_\{j\}=1\]is the pass@kkindicator over a group ofkksamples and𝐫−i\\mathbf\{r\}\_\{\-i\}excludes sampleii\.

##### Failed samples \(ri=0r\_\{i\}=0\)\.

Removing a zero does not change the maximum:maxk⁡\(𝐫\)=maxk⁡\(𝐫−i\)\\max\_\{k\}\(\\mathbf\{r\}\)=\\max\_\{k\}\(\\mathbf\{r\}\_\{\-i\}\)regardless of the other samples, soAi=0A\_\{i\}=0for alli∈Fi\\in F\.

##### Successful samples \(ri=1r\_\{i\}=1\)\.

We havemaxk⁡\(𝐫\)=1\\max\_\{k\}\(\\mathbf\{r\}\)=1\. The leave\-one\-out ismaxk\(𝐫−i\)=𝟏\[∃j≠i:rj=1\]\\max\_\{k\}\(\\mathbf\{r\}\_\{\-i\}\)=\\mathbf\{1\}\[\\exists\\,j\\neq i:r\_\{j\}=1\], so:

Ai=1−𝟏\[∃j≠i:rj=1\]=𝟏\[sampleiis the only success\]\.A\_\{i\}=1\-\\mathbf\{1\}\[\\exists\\,j\\neq i:r\_\{j\}=1\]=\\mathbf\{1\}\[\\text\{sample \}i\\text\{ is the only success\}\]\.HenceAi=1A\_\{i\}=1when\|S\|=1\|S\|=1andiiis the unique success, andAi=0A\_\{i\}=0when\|S\|≥2\|S\|\\geq 2\.

##### Gradient\.

Since only the unique\-success case contributes:

∇\\displaystyle\\nabla=∑i=1kAi∇log⁡πθ\(τi\)=\{∇log⁡πθ\(τs\)=𝐦Sif\|S\|=1,0otherwise\.\\displaystyle=\\sum\_\{i=1\}^\{k\}A\_\{i\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)=\\begin\{cases\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{s\}\)=\\mathbf\{m\}\_\{S\}&\\text\{if \}\|S\|=1,\\\\\[4\.0pt\] 0&\\text\{otherwise\.\}\\end\{cases\}Taking expectations,Pr⁡\[\|S\|=1\]=\(k1\)pqk−1=kpqk−1\\Pr\[\|S\|=1\]=\\binom\{k\}\{1\}p\\,q^\{k\-1\}=kpq^\{k\-1\}, so:

𝔼\[∇\]=kpqk−1𝐦S−0⋅𝐦F,\\mathbb\{E\}\[\\nabla\]=kpq^\{k\-1\}\\,\\mathbf\{m\}\_\{S\}\-0\\cdot\\mathbf\{m\}\_\{F\},givingmS=kpqk−1m\_\{S\}=kpq^\{k\-1\}andmF=0m\_\{F\}=0\.

Unlike the sign\-balanced methods \(where the masses are deterministic for any fixed batch\), here the gradient is zero in most batches and only fires when exactly one ofkksamples succeeds\.

#### 8\.4\.2Analytical Pass@kk\(Chen et al\.\)

Chen et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib11)\)define the group pass@kkestimatorRk=1−\(Fk\)/\(Gk\)R\_\{k\}=1\-\\binom\{F\}\{k\}\\\!/\\\!\\binom\{G\}\{k\}overGGrollouts withFFfailures andS=G−FS=G\-Fsuccesses, withσk=Rk\(1−Rk\)\\sigma\_\{k\}=\\sqrt\{R\_\{k\}\(1\-R\_\{k\}\)\}\. The per\-sample advantages are:

As=1−Rkσk,Af=1σk\(1−Rk−\(F−1k−1\)\(G−1k−1\)\)\.A\_\{s\}=\\frac\{1\-R\_\{k\}\}\{\\sigma\_\{k\}\},\\qquad A\_\{f\}=\\frac\{1\}\{\\sigma\_\{k\}\}\\\!\\left\(1\-R\_\{k\}\-\\frac\{\\binom\{F\-1\}\{k\-1\}\}\{\\binom\{G\-1\}\{k\-1\}\}\\right\)\.
##### Simplification\.

DefineQk−1′:=\(F−1k−1\)/\(G−1k−1\)Q\_\{k\-1\}^\{\\prime\}:=\\binom\{F\{\-\}1\}\{k\{\-\}1\}\\\!/\\\!\\binom\{G\{\-\}1\}\{k\{\-\}1\}\. The identity\(Fk\)/\(Gk\)=\(F/G\)Qk−1′\\binom\{F\}\{k\}\\\!/\\\!\\binom\{G\}\{k\}=\(F/G\)\\,Q\_\{k\-1\}^\{\\prime\}gives1−Rk=q^Qk−1′1\-R\_\{k\}=\\hat\{q\}\\,Q\_\{k\-1\}^\{\\prime\}, so:

As=q^Qk−1′σk,Af=Qk−1′\(q^−1\)σk=−p^Qk−1′σk\.A\_\{s\}=\\frac\{\\hat\{q\}\\,Q\_\{k\-1\}^\{\\prime\}\}\{\\sigma\_\{k\}\},\\qquad A\_\{f\}=\\frac\{Q\_\{k\-1\}^\{\\prime\}\(\\hat\{q\}\-1\)\}\{\\sigma\_\{k\}\}=\\frac\{\-\\hat\{p\}\\,Q\_\{k\-1\}^\{\\prime\}\}\{\\sigma\_\{k\}\}\.Both cases unify intoAi=Qk−1′\(ri−p^\)/σkA\_\{i\}=Q\_\{k\-1\}^\{\\prime\}\(r\_\{i\}\-\\hat\{p\}\)/\\sigma\_\{k\}, which is exact for any finiteGGand has theC⋅\(ri−p^\)C\\cdot\(r\_\{i\}\-\\hat\{p\}\)structure guaranteeing sign balance\. Following the GRPO pattern with\|S\|=Gp^\|S\|=G\\hat\{p\}and\|F\|=Gq^\|F\|=G\\hat\{q\}:

∇\\displaystyle\\nabla=q^Qk−1′σk⋅S⋅𝐦S−p^Qk−1′σk⋅F⋅𝐦F=Gp^q^Qk−1′σk\[𝐦S−𝐦F\]\.\\displaystyle=\\frac\{\\hat\{q\}\\,Q\_\{k\-1\}^\{\\prime\}\}\{\\sigma\_\{k\}\}\\cdot S\\cdot\\mathbf\{m\}\_\{S\}\-\\frac\{\\hat\{p\}\\,Q\_\{k\-1\}^\{\\prime\}\}\{\\sigma\_\{k\}\}\\cdot F\\cdot\\mathbf\{m\}\_\{F\}=\\frac\{G\\hat\{p\}\\hat\{q\}\\,Q\_\{k\-1\}^\{\\prime\}\}\{\\sigma\_\{k\}\}\\left\[\\mathbf\{m\}\_\{S\}\-\\mathbf\{m\}\_\{F\}\\right\]\.SubstitutingQk−1′=\(1−Rk\)/q^Q\_\{k\-1\}^\{\\prime\}=\(1\-R\_\{k\}\)/\\hat\{q\}andσk=Rk\(1−Rk\)\\sigma\_\{k\}=\\sqrt\{R\_\{k\}\(1\-R\_\{k\}\)\}:

mS=mF=p^1−Rkσk=p^1−RkRk\.m\_\{S\}=m\_\{F\}=\\hat\{p\}\\,\\frac\{1\-R\_\{k\}\}\{\\sigma\_\{k\}\}=\\hat\{p\}\\sqrt\{\\frac\{1\-R\_\{k\}\}\{R\_\{k\}\}\}\.Atk=1k=1,R1=S/G=p^R\_\{1\}=S/G=\\hat\{p\}, somS=mF=p^q^m\_\{S\}=m\_\{F\}=\\sqrt\{\\hat\{p\}\\hat\{q\}\}, recovering GRPO\. In the population limit \(G→∞G\\to\\infty\),Rk→1−qkR\_\{k\}\\to 1\-q^\{k\}andQk−1′→qk−1Q\_\{k\-1\}^\{\\prime\}\\to q^\{k\-1\}, givingAi→qk−1\(ri−p\)/σkA\_\{i\}\\to q^\{k\-1\}\(r\_\{i\}\-p\)/\\sigma\_\{k\}andmS=mF=pqk/\(1−qk\)m\_\{S\}=m\_\{F\}=p\\sqrt\{q^\{k\}/\(1\-q^\{k\}\)\}\.

Remark\.If one were to naively set the baseline toRkR\_\{k\}instead ofp^\\hat\{p\}, i\.e\.Ai=\(ri−Rk\)/σkA\_\{i\}=\(r\_\{i\}\-R\_\{k\}\)/\\sigma\_\{k\}, one would obtainmS=p\(1−Rk\)/σk≠mF=qRk/σkm\_\{S\}=p\(1\{\-\}R\_\{k\}\)/\\sigma\_\{k\}\\neq m\_\{F\}=qR\_\{k\}/\\sigma\_\{k\}fork\>1k\>1\. The per\-sample baseline must remain atp^\\hat\{p\}for sign symmetry to hold\.

#### 8\.4\.3Mix Pass@1/Pass@kk\(Chen et al\.\)

Chen et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib11)\)also propose a convex combination of pass@1 and pass@kkadvantages:

Aimix=p⋅A^pass@k\+q⋅A^pass@1\.A\_\{i\}^\{mix\}=p\\cdot\\hat\{A\}\_\{pass@k\}\+q\\cdot\\hat\{A\}\_\{pass@1\}\.Both components have the mean\-centered structureC⋅\(ri−p\)C\\cdot\(r\_\{i\}\-p\):

A^pass@1=ri−pσ1,A^pass@k=qk−1\(ri−p\)σk\.\\hat\{A\}\_\{pass@1\}=\\frac\{r\_\{i\}\-p\}\{\\sigma\_\{1\}\},\\qquad\\hat\{A\}\_\{pass@k\}=\\frac\{q^\{k\-1\}\(r\_\{i\}\-p\)\}\{\\sigma\_\{k\}\}\.A linear combination of mean\-centered terms remains mean\-centered:

Aimix=\(pqk−1σk\+qσ1\)\(ri−p\)=:Cmix\(ri−p\)\.A\_\{i\}^\{mix\}=\\left\(\\frac\{pq^\{k\-1\}\}\{\\sigma\_\{k\}\}\+\\frac\{q\}\{\\sigma\_\{1\}\}\\right\)\(r\_\{i\}\-p\)=:C\_\{mix\}\(r\_\{i\}\-p\)\.Since this is of the formC⋅\(ri−p\)C\\cdot\(r\_\{i\}\-p\), it is automatically sign\-balanced\. Following the GRPO pattern:

∇\\displaystyle\\nabla=Cmix\[q∑i∈S∇log⁡πθ\(τi\)−p∑i∈F∇log⁡πθ\(τi\)\]=Cmix⋅Bpq\[𝐦S−𝐦F\]\.\\displaystyle=C\_\{mix\}\\left\[q\\sum\_\{i\\in S\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)\\right\.\\left\.\-p\\sum\_\{i\\in F\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)\\right\]=C\_\{mix\}\\cdot Bpq\\left\[\\mathbf\{m\}\_\{S\}\-\\mathbf\{m\}\_\{F\}\\right\]\.SomS=mF=Cmix⋅pqm\_\{S\}=m\_\{F\}=C\_\{mix\}\\cdot pq\. Noting that the success\-side advantages areA^s\(k\)=Ckq=qk/σk\\hat\{A\}\_\{s\}^\{\(k\)\}=C\_\{k\}q=q^\{k\}/\\sigma\_\{k\}andA^s\(1\)=q/σ1\\hat\{A\}\_\{s\}^\{\(1\)\}=q/\\sigma\_\{1\}:

Cmix⋅pq\\displaystyle C\_\{mix\}\\cdot pq=p\(p⋅Ckq⏟A^s\(k\)\+q⋅C1q⏟A^s\(1\)\)=p\(pA^pass@k\+qA^pass@1\),\\displaystyle=p\\\!\\left\(p\\cdot\\underbrace\{C\_\{k\}q\}\_\{\\hat\{A\}\_\{s\}^\{\(k\)\}\}\+q\\cdot\\underbrace\{C\_\{1\}q\}\_\{\\hat\{A\}\_\{s\}^\{\(1\)\}\}\\right\)=p\\\!\\left\(p\\,\\hat\{A\}\_\{pass@k\}\+q\\,\\hat\{A\}\_\{pass@1\}\\right\),confirmingmS=mF=p\(pA^pass@k\+qA^pass@1\)m\_\{S\}=m\_\{F\}=p\\\!\\left\(p\\,\\hat\{A\}\_\{pass@k\}\+q\\,\\hat\{A\}\_\{pass@1\}\\right\)\.

### 8\.5Biasing towards Negatives with a Shifted Baseline

A series of policy weights propose to shift up or down the positive weights by adding a fixed offset to GRPO\(Arnal et al\.,[2026](https://arxiv.org/html/2607.01490#bib.bib5)\)or to REINFORCE\(Sutton,[1988](https://arxiv.org/html/2607.01490#bib.bib60)\), or using the minimum reward over the batch as an offset\(Garg and Venkatesh,[2025](https://arxiv.org/html/2607.01490#bib.bib19)\)\. We’ll derive the advantage weight of AsymRL which can be generalized to other types of offsets, the advantage of AsymRL is:

Ai=ri−\(r¯\+δ\)\.A\_\{i\}=r\_\{i\}\-\(\\bar\{r\}\+\\delta\)\.With binary rewardsri∈\{0,1\}r\_\{i\}\\in\\\{0,1\\\}andr¯=p\\bar\{r\}=p, the per\-trajectory weights are:

As=1−p−δ=q−δ,Af=−p−δ=−\(p\+δ\)\.A\_\{s\}=1\-p\-\\delta=q\-\\delta,\\qquad A\_\{f\}=\-p\-\\delta=\-\(p\+\\delta\)\.Splitting the gradient over successful and failed trajectories:

∇AsymRL\\displaystyle\\nabla\_\{\\mathrm\{AsymRL\}\}=∑i∈S\(q−δ\)∇log⁡πθ\(τi\)−∑i∈F\(p\+δ\)∇log⁡πθ\(τi\)\\displaystyle=\\sum\_\{i\\in S\}\(q\-\\delta\)\\,\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)\-\\sum\_\{i\\in F\}\(p\+\\delta\)\\,\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)=\(q−δ\)⋅\|S\|⋅𝐦S−\(p\+δ\)⋅\|F\|⋅𝐦F\\displaystyle=\(q\-\\delta\)\\cdot\|S\|\\cdot\\mathbf\{m\}\_\{S\}\-\(p\+\\delta\)\\cdot\|F\|\\cdot\\mathbf\{m\}\_\{F\}=Bp\(q−δ\)𝐦S−Bq\(p\+δ\)𝐦F,\\displaystyle=Bp\(q\-\\delta\)\\,\\mathbf\{m\}\_\{S\}\-Bq\(p\+\\delta\)\\,\\mathbf\{m\}\_\{F\},Atδ=0\\delta=0we recover mean GRPO but in practice,Arnal et al\. \([2026](https://arxiv.org/html/2607.01490#bib.bib5)\)recommendδ=0\.01\\delta=0\.01\.

### 8\.6Quantile Baseline and MC\-GRPO

The quantile baseline\(Dabney et al\.,[2018](https://arxiv.org/html/2607.01490#bib.bib15)\)replaces the mean baseline with theτ\\tau\-quantile of the batch rewards:Ai=ri−Qτ\(𝐫\)A\_\{i\}=r\_\{i\}\-Q\_\{\\tau\}\(\\mathbf\{r\}\), whereQτQ\_\{\\tau\}is theτ\\tau\-th quantile\. With binary rewardsri∈\{0,1\}r\_\{i\}\\in\\\{0,1\\\}, the quantile is a step function of the solve rate:

Qτ\(𝐫\)=\{0ifp≤τ,1ifp\>τ\.Q\_\{\\tau\}\(\\mathbf\{r\}\)=\\begin\{cases\}0&\\text\{if \}p\\leq\\tau,\\\\ 1&\\text\{if \}p\>\\tau\.\\end\{cases\}
##### Case 1:p≤τp\\leq\\tau\(hard problems\)\.

The baseline is0, soAs=1A\_\{s\}=1,Af=0A\_\{f\}=0\. Only successes contribute:

∇=∑i∈S∇log⁡πθ\(τi\)=Bp∇¯S\.\\nabla=\\sum\_\{i\\in S\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)=Bp\\,\\bar\{\\nabla\}\_\{S\}\.HencemS=pm\_\{S\}=pandmF=0m\_\{F\}=0\.

##### Case 2:p\>τp\>\\tau\(easy problems\)\.

The baseline is11, soAs=0A\_\{s\}=0,Af=−1A\_\{f\}=\-1\. Only failures contribute:

∇=−∑i∈F∇log⁡πθ\(τi\)=−Bq∇¯F\.\\nabla=\-\\sum\_\{i\\in F\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\)=\-Bq\\,\\bar\{\\nabla\}\_\{F\}\.HencemS=0m\_\{S\}=0andmF=qm\_\{F\}=q\.

Combining both cases:mS=p1\[p≤τ\]m\_\{S\}=p\\,\\mathbf\{1\}\[p\\leq\\tau\]andmF=q1\[p\>τ\]m\_\{F\}=q\\,\\mathbf\{1\}\[p\>\\tau\]\. The quantile baseline is always sign\-biased for binary rewards: it learns only from successes on hard problems and only from failures on easy ones, never both simultaneously\.

MC\-GRPO\(Kim,[2026](https://arxiv.org/html/2607.01490#bib.bib31)\)uses the median as the baseline,Ai=ri−med\(𝐫\)A\_\{i\}=r\_\{i\}\-\\mathrm\{med\}\(\\mathbf\{r\}\), which is the special caseτ=12\\tau=\\tfrac\{1\}\{2\}\. With binary rewards,med\(𝐫\)=𝟏\[p\>12\]\\mathrm\{med\}\(\\mathbf\{r\}\)=\\mathbf\{1\}\[p\>\\tfrac\{1\}\{2\}\], so the masses reduce tomS=p1\[p≤12\]m\_\{S\}=p\\,\\mathbf\{1\}\[p\\leq\\tfrac\{1\}\{2\}\]andmF=q1\[p\>12\]m\_\{F\}=q\\,\\mathbf\{1\}\[p\>\\tfrac\{1\}\{2\}\]: the method switches abruptly from pure positive reinforcement to pure negative reinforcement atp=12p=\\tfrac\{1\}\{2\}\.

### 8\.7Binary Contrastive

The binary contrastive weight\(Greensmith et al\.,[2004](https://arxiv.org/html/2607.01490#bib.bib21)\)isW\(τ\)=𝟏r=1−p¯1−p¯1r=0W\(\\tau\)=\\mathbf\{1\}\_\{r=1\}\-\\frac\{\\bar\{p\}\}\{1\-\\bar\{p\}\}\\,\\mathbf\{1\}\_\{r=0\}so we havemS=p×1m\_\{S\}=p\\times 1andmF=\(1−p\)×p1−p=pm\_\{F\}=\(1\-p\)\\times\\frac\{p\}\{1\-p\}=p\.

### 8\.8Power Norm

The Power Norm advantage\(Andrychowicz et al\.,[2021](https://arxiv.org/html/2607.01490#bib.bib4)\)isAi=ri−r¯\[r¯\(1−r¯\)\]γA\_\{i\}=\\frac\{r\_\{i\}\-\\bar\{r\}\}\{\[\\bar\{r\}\(1\-\\bar\{r\}\)\]^\{\\gamma\}\}\. With binary rewards andr¯=p\\bar\{r\}=p:

As=q\(pq\)γ,Af=p\(pq\)γ\.A\_\{s\}=\\frac\{q\}\{\(pq\)^\{\\gamma\}\},\\qquad A\_\{f\}=\\frac\{p\}\{\(pq\)^\{\\gamma\}\}\.The gradient is:

∇\\displaystyle\\nabla=q\(pq\)γ⋅Bp𝐦S−p\(pq\)γ⋅Bq𝐦F=Bpq\(pq\)γ\[𝐦S−𝐦F\]\.\\displaystyle=\\frac\{q\}\{\(pq\)^\{\\gamma\}\}\\cdot Bp\\,\\mathbf\{m\}\_\{S\}\-\\frac\{p\}\{\(pq\)^\{\\gamma\}\}\\cdot Bq\\,\\mathbf\{m\}\_\{F\}=\\frac\{Bpq\}\{\(pq\)^\{\\gamma\}\}\\left\[\\mathbf\{m\}\_\{S\}\-\\mathbf\{m\}\_\{F\}\\right\]\.HencemS=mF=pq/\(pq\)γ=p1−γq1−γm\_\{S\}=m\_\{F\}=pq/\(pq\)^\{\\gamma\}=p^\{1\-\\gamma\}q^\{1\-\\gamma\}\. Atγ=1/2\\gamma=1/2this recovers GRPO; atγ=0\\gamma=0it recovers Dr\. GRPO\.

### 8\.9Function\-Based Policy Weights

Both Softmax and Logmeanexp aggregate rewards through the exponential partition function\. With\|S\|=Gp\|S\|=Gpsuccesses and\|F\|=Gq\|F\|=Gqfailures in a batch ofGGrollouts, the partition function is:

Z:=\|S\|eβ\+\|F\|=G\(peβ\+q\)=G\[1\+p\(eβ−1\)\]\.Z:=\|S\|\\,e^\{\\beta\}\+\|F\|=G\(pe^\{\\beta\}\+q\)=G\\bigl\[1\+p\(e^\{\\beta\}\-1\)\\bigr\]\.The per\-sample softmax weights aresoftmaxβ\(1\)=eβ/Z\\mathrm\{softmax\}\_\{\\beta\}\(1\)=e^\{\\beta\}/Zandsoftmaxβ\(0\)=1/Z\\mathrm\{softmax\}\_\{\\beta\}\(0\)=1/Z\.

##### Softmax\(Shao et al\.,[2024](https://arxiv.org/html/2607.01490#bib.bib57)\)\.

The advantageAi=softmaxβ\(ri\)−1/GA\_\{i\}=\\mathrm\{softmax\}\_\{\\beta\}\(r\_\{i\}\)\-1/Gsubtracts a uniform baseline\. Substituting:

As=eβZ−1G=\(eβ−1\)qZ,Af=1Z−1G=−p\(eβ−1\)Z\.A\_\{s\}=\\frac\{e^\{\\beta\}\}\{Z\}\-\\frac\{1\}\{G\}=\\frac\{\(e^\{\\beta\}\-1\)q\}\{Z\},\\qquad A\_\{f\}=\\frac\{1\}\{Z\}\-\\frac\{1\}\{G\}=\\frac\{\-p\(e^\{\\beta\}\-1\)\}\{Z\}\.The gradient is:

∇\\displaystyle\\nabla=As⋅Gp𝐦S−\|Af\|⋅Gq𝐦F=\(eβ−1\)pqpeβ\+q\[𝐦S−𝐦F\]\.\\displaystyle=A\_\{s\}\\cdot Gp\\,\\mathbf\{m\}\_\{S\}\-\|A\_\{f\}\|\\cdot Gq\\,\\mathbf\{m\}\_\{F\}=\\frac\{\(e^\{\\beta\}\-1\)pq\}\{pe^\{\\beta\}\+q\}\\left\[\\mathbf\{m\}\_\{S\}\-\\mathbf\{m\}\_\{F\}\\right\]\.HencemS=mF=\(eβ−1\)pq1\+p\(eβ−1\)m\_\{S\}=m\_\{F\}=\\frac\{\(e^\{\\beta\}\-1\)pq\}\{1\+p\(e^\{\\beta\}\-1\)\}, which is sign\-balanced\. Asβ→0\\beta\\to 0, this recovers Dr\. GRPO; asβ→∞\\beta\\to\\infty, it converges to the pass@kkadvantage\.

##### Logmeanexp\(Jiang et al\.,[2025](https://arxiv.org/html/2607.01490#bib.bib27)\)\.

The advantageAi=lmeβ\(𝐫\)−lmeβ\(𝐫−i\)A\_\{i\}=\\mathrm\{lme\}\_\{\\beta\}\(\\mathbf\{r\}\)\-\\mathrm\{lme\}\_\{\\beta\}\(\\mathbf\{r\}\_\{\-i\}\)uses the same partition function through a leave\-one\-out log\-difference, wherelmeβ\(𝐫\)=1βlog⁡\(Z/G\)\\mathrm\{lme\}\_\{\\beta\}\(\\mathbf\{r\}\)=\\frac\{1\}\{\\beta\}\\log\(Z/G\)\. Removing a success shiftsZ→Z−eβZ\\to Z\-e^\{\\beta\}while removing a failure shiftsZ→Z−1Z\\to Z\-1\. Taking the log breaks the sign symmetry: removing a high\-weight success changeslog⁡Z\\log Zmore than removing a failure\. In expectation:

mS=peβpeβ\+q,mF=qpeβ\+q\.m\_\{S\}=\\frac\{pe^\{\\beta\}\}\{pe^\{\\beta\}\+q\},\\qquad m\_\{F\}=\\frac\{q\}\{pe^\{\\beta\}\+q\}\.The ratiomS/mF=peβ/qm\_\{S\}/m\_\{F\}=pe^\{\\beta\}/qis sign\-biased: successes are exponentially upweighted\. Asβ→0\\beta\\to 0,mS→pm\_\{S\}\\to pandmF→qm\_\{F\}\\to q\(REINFORCE\)\. Asβ→∞\\beta\\to\\infty,mS→1m\_\{S\}\\to 1andmF→0m\_\{F\}\\to 0\(pure positive reinforcement\)\.

### 8\.10HA\-DW

HA\-DW\(Yang et al\.,[2026](https://arxiv.org/html/2607.01490#bib.bib72)\)uses a hardness\-aware dynamic weight:W\(τ\)=A⋅λexp⁡\(−sgn\(A^\)sgn\(r^−Ct\)\|r^−Ct\|\)W\(\\tau\)=A\\cdot\\lambda\\exp\(\-\\mathrm\{sgn\}\(\\hat\{A\}\)\\,\\mathrm\{sgn\}\(\\hat\{r\}\-C\_\{t\}\)\\,\|\\hat\{r\}\-C\_\{t\}\|\), whereCtC\_\{t\}is a running baseline\. With binary rewards, lettingAAdenote the base advantage and the exponential modulate based on whether the reward exceeds the baseline:

mS\\displaystyle m\_\{S\}=f\(Ct,p\)qp,mF=qp,\\displaystyle=f\(C\_\{t\},p\)\\,qp,\\qquad m\_\{F\}=qp,wheref\(Ct,p\)f\(C\_\{t\},p\)depends on the exponential modulation\. The negative mass matches Dr\. GRPO while the positive mass is scaled by a hardness\-dependent factor\.

### 8\.11ReLU Advantage

The ReLU advantage\(Srinivasan et al\.,[2018](https://arxiv.org/html/2607.01490#bib.bib58)\)isAi=max⁡\(0,ri−r¯\)A\_\{i\}=\\max\(0,r\_\{i\}\-\\bar\{r\}\)\. With binary rewards:As=max⁡\(0,q\)=q,Af=max⁡\(0,−p\)=0\.A\_\{s\}=\\max\(0,q\)=q,A\_\{f\}=\\max\(0,\-p\)=0\.gives a purely positive gradient:∇=q⋅Bp𝐦S=Bpq𝐦S\.\\nabla=q\\cdot Bp\\,\\mathbf\{m\}\_\{S\}=Bpq\\,\\mathbf\{m\}\_\{S\}\.HencemS=pqm\_\{S\}=pqandmF=0m\_\{F\}=0\. ReLU is sign\-biased: it reinforces successes with the same magnitude as Dr\. GRPO but completely ignores failures\.

## 9Extension to Multi\-Turn and Connection to Resampling

### 9\.1Generalizing the framework to multi\-turn rollouts

The decomposition∇θJ=mS∇¯S−mF∇¯F\\nabla\_\{\\theta\}J=m\_\{S\}\\bar\{\\nabla\}\_\{S\}\-m\_\{F\}\\bar\{\\nabla\}\_\{F\}relies only on linearity of expectation over per\-token weights and is therefore agnostic to whether a trajectory is generated in a single turn or interleaved with environment / tool feedback across multiple turns\. We make the extension explicit here\.

##### Setup\.

A multi\-turn rollout decomposes a trajectory asτ=\(τ1,τ2,…,τT\)\\tau=\(\\tau\_\{1\},\\tau\_\{2\},\\dots,\\tau\_\{T\}\), where turnttis generated from a statests\_\{t\}that depends on\(τ1,…,τt−1\)\(\\tau\_\{1\},\\dots,\\tau\_\{t\-1\}\)and any intervening environment observationso<to\_\{<t\}\. Tokens generated by the policy carry a per\-token weightwtw\_\{t\}; environment\-supplied tokens are masked from the gradient\. The policy gradient is:

∇θJ=𝔼τ\[∑t=1T∑a∈τtwt,a∇θlog⁡πθ\(a∣st\)\]\.\\nabla\_\{\\theta\}J=\\mathbb\{E\}\_\{\\tau\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\tau\_\{t\}\}w\_\{t,a\}\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\_\{t\}\)\\right\]\.\(15\)Splittingwt,a=wt,a\+−wt,a−w\_\{t,a\}=w\_\{t,a\}^\{\+\}\-w\_\{t,a\}^\{\-\}and summing over turns yields

∇θJ=∑t=1T\[mS\(t\)∇¯S\(t\)−mF\(t\)∇¯F\(t\)\],\\nabla\_\{\\theta\}J=\\sum\_\{t=1\}^\{T\}\\big\[\\,m\_\{S\}^\{\(t\)\}\\,\\bar\{\\nabla\}\_\{S\}^\{\(t\)\}\\;\-\\;m\_\{F\}^\{\(t\)\}\\,\\bar\{\\nabla\}\_\{F\}^\{\(t\)\}\\,\\big\],\(16\)with per\-turn masses

mS\(t\)\\displaystyle m\_\{S\}^\{\(t\)\}=ρt⋅𝔼\[wt\+\|turntreached\],\\displaystyle\\,=\\,\\rho\_\{t\}\\cdot\\mathbb\{E\}\\\!\\left\[\\,w\_\{t\}^\{\+\}\\,\\big\|\\,\\text\{turn \}t\\text\{ reached\}\\,\\right\],mF\(t\)\\displaystyle m\_\{F\}^\{\(t\)\}=ρt⋅𝔼\[wt−\|turntreached\],\\displaystyle\\,=\\,\\rho\_\{t\}\\cdot\\mathbb\{E\}\\\!\\left\[\\,w\_\{t\}^\{\-\}\\,\\big\|\\,\\text\{turn \}t\\text\{ reached\}\\,\\right\],whereρt:=Prπθ⁡\(turntis reached\)\\rho\_\{t\}:=\\Pr\_\{\\pi\_\{\\theta\}\}\(\\text\{turn \}t\\text\{ is reached\}\)is the*reachability factor*of turntt\. Single\-turn is the special caseT=1T=1,ρ1=1\\rho\_\{1\}=1, recovering Section[2](https://arxiv.org/html/2607.01490#S2)exactly\.

### 9\.2Resampling failures with fixed context≡\\equivPowerα=2\\alpha=2

A natural multi\-turn schedule is to retry only failed prompts\. We show that when the retry uses the same prompt context \(no information added between attempts\), one round of GRPO\-on\-failures produces an update with the same gradient mass as Powerα=2\\alpha=2\(Table[1](https://arxiv.org/html/2607.01490#S2.T1)\)\.

##### Setup\.

Fix a prompt with success rateppand letq:=1−pq:=1\-p\. Round 1 samples a batch ofBBrollouts and applies GRPO\. Round 2 only fires for prompts that failed in round 1 \(probabilityqq\); it samples a fresh batch ofBBrollouts from the same prompt context and applies GRPO again\.

For round 1 mass we have the standard GRPOmS\(1\)=mF\(1\)=Bpq\.m\_\{S\}^\{\(1\)\}=m\_\{F\}^\{\(1\)\}=Bpq\.\. The reachability factor for round 2 isρ2=q\\rho\_\{2\}=q\. Because the resample shares the same context, the within\-round success rate is againppand within\-round GRPO yields the same closed formBpqBpq\. Multiplying by reachability:mS\(2\)=mF\(2\)=ρ2⋅Bpq=q⋅Bpq=Bpq2\.m\_\{S\}^\{\(2\)\}=m\_\{F\}^\{\(2\)\}=\\rho\_\{2\}\\cdot Bpq=q\\cdot Bpq=Bpq^\{2\}\.

##### Interpretation\.

Theqα−1q^\{\\alpha\-1\}multiplier in the Power\-α\\alphafamily is therefore not arbitrary: for integerα\\alpha, it is exactly the marginal contribution of one extra GRPO retry on the failed prompts\. It can be implemented either as a within\-batch reweighting \(cheap, no extra rollouts\) or as an explicit resample loop\.

## 10FADE Scheduler Additional Results

Algorithm[2](https://arxiv.org/html/2607.01490#alg2)details the full FADE procedure\. Two exponential moving averages track the online solve ratep^\\hat\{p\}and policy entropyH^\\hat\{H\}, which respectively control the difficulty focusα\\alphaand the sign biasδ\\delta\. Crucially,α\\alphaandδ\\deltaare updated from the smoothed EMAs, while the per\-sample advantage uses the batch\-level meanr¯\\bar\{r\}\. For binary rewards, the resulting per\-prompt gradient masses aremS=pqαm\_\{S\}=pq^\{\\alpha\}andmF=pqα/δm\_\{F\}=pq^\{\\alpha\}/\\delta, so the sign ratio ismS/mF=δm\_\{S\}/m\_\{F\}=\\delta: when entropy exceedsH∗H^\{\*\},δ=1\\delta=1\(sign\-balanced\); as entropy drops,δ\\deltadecreases and the negative mass dominates, triggering exploitation\.

Algorithm 1Powerα\\alpha\(static\)0:Policy

πθ\\pi\_\{\\theta\}, prompts

𝒬\\mathcal\{Q\}, unit tests

𝒯q\\mathcal\{T\}\_\{q\}, power

α≥1\\alpha\\geq 1
1:forstep

t=1,…,Tt=1,\\ldots,Tdo

2:Sample

GGrollouts per prompt; score

ri←Exec\(τi,𝒯q\)r\_\{i\}\\leftarrow\\textsc\{Exec\}\(\\tau\_\{i\},\\mathcal\{T\}\_\{q\}\)
3:

r¯←1\|ℬ\|∑iri\\bar\{r\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{i\}r\_\{i\}
4:foreach rollout

i∈ℬi\\in\\mathcal\{B\}do

5:

Ai←\(1−r¯\)α−1\(ri−r¯\)A\_\{i\}\\leftarrow\(1\{\-\}\\bar\{r\}\)^\{\\alpha\-1\}\(r\_\{i\}\-\\bar\{r\}\)
6:endfor

7:Update

θ\\thetavia PPO clipping \(

ε=0\.2\\varepsilon\{=\}0\.2\) using advantages

\{Ai\}\\\{A\_\{i\}\\\}
8:endfor

Algorithm 2FADE: Focal Advantage with Dynamic Entropy\(additions over Alg\.[1](https://arxiv.org/html/2607.01490#alg1)\)0:Policy

πθ\\pi\_\{\\theta\}, prompts

𝒬\\mathcal\{Q\}, unit tests

𝒯q\\mathcal\{T\}\_\{q\},target entropyH∗H^\{\*\}, max powerαmax\\alpha\_\{\\max\}, EMA coefficientβ=0\.02\\beta\{=\}0\.02

1:Initializep^←0\.5\\hat\{p\}\\leftarrow 0\.5,H^←H0\(πθ\)\\hat\{H\}\\leftarrow H\_\{0\}\(\\pi\_\{\\theta\}\)

2:forstep

t=1,…,Tt=1,\\ldots,Tdo

3:Sample

GGrollouts per prompt; score

ri←Exec\(τi,𝒯q\)r\_\{i\}\\leftarrow\\textsc\{Exec\}\(\\tau\_\{i\},\\mathcal\{T\}\_\{q\}\)
4:

r¯←1\|ℬ\|∑iri\\bar\{r\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{i\}r\_\{i\},Ht←1\|ℬ\|∑i1Ti∑t−log⁡πθ\(at\(i\)∣q,a<t\(i\)\)H\_\{t\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{i\}\\frac\{1\}\{T\_\{i\}\}\\sum\_\{t\}\{\-\}\\log\\pi\_\{\\theta\}\(a\_\{t\}^\{\(i\)\}\\mid q,a\_\{<t\}^\{\(i\)\}\)

5:p^←βp^\+\(1−β\)r¯\\hat\{p\}\\leftarrow\\beta\\,\\hat\{p\}\+\(1\{\-\}\\beta\)\\,\\bar\{r\},H^←βH^\+\(1−β\)Ht\\hat\{H\}\\leftarrow\\beta\\,\\hat\{H\}\+\(1\{\-\}\\beta\)\\,H\_\{t\}

6:α←clip⁡\(3\(1−p^\)2p^,1,αmax\)\\alpha\\leftarrow\\operatorname\{clip\}\\bigl\(\\tfrac\{3\(1\-\\hat\{p\}\)\}\{2\\hat\{p\}\},\\;1,\\;\\alpha\_\{\\max\}\\bigr\),δ←clip⁡\(1\+H^−H∗,0\.3,1\)\\delta\\leftarrow\\operatorname\{clip\}\\\!\\bigl\(1\+\\hat\{H\}\-H^\{\*\},\\;0\.3,\\;1\\bigr\)

7:foreach rollout

i∈ℬi\\in\\mathcal\{B\}do

8:

Ai←\(1−r¯\)α−1\(ri−r¯\)/\{1ri≥r¯δri<r¯A\_\{i\}\\leftarrow\(1\{\-\}\\bar\{r\}\)^\{\\alpha\-1\}\(r\_\{i\}\-\\bar\{r\}\)\\;/\\;\\hbox\{\\pagecolor\{fadehl\}$\\begin\{cases\}1&r\_\{i\}\\geq\\bar\{r\}\\\\ \\delta&r\_\{i\}<\\bar\{r\}\\end\{cases\}$\}
9:endfor

10:Update

θ\\thetavia PPO clipping \(

ε=0\.2\\varepsilon\{=\}0\.2\) using advantages

\{Ai\}\\\{A\_\{i\}\\\}
11:endfor

Figure[9](https://arxiv.org/html/2607.01490#S10.F9)summarizes our ablation study on each FADE component\.

δ\(t\)\\displaystyle\\delta\(t\)=δ0−\(δ0−δmin\)log⁡\(1\+t/τδ\)log⁡\(1\+T/τδ\),\\displaystyle=\\delta\_\{0\}\-\(\\delta\_\{0\}\-\\delta\_\{\\min\}\)\\,\\frac\{\\log\(1\+t/\\tau\_\{\\delta\}\)\}\{\\log\(1\+T/\\tau\_\{\\delta\}\)\},δ0=1\.0,δmin=0\.802,τδ=12264,\\displaystyle\\quad\\delta\_\{0\}=1\.0,\\;\\delta\_\{\\min\}=0\.802,\\;\\tau\_\{\\delta\}=12264,\(17\)α\(t\)\\displaystyle\\alpha\(t\)=α0−\(α0−αmin\)log⁡\(1\+t/τα\)log⁡\(1\+T/τα\),\\displaystyle=\\alpha\_\{0\}\-\(\\alpha\_\{0\}\-\\alpha\_\{\\min\}\)\\,\\frac\{\\log\(1\+t/\\tau\_\{\\alpha\}\)\}\{\\log\(1\+T/\\tau\_\{\\alpha\}\)\},α0=2\.0,αmin=1\.917,τα=8500,\\displaystyle\\quad\\alpha\_\{0\}=2\.0,\\;\\alpha\_\{\\min\}=1\.917,\\;\\tau\_\{\\alpha\}=8500,\(18\)whereTTis the total number of training steps\. We fitted a logarithmic decay curve to the Qwen 2\.5 7B resultsα\\alpha,δ\\deltaevolution per timestep to estimate these parameters\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/x10.png)Figure 9:Ablation Study on FADEreveals adaptiveα\\alphaonly collapses whereas scheduling both on entropy and ablationα\\alphaworsens pass@100100diversity with Qwen 2\.5 7B on LiveCodeBench at 8k reasoning\.
## 11Training and Evaluation Details

### 11\.1Infrastructure and Hyperparameters

Similar toGehring et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib20)\); Noukhovitch et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib48)\), we use an asynchronous distributed RL framework and a separate CPU cluster for code evaluations\. Concretely we have a set of GPUs and CPUs that are divided into:

- •samplers \(H\-100\): produce code generations and send them to the CPUs,
- •evaluators \(CPU\): evaluate the generated code against unit tests,
- •trainers \(H\-100\): receive the code tokens, their log probabilities and rewards to perform a backward update on our model\.

The model travels between samplers and trainers so we always have the most updated model for generations\. In practice, sampling is much slower than training so we have a trainer/sampler ratio of roughly0\.140\.14for Qwen 2\.5 7B and0\.3750\.375for CWM 32B\. For Qwen 2\.5 7B runs, we use88nodes \(6464GPUs\) with 1 trainer node and 7 sampler nodes\. For CWM 32B runs, we use 32 nodes with 12 trainer nodes and 20 sampler nodes\. Since we train with reasoning, we have varying sequence length and batch our answers by tokens rather than per sample using max tokens per batch=32768=32768\. Since we generate maximum81928192tokens for the Qwen 2\.5 7B and3276832768for the CWM 32B, we have respectively at least44and11sample per batch\.

Table 2:Qwen 2\.5 7B: gradient ratio vs\. baseline and learning rate required to equalize gradient magnitude across normalization schemes \(baseline LR=1e−7=1\\mathrm\{e\}\{\-7\}\)\.We use all advantages within the PPO clipping framework

rt\(θ\)\\displaystyle r\_\{t\}\(\\theta\)=πθ\(at∣st\)πθold\(at∣st\),\\displaystyle=\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(a\_\{t\}\\mid s\_\{t\}\)\},\(19\)Lt\(θ\)\\displaystyle L\_\{t\}\(\\theta\)=−min⁡\(rt\(θ\)A^i,clip\(rt\(θ\),1−εlow,1\+εhigh\)A^i\)\\displaystyle=\-\\min\\Big\(r\_\{t\}\(\\theta\)\\,\\hat\{A\}\_\{i\},\\text\{clip\}\\big\(r\_\{t\}\(\\theta\),\\,1\{\-\}\\varepsilon\_\{\\text\{low\}\},\\,1\{\+\}\\varepsilon\_\{\\text\{high\}\}\\big\)\\,\\hat\{A\}\_\{i\}\\Big\)\(20\)withεlow=εhigh=0\.2\\varepsilon\_\{\\text\{low\}\}=\\varepsilon\_\{\\text\{high\}\}=0\.2for Qwen 2\.5 7B andεlow=0\.2\\varepsilon\_\{\\text\{low\}\}=0\.2,εhigh=0\.25\\varepsilon\_\{\\text\{high\}\}=0\.25for CWM 32B\. The aggregated loss isℒ\(θ\)=1Tmax∑i∑t∈𝒜iLt\(θ\)\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{T\_\{\\max\}\}\\sum\_\{i\}\\sum\_\{t\\in\\mathcal\{A\}\_\{i\}\}L\_\{t\}\(\\theta\)where𝒜i\\mathcal\{A\}\_\{i\}is the set of agent \(model\-generated\) token positions in rolloutii, andTmaxT\_\{\\max\}is the maximum token budget per microbatch \(8192×\\timesbatch\_size for the 7B model and 32768×\\timesbatch size for the 32B model\)\.

### 11\.2Full Evaluation Results

We evaluate models trained with different advantage functions at their peak performance during training \(typically around20×10320\\times 10^\{3\}RL steps\), using the same training setup across all runs and varying only the reward normalization\. To reduce variance across checkpoints, we select evaluation points where pass@kkhas stabilized, meaning at least10001000consecutive steps yield similar results\. For each advantage their required number of training steps can be found in[table˜6](https://arxiv.org/html/2607.01490#S11.T6)\. For each pass@kkmetric, we draw2k2ksamples and compute the mean over400400groups of size2k2kto estimate the standard deviation\. Results can be found for LiveCodeBench v6 in[table˜3](https://arxiv.org/html/2607.01490#S11.T3), v5 in[table˜6](https://arxiv.org/html/2607.01490#S11.T6)and for AIME benchmarks in[table˜4](https://arxiv.org/html/2607.01490#S11.T4)and[table˜5](https://arxiv.org/html/2607.01490#S11.T5)\.

Table 3:Pass@kkevaluation results with standard deviations on LiveCodeBench v6 \(454 problems, Aug 2024\-May 2025\)\.Table 4:Pass@k evaluation on AIME benchmarks for Qwen 2\.5 7B SFT \(14k training budget\) \(%\)\.Table 5:Pass@k evaluation on AIME benchmarks for CWM 32B \(30k training budget\) \(%\)\.Table 6:Pass@kkevaluation results with standard deviations on LiveCodeBench v5 \(879 problems: 279 easy, 330 medium, 270 hard\)\. 7B at 8k token budget, CWM 32B at 30k\. Steps is the RL step selected for evaluation when evaluation results plateau \(SFT=0=0, i\.e\. no RL\)\.
### 11\.3Gradient Ratios in Practice

In Section[4\.1](https://arxiv.org/html/2607.01490#S4.SS1), we claim the speed of entropy collapse depends on the positive\-to\-negative ratioρ\(p\)=mSpmFq\\rho\(p\)=\\frac\{m\_\{S\}p\}\{m\_\{F\}q\}and show empirically in Figure[4](https://arxiv.org/html/2607.01490#S4.F4)a strong correlation between this ratio and the final policy entropy after a fixed number of training steps\. Mid\-way through training \(at approximately 15k gradient steps\), we observed an average solve rate of0\.30\.3for Qwen 2\.5 7B and0\.50\.5for the CWM 32B\. Using those values forp¯\\bar\{p\}, Table[7](https://arxiv.org/html/2607.01490#S12.T7)shows the positive to negative mass ratios used per policy weight\. All sign\-balanced methods shareρ\(p\)=p1−p\\rho\(p\)=\\frac\{p\}\{1\-p\}sincemS=mFm\_\{S\}=m\_\{F\}\.

To compare entropy collapse under equalized gradient scales, we adapt the learning rateη\\etafor each method using theirη⋅maxi⁡\|Ai\|\\eta\\cdot\\max\_\{i\}\|A\_\{i\}\|magnitude\. The referenceη\\etais1e−71\\mathrm\{e\}\{\-7\}for the Qwen 2\.5 7B and1\.8e−71\.8\\mathrm\{e\}\{\-7\}for CWM 32B \(see[table˜2](https://arxiv.org/html/2607.01490#S11.T2)\)\. Using GRPO as a reference, policy weights with smaller maximum values get larger learning rates and vice versa\. This ensures differences are due to the policy weight function, not its overall scale\. Figure[14](https://arxiv.org/html/2607.01490#S14.F14)shows how non\-equalized learning rate can speed up or slow down entropy collapse\.

## 12Pass@kkScaling Law: Proof and Fitting Details

We frame diversity as: how fast does pass@kkincrease withkk? If we definepass@k=𝔼p\[1−\(1−p\)k\]=1−𝔼p\[\(1−p\)k\]\\text\{pass@\}k=\\mathbb\{E\}\_\{p\}\[1\-\(1\-p\)^\{k\}\]=1\-\\mathbb\{E\}\_\{p\}\[\(1\-p\)^\{k\}\]whereppis the per\-problem pass@11rate, the asymptotic behavior depends on the distribution ofppnear0\.

### 12\.1Theory of Scaling Laws of Pass@k

We adapt the proof fromSchaeffer et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib54)\)\. LetpD\(p\)p\_\{D\}\(p\)be the density of pass@11rates over a datasetDD, andpassD@k=𝔼p∼D\[1−\(1−p\)k\]\\mathrm\{pass\}\_\{D@k\}=\\mathbb\{E\}\_\{p\\sim D\}\[1\-\(1\-p\)^\{k\}\]\.

###### Theorem 12\.1\(Dichotomy of Scaling Behavior\)\.

The asymptotic behavior ofpassD@k\\mathrm\{pass\}\_\{D@k\}ask→∞k\\to\\inftyis determined bypD\(p\)p\_\{D\}\(p\)nearp=0p=0:

1. 1\.Power\-law regime:IfpD\(p\)=Cpb−1\+O\(pb−1\+θ\)p\_\{D\}\(p\)=Cp^\{b\-1\}\+O\(p^\{b\-1\+\\theta\}\)asp→0\+p\\to 0^\{\+\}\(C,b,θ\>0C,b,\\theta\>0\), thenpassD@k=1−CΓ\(b\)k−b\+o\(k−b\)\\mathrm\{pass\}\_\{D@k\}=1\-C\\,\\Gamma\(b\)\\,k^\{\-b\}\+o\(k^\{\-b\}\)\.
2. 2\.Rapidly decaying regime:IfpD\(p\)≤Cexp⁡\(−c/pα\)p\_\{D\}\(p\)\\leq C\\exp\(\-c/p^\{\\alpha\}\)asp→0\+p\\to 0^\{\+\}\(c,α\>0c,\\alpha\>0\), then1−passD@k=o\(k−b\)1\-\\mathrm\{pass\}\_\{D@k\}=o\(k^\{\-b\}\)for allb\>0b\>0\.

###### Proof\.

Since\(1−p\)k≈e−kp\(1\-p\)^\{k\}\\approx e^\{\-kp\}for largekk, we have1−passD@k≈∫01e−kppD\(p\)𝑑p1\-\\mathrm\{pass\}\_\{D@k\}\\approx\\int\_\{0\}^\{1\}e^\{\-kp\}\\,p\_\{D\}\(p\)\\,dp, a Laplace\-type integral dominated by smallpp\.

Case 1\.WithpD\(p\)∼Cpb−1p\_\{D\}\(p\)\\sim Cp^\{b\-1\}, the tail\[ϵ,1\]\[\\epsilon,1\]is exponentially small inkk, so1−passD@k≈C∫0∞pb−1e−kp𝑑p=Ck−bΓ\(b\)1\-\\mathrm\{pass\}\_\{D@k\}\\approx C\\int\_\{0\}^\{\\infty\}p^\{b\-1\}e^\{\-kp\}\\,dp=C\\,k^\{\-b\}\\,\\Gamma\(b\), where the last step usesu=kpu=kp\.

Case 2\.The integrande−kp−c/pαe^\{\-kp\-c/p^\{\\alpha\}\}is maximized atp∗∼k−1/\(1\+α\)p^\{\*\}\\sim k^\{\-1/\(1\+\\alpha\)\}, giving∫01e−kppD\(p\)𝑑p≤exp⁡\(−Ω\(kα/\(1\+α\)\)\)\\int\_\{0\}^\{1\}e^\{\-kp\}\\,p\_\{D\}\(p\)\\,dp\\leq\\exp\(\-\\Omega\(k^\{\\alpha/\(1\+\\alpha\)\}\)\), which decays faster than any power ofkk\. ∎

### 12\.2From Asymptotic to Empirical Fits

As shown above, thekkvs\. pass@kkscaling law follows an exponential or polynomial growth ask→∞k\\to\\inftybased on the distribution of solve rates nearpp\. We analyze the distribution of per\-prompt solve ratesp^\\hat\{p\}across the training set at fixed checkpoint steps for the Qwen 2\.5 7B \(Figures[10\(a\)](https://arxiv.org/html/2607.01490#S12.F10.sf1)\) and CWM 32B models \(Figure[10\(b\)](https://arxiv.org/html/2607.01490#S12.F10.sf2)\)\. Rather than shifting smoothly from hard to easy, the solve\-rate distribution is strongly bimodal at both scales, with most mass concentrated nearp=0p=0\(unsolved\) andp=1p=1\(fully solved\) and relatively little weight in the intermediate range\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/figures/distribution_fits.png)\(a\)7B
![Refer to caption](https://arxiv.org/html/2607.01490v1/figures/distribution_evolution.png)\(b\)32B

Figure 10:Solve\-rate distribution during training\.Distribution of per\-prompt solve rates at fixed checkpoint steps for \(a\) 7B and \(b\) 32B\. We compare fits minimizing the mean squared error \(MSE\) with a Beta\-Binomial distributionBetaBin\(G,a,b\)\\mathrm\{BetaBin\}\(G,a,b\)and a two component Beta mixtureπ,Beta\(a1,b1\)\+\(1−π\),Beta\(a2,b2\)\\pi,\\mathrm\{Beta\}\(a\_\{1\},b\_\{1\}\)\+\(1\-\\pi\),\\mathrm\{Beta\}\(a\_\{2\},b\_\{2\}\)\.We found both the power and exponential laws tended to overestimate pass@kkvalues for lowerkkso for geometric fitting rather than asymptotic behavior prediction, we opted for a shifted power law\. We useG\(k\)=exp⁡\(−a\(k\+k0\)−b\)G\(k\)=\\exp\\bigl\(\-a\(k\+k\_\{0\}\)^\{\-b\}\\bigr\)with three parameters:aa\(scale\),bb\(exponent\), andk0k\_\{0\}\(horizontal shift\) \(see Figure[11](https://arxiv.org/html/2607.01490#S12.F11),[12](https://arxiv.org/html/2607.01490#S12.F12),[13](https://arxiv.org/html/2607.01490#S12.F13)\) which in practice has the lowest MSE across problem difficulty splits\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/figures/summary_combined_budget10.png)

![Refer to caption](https://arxiv.org/html/2607.01490v1/figures/summary_combined_budget50.png)

Figure 11:Shifted power\-law fits across budgets\.Pass@kkcurves and their fitted shifted power laws at budget 10 \(left\) and budget 50 \(right\)\.![Refer to caption](https://arxiv.org/html/2607.01490v1/figures/summary_combined_figure.png)
Figure 12:Shifted Power Law Predicts Pass@kacross different difficulty splits for an estimated pass@11distribution \(from 200 samples\) with small \(left\), medium \(middle\) and high mass near 0\.Table 7:Positive\-to\-negative ratioρ\(p\)=mS⋅pmF⋅q\\rho\(p\)=\\frac\{m\_\{S\}\\cdot p\}\{m\_\{F\}\\cdot q\}for each advantage function, evaluated at two training solve rates\. Since Pass@kkonly has positive weights we remove it from the calculation\.
![Refer to caption](https://arxiv.org/html/2607.01490v1/x11.png)Figure 13:Evolution of the k vs\. pass@k curve during RLfor different advantage functions with similar pass@100 performances\. We shorten the powerα\\alphaseries to PA\.

## 13Proof Entropy Change

We adapt the proof ofCui et al\. \([2025](https://arxiv.org/html/2607.01490#bib.bib14)\)to the discrete update setting and generalize to advantages with𝔼\[A\]≠0\\mathbb\{E\}\[A\]\\neq 0\.

###### Theorem 13\.1\(Entropy change per update step\)\.

Letπθ\(a\|s\)\\pi\_\{\\theta\}\(a\|s\)be a policy parameterized byθ\\theta, with entropyℋ\(π\)=−∑aπ\(a\|s\)log⁡π\(a\|s\)\\mathcal\{H\}\(\\pi\)=\-\\sum\_\{a\}\\pi\(a\|s\)\\log\\pi\(a\|s\)\. Suppose the parameters are updated via a policy gradient stepθt\+1=θt\+η𝔼a∼π\[A∇θlog⁡πθ\(a\|s\)\]\\theta\_\{t\+1\}=\\theta\_\{t\}\+\\eta\\,\\mathbb\{E\}\_\{a\\sim\\pi\}\[A\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\|s\)\]whereAAis an advantage function with𝔼\[A\]=mS−mF\\mathbb\{E\}\[A\]=m\_\{S\}\-m\_\{F\}\. Then, under a local projection assumption, the per\-step entropy change satisfies:Δℋ≈η\[\(mS−mF\)ℋ−Cov⁡\(A,log⁡πθ\)\]\+O\(η2\)\\Delta\\mathcal\{H\}\\approx\\eta\\left\[\(m\_\{S\}\-m\_\{F\}\)\\,\\mathcal\{H\}\-\\operatorname\{Cov\}\(A,\\log\\pi\_\{\\theta\}\)\\right\]\+O\(\\eta^\{2\}\), where the approximation is a first\-order Taylor expansion in the learning rateη\\eta\.

###### Proof\.

Step 1: First\-order approximation\.The entropy after one update isℋ\(θt\+1\)=ℋ\(θt\+η𝔼\[A∇θlog⁡π\]\)\\mathcal\{H\}\(\\theta\_\{t\+1\}\)=\\mathcal\{H\}\(\\theta\_\{t\}\+\\eta\\,\\mathbb\{E\}\[A\\nabla\_\{\\theta\}\\log\\pi\]\)\. Expanding to first order inη\\eta:

Δℋ:=ℋ\(θt\+1\)−ℋ\(θt\)=η∇θℋ⊤𝔼\[A∇θlog⁡π\]\+O\(η2\)\.\\Delta\\mathcal\{H\}:=\\mathcal\{H\}\(\\theta\_\{t\+1\}\)\-\\mathcal\{H\}\(\\theta\_\{t\}\)=\\eta\\,\\nabla\_\{\\theta\}\\mathcal\{H\}^\{\\top\}\\,\\mathbb\{E\}\[A\\nabla\_\{\\theta\}\\log\\pi\]\+O\(\\eta^\{2\}\)\.\(21\)This linearization is the source of the approximation: higher\-order terms inη\\etaare neglected, so the result is accurate for small learning rates\.

Step 2: Entropy gradient\.Applying the product rule to∇θℋ\\nabla\_\{\\theta\}\\mathcal\{H\}:

∇θℋ\(π\)\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{H\}\(\\pi\)=−∑a\(∇θπ\(a\|s\)log⁡π\(a\|s\)\+π\(a\|s\)∇θlog⁡π\(a\|s\)\)\.\\displaystyle=\-\\sum\_\{a\}\\bigl\(\\nabla\_\{\\theta\}\\pi\(a\|s\)\\log\\pi\(a\|s\)\+\\pi\(a\|s\)\\nabla\_\{\\theta\}\\log\\pi\(a\|s\)\\bigr\)\.\(22\)Using the log\-derivative trick∇θπ=π∇θlog⁡π\\nabla\_\{\\theta\}\\pi=\\pi\\nabla\_\{\\theta\}\\log\\piand the identity∑aπ\(a\|s\)∇θlog⁡π\(a\|s\)=∇θ∑aπ\(a\|s\)=0\\sum\_\{a\}\\pi\(a\|s\)\\nabla\_\{\\theta\}\\log\\pi\(a\|s\)=\\nabla\_\{\\theta\}\\sum\_\{a\}\\pi\(a\|s\)=0, this simplifies to∇θℋ\(π\)=−𝔼a∼π\[log⁡π∇θlog⁡π\]\\nabla\_\{\\theta\}\\mathcal\{H\}\(\\pi\)=\-\\mathbb\{E\}\_\{a\\sim\\pi\}\\left\[\\log\\pi\\;\\nabla\_\{\\theta\}\\log\\pi\\right\]

Step 3: Substitution\.Inserting \([22](https://arxiv.org/html/2607.01490#S13.E22)\) into \([21](https://arxiv.org/html/2607.01490#S13.E21)\):

Δℋ≈−η𝔼\[log⁡π∇θlog⁡π\]⊤𝔼\[A∇θlog⁡π\]\.\\Delta\\mathcal\{H\}\\approx\-\\eta\\,\\mathbb\{E\}\[\\log\\pi\\;\\nabla\_\{\\theta\}\\log\\pi\]^\{\\top\}\\mathbb\{E\}\[A\\;\\nabla\_\{\\theta\}\\log\\pi\]\.\(23\)
Step 4: Local projection assumption\.When the score vectors∇θlog⁡π\\nabla\_\{\\theta\}\\log\\pispan the relevant variation space, as holds exactly under natural policy gradients with the Fisher information metric, the inner product in parameter space reduces to an expectation in action space:

𝔼\[log⁡π∇θlog⁡π\]⊤𝔼\[A∇θlog⁡π\]≈𝔼\[Alog⁡π\]\.\\mathbb\{E\}\[\\log\\pi\\;\\nabla\_\{\\theta\}\\log\\pi\]^\{\\top\}\\mathbb\{E\}\[A\\;\\nabla\_\{\\theta\}\\log\\pi\]\\approx\\mathbb\{E\}\[A\\log\\pi\]\.\(24\)This givesΔℋ≈−η𝔼\[Alog⁡π\]\\Delta\\mathcal\{H\}\\approx\-\\eta\\,\\mathbb\{E\}\[A\\log\\pi\]\.

Step 5: Mean\-covariance decomposition\.Decomposing𝔼\[Alog⁡π\]\\mathbb\{E\}\[A\\log\\pi\]using𝔼\[A\]=mS−mF\\mathbb\{E\}\[A\]=m\_\{S\}\-m\_\{F\}and𝔼a∼π\[log⁡π\]=−ℋ\\mathbb\{E\}\_\{a\\sim\\pi\}\[\\log\\pi\]=\-\\mathcal\{H\}:

𝔼\[Alog⁡π\]\\displaystyle\\mathbb\{E\}\[A\\log\\pi\]=𝔼\[A\]𝔼\[log⁡π\]\+Cov⁡\(A,log⁡π\)=−\(mS−mF\)ℋ\+Cov⁡\(A,log⁡π\)\.\\displaystyle=\\mathbb\{E\}\[A\]\\,\\mathbb\{E\}\[\\log\\pi\]\+\\operatorname\{Cov\}\(A,\\log\\pi\)=\-\(m\_\{S\}\-m\_\{F\}\)\\,\\mathcal\{H\}\+\\operatorname\{Cov\}\(A,\\log\\pi\)\.\(25\)Substituting:

Δℋ≈η\[\(mS−mF\)ℋ−Cov⁡\(A,log⁡π\)\]\.\\Delta\\mathcal\{H\}\\approx\\eta\\left\[\(m\_\{S\}\-m\_\{F\}\)\\,\\mathcal\{H\}\-\\operatorname\{Cov\}\(A,\\log\\pi\)\\right\]\.\(26\)WhenmS=mFm\_\{S\}=m\_\{F\}\(sign\-balanced advantages\), the drift term vanishes and entropy dynamics are governed by the covariance alone:Δℋ≈−ηCov⁡\(A,log⁡π\)\\Delta\\mathcal\{H\}\\approx\-\\eta\\,\\operatorname\{Cov\}\(A,\\log\\pi\)\. ∎

## 14Weight\-Estimation Variance of Powerα\\alpha

The Powerα\\alphaweight isw\(p\)=Cp\(1−p\)αw\(p\)=C\\,p\\,\(1\{\-\}p\)^\{\\alpha\}, whereCCnormalizes the peak to match GRPO:

maxp⁡Cp\(1−p\)α=maxp⁡p\(1−p\)=14\.\\max\_\{p\}\\;C\\,p\(1\{\-\}p\)^\{\\alpha\}=\\max\_\{p\}\\;p\(1\{\-\}p\)=\\tfrac\{1\}\{4\}\.\(27\)The maximum ofp\(1−p\)αp\(1\{\-\}p\)^\{\\alpha\}is attained atp⋆=1/\(1\+α\)p^\{\\star\}=1/\(1\{\+\}\\alpha\), givingC=\(1\+α\)1\+α4ααC=\\frac\{\(1\{\+\}\\alpha\)^\{1\+\\alpha\}\}\{4\\,\\alpha^\{\\alpha\}\}\.

In practiceppis unknown; we estimate it fromGGindependent rollouts asp^=k/G\\hat\{p\}=k/G, wherek∼Binomial\(G,p\)k\\sim\\mathrm\{Binomial\}\(G,p\), soVar\(p^\)=p\(1−p\)/G\\mathrm\{Var\}\(\\hat\{p\}\)=p\(1\{\-\}p\)/G\. Because the weight is a nonlinear function ofp^\\hat\{p\}, it inherits estimation noise\. By the delta method \(GGreasonably large\),

Var\(w\(p^\)\)≈\[w′\(p\)\]2Var\(p^\)\.\\mathrm\{Var\}\\bigl\(w\(\\hat\{p\}\)\\bigr\)\\;\\approx\\;\\bigl\[w^\{\\prime\}\(p\)\\bigr\]^\{2\}\\,\\mathrm\{Var\}\(\\hat\{p\}\)\.\(28\)Differentiatingw\(p\)=Cp\(1−p\)αw\(p\)=C\\,p\(1\{\-\}p\)^\{\\alpha\}:

w′\(p\)=C\(1−p\)α−1\[1−\(1\+α\)p\],w^\{\\prime\}\(p\)=C\\,\(1\{\-\}p\)^\{\\alpha\-1\}\\bigl\[1\-\(1\{\+\}\\alpha\)\\,p\\bigr\],\(29\)so the absolute variance is

Var\(w\(p^\)\)=C2p\(1−p\)2α−1\[1−\(1\+α\)p\]2G,\\mathrm\{Var\}\\bigl\(w\(\\hat\{p\}\)\\bigr\)\\;=\\;\\frac\{C^\{2\}\\,p\\,\(1\{\-\}p\)^\{2\\alpha\-1\}\\,\\bigl\[1\-\(1\{\+\}\\alpha\)\\,p\\bigr\]^\{2\}\}\{G\},\(30\)and the relative variance \(coefficient of variation squared\) is

Var\(w\)w2=\[1−\(1\+α\)p\]2Gp\(1−p\)\.\\frac\{\\mathrm\{Var\}\(w\)\}\{w^\{2\}\}\\;=\\;\\frac\{\\bigl\[1\-\(1\{\+\}\\alpha\)\\,p\\bigr\]^\{2\}\}\{G\\,p\\,\(1\{\-\}p\)\}\.\(31\)This vanishes at the weight modep=1/\(1\+α\)p=1/\(1\{\+\}\\alpha\)and grows quadratically inα\\alphaaway from it\. Figure[7](https://arxiv.org/html/2607.01490#S4.F7)shows the variance per solve rateppinduced by differentα\\alphapowers\. Largerα\\alphaamplifies the weight\-estimation noise for prompts away from the mode, adding a penaltyVar\(w\)w2\(s2\+v\)\\frac\{\\mathrm\{Var\}\(w\)\}\{w^\{2\}\}\(s^\{2\}\+v\)to the effective per\-prompt noise \(see Eq\. \([10](https://arxiv.org/html/2607.01490#S4.E10)\)\)\. This is a further downward force on the optimalα\\alpha, beyond theNeffN\_\{\\mathrm\{eff\}\}trade\-off discussed in Section[4\.3](https://arxiv.org/html/2607.01490#S4.SS3)\. The effect scales asO\(1/G\)O\(1/G\)\.

![Refer to caption](https://arxiv.org/html/2607.01490v1/x12.png)Figure 14:Lower scale preserves entropy\.Different advantages at the same learning rate \(right\) vs\. at equalized learning rates \(left\)\. Higher gradient ratios correlate with faster entropy collapse\.At mean solve ratep¯\\bar\{p\}, replacing GRPO \(α=1\\alpha\{=\}1\) withα=3\\alpha\{=\}3is justified only if harder prompts are more informative bys\(p\)/s\(p¯\)≥\[2\(1−p\)\]2s\(p\)/s\(\\bar\{p\}\)\\geq\[2\(1\{\-\}p\)\]^\{2\}\. Table[8](https://arxiv.org/html/2607.01490#S14.T8)gives the concrete multipliers\.

Table 8:Minimum signal enrichments\(p\)/s\(0\.5\)s\(p\)/s\(0\.5\)required forα=3\\alpha\{=\}3to outperform GRPO at mean solve ratep¯=0\.5\\bar\{p\}\{=\}0\.5\(iid noise,G=∞G\{=\}\\infty\)\. At finiteGGthe multipliers ease by≤30%\{\\leq\}30\\%because thep=0\.5p\{=\}0\.5reference is itself penalized away fromα=3\\alpha\{=\}3’s weight mode at0\.250\.25\.### 14\.1Sweet spot and theα\\alpha–pprelationship

The most informative difficulty is where the per\-prompt ratio peaks,SNR2\(p\)=s\(p\)2/v\(p\)∝s\(p\)2p\(1−p\)\\mathrm\{SNR\}^\{2\}\(p\)=s\(p\)^\{2\}/v\(p\)\\propto s\(p\)^\{2\}\\,p\(1\-p\)\. Fors=e−cps=e^\{\-cp\}:

The flat case peaks atp=12p=\\tfrac\{1\}\{2\}; a mild\-to\-moderate slope \(c≈0\.5c\\approx 0\.5–11\) moves the sweet spot into\[0\.3,0\.5\]\[0\.3,0\.5\], withα≈1\.2\\alpha\\approx 1\.2–1\.451\.45for known weights falling to≈1\.05\\approx 1\.05\-1\.251\.25once the weight\-estimation correction is included \(G=8G\{=\}8–1616\)\.

## 15Weight\-Space Analysis Details

### 15\.1Rank\-1 Collapse Theory

The per\-token policy gradient with advantageAiA\_\{i\}isgi=Aivi⊗hig\_\{i\}=A\_\{i\}\\,v\_\{i\}\\otimes h\_\{i\}, wherevi=eyi−πi∈ℝVv\_\{i\}=e\_\{y\_\{i\}\}\-\\pi\_\{i\}\\in\\mathbb\{R\}^\{V\}andhi∈ℝdh\_\{i\}\\in\\mathbb\{R\}^\{d\}is the hidden state\. Decomposinghi=αiu1\+hi⟂h\_\{i\}=\\alpha\_\{i\}u\_\{1\}\+h\_\{i\}^\{\\perp\}along the top right singular vectoru1u\_\{1\}ofG=𝔼\[gi\]G=\\mathbb\{E\}\[g\_\{i\}\]separates the batch gradient into a rank\-1 signal and a higher\-rank residual:

WΔ=∑i=1NAivi⊗hi=\(∑i=1NAiαivi\)⊗u1⏟M1\(rank 1\)\+∑i=1NAivi⊗hi⟂⏟M2\(higher rank\)\.W\_\{\\Delta\}\\;=\\;\\sum\_\{i=1\}^\{N\}A\_\{i\}\\,v\_\{i\}\\otimes h\_\{i\}\\;=\\;\\underbrace\{\\Bigl\(\\sum\_\{i=1\}^\{N\}A\_\{i\}\\alpha\_\{i\}\\,v\_\{i\}\\Bigr\)\\otimes u\_\{1\}\}\_\{M\_\{1\}\\;\\text\{\(rank 1\)\}\}\\;\+\\;\\underbrace\{\\sum\_\{i=1\}^\{N\}A\_\{i\}\\,v\_\{i\}\\otimes h\_\{i\}^\{\\perp\}\}\_\{M\_\{2\}\\;\\text\{\(higher rank\)\}\}\.\(32\)We quantify the degree of rank\-1 collapse withr1=σ12\(WΔ\)/‖WΔ‖F2r\_\{1\}=\\sigma\_\{1\}^\{2\}\(W\_\{\\Delta\}\)/\\\|W\_\{\\Delta\}\\\|\_\{F\}^\{2\}, the fraction of the update’s Frobenius energy captured by its leading singular component:r1=1r\_\{1\}=1meansWΔW\_\{\\Delta\}is exactly rank\-1\.

###### Proposition 15\.1\.

Let\{\(Ai,vi,hi\)\}i=1N\\\{\(A\_\{i\},v\_\{i\},h\_\{i\}\)\\\}\_\{i=1\}^\{N\}be i\.i\.d\. withhi=αiu1\+hi⟂h\_\{i\}=\\alpha\_\{i\}u\_\{1\}\+h\_\{i\}^\{\\perp\},hi⟂⟂u1h\_\{i\}^\{\\perp\}\\perp u\_\{1\},u1u\_\{1\}the top right singular vector ofGG,𝔼\[Ai2‖vi‖2‖hi‖2\]<∞\\mathbb\{E\}\[A\_\{i\}^\{2\}\\\|v\_\{i\}\\\|^\{2\}\\\|h\_\{i\}\\\|^\{2\}\]<\\infty, and𝔼\[Aiαivi\]≠0\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]\\neq 0\. Then

r1→N→∞‖𝔼\[Aiαivi\]‖2‖𝔼\[Aiαivi\]‖2\+‖R‖F2,r\_\{1\}\\;\\xrightarrow\{N\\to\\infty\}\\;\\frac\{\\\|\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]\\\|^\{2\}\}\{\\\|\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]\\\|^\{2\}\+\\\|R\\\|\_\{F\}^\{2\}\}\\,,\(33\)whereR=𝔼\[Aivi⊗hi⟂\]R=\\mathbb\{E\}\[A\_\{i\}\\,v\_\{i\}\\otimes h\_\{i\}^\{\\perp\}\]is the higher\-rank residual\. In particular,r1→1r\_\{1\}\\to 1iffR=0R=0\.

###### Proof\.

*Orthogonality\.*Sincehi⟂⟂u1h\_\{i\}^\{\\perp\}\\perp u\_\{1\}for allii, we haveM2u1=∑iAivi\(hi⟂⊤u1\)=0M\_\{2\}\\,u\_\{1\}=\\sum\_\{i\}A\_\{i\}v\_\{i\}\(h\_\{i\}^\{\\perp\\top\}u\_\{1\}\)=0\. Therefore⟨M1,M2⟩F=w⊤\(M2u1\)=0\\langle M\_\{1\},M\_\{2\}\\rangle\_\{F\}=w^\{\\top\}\(M\_\{2\}\\,u\_\{1\}\)=0, giving‖WΔ‖F2=‖M1‖F2\+‖M2‖F2\\\|W\_\{\\Delta\}\\\|\_\{F\}^\{2\}=\\\|M\_\{1\}\\\|\_\{F\}^\{2\}\+\\\|M\_\{2\}\\\|\_\{F\}^\{2\}\.

*Signal\.*M1=w⊗u1M\_\{1\}=w\\otimes u\_\{1\}wherew=∑iAiαiviw=\\sum\_\{i\}A\_\{i\}\\alpha\_\{i\}v\_\{i\}\. Sinceu1u\_\{1\}is a fixed \(non\-random\) direction, the summandsAiαiviA\_\{i\}\\alpha\_\{i\}v\_\{i\}are i\.i\.d\. with finite variance \(𝔼\[Ai2αi2‖vi‖2\]≤𝔼\[Ai2‖vi‖2‖hi‖2\]<∞\\mathbb\{E\}\[A\_\{i\}^\{2\}\\alpha\_\{i\}^\{2\}\\\|v\_\{i\}\\\|^\{2\}\]\\leq\\mathbb\{E\}\[A\_\{i\}^\{2\}\\\|v\_\{i\}\\\|^\{2\}\\\|h\_\{i\}\\\|^\{2\}\]<\\infty\)\. By Kolmogorov’s strong law of large numbers,w/N→𝔼\[Aiαivi\]w/N\\to\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]a\.s\., so‖M1‖F2/N2=‖w/N‖2→‖𝔼\[Aiαivi\]‖2\\\|M\_\{1\}\\\|\_\{F\}^\{2\}/N^\{2\}=\\\|w/N\\\|^\{2\}\\to\\\|\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]\\\|^\{2\}\.

*Residual\.*Each entry ofM2/N=1N∑iAivi⊗hi⟂M\_\{2\}/N=\\frac\{1\}\{N\}\\sum\_\{i\}A\_\{i\}v\_\{i\}\\otimes h\_\{i\}^\{\\perp\}converges a\.s\. to the corresponding entry ofR=𝔼\[Aivi⊗hi⟂\]R=\\mathbb\{E\}\[A\_\{i\}v\_\{i\}\\otimes h\_\{i\}^\{\\perp\}\]by the same SLLN argument \(finite variance is guaranteed by the second\-moment condition\)\. By continuity of∥⋅∥F\\\|\\cdot\\\|\_\{F\},‖M2‖F2/N2→‖R‖F2\\\|M\_\{2\}\\\|\_\{F\}^\{2\}/N^\{2\}\\to\\\|R\\\|\_\{F\}^\{2\}a\.s\.

*Combine\.*By the SLLN,WΔ/N→GW\_\{\\Delta\}/N\\to Gentry\-wise a\.s\. SinceGu1=𝔼\[Aiαivi\]G\\,u\_\{1\}=\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]\(the residual contributes nothing:Ru1=0R\\,u\_\{1\}=0\),u1u\_\{1\}is the top right singular vector ofGGwith singular valueσ1\(G\)=‖𝔼\[Aiαivi\]‖\\sigma\_\{1\}\(G\)=\\\|\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]\\\|\. By continuity of singular values \(Weyl’s inequality:\|σ1\(WΔ/N\)−σ1\(G\)\|≤‖WΔ/N−G‖F→0\|\\sigma\_\{1\}\(W\_\{\\Delta\}/N\)\-\\sigma\_\{1\}\(G\)\|\\leq\\\|W\_\{\\Delta\}/N\-G\\\|\_\{F\}\\to 0\),σ1\(WΔ/N\)→σ1\(G\)\\sigma\_\{1\}\(W\_\{\\Delta\}/N\)\\to\\sigma\_\{1\}\(G\)\. Thereforer1=σ12\(WΔ/N\)/‖WΔ/N‖F2→σ12\(G\)/‖G‖F2r\_\{1\}=\\sigma\_\{1\}^\{2\}\(W\_\{\\Delta\}/N\)/\\\|W\_\{\\Delta\}/N\\\|\_\{F\}^\{2\}\\to\\sigma\_\{1\}^\{2\}\(G\)/\\\|G\\\|\_\{F\}^\{2\}, which equals Eq\. \([33](https://arxiv.org/html/2607.01490#S15.E33)\) by the orthogonal decomposition‖G‖F2=‖𝔼\[Aiαivi\]‖2\+‖R‖F2\\\|G\\\|\_\{F\}^\{2\}=\\\|\\mathbb\{E\}\[A\_\{i\}\\alpha\_\{i\}v\_\{i\}\]\\\|^\{2\}\+\\\|R\\\|\_\{F\}^\{2\}\. ∎

###### Corollary 15\.2\.

Suppose failures arise fromKKindependent reasoning modes with equal probability1/K1/K, and letμk=𝔼\[hi⟂∣modek\]\\mu\_\{k\}=\\mathbb\{E\}\[h\_\{i\}^\{\\perp\}\\mid\\text\{mode\}\\;k\]denote the mean perpendicular hidden state of modekk\. Split the residual by outcome:R=pRS\+qδRFR=p\\,R\_\{S\}\+\\frac\{q\}\{\\delta\}\\,R\_\{F\}\. The failure residual decomposes asRF=1K∑k=1K𝔼\[Aivi∣k\]⊗μkR\_\{F\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\[A\_\{i\}\\,v\_\{i\}\\mid k\]\\otimes\\mu\_\{k\}\. If theμk\\mu\_\{k\}If theμk\\mu\_\{k\}are mutually orthogonal with‖μk‖=μ\\\|\\mu\_\{k\}\\\|=\\muand∥𝔼\[Aivi∣k\]∥=c\\\|\\mathbb\{E\}\[A\_\{i\}v\_\{i\}\\mid k\]\\\|=cfor allkk, then‖RF‖F2=c2μ2/K\\\|R\_\{F\}\\\|\_\{F\}^\{2\}=c^\{2\}\\mu^\{2\}/K, so‖RF‖∼1/K\\\|R\_\{F\}\\\|\\sim 1/\\sqrt\{K\}: more diverse failure modes yield a smaller residual\. Settingδ<1\\delta<1amplifies the weightq/δq/\\deltaon this shrinkingRFR\_\{F\}relative to the survivingRSR\_\{S\}from correlated successes, accelerating the collapser1→1r\_\{1\}\\to 1\.

Table[9](https://arxiv.org/html/2607.01490#S15.T9)highlights the rank 1 funnel effect\.

Table 9:Rank\-1 dominance of the output headΔW\\Delta W\.\(a\)At 500 steps, all methods are rank\-1 dominant\.\(b\)Over longer training, sign\-biased methods maintain rank\-1 while sign\-balanced methods lose it; the effect weakens with model scale\.\(a\) 500 steps \(7B\)

\(b\) Converged

We analyze how RL training with different advantage functions modifies the weights across three model scales:

- •Qwen 2\.5 7B:d=3584d\{=\}3584, 28 layers, 28 heads, 4 KV heads, vocab=152,064\{=\}152\{,\}064\. 26 runs at 8k context, 8 runs at 16k\.
- •Qwen 2\.5 14B:d=5120d\{=\}5120, 48 layers, 40 heads, 8 KV heads, vocab=152,064\{=\}152\{,\}064\. 1 run \(GRPO mean\), 5 checkpoints\.
- •CWM 32B:d=6144d\{=\}6144, 64 layers, 48 heads, vocab=128,256\{=\}128\{,\}256\. 6 runs\.

##### Analysis techniques\.

For each run we computeΔ=Wrl−Wsft\\Delta=W\_\{\\mathrm\{rl\}\}\-W\_\{\\mathrm\{sft\}\}and measure globalL2L\_\{2\}distance, per\-layerL2L\_\{2\}decomposition, pairwise cosine similarity of weight\-change vectors with hierarchical clustering, full SVD of the output head delta, rank\-1 decompositionΔ≈s1uv⊤\\Delta\\approx s\_\{1\}\\,u\\,v^\{\\top\}with cross\-run alignment ofu/vu/vvectors, and per\-layer SVD across all weight matrices\.

##### Hidden\-state analysis\.

To measure the geometric structure of correct vs\. failed trajectories, we extract the last\-layer hidden stateshi∈ℝdh\_\{i\}\\in\\mathbb\{R\}^\{d\}for all tokens across a batch and project out the top right singular vectoru1u\_\{1\}of the batch gradient, yielding the perpendicular componenthi⟂=hi−\(hi⊤u1\)u1h\_\{i\}^\{\\perp\}=h\_\{i\}\-\(h\_\{i\}^\{\\top\}u\_\{1\}\)u\_\{1\}\. We then compute the mean pairwise cosine similarityρ^\\hat\{\\rho\}of theseh⟂h^\{\\perp\}vectors separately for correct and failed trajectory groups\. This measures how much hidden\-state structure survives beyond the dominant rank\-1 direction\. We also compute group\-conditional residual norms‖Rgroup‖F=‖1N∑i∈groupvi⊗hi⟂‖F\\\|R\_\{\\text\{group\}\}\\\|\_\{F\}=\\\|\\frac\{1\}\{N\}\\sum\_\{i\\in\\text\{group\}\}v\_\{i\}\\otimes h\_\{i\}^\{\\perp\}\\\|\_\{F\}, wherevi=eyi−πiv\_\{i\}=e\_\{y\_\{i\}\}\-\\pi\_\{i\}is the softmax error vector\.111These norms are unweighted by advantages \(𝔼\[v⊗h⟂∣group\]\\mathbb\{E\}\[v\\otimes h^\{\\perp\}\\mid\\text\{group\}\]rather than𝔼\[Av⊗h⟂∣group\]\\mathbb\{E\}\[Av\\otimes h^\{\\perp\}\\mid\\text\{group\}\]\), since advantages vary per\-problem and across advantage functions\. The projection direction is the top right singular vector ofHH, which may differ fromu1u\_\{1\}in Proposition[15\.1](https://arxiv.org/html/2607.01490#S15.Thmtheorem1)\.

The rank\-1 left singular vectoru∈ℝ\|𝒱\|u\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}of the output head delta reveals a universal “suppress non\-code” direction shared across sign\-biased runs: 99\.93–99\.97% of tokens have negativeuuvalues \(Chinese, Russian, Thai, Arabic text; non\-code identifiers\),uuis not sparse \(50% of energy in∼\{\\sim\}22% of vocab\), anduuvectors are nearly identical across sign\-biased runs \(cos=0\.93\\cos=0\.93–0\.990\.99\) and context lengths \(cos=0\.82\\cos=0\.82–0\.960\.96\)\.

### 15\.2Empirical Verification of Rank\-1 Collapse

Correct trajectories are33\-4×4\{\\times\}more correlated than failed ones in the perpendicular subspace, consistent across methods and training steps \(Table[10](https://arxiv.org/html/2607.01490#S15.T10)\)\.

Table 10:Perpendicular correlationρ^\\hat\{\\rho\}by trajectory outcome\.##### Residual norms and CLT baseline\.

For each groupg∈\{success,failure\}g\\in\\\{\\text\{success\},\\text\{failure\}\\\}withNgN\_\{g\}tokens, we compute the group\-conditional residualRg=1Ng∑i∈gvi⊗hi⟂,R\_\{g\}=\\frac\{1\}\{N\_\{g\}\}\\sum\_\{i\\in g\}v\_\{i\}\\otimes h\_\{i\}^\{\\perp\},wherevi=eyi−πiv\_\{i\}=e\_\{y\_\{i\}\}\-\\pi\_\{i\}andhi⟂h\_\{i\}^\{\\perp\}is the hidden state with the top SVD direction projected out\.

Under the null hypothesis that both groups have the same per\-token covariance structure and differ only in sample size, each entry ofRgR\_\{g\}is an average ofNgN\_\{g\}i\.i\.d\. terms with common varianceσ2\\sigma^\{2\}\. By the CLT,‖Rg‖F2≈Vd⋅σ2/Ng\\\|R\_\{g\}\\\|\_\{F\}^\{2\}\\approx Vd\\cdot\\sigma^\{2\}/N\_\{g\}\(whereVVis the vocabulary size andddthe hidden dimension\), so the ratio of residual norms scales as‖RF‖‖RS‖≈NSNF,\\frac\{\\\|R\_\{F\}\\\|\}\{\\\|R\_\{S\}\\\|\}\\approx\\sqrt\{\\frac\{N\_\{S\}\}\{N\_\{F\}\}\},WithNS=8,968N\_\{S\}=8\{,\}968andNF=54,700N\_\{F\}=54\{,\}700in our batches during the AsymGRPOδ=0\.5\\delta=0\.5training with Qwen 2\.5 7B, this givesNS/NF≈0\.40\\sqrt\{N\_\{S\}/N\_\{F\}\}\\approx 0\.40\. The measured ratio is‖RF‖/‖RS‖=0\.73\\\|R\_\{F\}\\\|/\\\|R\_\{S\}\\\|=0\.73–0\.790\.79\(Table[11](https://arxiv.org/html/2607.01490#S15.T11)\), nearly twice the CLT prediction\. This confirms that the gap is not an artifact of having more failure tokens: failures are genuinely more decorrelated in the perpendicular subspace, producing a smaller residual per token than successes do\.

Table 11:Group\-conditional residual norms for the AsymGRPOδ=0\.5\\delta=0\.5run whereSS,FFare the set of correct and incorrect trajectories at a given timestep over a subset of samples from the training set\.
##### Causal verification\.

Since the rank\-1 signal originates at the output head and propagates backward \(Figure[5](https://arxiv.org/html/2607.01490#S4.F5)\), Proposition[15\.1](https://arxiv.org/html/2607.01490#S15.Thmtheorem1)predicts that asymmetric learning should be concentrated in the last few layers\. Periodically resetting the last44layers toward SFT weights erases most of AsymGRPO’s gains \(withδ=0\.5\\delta=0\.5\), with pass@11dropping back to the starting policy performance, while symmetric GRPO, with more distributed learning, benefits from the reset\. This trade\-off is scale\-dependent: at 32B the rank\-1 fraction drops from 78–96% to 45\.6% for AsymGRPO \(Table[9](https://arxiv.org/html/2607.01490#S15.T9)\), and performance gaps between symmetric and asymmetric methods shrink accordingly \(Tables[4](https://arxiv.org/html/2607.01490#S11.T4)and[5](https://arxiv.org/html/2607.01490#S11.T5)\)\.

###### Corollary 15\.3\(Cross\-step accumulation\)\.

If at each optimization stepttthe batch gradient at the output head satisfiesgt≈wt⊗ug\_\{t\}\\approx w\_\{t\}\\otimes ufor a shared directionu∈ℝdu\\in\\mathbb\{R\}^\{d\}, then the accumulated weight updateWΔ=∑tηtgt≈\(∑tηtwt\)⊗uW\_\{\\Delta\}=\\sum\_\{t\}\\eta\_\{t\}g\_\{t\}\\approx\(\\sum\_\{t\}\\eta\_\{t\}w\_\{t\}\)\\otimes uis exactly rank\-1\.

The corollary gives a sufficient condition: if gradient directions align across steps, then rank\-1 structure compounds\. Empirically, cosine similarity between successive batch\-gradient directions at the output head converges to\|cos\|≈1\.0\|\\cos\|\\approx 1\.0from step∼600\{\\sim\}600onward for asymmetric methods, confirming the condition during the exploitation phase\. For symmetric methods the alignment remains lower, consistent with the higher effective rank of their accumulated updates\.

### 15\.3Alignment Dynamics in GRPO

Tracking a GRPO run at Qwen 2\.5 7B across 16 checkpoints \(step 300 to 15,000\), cross\-checkpoint direction alignment \(Table[12](https://arxiv.org/html/2607.01490#S15.T12)\) reveals a clear exploration\-to\-exploitation transition\. The output head direction converges by step 1,200, while inner layers remain exploratory until step∼6,000\{\\sim\}6\{,\}000\. After that, the model scales up the same directions with diminishing magnitude\. All values are\|cos\|\|\\cos\|of the leading singular vector at each checkpoint’sWΔ=Wsft−WrlW\_\{\\Delta\}=W\_\{sft\}\-W\_\{rl\}versus the final \(step 15,000\) checkpoint:uu\(left/output\-token direction\) for the output head, and the layer\-averaged\|cos⁡\(u\)\|\|\\cos\(u\)\|for the transformer blocks grouped into early/mid/late thirds\.

Table 12:Cross\-checkpoint direction alignment for a GRPO run \(Qwen 2\.5 7B, no\-KL, 8k\)\. Each entry is\|cos\|\|\\cos\|between the leading singular vector ofΔW=Wt−WSFT\\Delta W=W\_\{t\}\-W\_\{\\text\{SFT\}\}at stepttand at the final step \(15,00015\{,\}000\)\. The output\-head direction converges by step1,2001\{,\}200\(\|cos⁡\(u\)\|=0\.97\|\\cos\(u\)\|=0\.97\), while inner layers remain exploratory \(\|cos⁡\(u\)\|<0\.25\|\\cos\(u\)\|<0\.25\) until step∼6,000\{\\sim\}6\{,\}000, after which the model rescales the same directions with diminishing magnitude\.
Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

Similar Articles

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

Evolved Policy Gradients

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Gradient Extrapolation-Based Policy Optimization

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

Submit Feedback

Similar Articles

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization
MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Gradient Extrapolation-Based Policy Optimization
GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards