Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
Summary
This paper proposes a covariance-aware variant of Group Relative Policy Optimization (GRPO) that uses Gaussian-kernel advantage reweighting to stabilize training entropy and improve reasoning performance in large language models.
View Cached Full Text
Cached at: 05/13/26, 06:14 AM
# Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
Source: [https://arxiv.org/html/2605.11538](https://arxiv.org/html/2605.11538)
Cheng Wang†Qin Liu‡Wenxuan Zhou§Muhao Chen‡ †National University of Singapore‡University of California, Davis §University of Southern California wangcheng@u\.nus\.edu
###### Abstract
Group Relative Policy Optimization \(GRPO\) has emerged as a promising approach for improving the reasoning capabilities of large language models\. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance\. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter\-free,*covariance\-weighted optimization*method that dynamically down\-weights extreme token\-level updates via a Gaussian kernel\. This approach automatically reduces the instability caused by exploration\-exploitation trade\-off while preserving informative learning signals\. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses\.
## 1Introduction
Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024a](https://arxiv.org/html/2605.11538#bib.bib1)\)has emerged as a promising approach for enhancing the reasoning capabilities of large language models \(LLMs\), particularly in complex mathematical and coding tasks\. Despite its demonstrated effectiveness, GRPO faces a critical limitation in properly balancing between exploitation and exploration during policy optimization, which can undermine its performance\(Wanget al\.,[2025a](https://arxiv.org/html/2605.11538#bib.bib32),[a](https://arxiv.org/html/2605.11538#bib.bib32)\)\.
Excessive exploitation can cause the model to become overconfident in its suboptimal solutions, thereby limiting its capabilities to explore novel reasoning strategies and potentially overlook more effective approaches\. Conversely, while exploration is necessary for identifying better policies, excessive exploration may result in unstable training dynamics and hinder convergence to a stable, high\-performing solution\. These opposing risks highlights the importance of a principled mechanism for balancing exploration and exploitation so as to realize a more robust GRPO\.
Figure 1:Policy Entropy During Training\.Vanilla GRPO exhibits entropy instability, while our method keeps entropy at a reasonable level that effectively balances exploration and exploitation\.Figure 2:Illustration of Our Proposed Method\.Compared with vanilla GRPO, our method reweights the advantages based on the covariance between token probabilities and advantages\.Specifically, the trade\-off in GRPO is fundamentally tied to the policy’s entropy dynamics during training\. As established byCuiet al\.\([2025b](https://arxiv.org/html/2605.11538#bib.bib11)\), entropy changes under the natural policy gradient update are governed by the covariance between token log\-probabilities and their corresponding advantage estimates\. Based on this theoretical foundation and our empirical observations, we identify that a small fraction of tokens with extreme covariance values disproportionately dominate the policy updates, resulting in entropy instability and degraded training dynamics\.
To mitigate this issue, we propose a covariance\-aware variant of GRPO that attenuates extreme token\-level updates through Gaussian kernel weighting\. Specifically, our approach computes the covariance between centered log\-probabilities and centered advantages for each token, and applies a smooth down\-weighting function to tokens with high\-magnitude covariances while preserving the influence of those with moderate covariances\. This mechanism effectively regulates the contribution of outlier tokens to the policy gradient, thereby improving the balance between exploration and exploitation in a hyperparameter\-free way\. Extensive experiments demonstrate that our approach drastically improves over the vanilla GRPO, achieving better downstream performance and maintaining stable entropy dynamics, as illustrated in Figure[1](https://arxiv.org/html/2605.11538#S1.F1)\.
## 2Method
### 2\.1Preliminaries
GRPO\(Shaoet al\.,[2024b](https://arxiv.org/html/2605.11538#bib.bib12)\)extends Proximal Policy Optimization\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.11538#bib.bib29)\)by removing the value network and using group\-based rewards to estimate advantages\. For each promptqqsampled from𝒟\\mathcal\{D\}, GRPO samples a group ofGGresponses\{o1,o2,…,oG\}\\\{o\_\{1\},o\_\{2\},\\ldots,o\_\{G\}\\\}from the current policyπθ\\pi\_\{\\theta\}and evaluates them using a reward modelrϕr\_\{\\phi\}, which is usually a rule\-based verifier\. GRPO computes the advantage for responseiiasA^i=ri−r¯σr\\hat\{A\}\_\{i\}=\\frac\{r\_\{i\}\-\\bar\{r\}\}\{\\sigma\_\{r\}\}, whererir\_\{i\}is the reward for responseii, andr¯\\bar\{r\}andσr\\sigma\_\{r\}are the mean and standard deviation of rewards within the group\. The GRPO aims to maximize the following objective:
JGRPO\(θ\)=𝔼q∼𝒟\[1G∑i=1Gπθ\(oi\|q\)πθold\(oi\|q\)A^i\]\\displaystyle J\_\{GRPO\}\(\\theta\)=\\mathbb\{E\}\_\{q\\sim\\mathcal\{D\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{\\pi\_\{\\theta\}\(o\_\{i\}\|q\)\}\{\\pi\_\{\\theta\_\{old\}\}\(o\_\{i\}\|q\)\}\\hat\{A\}\_\{i\}\\right\]−β𝔼q∼𝒟\[DKL\[πθ\(⋅\|q\)∥πref\(⋅\|q\)\]\],\\displaystyle\-\\beta\\mathbb\{E\}\_\{q\\sim\\mathcal\{D\}\}\\left\[D\_\{KL\}\[\\pi\_\{\\theta\}\(\\cdot\|q\)\\\|\\pi\_\{ref\}\(\\cdot\|q\)\]\\right\],whereπθold\\pi\_\{\\theta\_\{old\}\}is the policy from the previous iteration,πref\\pi\_\{ref\}is the reference policy, andβ\\betais the KL penalty coefficient\.
### 2\.2Motivation
To measure the exploration\-exploitation trade\-off in GRPO, we can use policy entropy as an indicator, which is defined as:
ℋ\(πθ\)=−𝔼q∼𝒟\[𝔼o∼πθ\(⋅\|q\)\[logπθ\(o\|q\)\]\]\\displaystyle\\mathcal\{H\}\(\\pi\_\{\\theta\}\)=\-\\mathbb\{E\}\_\{q\\sim\\mathcal\{D\}\}\\left\[\\mathbb\{E\}\_\{o\\sim\\pi\_\{\\theta\}\(\\cdot\|q\)\}\[\\log\\pi\_\{\\theta\}\(o\|q\)\]\\right\]=−1\|𝒟\|∑q∈𝒟1\|o\|∑t=1\|o\|𝔼ot∼πθ\[logπθ\(ot∣q,o<t\)\]\.\\displaystyle=\-\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{q\\in\\mathcal\{D\}\}\\frac\{1\}\{\|o\|\}\\sum\_\{t=1\}^\{\|o\|\}\\mathbb\{E\}\_\{o\_\{t\}\\sim\\pi\_\{\\theta\}\}\\left\[\\log\\pi\_\{\\theta\}\\left\(o\_\{t\}\\mid q,o\_\{<t\}\\right\)\\right\]\.
Cuiet al\.\([2025b](https://arxiv.org/html/2605.11538#bib.bib11)\)show that, under Natural Policy GradientKakade \([2001](https://arxiv.org/html/2605.11538#bib.bib30)\), the change in policy entropy is governed by the covariance between token log\-probabilities and advantages:
Δℋ≈−η⋅Covt\(logπθ\(ot\|q,o<t\),A^i\)\.\\Delta\\mathcal\{H\}\\approx\-\\eta\\cdot\\text\{Cov\}\_\{t\}\(\\log\\pi\_\{\\theta\}\(o\_\{t\}\|q,o\_\{<t\}\),\\hat\{A\}\_\{i\}\)\.This relationship reveals that the covariance between log\-probabilities and advantages directly drives entropy dynamics during training\. As a preliminary experiment, we use GRPO to fine\-tune DeepSeek\-R1\-Distill\-Qwen\-1\.5BDeepSeek\-AI \([2025](https://arxiv.org/html/2605.11538#bib.bib4)\)and track token\-level covariance throughout 50 training steps\. As shown in Figure[3](https://arxiv.org/html/2605.11538#S2.F3)and Table[1](https://arxiv.org/html/2605.11538#S2.T1), we observe that a small fraction of tokens possess large\-magnitude covariance values, which disproportionately dominate the overall covariance and precipitate unstable entropy dynamics\. These extreme values push the policy away from ideal exploration\-exploitation balance, leading to suboptimal performance\. This observation directly leads to our method, which moderates extreme covariance updates through covariance\-aware advantage reweighting, as suppressing these extreme updates is crucial for maintaining stable entropy\.
Figure 3:Cumulative Contribution of Covariance Values\. A small fraction of tokens with extreme covariance values disproportionately dominate policy updates,PercentilePositive CovarianceNegative Covariance0\.01%11\.52\-13\.621\.00%3\.32\-3\.3420\.00%0\.58\-0\.3640\.00%0\.33\-0\.22100\.00%0\.06\-0\.04
Table 1:Covariance Distribution Statistics\.The table presents numerical values of covariance magnitudes at specific percentile thresholds\.
### 2\.3Covariance\-Aware Advantage Reweighting
To address the issue of extreme covariance values destabilizing training so as to balance the exploration\-exploration trade\-off, we propose a covariance\-weighted GRPO variant \(CW\-GRPO\) that automatically down\-weights tokens with large\-magnitude covariances while preserving learning signals from moderate\-covariance tokens\.
We adopt the standard GRPO setup in which a policyπθ\\pi\_\{\\theta\}producesGGresponsesoio\_\{i\}per promptqq\. The vanilla GRPO objective is
ℒGRPO\\displaystyle\\mathcal\{L\}\_\{\\text\{GRPO\}\}=𝔼q,\{oi\}\[1G∑i=1G1\|oi\|∑t=1\|oi\|ℓi,t\],\\displaystyle=\\mathbb\{E\}\_\{q,\\\{o\_\{i\}\\\}\}\\Bigl\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\ell\_\{i,t\}\\Bigr\],whereℓi,t\\ell\_\{i,t\}is the token\-level loss defined as:
ℓi,t=πθ\(oi,t\|q,oi,<t\)πθold\(oi,t\|q,oi,<t\)A^i−βKLi,t\.\\displaystyle\\ell\_\{i,t\}=\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{old\}\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\}\\hat\{A\}\_\{i\}\-\\beta\\text\{KL\}\_\{i,t\}\.
Based on the motivation from Section[2\.2](https://arxiv.org/html/2605.11538#S2.SS2), we propose to reweight the advantage signal based on the covariance between log\-probabilities and advantages\. We compute the token\-level covariance between centered log\-probabilities and centered advantages:
ci,t=\(logπθ\(oi,t\|q,oi,<t\)−logπ¯\)⋅\(A^i−A^¯\),c\_\{i,t\}=\\left\(\\log\\pi\_\{\\theta\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\-\\overline\{\\log\\pi\}\\right\)\\cdot\\left\(\\hat\{A\}\_\{i\}\-\\overline\{\\hat\{A\}\}\\right\),wherelogπ¯\\overline\{\\log\\pi\}andA^¯\\overline\{\\hat\{A\}\}are the means of log\-probabilities and advantages computed over responses\.
We then apply a Gaussian kernel that exponentially suppresses only the magnitudes that exceed typical variability by setting the bandwidth to the empirical standard deviationσ\\sigmaof the covariances:
wi,t=exp\(−ci,t22σ2\),w\_\{i,t\}=\\exp\{\\Bigl\(\-\\tfrac\{c\_\{i,t\}^\{2\}\}\{2\\sigma^\{2\}\}\\Bigr\)\},whereσ\\sigmais the standard deviation of\{ci,t\}\\\{c\_\{i,t\}\\\}\. The Gaussian kernel softly filters out the handful of extreme covariance tokens that would otherwise destabilize entropy, yet leaves informative moderate\-covariance tokens intact\.
To maintain the expected loss, we normalize the weights:
w~i,t=wi,t⋅N∑j=1G∑k=1\|oj\|wj,k,\\tilde\{w\}\_\{i,t\}=w\_\{i,t\}\\cdot\\frac\{N\}\{\\sum\_\{j=1\}^\{G\}\\sum\_\{k=1\}^\{\|o\_\{j\}\|\}w\_\{j,k\}\},whereN=∑j=1G\|oj\|N=\\sum\_\{j=1\}^\{G\}\|o\_\{j\}\|is the total number of tokens across all responses\.
The covariance\-weighted advantage reweighting modifies the token\-level loss as:
ℓ~i,t=πθ\(oi,t\|q,oi,<t\)πθold\(oi,t\|q,oi,<t\)⋅\(w~i,tA^i\)−βKLi,t\.\\displaystyle\\tilde\{\\ell\}\_\{i,t\}=\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{old\}\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\}\\cdot\(\\tilde\{w\}\_\{i,t\}\\hat\{A\}\_\{i\}\)\-\\beta\\text\{KL\}\_\{i,t\}\.Our advantage reweighting approach automatically maintains the policy entropy at a reasonable level by adaptively filtering extreme covariance tokens that would otherwise cause performance issues\.
ModelFine\-tuning SamplesAIME24MATH\-500AMC23MinervaOlympiadBenchAvg\.General ModelsLlama\-3\.1\-70B\-Instruct16\.764\.630\.135\.331\.935\.7o1\-preview44\.685\.5––––1\.5B ModelsStill\-3\-1\.5B\-Preview30,00032\.584\.466\.729\.045\.451\.6DeepScaleR\-1\.5B\-Preview40,00043\.187\.873\.630\.250\.057\.0Open\-RS118,61530\.083\.870\.029\.052\.453\.01\.5B Model ExperimentsDeepSeek\-R1\-Distill\-Qwen\-1\.5BBase Model28\.882\.862\.926\.543\.348\.9GRPO700033\.385\.067\.527\.249\.952\.6 \(\+3\.7\)Clip\-Cov\(Cuiet al\.,[2025a](https://arxiv.org/html/2605.11538#bib.bib33)\)700033\.385\.570\.029\.050\.053\.6 \(\+4\.7\)GRPO \+ Gaussian Reweight \(ours\)700030\.087\.077\.529\.852\.055\.3 \(\+6\.4\)7B ModelsrStar\-Math\-7B26\.778\.447\.5–47\.1–Eurus\-2\-7B\-PRIME26\.779\.257\.838\.642\.148\.9Qwen2\.5\-7B\-SimpleRL26\.782\.462\.539\.743\.350\.97B Model ExperimentsQwen\-2\.5\-Math\-7B\-InstructBase Model3\.382\.647\.533\.140\.441\.4GRPO700010\.082\.255\.033\.140\.344\.1 \(\+2\.7\)Clip\-Cov\(Cuiet al\.,[2025a](https://arxiv.org/html/2605.11538#bib.bib33)\)700010\.0 \(3/30\)82\.4 \(412/500\)57\.5 \(23/40\)32\.4 \(88/272\)41\.3 \(279/675\)44\.7 \(\+3\.3\)GRPO \+ Gaussian Reweight \(ours\)700013\.382\.862\.532\.042\.746\.7 \(\+4\.3\)
Table 2:Main Experimental Results\.This table presents zero\-shot pass@1 performance across mathematical reasoning benchmarks\. Values in parentheses indicate the improvement over the base model\.
## 3Experiments
### 3\.1Experimental Setup
#### Models\.
We consider two different scales of models\. For 1\.5B models, we use DeepSeek\-R1\-Distill\-Qwen\-1\.5B, which is Qwen2\.5\-Math\-1\.5B\(Yanget al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib18)\)fine\-tuned on reasoning data from DeepSeek\-R1\(Guo and DeepSeek\-AI,[2025](https://arxiv.org/html/2605.11538#bib.bib13)\)\. For 7B models, we use Qwen2\.5\-Math\-7B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib18)\)\.
#### Datasets\.
For training dataset, we use Open\-RS\(Dang and Ngo,[2025](https://arxiv.org/html/2605.11538#bib.bib17)\), which consists of 7k high quality math questions\. For evaluation datasets, we use five broadly used math benchmarks\. More details about the dataset information can be found in Appendix[A](https://arxiv.org/html/2605.11538#A1)\.
#### Experimental Setup\.
FollowingDang and Ngo \([2025](https://arxiv.org/html/2605.11538#bib.bib17)\), we set the sampling temperature to 0\.7 during training\. The specific prompt template used for training is detailed in Appendix[B](https://arxiv.org/html/2605.11538#A2), and comprehensive implementation details are provided in Appendix[C](https://arxiv.org/html/2605.11538#A3)\.
### 3\.2Results
#### Accuracy Results\.
We present the experimental results in Table[2](https://arxiv.org/html/2605.11538#S2.T2)\. Our covariance\-weighted GRPO consistently outperforms vanilla GRPO across both model scales and all evaluation benchmarks\. For the 1\.5B DeepSeek\-R1\-Distill\-Qwen model, our method achieves an average improvement of \+2\.7 points over vanilla GRPO, with particularly notable gains on AMC23Liet al\.\([2024](https://arxiv.org/html/2605.11538#bib.bib7)\)and MinervaLewkowyczet al\.\([2022](https://arxiv.org/html/2605.11538#bib.bib8)\)\. Similarly, for the 7B Qwen2\.5\-Math model, our approach delivers a \+2\.4 point improvement on average, with the most substantial gain observed on AMC23 and OlympiadBenchHeet al\.\([2024](https://arxiv.org/html/2605.11538#bib.bib22)\)\. These consistent improvements across different model architectures and mathematical reasoning benchmarks demonstrate the effectiveness of our covariance\-aware advantage reweighting in enhancing reasoning performance\.
#### Policy Entropy Results\.
Using DeepSeek\-R1\-Distill\-Qwen\-1\.5B, we plot the dynamics of entropy during training as introduced in Section[2\.2](https://arxiv.org/html/2605.11538#S2.SS2)\. As shown in Figure[1](https://arxiv.org/html/2605.11538#S1.F1), our proposed method maintains entropy at a reasonable level, effectively balancing exploration and exploitation, while vanilla GRPO exhibits significant entropy instability\. To demonstrate that stable entropy correlates with better reasoning performance, we evaluate model performance at different training checkpoints on MATH\-500Hendryckset al\.\([2021](https://arxiv.org/html/2605.11538#bib.bib24)\); Lightmanet al\.\([2023](https://arxiv.org/html/2605.11538#bib.bib25)\)and OlympiadBench\(Heet al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib22)\)to ensure more reliable statistical measurements, as shown in Table[3](https://arxiv.org/html/2605.11538#S3.T3)\.
The results confirm our hypothesis: vanilla GRPO’s entropy instability leads to performance degradation \(from 85\.0% to 79\.8% on MATH\-500\), while our method maintains consistent performance \(86\.2%\-87\.0%\) across all checkpoints, validating the importance of controlling extreme covariance values\.
MethodTraining StepMATH\-500OlympiadBenchGRPO10085\.049\.9150 \(entropy low\)82\.049\.9200 \(entropy high\)79\.847\.8CW\-GRPO10087\.052\.0150 \(entropy low\)86\.253\.9200 \(entropy high\)86\.453\.5
Table 3:Performance Comparison at Different Checkpoints\.Vanilla GRPO training exhibits entropy instability, resulting in degraded performance\.
## 4Conclusion
We present a covariance\-aware extension to GRPO that uses Gaussian kernel weighting to moderate extreme token\-level updates, automatically improving the exploration\-exploitation trade\-off without additional hyperparameters\. Experimental results across multiple reasoning benchmarks demonstrate consistent improvements over vanilla GRPO, validating our principled approach to enhancing reasoning performance through more balanced gradient updates\.
## Limitations
Our experiments are conducted on models up to 7B parameters, and while the results demonstrate consistent improvements, further validation on larger\-scale models would strengthen the evidence for the method’s broad applicability\. Additionally, our evaluation focuses primarily on mathematical reasoning tasks, which provide clear correctness criteria ideal for testing our approach\. Exploring the method’s effectiveness across more diverse tasks would offer broader insights into its general utility\.
## Acknowledgments
We appreciate the reviewers for their insightful comments and suggestions\. Qin Liu and Muhao Chen were supported by an Amazon Nova Trusted AI Prize, grants OAC 2531126 and ITE 2333736 from the United States National Science Foundation\.
## References
- G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen,et al\.\(2025a\)The entropy mechanism of reinforcement learning for reasoning language models\.arXiv preprint arXiv:2505\.22617\.Cited by:[Table 2](https://arxiv.org/html/2605.11538#S2.T2.1.1.12.1.1),[Table 2](https://arxiv.org/html/2605.11538#S2.T2.1.1.21.1.1)\.
- G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen,et al\.\(2025b\)The entropy mechanism of reinforcement learning for reasoning language models\.arXiv preprint arXiv:2505\.22617\.Cited by:[§1](https://arxiv.org/html/2605.11538#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.11538#S2.SS2.p3.1)\.
- The entropy mechanism of reinforcement learning for reasoning language models\.arXiv preprint arXiv:2505\.22617\.External Links:[Link](https://arxiv.org/abs/2505.22617)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px2.p1.1)\.
- Q\. Dang and C\. Ngo \(2025\)Reinforcement learning for reasoning in small llms: what works and what doesn’t\.External Links:2503\.16219,[Link](https://arxiv.org/abs/2503.16219)Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1),[§3\.1](https://arxiv.org/html/2605.11538#S3.SS1.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.11538#S3.SS1.SSS0.Px3.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px2.p1.1),[§2\.2](https://arxiv.org/html/2605.11538#S2.SS2.p3.2)\.
- B\. Gao, F\. Song, Z\. Yang, Z\. Cai, Y\. Miao, Q\. Dong, L\. Li, C\. Ma, L\. Chen, R\. Xu, Z\. Tang, B\. Wang, D\. Zan, S\. Quan, G\. Zhang, L\. Sha, Y\. Zhang, X\. Ren, T\. Liu, and B\. Chang \(2024\)Omni\-math: a universal olympiad level mathematic benchmark for large language models\.External Links:2410\.07985,[Link](https://arxiv.org/abs/2410.07985)Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1)\.
- D\. Guo and DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.External Links:[Link](https://arxiv.org/abs/2501.12948)Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1),[§3\.1](https://arxiv.org/html/2605.11538#S3.SS1.SSS0.Px1.p1.1)\.
- Z\. Hao, H\. Wang, H\. Liu, J\. Luo, J\. Yu, H\. Dong, Q\. Lin, C\. Wang, and J\. Chen \(2025\)Rethinking entropy interventions in rlvr: an entropy change perspective\.arXiv preprint arXiv:2510\.10150\.Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px2.p1.1)\.
- A\. He, D\. Fried, and S\. Welleck \(2025\)Rewarding the unlikely: lifting grpo beyond distribution sharpening\.arXiv preprint arXiv:2506\.02355\.External Links:[Link](https://arxiv.org/abs/2506.02355)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px2.p1.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. L\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.External Links:2402\.14008Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1),[§3\.2](https://arxiv.org/html/2605.11538#S3.SS2.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.11538#S3.SS2.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1),[§3\.2](https://arxiv.org/html/2605.11538#S3.SS2.SSS0.Px2.p1.1)\.
- S\. M\. Kakade \(2001\)A natural policy gradient\.InAdvances in Neural Information Processing Systems,T\. Dietterich, S\. Becker, and Z\. Ghahramani \(Eds\.\),Vol\.14,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf)Cited by:[§2\.2](https://arxiv.org/html/2605.11538#S2.SS2.p3.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo,et al\.\(2022\)Solving quantitative reasoning problems with language models\.Advances in Neural Information Processing Systems35,pp\. 3843–3857\.Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1),[§3\.2](https://arxiv.org/html/2605.11538#S3.SS2.SSS0.Px1.p1.1)\.
- J\. LI, E\. Beeching, L\. Tunstall, B\. Lipkin, R\. Soletskyi, S\. C\. Huang, K\. Rasul, L\. Yu, A\. Jiang, Z\. Shen, Z\. Qin, B\. Dong, L\. Zhou, Y\. Fleureau, G\. Lample, and S\. Polu \(2024\)NuminaMath\.Numina\.Note:[https://github\.com/project\-numina/aimo\-progress\-prize/blob/main/report/numina\_dataset\.pdf](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1)\.
- J\. Li, E\. Beeching, L\. Tunstall, B\. Lipkin, R\. Soletskyi, S\. Huang, K\. Rasul, L\. Yu, A\. Q\. Jiang, Z\. Shen,et al\.\(2024\)Numinamath: the largest public dataset in ai4maths with 860k pairs of competition math problems and solutions\.Hugging Face repository13,pp\. 9\.Cited by:[§3\.2](https://arxiv.org/html/2605.11538#S3.SS2.SSS0.Px1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2605.11538#A1.p1.1),[§3\.2](https://arxiv.org/html/2605.11538#S3.SS2.SSS0.Px2.p1.1)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. Hashimoto \(2025\)S1: simple test\-time scaling\.arXiv preprint arXiv:2501\.19393\.Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2\.1](https://arxiv.org/html/2605.11538#S2.SS1.p1.12)\.
- A\. Setlur, C\. Nagpal, A\. Fisch, X\. Geng, J\. Eisenstein, R\. Agarwal, A\. Agarwal, J\. Berant, and A\. Kumar \(2024\)Rewarding progress: scaling automated process verifiers for llm reasoning\.External Links:2410\.08146,[Link](https://arxiv.org/abs/2410.08146)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024a\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.11538#S1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu, and D\. Guo \(2024b\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:[Link](https://arxiv.org/abs/2402.03300)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2605.11538#S2.SS1.p1.12)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.External Links:2408\.03314,[Link](https://arxiv.org/abs/2408.03314)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px1.p1.1)\.
- C\. Wang, L\. Wei, Y\. Zhang, C\. Shao, Z\. Dan, W\. Huang, Y\. Zhang, and Y\. Wang \(2025a\)EFRame: deeper reasoning via exploration\-filter\-replay reinforcement learning framework\.External Links:2506\.22200,[Link](https://arxiv.org/abs/2506.22200)Cited by:[§1](https://arxiv.org/html/2605.11538#S1.p1.1)\.
- Y\. Wang, Q\. Yang, Z\. Zeng, L\. Ren, L\. Liu, B\. Peng, H\. Cheng, X\. He, K\. Wang, J\. Gao, W\. Chen, S\. Wang, and S\. S\. Du \(2025b\)Reinforcement learning for reasoning in large language models with one training example\.arXiv preprint arXiv:2504\.20571\.External Links:[Link](https://arxiv.org/abs/2504.20571)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, B\. Zhang, B\. Hui, B\. Gao, B\. Yu, C\. Li, D\. Liu, J\. Tu, J\. Zhou, J\. Lin, K\. Lu, M\. Xue, R\. Lin, T\. Liu, X\. Ren, and Z\. Zhang \(2024\)Qwen2\.5\-math technical report: toward mathematical expert model via self\-improvement\.arXiv preprint arXiv:2409\.12122\.Cited by:[§3\.1](https://arxiv.org/html/2605.11538#S3.SS1.SSS0.Px1.p1.1)\.
- J\. Zhao, R\. Liu, K\. Zhang, Z\. Zhou, J\. Gao, D\. Li, J\. Lyu, Z\. Qian, B\. Qi, X\. Li, and B\. Zhou \(2025\)GenPRM: scaling test\-time compute of process reward models via generative reasoning\.External Links:2504\.00891,[Link](https://arxiv.org/abs/2504.00891)Cited by:[Appendix D](https://arxiv.org/html/2605.11538#A4.SS0.SSS0.Px1.p1.1)\.
## Appendix ADatasets Information
We use Open\-RS dataset as the training set, which is curated byDang and Ngo \([2025](https://arxiv.org/html/2605.11538#bib.bib17)\), totaling 7,000 samples: 3,000 from the Open\-s1 dataset\(Dang and Ngo,[2025](https://arxiv.org/html/2605.11538#bib.bib17)\)\(filtered mathematical problems from diverse sources like NuminaMATH\(LIet al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib21)\)and AIME\), 3,000 from the Open\-DeepScaleR\(Dang and Ngo,[2025](https://arxiv.org/html/2605.11538#bib.bib17)\)dataset \(mathematics problems from AIME, AMC, and OmniMATH\(Gaoet al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib20)\)\), and 1,000 easier problems from the DeepScaleR\(Guo and DeepSeek\-AI,[2025](https://arxiv.org/html/2605.11538#bib.bib13)\)dataset\. Both models are trained on the Open\-RS dataset\. For evaluation, we select five datasets: AIME24, MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.11538#bib.bib24); Lightmanet al\.,[2023](https://arxiv.org/html/2605.11538#bib.bib25)\), AMC23, Minerva\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.11538#bib.bib8)\)and OlympiadBench\(Heet al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib22)\)\. More information is presented in Table[4](https://arxiv.org/html/2605.11538#A1.T4)\.
NameHuggingface PathSizeAIME24HuggingFaceH4/aime\_202430AMC23knoveleng/AMC\-2340MATH\-500HuggingFaceH4/MATH\-500500Minervaknoveleng/Minerva\-Math272OlympiadBenchknoveleng/OlympiadBench675
Table 4:Datasets Information\.
## Appendix BTraining Prompt
The prompt used during training is presented in Figure[4](https://arxiv.org/html/2605.11538#A2.F4), in which we instruct the model to use English only, as we have observed some language mixture issues\.
Prompt Used for TrainingA conversation between User and Assistant\. The user asks a question, and the Assistant solves it\. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer, and put your final answer within boxed \. The reasoning process and answer are enclosed within <think\> </think\> and <answer\> </answer\> tags, respectively, i\.e\., <think\> reasoning process here </think\> <answer\> answer here </answer\>\. Note that respond by English, NOT use other languages\.Figure 4:prompt
## Appendix CImplementation Details
We conduct all training and evaluation using four NVIDIA H200 GPUs\. Our reward function combines accuracy and format metrics\. For accuracy, we compare the parsed model output against the ground truth using a verification function implemented inLaTeX\. The function assigns a reward of 1\.0 for exact matches and 0\.0 otherwise\. For format compliance, we award a reward of 1\.0 when the output contains properly matched<think\>tags\.
ParameterValueLearning Rate1\.0×10−61\.0\\times 10^\{\-6\}Batch Size12Gradient Accumulation Steps4Training Steps100Warmup Ratio0\.1Max Prompt Length512Max Completion Length4096Temperature0\.7Number of Generations12Table 5:Hyperparameters Configuration for Experiments\.
## Appendix DRelated Work
#### Test\-time Scaling for Reasoning Tasks\.
Test\-time scaling has emerged as a promising paradigm for improving LLM performance by allocating additional computational resources during inference\.\(Snellet al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib26)\)demonstrated that scaling test\-time compute optimally can be more effective than scaling model parameters, showing over 4× efficiency gains through compute\-optimal strategies\.\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.11538#bib.bib19)\)introduced a simplified approach using "budget forcing" to control inference compute by appending "Wait" tokens, achieving strong reasoning with minimal training data\.\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.11538#bib.bib27)\)advanced the field with GenPRM, a generative process reward model that scales test\-time compute through explicit Chain\-of\-Thought reasoning\.\(Setluret al\.,[2024](https://arxiv.org/html/2605.11538#bib.bib28)\)proposed that effective process rewards should measure progress by evaluating likelihood changes before and after each reasoning step\.
#### Reinforcement Learning for Reasoning Tasks\.
Reinforcement Learning with Verifiable Rewards \(RLVR\) has rapidly become the dominant route for eliciting step\-by\-step reasoning in LLMs\.Shaoet al\.\([2024b](https://arxiv.org/html/2605.11538#bib.bib12)\)first showed that Group Relative Policy Optimization \(GRPO\) can improve while dispensing with a value network, and the recipe was later scaled in DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.11538#bib.bib4)\)\. Subsequent work diagnoses exploration bottlenecks: unlikeliness reward boosts low\-probability but correct trajectories\(Heet al\.,[2025](https://arxiv.org/html/2605.11538#bib.bib14)\), whereas covariance\-based clipping traces early saturation to entropy collapse\(Cuiet al\.,[2025c](https://arxiv.org/html/2605.11538#bib.bib15); Haoet al\.,[2025](https://arxiv.org/html/2605.11538#bib.bib34)\)\. Efficiency studies reveal that even a*single*worked example can unlock large gains through 1\-shot RLVR\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11538#bib.bib16)\)\.Similar Articles
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
This paper identifies and addresses aggregation bias in GRPO-style reinforcement learning for LLMs, proposing Balanced Aggregation (BA) which improves training stability and final performance by computing token-level means separately for positive and negative subsets.
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
F-GRPO proposes a factorized group-relative policy optimization framework that unifies candidate generation and ranking in a single autoregressive LLM, addressing credit assignment issues and improving top-ranked performance across sequential recommendation and multi-hop QA benchmarks.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.