Rethinking Groups in Critic-Free RLVR
Summary
This paper rethinks the role of grouping in critic-free reinforcement learning for LLMs and proposes negative token filtering to enable stable training with a single rollout per prompt, achieving comparable or better performance on reasoning and agentic tasks.
View Cached Full Text
Cached at: 06/17/26, 05:37 AM
# Rethinking Groups in Critic-Free RLVR
Source: [https://arxiv.org/html/2606.17250](https://arxiv.org/html/2606.17250)
Yihong Wu1Liheng Ma2,311footnotemark:1Lingfeng Xiao4Muzhi Li5 Xinyu Wang2Yingxue Zhang6Jian\-Yun Nie1
1Université de Montréal2McGill University3Mila \- Quebec AI Institute 4University of Waterloo5The Chinese University of Hong Kong6Huawei Noah’s Ark Lab
###### Abstract
Reinforcement learning \(RL\) has become a central paradigm for post\-training large language models\. Existing critic\-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation\. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts\. In this work, we revisit the role of the “group” and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples\. Building on this insight, we proposenegative token filtering, a simple and effective strategy that enables stable single\-rollout training\. We apply it to two batch\-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group\-based RL techniques\.
Rethinking Groups in Critic\-Free RLVR
## 1Introduction
Reinforcement learning \(RL\) has become thede factoparadigm for post\-training Large Language Models \(LLMs\) to enhance their capabilities\. To avoid the computational and memory overhead of a separate critic network, as in critic\-based methods like PPOSchulmanet al\.\([2017](https://arxiv.org/html/2606.17250#bib.bib28)\), critic\-free methods have been widely adopted\. Most of these—such as GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.17250#bib.bib1)\), RLOOAhmadianet al\.\([2024](https://arxiv.org/html/2606.17250#bib.bib16)\), and ReMaxLiet al\.\([2023](https://arxiv.org/html/2606.17250#bib.bib26)\)—generate multiple rollouts per prompt and use the resulting group to estimate value baselines for advantage computation\.
Figure 1:Advantage computation in GRPO vs\. REINFORCE\+\+\. GRPO \(left\) computes advantages at the group level, whereas REINFORCE\+\+ \(right\) computes them at the generation\-batch level\.Although critic\-free methods are more efficient than critic\-based alternatives, their reliance on generating and grouping multiple rollouts still incurs a cost\. For example, rollout grouping can introduce synchronization barriersXu and Ding \([2026](https://arxiv.org/html/2606.17250#bib.bib22)\), force discarding groups whose rollouts receive identical rewardsYuet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib19)\), and prove inflexible for structured rollouts in agentic RLFenget al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib32)\)\. Therefore, recent works have explored critic\-free methods that avoid multiple rollouts per prompt and group\-based advantage computationHuet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib21)\); Xu and Ding \([2026](https://arxiv.org/html/2606.17250#bib.bib22)\)\. REINFORCE\+\+Huet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib21)\)replaces group normalization with batch\-level normalization, computing advantages from generation batch rather than group\-level statistics \(Fig\.[1](https://arxiv.org/html/2606.17250#S1.F1)\)\. However, for reasoning tasks it still typically uses a group size larger than one to improve training stability; with a single rollout, it suffers from severe training instability \(Fig\.[3](https://arxiv.org/html/2606.17250#S2.F3)\)\. To fully enable single rollout generation, SPOXu and Ding \([2026](https://arxiv.org/html/2606.17250#bib.bib22)\)builds on batch\-level normalization and introduce an additional tracker that estimates the value baseline via historical information\. However, this tracker requires extra sampling before training, adding computational overhead\.
To uncover the role of grouping, we reverse\-engineer the functional mechanism of rollout groups and use the resulting insight to develop an alternative single\-rollout, critic\-free policy optimization method\. We make two observations: \(1\) the training instability of single\-rollout RL originates from negative samples; and \(2\) using exactly one positive and one negative rollout per prompt restores stable training\. We interpret these observations as follows\. First, an incorrect reasoning trajectory is rarely entirely wrong—it still contains many useful token patterns, such as formatting, intermediate reasoning steps, and tool\-use cues\. Penalizing all of them equally therefore leads to harmful updates\. Second, this harm is strongly alleviated when a positive trajectory for the same prompt is present: since the positive and negative rollouts typically share these functional tokens, the negative gradient on the shared tokens is partially cancelled out\. In other words, the group does not merely estimate a baseline; itde factoprotects shared useful tokens from being over\-penalized\. We confirm this cancellation effect through \(1\) a statistical analysis of token overlap and \(2\) gradient projection onto the Top\-KKsubspace of the weight matrices\. Stemming from this insight, we propose a simple filtering strategy for negative trajectories that retains only the Top\-10%10\\%lowest\-probability tokens in the negative loss\. We empirically verify this filtering on two batch\-level advantage computation methods, and the resulting group\-free methods outperform their group\-based counterparts\.
Figure 2:Training Curves ofRF\+\+\[11\],RF\+\+\[22\] andRF\+\+w/Baseline\\text\{RF\+\+\}\_\{w/\\ \\text\{Baseline\}\}\[22\]\. This indicates that multi\-rollout generation cannot guarantee stable training\. The grouping mechanism is more critical for training stability,\.
## 2Analysis
Most critic\-free RL methods for LLMs introduce two mechanisms: multi\-rollout sampling per prompt, and group\-based advantage computation\. Note that a method with multi\-rollout sampling might not introduce group\-based advantage\. On the other hand, the group\-based advantage computation relies on multi\-rollout sampling\. In this section, we disentangle these two mechanisms and study their effects in isolation\.
### 2\.1The Effect of Grouping
We use REINFORCE\+\+ \(RF\+\+\)Huet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib21)\)as the base methods, which enables us to isolate the effects of multi\-rollout generation and grouping\. Motivated by recent workWuet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib18)\), we focus on a group sizeGGof22as the minimal multi\-rollout setting, which allows a clean analysis while avoiding unnecessary experimental complexity\.
Specifically, we compare three variants:RF\+\+\[11\],RF\+\+\[22\] andRF\+\+w/Baseline\\text\{RF\+\+\}\_\{w/\\ \\text\{Baseline\}\}\[22\], where \[nn\] denotes the group size ofnn\. TheRF\+\+performs reward normalization over the mini\-batch of rollouts, rather than within each prompt\-level group as in GRPO\. TheRF\+\+w/Baseline\\text\{RF\+\+\}\_\{w/\\ \\text\{Baseline\}\}further subtracts the group\-wise sample mean before batch\-level normalization, thereby incorporating a GRPO\-style grouping mechanism prior to mini\-batch normalization – a group of rollouts, which are all correct or all incorrect, will yield zero advantages and thus contributes no gradient to the optimization\.
As shown in Fig\.[2](https://arxiv.org/html/2606.17250#S1.F2), althoughRF\+\+\[22\] is more stable thanRF\+\+\[11\], it still eventually collapses\. By contrast,RF\+\+w/Baseline\\text\{RF\+\+\}\_\{w/\\ \\text\{Baseline\}\}\[22\] remains stable throughout training, with steadily increasing rewards and no visible sign of collapse\. This suggests that multi\-rollout sampling can partially mitigate instability, but does not guarantee stable training\. Compared with multi\-rollout sampling alone, the grouping mechanism is more critical for training stability – it ensures each effective group contains both positive and negative samples\.
Figure 3:Learning dynamics under varying negative coefficientsβk\\beta\_\{k\}\. We train the Qwen2\.5\-Math\-1\.5B model on a 7\.5k\-example subset of the DAPO\-MATH dataset for 1k steps with a learning rate of3×10−63\\times 10^\{\-6\}\. Each step contains512512prompts and uses a mini\-batch size of3232\. We apply an in\-reward KL penalty with a coefficient of1×10−31\\times 10^\{\-3\}\.
### 2\.2The Impact of the Negative
In the previous section, we reveal that the group\-based advantage can effectively mitigate training instability\. In this section, we next investigate, in the absence of the grouping, which component is primarily responsible for instability: positive or negative samples\.
Trajectories with negative advantages induce updates that decrease the likelihood of the sampled tokens\. However, especially in RLVR, a negative sample is labeled only by its incorrect final answer\. As the old saying goes,*Bonum ex integra causa, malum ex quocumque defectu\.*111Literal translation:*Good arises from an integral cause; evil from any defect whatsoever*\.An incorrect final answer does not imply that every token in the trajectory is erroneous or should be penalized\. We therefore hypothesize thatuniformly penalizing negative trajectories can falsely discourage useful reasoning patterns, grammatical tokens, and other functional components– we denote them as*supporting tokens*\(an example is shown in Fig\.[5](https://arxiv.org/html/2606.17250#S2.F5)\) – thereby inducing unstable and potentially destructive updates\.
Figure 4:Hit rates in positive rollouts of high\- and low\-probabilitynn\-grams from negative rollouts\. Results are computed with Qwen2\.5\-Math\-1\.5B on MATH500\. For each prompt, we generate 10 responses and discard degenerated trajectories\. Among 500 groups, 141 groups with all correct or all incorrect responses are excluded\.To verify our hypothesis, we conduct a controlled experiment onRF\+\+\[11\] that varies the strength of the negative signal while keeping the positive signal fixed\. Specifically, we scale the negative advantage by a coefficientβ−\\beta\_\{\-\}and examine its effect on training dynamics, monitoring reward, entropy, gradient norm, and sequence length throughout training\. As shown in Fig\.[3](https://arxiv.org/html/2606.17250#S2.F3), collapse manifests as a sharp drop in reward, accompanied by simultaneous increases in entropy, gradient norm, and sequence length\. Once the sequence length saturates at the preset maximum of 2048 tokens, training briefly stabilizes before collapsing again\. During this transient phase, the KL penalty becomes the dominant term in the objective, driving the sequence length back down\. By this point, the model has already suffered substantial degradation\.
As we decreaseβ−\\beta\_\{\-\}from11\(the baseline\) toward0, training becomes progressively more stable, and the onset of collapse is delayed\.222Note that collapse is stochastic: even under the same configuration, it may occur at different training steps\. However, the overall trend is clear\.Atβ−=0\.25\\beta\_\{\-\}=0\.25, training remains stable throughout the full 1k\-step training horizon: the reward increases steadily, with no abrupt drops, while entropy, gradient norm, and sequence length remain well\-behaved\. Since this horizon is substantially longer than the typical collapse time observed under larger negative coefficients, the stability is unlikely to be incidental\. The opposite further supports this conclusion: whenβ−=2\\beta\_\{\-\}=2, corresponding to the strongest negative signal we test, collapse occurs earlier than in all other configurations\. Together, these results indicate that training instability is primarily driven by negative samples rather than positive samples\.
### 2\.3Supporting Tokens in Rollouts
Motivated by the previous findings – the instabilities stem from negative samples, whereas the positive rollouts from the same group can mitigate –, we hypothesize that: 1\) in the absence of positive rollouts, negative samples lead to penalties on supporting tokens, inducing harmful updates; 2\) when optimized jointly, positive samples can offset these harmful gradients through their shared supporting tokensChenget al\.\([2026](https://arxiv.org/html/2606.17250#bib.bib24)\)\.
This hypothesis is empirically verified by measuring token\-level overlap between positive and negative samples generated from the same prompt \(shown in Fig\.[4](https://arxiv.org/html/2606.17250#S2.F4)\)\. Notably, we further categorize the tokens into two types: high\-probability tokens as those whose probabilities fall within the top90%90\\%in a sequence, and define the remaining tokens as low\-probability tokens\. As shown in Fig\.[4](https://arxiv.org/html/2606.17250#S2.F4), high\-probabilitynn\-grams consistently achieve higher hit rates than low\-probabilitynn\-grams across different values ofnn\. This evidence supports two key findings: 1\) positive samples within the same group protect training stability by offsetting harmful updates induced by negative samples on shared supporting tokens; 2\) these supporting tokens, particularly at thenn\-gram level, are more likely to fall into the high\-probability token category\.
Figure 5:An example illustrating that positive and negative rollouts for the same prompt can share many*supporting tokens*, shown asuncolored text\.This observation aligns with a natural intuition: tokens assigned higher probabilities by the policy are more likely to be supporting tokens, as the pretrained base model already encodes strong language\-modeling and reasoning priors\. This insight also suggests a straightforward mitigation for single\-rollout RL methods: suppress supporting tokens in negative samples with prediction confidence as a proxy, motivating our proposed technique – NTF – discussed in Section[3\.2](https://arxiv.org/html/2606.17250#S3.SS2)\.
### 2\.4Top\-KKSubspace Alignment
In the previous section, we have statistically shown shared supporting tokens among the positive and negative rollouts within a group\. The failure mode is not a gradual decline in reward but an abrupt collapse of the model’s language and reasoning ability—the signature of catastrophic forgettingKirkpatricket al\.\([2017](https://arxiv.org/html/2606.17250#bib.bib30)\)\. We hypothesize that a layer’s pretrained competence is carried mainly by the dominant directions of its weight matrix, i\.e\., its top singular subspaces, so that an update is harmful to the extent that its gradient concentrates there\. This motivates a direct question: for each type of update, how much of the gradient’s energy falls within the top singular subspace of the weights? We answer it by projecting the gradient onto the top\-kksingular subspaces, as described below\.
For a weight matrix𝑾∈ℝdout×din\\bm\{W\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, we compute its singular value decomposition:
𝑾=𝑼𝚺𝑽⊺,\\bm\{W\}=\\bm\{U\}\\bm\{\\Sigma\}\\bm\{V\}^\{\\intercal\},\(1\)where𝑼∈ℝdout×r\\bm\{U\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}and𝑽∈ℝdin×r\\bm\{V\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times r\}have orthonormal columns,𝚺=diag\(σ1,…,σr\)\\bm\{\\Sigma\}=\\operatorname\{diag\}\(\\sigma\_\{1\},\\dots,\\sigma\_\{r\}\)withσ1≥⋯≥σr≥0\\sigma\_\{1\}\\geq\\cdots\\geq\\sigma\_\{r\}\\geq 0, andr=min\(dout,din\)r=\\min\(d\_\{\\text\{out\}\},d\_\{\\text\{in\}\}\)\. Let𝑼k\\bm\{U\}\_\{k\}and𝑽k\\bm\{V\}\_\{k\}contain the leadingkkleft and right singular vectors, respectively; these vectors span the top\-kkleft and right singular subspaces of𝑾\\bm\{W\}\.
Given a gradient𝑮∈ℝdout×din\\bm\{G\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}with the same shape as𝑾\\bm\{W\}, we project it onto the top\-kksingular subspaces of𝑾\\bm\{W\}:
𝑷k=𝑼k⊺𝑮𝑽k∈ℝk×k\.\\bm\{P\}\_\{k\}=\\bm\{U\}\_\{k\}^\{\\intercal\}\\bm\{G\}\\bm\{V\}\_\{k\}\\;\\in\\;\\mathbb\{R\}^\{k\\times k\}\.\(2\)This projection is the leadingk×kk\\times kblock of𝑮\\bm\{G\}expressed in the singular\-vector basis of𝑾\\bm\{W\}\. We then define the normalized block energy as
ρk=∥𝑷k∥F2∥𝑮∥F2∈\[0,1\],\\rho\_\{k\}=\\frac\{\\lVert\\bm\{P\}\_\{k\}\\rVert\_\{F\}^\{2\}\}\{\\lVert\\bm\{G\}\\rVert\_\{F\}^\{2\}\}\\;\\in\\;\[0,1\],\(3\)which measures the fraction of the gradient’s Frobenius energy captured by the top\-kksingular subspaces\. Since orthogonal transformations preserve the Frobenius norm,ρk\\rho\_\{k\}is non\-decreasing inkkand reaches11atk=rk=r\. A largeρk\\rho\_\{k\}for smallkktherefore indicates that the gradient is concentrated along the dominant directions of𝑾\\bm\{W\}\.
Figure 6:Block energyρk\\rho\_\{k\}as a function of the subspace dimensionkk, computed with Qwen2\.5\-Math\-1\.5B on MATH500 over1,3801\{,\}380positive and1,3801\{,\}380negative samples drawn from359359prompts\. Each curve is averaged over197197weight matrices\.We compute the gradient from the advantage\-weighted log\-likelihood loss
l=−A\|o\|∑t=1\|o\|mtlogπθ\(ot∣o<t,q\),𝑮=∂l∂𝑾,l=\-\\frac\{A\}\{\|o\|\}\\sum\_\{t=1\}^\{\|o\|\}m\_\{t\}\\,\\log\\pi\_\{\\theta\}\\\!\\left\(o\_\{t\}\\mid o\_\{<t\},q\\right\),\\,\\,\\bm\{G\}=\\frac\{\\partial l\}\{\\partial\\bm\{W\}\},\(4\)whereoois the sampled trajectory,qqthe question,\|o\|\|o\|the trajectory length,AAthe advantage, andmt∈\{0,1\}m\_\{t\}\\in\\\{0,1\\\}a token mask that selects which tokens contribute to the gradient\. We compare four settings\. The first three isolate the effect of a token subset: we fixA=1A=1, accumulate the gradient over both positive and negative samples, and vary only the mask—\(1\)high\-prob\.tokens, \(2\)low\-prob\.tokens, and \(3\)all tokens\(mt=1m\_\{t\}=1\)\. The fourth, \(4\)policy gradient, setsmt=1m\_\{t\}=1and uses the group\-relative advantage \(A=1A=1for correct andA=−1A=\-1for incorrect trajectories\), so that positive and negative trajectories enter with opposite signs\.
Fig\.[6](https://arxiv.org/html/2606.17250#S2.F6)reveals three patterns: \(i\) the all\-token curve closely tracks the high\-prob\. token curve; \(ii\) for smallkk, high\-prob\. tokens attain a substantially largerρk\\rho\_\{k\}than low\-prob\. tokens; and \(iii\) atk=1,2k=1,2, the policy\-gradient curve nearly coincides with the low\-prob\. token curve\.
These patterns support our hypothesis\. High\-prob\. tokens align far more strongly with the top singular directions than low\-prob\. tokens \(observation \(ii\)\), so updating on them perturbs precisely the directions we associate with the model’s core capability—and since the all\-token gradient is dominated by these tokens \(observation \(i\)\), naive single\-rollout training cannot avoid this\. Under group\-based optimization, however, the high\-prob\. gradients of positive and negative trajectories are top\-KK\-aligned but carry opposite\-sign advantages, and therefore largely cancel; the policy\-gradient curve consequently collapses onto the low\-prob\. curve atk=1,2k=1,2\(observation \(iii\)\)\. Grouping thus shields the dominant directions from over\-penalization, preventing collapse\.
## 3Methodology
### 3\.1Batch\-level Advantage Computation
#### Notation\.
Letπθ\\pi\_\{\\theta\}be the current policy with parameterθ\\thetaandπθold\\pi\_\{\\theta\_\{\\text\{old\}\}\}the behavior policy used to collect rollouts, and letρ\(ot\)=πθ\(ot\)/πθold\(ot\)\\rho\(o\_\{t\}\)=\\pi\_\{\\theta\}\(o\_\{t\}\)/\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(o\_\{t\}\)denote the per\-token importance ratio\. For a queryqq, each sampled trajectory is labeled positive \(o\+o^\{\+\}\) or negative \(o−o^\{\-\}\); in RLVR with binary verifiable rewards this label is simply the correctness of the trajectory\. We writeN\+N^\{\+\}andN−N^\{\-\}for the number of positive and negative trajectories in a mini\-batch\. For continuous rewards, the positive and negative samples are determined by the value baseline estimated by group/batch average\.
#### Base RL techniques with batch\-level advantage estimation\.
We mainly verify our proposed techniques on two RL techniques with batch\-level advantage estimation\.
REINFORCE\+\+ \(RF\+\+\)\.RF\+\+Huet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib21)\)forms the advantage by z\-score normalizing rewards within a generation batchBB,
A=r−μBstdB\.A=\\frac\{r\-\\mu\_\{B\}\}\{\\mathrm\{std\}\_\{B\}\}\.\(5\)
Contrastive\-REINFORCE \(C\-RF\)\.For binary rewards with batch success ratepp, we haveμB=p\\mu\_\{B\}=pandstdB=p\(1−p\)\\mathrm\{std\}\_\{B\}=\\sqrt\{p\(1\-p\)\}\.Wuet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib18)\)andLiet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib31)\)show that the surrogate with z\-score normalization yields333Conditioning on the outcome,A\+=\(1−p\)/pA^\{\+\}=\\sqrt\{\(1\-p\)/p\}with probabilityppandA−=−p/\(1−p\)A^\{\-\}=\-\\sqrt\{p/\(1\-p\)\}with probability1−p1\-p; multiplying by the respective outcome probabilities gives the shared coefficientpA\+=−\(1−p\)A−=p\(1−p\)p\\,A^\{\+\}=\-\(1\-p\)\\,A^\{\-\}=\\sqrt\{p\(1\-p\)\}\.
𝒥=p\(1−p\)𝔼q\[l\+\(o\+∣q\)−l−\(o−∣q\)\],\\mathcal\{J\}=\\sqrt\{p\(1\-p\)\}\\;\\mathbb\{E\}\_\{q\}\\\!\\left\[\\,l^\{\+\}\(o^\{\+\}\\\!\\mid q\)\-l^\{\-\}\(o^\{\-\}\\\!\\mid q\)\\,\\right\],\(6\)where the scalarp\(1−p\)=stdB\\sqrt\{p\(1\-p\)\}=\\mathrm\{std\}\_\{B\}is the*standard deviation*of the batch success indicator, which might lead to an undesired weighting effect in RLVR setting – amplifying atp=12p=\\tfrac\{1\}\{2\}and attenuating asp→0p\\to 0orp→1p\\to 1\. This might be more pronounced in agentic settings, where the initial success rate is near zero\.
Therefore, we also verify on another objective, W\-REINFORCEZhuet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib27)\), without the weighting coefficient:
ℒ=−12\(1N\+∑i=1N\+l\+\(oi\+\)−1N−∑j=1N−l−\(oj−\)\),\\displaystyle\\mathcal\{L\}=\-\\frac\{1\}\{2\}\\left\(\\frac\{1\}\{N^\{\+\}\}\\sum\_\{i=1\}^\{N^\{\+\}\}l^\{\+\}\(o\_\{i\}^\{\+\}\)\\;\-\\;\\frac\{1\}\{N^\{\-\}\}\\sum\_\{j=1\}^\{N^\{\-\}\}l^\{\-\}\(o\_\{j\}^\{\-\}\)\\right\),\(7\)l\+\(o\)=1\|o\|∑t=1\|o\|min\(ρ\(ot\),1\+ϵ\),\\displaystyle l^\{\+\}\(o\)=\\frac\{1\}\{\|o\|\}\\sum\_\{t=1\}^\{\|o\|\}\\min\\\!\\big\(\\rho\(o\_\{t\}\),1\+\\epsilon\\big\),l−\(o\)=1\|o\|∑t=1\|o\|max\(ρ\(ot\),1−ϵ\)\.\\displaystyle l^\{\-\}\(o\)=\\frac\{1\}\{\|o\|\}\\sum\_\{t=1\}^\{\|o\|\}\\max\\\!\\big\(\\rho\(o\_\{t\}\),1\-\\epsilon\\big\)\.We denote it as Contrastive\-REINFORCE \(C\-RF\), as the weighting coefficient is removed\.
### 3\.2Negative Token Filtering \(NTF\)
A negative trajectory is penalized token by token, but not every token is to blame for the failure\. Many of its high\-probability tokens are generic or are also produced during correct behavior\. Penalizing them indiscriminately injects noise into the update and destabilizes training\. We therefore concentrate the negative gradient on the tokens most plausibly responsible for the failure\.
Concretely, given a masking fractionτ∈\[0,1\]\\tau\\in\[0,1\], we rank the tokens of each negative trajectory by their probability under the current policy and mask the top\-τ\\taufraction \(the highest\-probability tokens\)\. Let𝒦\(o\)\\mathcal\{K\}\(o\)denote the set of tokens that remain\. The negative term is still normalized by the full sequence length,
l−\(o\)=1\|o\|∑t∈𝒦\(o\)max\(ρ\(ot\),1−ϵ\),l^\{\-\}\(o\)=\\frac\{1\}\{\|o\|\}\\sum\_\{t\\in\\mathcal\{K\}\(o\)\}\\max\\\!\\big\(\\rho\(o\_\{t\}\),\\,1\-\\epsilon\\big\),\(8\)so masked tokens contribute nothing while still counting toward the denominator\. This shields high\-probability tokens from spurious penalties while keeping the update focused on the tokens that drive the negative outcome\.
## 4Related Work
#### Group\-Based RL Algorithms
Most current RL algorithms for LLMs are group\-based; they require multiple rollouts per prompt\. GRPO and RLOO are two representative members of this family\. ReMax uses two rollouts per prompt and can therefore also be regarded as group\-based\. We note that REINFORCE\+\+ performs a single rollout in the RLHF setting but multiple rollouts in the RLVR setting\. In all of these methods, the group of rollouts is used to estimate the value function, thereby reducing variance\. Recent work offers an alternative interpretation in which group\-based methods are viewed as a form of contrastive learningWuet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib18)\); Zhuet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib27)\), where the primary role of grouping is to supply contrastive pairs\.
#### Group\-Free RL Algorithms
PPOSchulmanet al\.\([2017](https://arxiv.org/html/2606.17250#bib.bib28)\), the most widely used RL algorithm, requires only a single rollout per prompt\. However, it must maintain a separate LLM as the critic\. More recently, SPO has been proposed as a critic\-free, group\-free method\. SPO leverages historical information to estimate the success rate of each prompt, which serves as the value function\. To obtain accurate estimates, SPO requires a pre\-rollout phase before training, incurring additional computational cost\. In this paper, we instead focus on uncovering the hidden mechanism behind grouping\. Our proposed negative filtering enables stable single\-rollout training without any additional computational overhead\.
#### Token Masking in RL
Wanget al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib20)\)propose improving RL post\-training by only optimizing high\-entropy minority tokens\. While this idea is philosophically related to our proposed NTF, the two methods differ in both motivation and implementation\. NTF applies only to negative trajectories, whereasWanget al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib20)\)are not restricted to negative samples and targets high\-entropy tokens\.
## 5Experiment
In this section, we conduct empirical experiments to validate the efficacy of our proposed NTF on two RL base techniques with batch\-level advantage estimation: RF\+\+ and C\-RF\.
#### Experimental Setup
We evaluate on mathematical reasoning and agentic decision\-making\. The reasoning experiments train on a 7\.5K\-prompt subset of DAPO\-Math\(Yuet al\.,[2025](https://arxiv.org/html/2606.17250#bib.bib19); Wuet al\.,[2025](https://arxiv.org/html/2606.17250#bib.bib18)\)and evaluate on five held\-out math benchmarks; the agentic experiments followHeet al\.\([2026](https://arxiv.org/html/2606.17250#bib.bib17)\)on ALFWorld and WebShop\. We use verlShenget al\.\([2024](https://arxiv.org/html/2606.17250#bib.bib33)\)for experiment framework\. Full baseline, dataset, metric, reward, and hyper\-parameter details are provided in Appendix[A](https://arxiv.org/html/2606.17250#A1)\.
### 5\.1Reasoning Task
Table 1:Mathematical reasoning results under different rollout budgets\. All methods are trained on the 7\.5K\-prompt DAPO\-Math subset and evaluated with Mean@32 accuracy \(%\) on five held\-out benchmarks\. AllG=1G\{=\}1rows use the same log\-probability\-based negative\-token filtering ratioτ=0\.1\\tau\{=\}0\.1\. “–” indicates a training crash\.Table[1](https://arxiv.org/html/2606.17250#S5.T1)reports mathematical reasoning results under different rollout budgets\. The main observation is that single\-rollout training becomes viable when negative trajectories are filtered\. In our experiments, RF\+\+ and C\-RF withG=1G\\\!=\\\!1can be trained stably with our proposed NTF, while the same setting without filtering collapses during training\. This suggests that directly penalizing all tokens in failed trajectories is harmful for long\-form reasoning, Filtering mitigates this issue by selectively reducing negative updates on high\-confidence tokens and thus preserves useful reasoning patterns\.
Compared to RF\+\+ w/ NTF, C\-RF w/ NTF reaches a better overall performance, outperforming the low\-budget GRPO baseline withG=2G=2on the 1\.5B model and approaches GRPO withG=16G=16, with only a 0\.91 gap in average for the 1\.5B model\. On the 7B model, group\-based GRPO remains stronger, but our method still substantially stablize the training of RF\+\+/C\-RF, while using only one rollout per prompt\.
We attribute this advantage to a stronger and more reliable training signal in the single\-rollout setting: after filtering harmful negative updates on supporting tokens, our objective can make better use of the remaining sparse reward signal than vanilla batch\-level advantage computation\.
### 5\.2Agentic Task
Table 2:Agentic decision\-making results on ALFWorld and WebShop\. We compare prompting baselines and RL\-trained methods on Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct\. ALFWorld reports in\-distribution and out\-of\-distribution success rates, while WebShop reports task score and task success\. “–” indicates a training crash\. Baseline results are fromHeet al\.\([2026](https://arxiv.org/html/2606.17250#bib.bib17)\)\.ModelTypeMethodALFWorldWebShopIn\-SuccessOut\-SuccessTask ScoreTask SuccessQwen2\.5\-1\.5B\-InstructPromptingQwen2\.54\.1–23\.15\.2PromptingReAct12\.8–40\.111\.3PromptingReflexion21\.8–55\.821\.9RL TrainingPPO \(with critic\)54\.4–73\.851\.5RL TrainingRLOO69\.768\.773\.952\.1RL TrainingGRPO72\.870\.175\.856\.8RL TrainingC\-RF––––\\rowcolorgray\!12RL TrainingC\-RF w/ NTF87\.477\.972\.465\.7Qwen2\.5\-7B\-InstructPromptingQwen2\.514\.8–26\.47\.8PromptingReAct31\.2–46\.219\.5PromptingReflexion42\.7–58\.128\.8RL TrainingPPO \(with critic\)77\.176\.281\.468\.7RL TrainingRLOO77\.974\.080\.365\.7RL TrainingGRPO78\.676\.879\.366\.1RL TrainingC\-RF––––\\rowcolorgray\!12RL TrainingC\-RF w/ NTF90\.586\.380\.274\.1In agentic tasks, we evaluate C\-RF with NTF only, as we were unable to successfully train RF\+\+ with NTF in these settings\. We attribute this to the coefficientp\(1−p\)\\sqrt\{p\(1\-p\)\}discussed in Eq\.[6](https://arxiv.org/html/2606.17250#S3.E6): because agentic tasks are substantially harder in the early stages of training,ppis small, so this coefficient sharply attenuates the learning signal and makes training difficult\.
Table[2](https://arxiv.org/html/2606.17250#S5.T2)reports results on ALFWorld and WebShop, which test whether our single\-rollout training principle generalizes beyond math reasoning to agentic tasks under sparse, delayed rewards\. RL training substantially outperforms prompting\-based methods in both environments, confirming that environment feedback supplies useful supervision for agentic behavior\. We compare primarily against PPO, RLOO, and GRPO, representing critic\-based and group\-based critic\-free training\.
On ALFWorld, our method is strong at both model scales\. Qwen2\.5\-1\.5B\-Instruct reaches 87\.38 in\-distribution and 77\.86 out\-of\-distribution success, exceeding GRPO on both\. Qwen2\.5\-7B\-Instruct further improves these to 90\.48 and 86\.32, surpassing PPO, RLOO, and GRPO\.
WebShop shows a similar advantage on task success: our method raises success over GRPO from 56\.8 to 65\.73 \(1\.5B\) and from 66\.1 to 74\.07 \(7B\)\. Results on task score are more mixed—our method trails GRPO on the 1\.5B model and PPO/RLOO on the 7B model, though it remains competitive\.
The stronger performance of C\-RF w/ NTF on agentic tasks suggests that group\-based cancellation may be less reliable in complex search spaces, where positive and negative trajectories share relatively fewer supporting tokens due to the large search spaces\. This further highlights the importance of selectively controlling negative\-sample updates\. Extending NTF to group\-based RL algorithms is a promising direction for future work\.
### 5\.3Sensitivity Study
Figure 7:Learning dynamics under varying negative masking fractionsτ\\tau\. In each setting, Qwen2\.5\-Math\-1\.5B is trained with C\-RF for 1k steps\.To study the effect of the masking fractionτ\\tauon training stability, we conduct a sensitivity analysis of this key hyperparameter in negative token filtering, which determines the proportion of negative tokens retained\. In the study, we follow the the setting of the reasoning tasks – the base model, Qwen2\.5\-Math\-1\.5B, is trained on a DAPO subset with C\-RF for 1k steps\. Compared with the previous experiment \(Fig\.[3](https://arxiv.org/html/2606.17250#S2.F3)\), we adopt a more benign learning rate \(1×10−61\\times 10^\{\-6\}\) with a cosine scheduler and no KL penalty\. As shown in Fig\.[7](https://arxiv.org/html/2606.17250#S5.F7), we report the curves of reward, validation accuracy, gradient norm, and response length over training; we omit entropy, as it provides little additional information\. We observe that: \(1\) under this benign learning\-rate setting, collapse manifests as a degeneration in reward and validation accuracy; and \(2\) response length is strongly correlated with this degeneration\. Forτ≤0\.6\\tau\\leq 0\.6, the degeneration is pronounced, whereas forτ=0\.9\\tau=0\.9and0\.80\.8training remains stable\. Moreover,τ=0\.9\\tau=0\.9, our default choice, achieves higher reward and shorter responses thanτ=0\.8\\tau=0\.8\. However, we note that a largerτ\\taudoes not always yield better performance\. In the setting withτ=1\\tau=1\(positive\-only\), a different type of collapse occurs: the response length drops rapidly, and the model forgets to reason, generating the answer directly instead\. These results demonstrate the importance and necessity of appropriate negative token filtering\.
## 6Conclusion
We revisited the role of multi\-rollout sampling and group\-based advantage in critic\-free RLVR and showed that their function extends beyond value\-baseline estimation\. In particular, groups stabilize training by allowing positive and negative rollouts from the same prompt to cancel harmful negative updates on shared supporting tokens\. This perspective explains why single\-rollout training is especially vulnerable to negative samples, which can falsely penalize useful reasoning patterns, grammatical tokens, and structural components\.
Motivated by this insight, we proposed Negative Token Filtering \(NTF\), which masks high\-probability tokens in negative trajectories and concentrates the negative update on lower\-probability tokens\. NTF enables stable single\-rollout, critic\-free, and group\-free training\. Experiments on mathematical reasoning and agentic decision\-making show that our method avoids collapse and achieves competitive or stronger performance than representative group\-based baselines\. These results suggest that stable RLVR does not inherently require group\-based advantage computation, provided that negative\-token updates are handled selectively\.
## 7Limitation
Our analysis is primarily static: we do not conduct dynamic experiments that track metrics throughout RL training\. This is because our weight analysis relies on SVD, which is computationally expensive even at small scale, making spectrum analysis over the course of training impractical\. In addition, our experiments focus on the RLVR \(binary reward\) setting and do not cover ordinal or continuous rewards\. We believe RLVR is representative enough to validate our findings, and we leave the exploration of continuous\-reward settings to future work\.
We use LLMs for polishing writing\.
## References
- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024\)Back to basics: revisiting reinforce\-style optimization for learning from human feedback in llms\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12248–12267\.Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.17250#S1.p1.1)\.
- T\. Cheng, Z\. Huang, Z\. Qiu, Y\. Cheng, E\. Ponti, Y\. Xu, I\. Titov, and Z\. Xu \(2026\)The cancellation hypothesis in critic\-free rl: from outcome rewards to token credits\.arXiv preprint arXiv:2605\.08666\.Cited by:[§2\.3](https://arxiv.org/html/2606.17250#S2.SS3.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2025\)Group\-in\-group policy optimization for llm agent training\.Advances in Neural Information Processing Systems38,pp\. 46375–46408\.Cited by:[§1](https://arxiv.org/html/2606.17250#S1.p2.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. L\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.External Links:2402\.14008,[Link](https://arxiv.org/abs/2402.14008)Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p1.1)\.
- S\. He, L\. Feng, Q\. Wei, X\. Cheng, L\. Feng, and B\. An \(2026\)Hierarchy\-of\-groups policy optimization for long\-horizon agentic tasks\.arXiv preprint arXiv:2602\.22817\.Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p2.1),[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.SSS0.Px2.p2.1),[§5](https://arxiv.org/html/2606.17250#S5.SS0.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.17250#S5.T2)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p1.1)\.
- J\. Hu, J\. K\. Liu, H\. Xu, and W\. Shen \(2025\)Reinforce\+\+: stabilizing critic\-free policy optimization with global advantage normalization\.arXiv preprint arXiv:2501\.03262\.Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.17250#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17250#S2.SS1.p1.2),[§3\.1](https://arxiv.org/html/2606.17250#S3.SS1.SSS0.Px2.p2.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§2\.4](https://arxiv.org/html/2606.17250#S2.SS4.p1.1)\.
- N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu, Y\. Gu, S\. Malik, V\. Graf, J\. D\. Hwang, J\. Yang, R\. L\. Bras, O\. Tafjord, C\. Wilhelm, L\. Soldaini, N\. A\. Smith, Y\. Wang, P\. Dasigi, and H\. Hajishirzi \(2025\)Tulu 3: pushing frontiers in open language model post\-training\.External Links:2411\.15124,[Link](https://arxiv.org/abs/2411.15124)Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.p1.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo, Y\. Wu, B\. Neyshabur, G\. Gur\-Ari, and V\. Misra \(2022\)Solving quantitative reasoning problems with language models\.External Links:2206\.14858,[Link](https://arxiv.org/abs/2206.14858)Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p1.1)\.
- G\. Li, M\. Lin, T\. Galanti, Z\. Tu, and T\. Yang \(2025\)Disco: reinforcing large reasoning models with discriminative constrained optimization\.Advances in Neural Information Processing Systems38,pp\. 57304–57331\.Cited by:[§3\.1](https://arxiv.org/html/2606.17250#S3.SS1.SSS0.Px2.p3.3)\.
- Z\. Li, T\. Xu, Y\. Zhang, Z\. Lin, Y\. Yu, R\. Sun, and Z\. Luo \(2023\)Remax: a simple, effective, and efficient reinforcement learning method for aligning large language models\.arXiv preprint arXiv:2310\.10505\.Cited by:[§1](https://arxiv.org/html/2606.17250#S1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.External Links:2305\.20050,[Link](https://arxiv.org/abs/2305.20050)Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p1.1)\.
- Math\-AI Team \(2025\)AMC 2023\.Note:Hugging Face datasetAvailable at:[https://huggingface\.co/datasets/math\-ai/amc23](https://huggingface.co/datasets/math-ai/amc23)Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p1.1)\.
- MathArena \(2026\)AIME 2025\.Note:Hugging Face datasetDataset constructed from the 2025 American Invitational Mathematics ExaminationExternal Links:[Link](https://huggingface.co/datasets/MathArena/aime_2025)Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.17250#S1.p1.1),[§4](https://arxiv.org/html/2606.17250#S4.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.17250#S1.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[§5](https://arxiv.org/html/2606.17250#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.SSS0.Px2.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.External Links:2010\.03768,[Link](https://arxiv.org/abs/2010.03768)Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p2.1)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang,et al\.\(2025\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for llm reasoning\.InAdvances in Neural Information Processing Systems,Cited by:[§4](https://arxiv.org/html/2606.17250#S4.SS0.SSS0.Px3.p1.1)\.
- Y\. Wu, L\. Ma, L\. Ding, M\. Li, X\. Wang, K\. Chen, Z\. Su, Z\. Zhang, C\. Huang, Y\. Zhang,et al\.\(2025\)It takes two: your grpo is secretly dpo\.arXiv preprint arXiv:2510\.00977\.Cited by:[§2\.1](https://arxiv.org/html/2606.17250#S2.SS1.p1.2),[§3\.1](https://arxiv.org/html/2606.17250#S3.SS1.SSS0.Px2.p3.3),[§4](https://arxiv.org/html/2606.17250#S4.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.17250#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Xu and Z\. Ding \(2026\)Single\-stream policy optimization\.InInt\. Conf\. Learn\. Represent\.,Cited by:[§1](https://arxiv.org/html/2606.17250#S1.p2.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2023a\)WebShop: towards scalable real\-world web interaction with grounded language agents\.External Links:2207\.01206,[Link](https://arxiv.org/abs/2207.01206)Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[§A\.2](https://arxiv.org/html/2606.17250#A1.SS2.SSS0.Px2.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2606.17250#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.17250#S1.p2.1),[§5](https://arxiv.org/html/2606.17250#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Zhu, M\. Xia, Z\. Wei, W\. Chen, D\. Chen, and Y\. Meng \(2025\)The surprising effectiveness of negative reinforcement in llm reasoning\.Advances in Neural Information Processing Systems38,pp\. 126546–126573\.Cited by:[§3\.1](https://arxiv.org/html/2606.17250#S3.SS1.SSS0.Px2.p4.1),[§4](https://arxiv.org/html/2606.17250#S4.SS0.SSS0.Px1.p1.1)\.
## Appendix AExperimental Details
### A\.1Datasets
For mathematical reasoning, all RL methods are trained on the same 7\.5K\-prompt subset of DAPO\-Math\(Yuet al\.,[2025](https://arxiv.org/html/2606.17250#bib.bib19)\)and evaluated on MATH\-500Hendryckset al\.\([2021](https://arxiv.org/html/2606.17250#bib.bib2)\); Lightmanet al\.\([2023](https://arxiv.org/html/2606.17250#bib.bib10)\), AMC 2023Math\-AI Team \([2025](https://arxiv.org/html/2606.17250#bib.bib3)\), Minerva MathLewkowyczet al\.\([2022](https://arxiv.org/html/2606.17250#bib.bib6)\), AIME 2025MathArena \([2026](https://arxiv.org/html/2606.17250#bib.bib4)\), and Olympiad BenchHeet al\.\([2024](https://arxiv.org/html/2606.17250#bib.bib5)\)\. We use binary verifiable rewards based on extracted final\-answer correctness and report Mean@32 accuracy, computed from 32 sampled generations per problem\.
For agentic tasks, we follow the ALFWorldShridharet al\.\([2021](https://arxiv.org/html/2606.17250#bib.bib8)\)and WebShopYaoet al\.\([2023a](https://arxiv.org/html/2606.17250#bib.bib9)\)protocol ofHeet al\.\([2026](https://arxiv.org/html/2606.17250#bib.bib17)\)\. We train our method on a 1024\-task ALFWorld subset and a 1024\-instruction WebShop subset, and evaluate with the same metrics as the reported baselines: in\-/out\-of\-distribution success for ALFWorld, and task score/task success for WebShop\. Rewards are sparse and delayed environment feedback with no per\-step shaping; for WebShop, we use binary task success as the training reward\.
### A\.2Baseline Details
We provide additional details about the baselines used in our experiments\. The baselines are selected to compare our method with representative RLVRLambertet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib12)\)training paradigms under different rollout budgets\. In particular, we use REINFORCE\+\+Huet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib21)\)withG=1G=1as the closest single\-rollout baseline, GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.17250#bib.bib1)\)with small and large group sizes as group\-based references, and standard prompting/RL baselines for agentic tasks\.
#### Mathematical reasoning\.
For mathematical reasoning, we compare with GRPO and REINFORCE\+\+\. GRPO is a representative group\-based critic\-free RLVR method, where multiple responses are sampled for the same prompt and relative advantages are estimated within each group\. We report GRPO withG=16G=16as a strong multi\-rollout baseline and GRPO withG=2G=2as a lower\-budget group\-based baseline\. These two settings allow us to examine how our single\-rollout method compares with group\-based training under different rollout costs\.
REINFORCE\+\+ is used as the main batch\-normalized critic\-free baseline\. We report REINFORCE\+\+ withG=8G=8as a multi\-rollout batch\-normalized baseline and REINFORCE\+\+ withG=1G=1as the closest comparison to our method\. TheG=1G=1setting is especially important because it shares the same one\-rollout\-per\-prompt constraint as our method\. For all single\-rollout runs conducted by us, including REINFORCE\+\+ withG=1G=1and our method, we apply the same log\-probability\-based negative\-token filtering with ratioτ=0\.1\\tau=0\.1\. Therefore, the comparison between REINFORCE\+\+ withG=1G=1and our method evaluates whether our single\-rollout training rule can make better use of the same filtered reward signal, rather than merely testing the effect of adding token filtering\.
#### Agentic tasks\.
For agentic tasks, we compare with both prompting\-based and RL\-based baselines\. The prompting baselines include direct Qwen2\.5Qwenet al\.\([2025](https://arxiv.org/html/2606.17250#bib.bib15)\)prompting, ReActYaoet al\.\([2023b](https://arxiv.org/html/2606.17250#bib.bib13)\), and ReflexionShinnet al\.\([2023](https://arxiv.org/html/2606.17250#bib.bib14)\), which measure the performance of the base model without RL training\. The RL baselines include PPOSchulmanet al\.\([2017](https://arxiv.org/html/2606.17250#bib.bib28)\), RLOOAhmadianet al\.\([2024](https://arxiv.org/html/2606.17250#bib.bib16)\), and GRPO\. PPO represents critic\-based RL training, while RLOO and GRPO represent critic\-free methods based on relative or grouped trajectory comparison\.
The baseline numbers for agentic tasks are taken fromHeet al\.\([2026](https://arxiv.org/html/2606.17250#bib.bib17)\), which reports results on ALFWorld and WebShop using Qwen2\.5\-Instruct backbones\. Specifically, ALFWorld is evaluated by in\-distribution and out\-of\-distribution success rates, while WebShop is evaluated by task score and task success rate\. Our method is evaluated under the same task protocol and metrics for comparison\. Unlike the group\-based baselines, our method uses only one rollout per prompt and does not require a critic, a value tracker, or prompt\-level rollout groups\.
### A\.3Hyper\-parameters
See Table[3](https://arxiv.org/html/2606.17250#A1.T3)and Table[4](https://arxiv.org/html/2606.17250#A1.T4)\.
Table 3:Training hyper\-parameters of our method\. Values that are shared across all three tasks are merged into a single cell; columns differ only where shown\.Table 4:Evaluation hyper\-parameters\.Similar Articles
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
This paper introduces ResRL, a method to boost LLM reasoning by decoupling semantic distributions between positive and negative responses through negative sample projection. It aims to maintain generation diversity while improving performance on various benchmarks.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
This paper challenges the assumption that RL teaches new reasoning capabilities to LLMs, arguing instead that it performs sparse policy selection at high-entropy decision points. It introduces ReasonMaxxer, an RL-free method that matches full RL performance with significantly lower training costs.
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
This paper proposes a reinforcement learning framework that improves LLM reasoning efficiency by modeling token significance to selectively penalize unimportant tokens while preserving essential reasoning, using both significance-aware and dynamic length rewards to reduce verbosity without sacrificing accuracy.
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
This paper introduces ICRL, a framework that jointly trains a solver and critic with reinforcement learning to internalize critique guidance, enabling the solver to improve without external critique. It uses distribution calibration and role-wise group advantage estimation, achieving 6-7 point gains over GRPO on agentic and mathematical reasoning tasks.