Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

arXiv cs.LG 05/19/26, 04:00 AM Papers
Summary
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
arXiv:2605.16302v1 Announce Type: new Abstract: Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training stability and performance upper bounds on mathematical and code reasoning benchmarks, pointing to a promising direction for unlocking the performance potential of LLMs.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:40 AM
# Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Source: [https://arxiv.org/html/2605.16302](https://arxiv.org/html/2605.16302)
###### Abstract

Reinforcement learning for multi\-step reasoning with large language models \(LLMs\) typically relies on sparse terminal rewards, leading to poor credit assignment—the final feedback is uniformly propagated to all intermediate decision steps\. This produces high gradient variance, training instability, and numerous ineffective updates, ultimately preventing the model from achieving sustained improvement\. We propose a counterfactual comparison\-based credit assignment framework that samples multiple reasoning trajectories from the same input, treats inter\-trajectory differences as implicit approximations of alternative decisions, and thereby constructs an implicit process\-level advantage estimator that converts sparse terminal rewards into step\-sensitive learning signals\. Building on this, we introduce Implicit Behavior Policy Optimization \(IBPO\), which significantly improves training stability and performance ceilings on mathematical and code reasoning benchmarks, pointing toward a promising direction for unlocking the performance potential of LLMs\.

Machine Learning, ICML

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.16302v1/zhaiyaotu2.png)Figure 1:Overview of IBPO: a counterfactual trajectory comparison framework for process\-level credit assignment under sparse terminal rewards\. By contrasting multiple reasoning paths sampled from the same input, IBPO derives implicit step\-sensitive learning signals, improving optimization stability and sample efficiency in LLM reinforcement learning\.Recent advances in large language models \(LLMs\) have achieved remarkable breakthroughs on complex multi\-step reasoning tasks, particularly after fine\-tuning with reinforcement learning \(RL\)\. RL has become a key paradigm for scaling LLM capabilities, enabling models to solve increasingly complex problems through deeper and longer chains of reasoning, such as competition\-level mathematics and program synthesis\.

However, scaling RL for reasoning tasks requires maintaining training stability and sample efficiency under ever\-increasing compute budgets\. Despite this, mainstream RL methods—such as Group Relative Policy Optimization \(GRPO\)\(Shao et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib11)\)—still use sequence\-level or trajectory\-level rewards to optimize policies\. This creates a fundamental mismatch between the learning signal and the inherently step\-by\-step nature of reasoning\.

In multi\-step reasoning, correctness depends on a sequence of intermediate decisions\. Yet sequence\-level supervision rewards entire trajectories based solely on the final answer: trajectories with flawed reasoning processes can receive positive rewards if they happen to produce the correct final output, while trajectories with largely correct reasoning but a single local error may be entirely discarded\. This coarse\-grained feedback undermines the model’s ability to distinguish early from late errors, disrupts credit assignment, destabilizes learning, and limits exploration of alternative reasoning paths\. This problem is particularly pronounced in long\-horizon or difficult tasks\. Moreover, even a single local error may require extensive sampling and updates to be statistically corrected, introducing a significant efficiency bottleneck—commonly referred to as the*learning tax*\.

In this work, we propose a counterfactual learning approach to address the credit assignment problem under sparse terminal rewards\. Even without step\-level supervision, the differences among reasoning trajectories sampled from the same input naturally contain process\-level information\. The divergences between these trajectories implicitly reflect how alternative intermediate decisions might have led to different outcomes\. By systematically comparing these counterfactual paths and aligning their differences with final outcomes, we construct informative learning signals that are more sensitive to intermediate decisions\.

Building on this insight, we introduce*Implicit Behavior Policy Optimization \(IBPO\)*—a process\-level credit assignment framework induced by counterfactual trajectory comparison\. IBPO defines a general multi\-trajectory comparison operator and uses it to construct an implicit advantage estimator\. This estimator reweights terminal rewards based on trajectory\-level differences, thereby reducing gradient variance and amplifying learning signals at points of frequent decision errors\. IBPO does not rely on step\-level annotations, external verifiers, or additional value networks, and can be seamlessly integrated with existing sequence\-level RL optimizers while improving convergence stability and sample efficiency\.

##### Contributions\.

Our main contributions are as follows:

- •Counterfactual credit assignment formulation\.We introduce a counterfactual learning perspective on credit assignment in LLM reinforcement learning, treating multiple reasoning trajectories from the same input as approximations of alternative decisions\. We show that the inconsistencies among these trajectories contain key information for process\-level learning, even without step\-level rewards\.
- •Implicit process\-level advantage and the IBPO framework\.We formalize a general multi\-trajectory comparison operator and use it to construct an implicit process\-level advantage estimator, from which we derive the IBPO framework\.
- •Theoretical analysis of variance reduction and positive transfer\.We analyze how counterfactual trajectory comparison reduces gradient variance and amplifies learning signals in high\-error\-rate regions\. We prove that this mechanism induces backward transfer to underlying reasoning skills and mitigates the learning tax problem\.
- •Mechanism\-driven empirical validation\.We evaluate IBPO on multiple mathematical and code reasoning benchmarks\. Experiments demonstrate that IBPO consistently improves convergence, sample efficiency, and early error correction ability over strong baselines under compute\-matched conditions\.

## 2Related Work

Group Relative Policy Optimization \(GRPO\)\.GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib11)\)is a recent reinforcement learning algorithm developed for fine\-tuning large language models \(LLMs\) on reasoning tasks, achieving strong results in systems such as DeepSeek\-R1\(Guo et al\.,[2025](https://arxiv.org/html/2605.16302#bib.bib2)\)\. GRPO leverages within\-group sampling to estimate group\-relative advantages, replacing explicit value modeling in PPO, thereby enabling faster and more efficient training\. However, GRPO suffers from entropy collapse, reward collapse, and unstable convergence\(Yu et al\.,[2025](https://arxiv.org/html/2605.16302#bib.bib18)\), largely stemming from its reliance on the assumption that*the terminal reward sufficiently characterizes the reasoning trajectory*\. This assumption often fails in long\-horizon reasoning—where the model’s success depends on a sequence of interdependent steps—leading to ill\-posed credit assignment and inflated gradient variance\. GSPO\(Zheng et al\.,[2025](https://arxiv.org/html/2605.16302#bib.bib19)\)is an improvement over GRPO that computes importance ratios at the sequence level\.

Self\-Correction Strategies\.Self\-correction has emerged as a promising direction for enhancing reasoning capabilities\. For example, selective reflection fine\-tuning\(Li et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib6)\)enables models to perform reflective evaluation over multiple candidate responses and fine\-tune on the optimal response through supervised learning\.

Reward Modeling\.Reward models are crucial for achieving robust System\-2 reasoning but are difficult to construct\. Recent directions include LLM\-as\-a\-Judge frameworks\(Zheng et al\.,[2023](https://arxiv.org/html/2605.16302#bib.bib20); Qi et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib10)\), outcome reward models\(Yang et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib15); Yu et al\.,[2023](https://arxiv.org/html/2605.16302#bib.bib17)\), and process reward models \(PRMs\)\(Lightman et al\.,[2023](https://arxiv.org/html/2605.16302#bib.bib7); Luo et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib8); Wang et al\.,[2024b](https://arxiv.org/html/2605.16302#bib.bib14)\)that provide step\-level feedback for complex tasks\. However, PRMs have critical limitations: high annotation costs, weak generalization, and noisy signals produced by automated methods such as Monte Carlo sampling or MCTS\(Kang et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib4); Wang et al\.,[2024a](https://arxiv.org/html/2605.16302#bib.bib13)\)\. Human\-annotated datasets like PRM800k\(Lightman et al\.,[2023](https://arxiv.org/html/2605.16302#bib.bib7)\)are difficult to scale, and existing automatic annotation methods typically produce noisy or inconsistent reward scores\. In contrast, our IBPO approach bypasses the need for fine\-grained annotation through implicit comparison while still providing effective process\-level supervision\. Unlike existing methods, our approach does not assume that rewards can be decomposed into stepwise reward signals\.

SCoRe\(Kumar et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib5)\)iteratively leverages previously generated responses, prompting the model to identify errors in earlier outputs\. It improves reasoning accuracy through multi\-round reinforcement learning, but suffers from low training efficiency due to repeated generation and optimization cycles\.

## 3Method

### 3\.1Problem Formulation

We consider multi\-step reasoning reinforcement learning problems with terminal rewards\. Given inputxx, the policyπθ\\pi\_\{\\theta\}generates a reasoning trajectory of lengthTT:

τ=\(a1,a2,…,aT\),at∼πθ\(⋅∣x,a<t\),\\tau=\(a\_\{1\},a\_\{2\},\\dots,a\_\{T\}\),\\quad a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x,a\_\{<t\}\),\(1\)whereata\_\{t\}denotes the decision generated at steptt\.

The environment provides asequence\-level rewardonly upon trajectory completion:

R\(τ\)∈\[−1,1\]\.R\(\\tau\)\\in\[\-1,1\]\.\(2\)In most reasoning tasks,R\(τ\)R\(\\tau\)is typically sparse \(e\.g\., binary correctness of the final answer\) and does not provide explicit step\-level supervision\.

The standard policy gradient objective is:

∇θJ\(θ\)=𝔼τ∼πθ\[A\(τ\)∑t=1T∇θlog⁡πθ\(at∣x,a<t\)\]\.\\nabla\_\{\\theta\}J\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\left\[A\(\\tau\)\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid x,a\_\{<t\}\)\\right\]\.\(3\)
In multi\-step reasoning tasks, the core challenge lies not in the sparsity of the reward itself, but in theextreme instability of credit assignment from terminal rewards to early decisions\. When local errors occur at early steps, their effects tend to cascade and amplify through subsequent reasoning steps\. However, such errors are only indirectly reflected through the terminal reward, resulting in extremely noisy gradient signals whose variance grows significantly with trajectory length\.

##### IBPO as a Framework Rather Than an Implementation\.

We emphasize that IBPO is a*training framework*for credit assignment under sparse terminal rewards, rather than a specific error\-correction or rewriting algorithm\. Its core contribution is establishing counterfactual trajectory comparison as a general mechanism for inducing implicit process\-level learning signals\. Specifically, the multi\-trajectory comparison operatorℳ\\mathcal\{M\}in our framework is an abstract operator primarily used to extract differences between trajectories and generate learning signals reflecting process\-level decision differences\. The IBPO framework does not depend on specific implementation details, such as how counterfactual differences are computed or what comparison mechanism is used\. The operatorℳ\\mathcal\{M\}can be instantiated through various means—such as consistency scoring, relative ranking, or error detection—but these are implementation details rather than components of the IBPO framework itself\. Therefore, IBPO’s core contribution lies in its framework\-level design, while specific instantiations can be customized according to task requirements\.

Specific comparison mechanisms such as error correction, verifier\-based ranking, or consistency scoring should be viewed as*instantiations*of the comparison operator used within the IBPO framework\. Our theoretical analysis and optimization framework apply to any instantiation that produces trajectory\-dependent comparison signals sensitive to counterfactual differences\.

### 3\.2Counterfactual Trajectory Comparison

##### Counterfactual Trajectory Comparison and the Role of Operatorℳ\\mathcal\{M\}\.

The core idea of IBPO is to sample multiple reasoning trajectories from the same input and leverage their differences as counterfactual approximations to construct process\-sensitive learning signals\. Specifically, we sampleGGtrajectories from the policy\. Completely correct trajectories do not require additional signals; for each incorrect trajectoryτi\\tau\_\{i\}, we pair it withK−1K\\\!\-\\\!1correct trajectories to form aKK\-tuple\. When correct trajectories are insufficient, we duplicate them to reach the required number; if no correct trajectories exist, incorrect trajectories are randomly selected as substitutes\.

τi\(1\),…,τi\(K\)∼πθ\(⋅∣x\),K≥2\.\\tau\_\{i\}^\{\(1\)\},\\ldots,\\tau\_\{i\}^\{\(K\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\),\\qquad K\\geq 2\.\(4\)
We introduce themulti\-trajectory comparison operator

ℳ:\{τi\(k\)\}k=1K⟼𝐬\(τi\)∈\[0,1\],\\mathcal\{M\}:\\\{\\tau\_\{i\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}\\longmapsto\\mathbf\{s\}\(\\tau\_\{i\}\)\\in\[0,1\],\(5\)where each components\(τi\)s\(\\tau\_\{i\}\)represents the comparison\-induced signal associated with trajectoryτi\(k\)\\tau\_\{i\}^\{\(k\)\}, summarizing its intermediate decision quality relative to other counterfactual trajectories \(e\.g\., relative consistency, recoverability, or difference\-aware quality\)\. The operatorℳ\\mathcal\{M\}can be implemented through various mechanisms and is subsequently validated through rule\-based rewards to avoid circular reasoning and potential reward hacking or self\-confirmation bias\. The IBPO framework only assumes that𝐬\(τi\)\\mathbf\{s\}\(\\tau\_\{i\}\)is sensitive to counterfactual differences in\{τi\(k\)\}\\\{\\tau\_\{i\}^\{\(k\)\}\\\}\.

The output ofℳ\\mathcal\{M\}is not limited to sequence\-level scalar signals;ℳ\\mathcal\{M\}can also produce token\-level signals, such as the proportion of unmodified tokens for reward shaping \(IBPO\-ratio\), or a token\-level mask for blocking gradients on unmodified tokens \(IBPO\-mask\)\. Detailed definitions of these variants are provided in Appendix[C\.2](https://arxiv.org/html/2605.16302#A3.SS2)\.

##### Per\-Trajectory Shaping Function\.

Given the comparison output𝐬\(τi\)=ℳ\(\{τi\(k\)\}k=1K\)\\mathbf\{s\}\(\\tau\_\{i\}\)=\\mathcal\{M\}\(\\\{\\tau\_\{i\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}\), we define the per\-trajectory shaping function:

ϕi=\{0ifτiis correct,𝐬\(τi\)∈\[0,1\]otherwise\.\\phi\_\{i\}=\\begin\{cases\}0&\\text\{if \}\\tau\_\{i\}\\text\{ is correct\},\\\\ \\mathbf\{s\}\(\\tau\_\{i\}\)\\in\[0,1\]&\\text\{otherwise\}\.\\end\{cases\}\(6\)
ϕ\\phiis validated through rule\-based rewards to avoid circular reasoning and potential reward hacking or self\-confirmation bias\. This function maps the comparison signal to a scalar shaping term for trajectoryτi\\tau\_\{i\}\. Importantly,ϕi\\phi\_\{i\}depends onτi\\tau\_\{i\}only through the relationship betweenτi\\tau\_\{i\}and other counterfactual trajectories, requiring no explicit step\-level annotation or value estimation\.

##### Token\-Level Masking\.

When the operatorℳ\\mathcal\{M\}can localize which tokens in the trajectory should receive gradient updates and which should not, the comparison signal can be refined from the sequence level to a token\-level mask\. Let𝐦i=\(m1,…,mT\)∈\{0,1\}T\\mathbf\{m\}\_\{i\}=\(m\_\{1\},\\ldots,m\_\{T\}\)\\in\\\{0,1\\\}^\{T\}be the token\-level mask produced byℳ\\mathcal\{M\}, wheremt=1m\_\{t\}=1indicates that thett\-th token should receive gradient updates andmt=0m\_\{t\}=0indicates that the token should not be updated\. Based on this, the policy gradient can selectively update only the designated tokens:

∇θ𝒥imask=∑t=1Tmt⋅A′^i⋅∇θlog⁡πθ\(yt∣y<t,x\)\.\\nabla\_\{\\theta\}\\mathcal\{J\}\_\{i\}^\{\\mathrm\{mask\}\}=\\sum\_\{t=1\}^\{T\}m\_\{t\}\\cdot\\widehat\{A^\{\\prime\}\}\_\{i\}\\cdot\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\.\(7\)This masking mechanism concentrates gradient updates on potentially erroneous tokens, avoiding unnecessary penalization of correct reasoning steps, thereby achieving more fine\-grained token\-level credit assignment\. The specific construction of the mask \(e\.g\., based on edit distance\) is detailed in Appendix[C\.2](https://arxiv.org/html/2605.16302#A3.SS2)\.

### 3\.3Implicit Process\-Level Advantage Estimation

To inject comparison signals into the optimization process, we provide two complementary paths to replace the coarse\-grained feedback determined solely byR\(τ\)R\(\\tau\)\.

##### Path One: Sequence\-Level Reward Shaping\.

For candidate trajectoryτi\(k\)\\tau\_\{i\}^\{\(k\)\}, we define its shaped reward as:

Ri′\(x\)=R\(τi\)\+λϕiR^\{\\prime\}\_\{i\}\(x\)=R\(\\tau\_\{i\}\)\+\\lambda\\,\\phi\_\{i\}\(8\)where0≤λϕi<10\\leq\\lambda\\,\\phi\_\{i\}<1, withλϕi=0\\lambda\\,\\phi\_\{i\}=0whenR\(τi\)=1R\(\\tau\_\{i\}\)=1andλϕi\\lambda\\,\\phi\_\{i\}being a positive value less than 1 whenR\(τi\)=−1R\(\\tau\_\{i\}\)=\-1\. AlthoughRi′R^\{\\prime\}\_\{i\}remains a sequence\-level scalar, its value is conditioned on counterfactual comparisons across multiple trajectories, thereby statistically encoding process\-level credit information\.

We centerλϕi\\lambda\\,\\phi\_\{i\}through within\-group advantage normalization:

A′^i=Ri′\(x\)−mean\(\{Ri′\(x\)\}i=1G\)std\(\{Ri′\(x\)\}i=1G\)\.\\widehat\{A^\{\\prime\}\}\_\{i\}=\\frac\{R^\{\\prime\}\_\{i\}\(x\)\-\\text\{mean\}\\left\(\\\{R^\{\\prime\}\_\{i\}\(x\)\\\}\_\{i=1\}^\{G\}\\right\)\}\{\\text\{std\}\\left\(\\\{R^\{\\prime\}\_\{i\}\(x\)\\\}\_\{i=1\}^\{G\}\\right\)\}\.\(9\)

##### Path Two: Token\-Level Gradient Masking\.

Whenℳ\\mathcal\{M\}produces a token\-level mask𝐦i\\mathbf\{m\}\_\{i\}\(see previous section\), the policy gradient can perform selective updates at the token granularity \(Equation \([7](https://arxiv.org/html/2605.16302#S3.E7)\)\)\. This path combines sequence\-level advantageA′^i\\widehat\{A^\{\\prime\}\}\_\{i\}with token\-level masking: the advantage is still determined at the sequence level by counterfactual comparison, but gradients flow only through tokens marked for update\. Neither path requires explicit step\-level annotation or a value model\.

### 3\.4Mechanism: Positive Backward Transfer in Multi\-Task Learning

Counterfactual trajectory comparison does not directly provide the model with explicit error labels\. Instead, by contrasting differences among multiple reasoning paths, potential errors become more salient during the comparison process\. When this comparison behavior is jointly optimized with the base reasoning task during training, from a multi\-task learning perspective, the auxiliary comparison task induces*positive backward transfer*to the original reasoning task: the model learns to suppress local errors more quickly, thereby accelerating convergence on the base task\. This mechanism reduces the number of ineffective updates required to correct local errors and mitigates the*learning tax*commonly observed in long\-horizon reinforcement learning\.

##### Testable Predictions\.

This mechanism specifically predicts significant improvements on difficult reasoning tasks, particularly in scenarios where correct trajectories are extremely scarce and training signals are dominated by sparse terminal rewards\. Specifically, we expect to observe faster convergence and more stable training dynamics on challenging benchmarks\. These effects are validated in Figure[2](https://arxiv.org/html/2605.16302#A0.F2)—GSPO exhibits significantly greater fluctuation on the more difficult LiveCodeBench tasks, while IBPO demonstrates noticeably smoother training curves\.

###### Proposition 3\.1\(Effectiveness of IBPO’s Process\-Level Advantage Estimation\)\.

For any multi\-step reasoning task, assume the policyπθ\\pi\_\{\\theta\}generatesGGtrajectories\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}given inputxx, where each trajectoryτi\\tau\_\{i\}corresponds to a sequence\-level rewardR\(τi\)R\(\\tau\_\{i\}\)\. For each incorrect trajectoryτi\\tau\_\{i\}, comparison signals𝐬\(τi\)\\mathbf\{s\}\(\\tau\_\{i\}\)are generated through counterfactual comparison withK−1K\-1correct trajectories, and injected into the optimization process through two paths: \(i\) mapping to shaped rewardsRi′\(x\)R^\{\\prime\}\_\{i\}\(x\); \(ii\) whenℳ\\mathcal\{M\}produces token\-level masks𝐦i\\mathbf\{m\}\_\{i\}, selectively updating only the marked tokens\.

Under the assumption that the shaping functionϕi\\phi\_\{i\}provides informative stepwise signals for suboptimal trajectories, the variance of the policy gradient estimator for any trajectoryτi\\tau\_\{i\}is significantly reduced compared to the baseline based on episode\-level rewards\. Specifically, in Path One,ϕi\\phi\_\{i\}suppresses gradient noise through reward shaping; in Path Two, token\-level masking further concentrates gradient updates on potentially erroneous tokens, avoiding unnecessary penalization of correct reasoning steps\. The two paths synergistically enhance training stability and accelerate convergence\.

###### Proof\.

By introducing the multi\-trajectory comparison operatorℳ\\mathcal\{M\}, we obtain contrastive signals𝐬i\(x\)\\mathbf\{s\}\_\{i\}\(x\)fromGGtrajectories, which are sensitive to counterfactual differences between trajectories\. These differences reflect the impact of different decisions during reasoning and are injected into optimization through two paths: \(i\)ϕi\\phi\_\{i\}is mapped to shaped rewardsRi′\(x\)R^\{\\prime\}\_\{i\}\(x\), achieving finer\-grained sequence\-level credit assignment; \(ii\) token\-level masks𝐦i\\mathbf\{m\}\_\{i\}restrict gradient updates to modified tokens \(Equation \([7](https://arxiv.org/html/2605.16302#S3.E7)\)\), achieving selective credit assignment at the token granularity\. Both paths effectively reduce gradient variance caused by early decision errors, thereby avoiding typical training instability\.

Specifically,ϕi\\phi\_\{i\}provides process\-level feedback for incorrect trajectories rather than relying solely on terminal rewards;𝐦i\\mathbf\{m\}\_\{i\}further filters out gradient contributions from correct tokens, making the update signal more precise\. Detailed mathematical proofs are provided in Appendix[A](https://arxiv.org/html/2605.16302#A1)\. ∎

## 4Experiments

##### Instantiation of IBPO\.

As discussed above, IBPO provides a framework for reward and advantage construction based on multi\-trajectory counterfactual comparison, rather than introducing a new sequence\-level policy optimizer\. Therefore, in concrete experiments, IBPO must be instantiated on top of existing sequence\-level reinforcement learning methods\.

In this work, we instantiate IBPO on top of GSPO\. IBPO focuses on constructing process\-sensitive advantage estimates through counterfactual multi\-trajectory comparison under terminal reward conditions, while GSPO purely serves as a sequence\-level optimizer to carry and apply these advantage signals\. As shown in Appendix[E](https://arxiv.org/html/2605.16302#A5), we provide the detailed framework of IBPO\+GSPO\. We also verified similar trends on GRPO; results are omitted for brevity\. The specific instantiation of the comparison operatorℳ\\mathcal\{M\}is detailed in Appendix[C](https://arxiv.org/html/2605.16302#A3)\.

Tasks and Datasets\.We evaluate the proposed method on a set of mathematical and code reasoning benchmarks\. The selected tasks are designed to assess the model’s capabilities in symbolic manipulation, multi\-step reasoning, domain\-specific mathematical understanding, and code reasoning\.

HMMT25\(Balunović et al\.,[2025](https://arxiv.org/html/2605.16302#bib.bib1)\),AIME25\(Mathematical Association of America,[2025](https://arxiv.org/html/2605.16302#bib.bib9)\), andLiveCodeBench v6 \(25\.02\-25\.05\)\(Jain et al\.,[2024](https://arxiv.org/html/2605.16302#bib.bib3)\)Base Models\.Qwen3\-32B\(Team,[2025](https://arxiv.org/html/2605.16302#bib.bib12)\)\. Qwen3\-Next\-80B\-A3B\-Thinking\(Yang et al\.,[2025](https://arxiv.org/html/2605.16302#bib.bib16)\)\.

We configure Qwen3\-32B with 32k tokens and Qwen3\-Next\-80B\-A3B\-Thinking with 256k tokens\. Inference is performed using the VLLM engine \(version 0\.11\.2\)\.

Baselines and Comparison Setup\.We compare IBPO instantiated on GSPO\(Zheng et al\.,[2025](https://arxiv.org/html/2605.16302#bib.bib19)\)\(denoted asIBPO\+GSPO\) against the following baselines: \(1\) vanilla GSPO; and \(2\) GSPO with prompt\-based error correction, which introduces additional correction prompts at inference time after GSPO training to generate revised outputs\.

Experiments are conducted on 32 Nvidia A800 \(80G\) GPUs\. Training hyperparameters are as follows: initial learning rate5×10−75\\times 10^\{\-7\}; cosine annealing learning rate scheduler with minimum learning rate ratio of 0\.1; linear warmup phase comprising 3% of total training steps; entropy regularization coefficientβ=0\\beta=0; GSPO samples 64 rollouts per input, IBPO\+GSPO samples 8 rollouts per input; mini\-batch size of 32\.

##### Compute\-Matched Protocol\.

See Appendix[B](https://arxiv.org/html/2605.16302#A2)for details\.

##### Complete Rewrite Filtering\.

To prevent the model from completely rewriting rather than locally fixing during correction \(leading to reward hijacking\), we detect complete rewrites via edit distance and set their shaped reward to zero\. This mechanism serves as a defensive safeguard rather than a core component; see Appendix[C\.3](https://arxiv.org/html/2605.16302#A3.SS3)for details\.

Although the experiments in this paper primarily focus on mathematical and code reasoning tasks, the framework design and applicability of IBPO are not limited to specific task domains\. Its core mechanism is not multi\-draft generation or explicit error correction per se, but rather constructing multiple counterfactual reasoning trajectories under the same input and leveraging the differences between these trajectories in terminal outcomes and intermediate decisions to induce learning signals that are more sensitive to the reasoning process\. From this perspective, IBPO is essentially a training paradigm based on counterfactual trajectory comparison, whose applicability depends solely on whether some objective or verifiable feedback signal exists during training, rather than relying on specific task formats or output structures\.

In mathematical and programming tasks, the correctness of the final solution has a clear and automatically verifiable definition, making these tasks convenient and reliable testbeds for studying the role of counterfactual trajectory comparison in multi\-step reasoning reinforcement learning\. However, for other types of tasks—such as factual question answering, structured knowledge reasoning, or multi\-step decision problems with well\-defined termination conditions—appropriate verifiable reward functions or evaluation criteria can similarly be designed to distinguish the terminal quality of different counterfactual trajectories\. By comparing multiple counterfactual trajectories sampled under the same input during training and injecting the resulting differences into reward shaping or advantage estimation, the model can statistically learn which intermediate decisions are more likely to lead to success and which are more likely to lead to failure\.

From a structural perspective, the key advantage of IBPO lies in its modeling of counterfactual trajectory independence and the resulting implicit process\-level credit assignment mechanism\. This mechanism does not rely on explicit step\-level annotations or additional value models; rather, under the condition of only terminal rewards, it introduces more discriminative learning signals for the reasoning process through multi\-trajectory counterfactual comparison\. Therefore, although we have not yet provided empirical results on non\-mathematical or non\-code tasks, the counterfactual trajectory comparison principle and process\-level advantage construction underlying IBPO are in principle applicable to any reasoning or decision\-making task with verifiable terminal outcomes\.

## 5Results and Analysis

Table 1:We present experimental results using Qwen3\-32B and Qwen3\-Next\-80B\-A3B\-Thinking\. For each test set, we evaluate 64 times and report the average accuracy\. We report the mean and 95% bootstrap confidence interval \(mean ± 95% CI\) over 5 random seeds; improvements over baseline methods are statistically significant under paired bootstrap tests \(p ¡ 0\.01\)\. Total training FLOPs are matched across all methods, including generation and comparison overhead\. The Best\-of\-N method uses N = 8\.Table[1](https://arxiv.org/html/2605.16302#S5.T1)reports performance comparisons across multiple reasoning benchmarks\. For the Qwen3\-32B model, inference parameters are set to temperature=0\.6=0\.6, TopP=0\.95=0\.95, TopK=20=20, MinP=0=0; for the Qwen3\-Next\-80B\-A3B\-Instruct model, inference parameters are set to temperature=0\.7=0\.7, TopP=0\.8=0\.8, TopK=20=20, MinP=0=0\. Compared methods include IBPO, GSPO, and GSPO augmented with prompt\-based error correction\.

In addition to the base IBPO \(sequence\-level reward shaping\), we also evaluate two variants based on token\-level edit distance for counterfactual analysis:IBPO\(ratio\)uses the proportion of unmodified tokens between the original response and the corrected output as a shaping coefficient—a higher unmodified proportion indicates the original reasoning is closer to correct, resulting in a larger shaped reward;IBPO\(mask\)uses edit distance to locate modified tokens, masks the gradient contribution of unmodified tokens, and applies policy gradient updates only to modified tokens\. Detailed definitions of both variants are provided in Appendix[C\.2](https://arxiv.org/html/2605.16302#A3.SS2)\.

### 5\.1Comparison with GSPO\.

To maintain approximately matched computational cost, GSPO samples 64 responses per input prompt, while IBPO samples 8 responses per prompt\.

As shown in Figure[2](https://arxiv.org/html/2605.16302#A0.F2), we plot the training reward and evaluation performance curves as compute increases\.

We also report the compute required to reach a fixed reward threshold \(e\.g\., 0\.75\); GSPO\+IBPO consistently requires fewer FLOPs\.

Table 2:Based on Qwen3\-Next\-80B\-A3B\-Thinking, we measure the compute required to reach a fixed training reward threshold under thecompute\-matchedsetting\. Results are normalized relative to GSPO\.#### 5\.1\.1Training Efficiency and Stability under Compute\-Matched Conditions

Figure[2](https://arxiv.org/html/2605.16302#A0.F2)dynamically compares GSPO and GSPO\+IBPO undercompute\-matchedconditions from two dimensions: training reward and external evaluation performance \(AIME25 and LiveCodeBench\)\. In this experiment, we align the training process by matching the overall compute consumption of different methods, ensuring that at any point on the horizontal axis, the total computational resources consumed by both methods—including model forward passes, backward passes, and generation overhead—are statistically equivalent\. Therefore, the horizontal axis is labeled*Training Compute*, which can be understood as a direct measure of the actual training compute budget\.

##### Higher Compute Efficiency under the Same Budget\.

Under a strictly matched compute budget, GSPO\+IBPO consistently achieves higher training rewards throughout the training process and enters the high\-reward regime earlier under the same compute constraint\. In comparison, GSPO shows notably slower reward improvement under equivalent compute\. These results indicate that IBPO can convert terminal rewards into more effective parameter updates*per unit of compute*, thereby substantially improving training compute efficiency\. In other words, under the same compute investment, GSPO\+IBPO produces more*effective learning*, significantly enhancing overall training efficiency\.

##### More Stable Optimization Dynamics\.

Beyond improvements in average performance, under compute\-matched conditions, the training reward curve of GSPO\+IBPO exhibits a notably smoother evolution, with significantly smaller fluctuations compared to GSPO\. When using only sequence\-level terminal rewards, local reasoning errors tend to be propagated backward uniformly through the shared terminal feedback across the entire generated sequence, causing gradient estimates to be dominated by noise from irrelevant time steps, thereby exacerbating training instability\. By introducing shaped signals based on counterfactual trajectory comparison, IBPO enables the model to more rapidly identify error\-prone regions and suppress the impact of ineffective updates on the optimization process\. Gradient variance measurements in the appendix further confirm this: IBPO reduces policy gradient variance by approximately 30% on average, which directly corresponds to the observed improvement in training stability, thereby substantially alleviating the*learning tax*in reinforcement learning\.

##### Evolution of Correction Success Rate\.

The informativeness of the shaped signal depends on whether the correction success rate falls within a meaningful intermediate range—if correction almost always succeeds or always fails, the signal degenerates to a constant\. We tracked the evolution of correction success rate during training of Qwen3\-32B \(AIME25\): at the beginning of training, the correction success rate is approximately 12%, gradually rising to approximately 67% as training progresses\. This indicates that the shaped signal remains in an information\-rich regime throughout the training process, being neither constantly zero nor constantly one, thereby continuously providing discriminative process\-level feedback for policy optimization\.

#### 5\.1\.2Positive Backward Transfer

As shown in Figure[2](https://arxiv.org/html/2605.16302#A0.F2), we observe that introducing IBPO leads to faster performance improvement and faster convergence\. We attribute this to a*positive backward transfer*effect\. In multi\-task learning, positive backward transfer refers to the phenomenon where learning a subsequent task \(Task B\) improves performance on a previous task \(Task A\), reflecting strong generalization capability\. By introducing an auxiliary task based on counterfactual reasoning trajectory comparison, IBPO induces a significant positive transfer effect on the main reasoning task during training\.

Specifically, under GSPO training, the model receives only the sequence\-level rewardR\(y\)R\(y\), which propagates uniformly across the entire reasoning sequence\. When the final failure is caused by only a few local tokens, this supervisory signal cannot indicate where the error occurred, forcing the model to rely on extensive sampling and iterative updates to statistically internalize these local errors over time\. This substantially increases the sample complexity of the learning process, giving rise to the well\-known*learning tax*in reinforcement learning\.

IBPO introduces a novel auxiliary task based on counterfactual reasoning trajectory comparison\. By contrasting inconsistencies across different reasoning paths, potential local errors become structurally more salient, thereby guiding the learning process\. This mechanism does not provide explicit token\-level annotations; rather, it enhances the*observability*of errors through counterfactual contrast, accelerating the model’s internalization of fine\-grained reasoning errors\. From the experimental results, this positive transfer significantly improves learning efficiency, reduces the learning tax, and enhances convergence stability in long\-horizon reasoning tasks\.

#### 5\.1\.3Faster Convergence on Difficult Tasks

In high\-difficulty reasoning tasks, correct responses are often extremely rare, with the vast majority of model\-generated trajectories being incorrect\. This distribution leads to highly sparse sequence\-level rewards, causing policy gradient estimates to be dominated by negative samples, which slows convergence and may even lead to training instability\. To address this issue, some GRPO variants use sample truncation to artificially balance the ratio of correct to incorrect responses\. However, this count\-based truncation strategy alters the original sampling distribution, thereby violating the consistency of importance sampling and potentially introducing additional bias\.

In contrast, our method introduces an auxiliary learning mechanism based on counterfactual reasoning trajectory comparison\. Without altering the original sampling distribution, IBPO explicitly amplifies rare positive signals, enabling more stable and efficient policy updates\. Intuitively, IBPO converts a small number of correct reasoning paths into multiple information\-rich learning signals through counterfactual trajectory comparison, substantially alleviating the learning bottleneck caused by the scarcity of positive samples in difficult tasks\.

### 5\.2Comparison with GSPO with Prompt

The core difference between GSPO\+Prompt and IBPO\+GSPO lies in whether joint training is performed\. The former conducts multi\-trajectory comparison and correction only at inference time through prompts after training is complete, while the latter performs multi\-trajectory comparison and jointly optimizes the model during training\. Experimental results validate the effectiveness of implicit process\-level rewards\.

Overall, these results indicate that IBPO not only improves overall accuracy but also substantially enhances the model’s robustness across problems of varying difficulty\. The consistent performance improvements observed across multiple datasets further support our hypothesis\.

### 5\.3Ablation Study

Table 3:Ablation results of Qwen3\-32B on AIME25, LiveCodeBench, and HMMT25\. Each variant removes one key component from the full IBPO algorithm\.To evaluate the contribution of each component in IBPO, we conduct ablation experiments based on the Qwen3\-32B model on the AIME25, LiveCodeBench, and HMMT25 datasets\. Table[3](https://arxiv.org/html/2605.16302#S5.T3)summarizes the corresponding results\.

##### GSPO \+ Test\-Time Prompt\.

Multi\-trajectory comparison is performed only at inference time through prompts\. Due to the absence of joint training, the gains introduced by IBPO vanish, leading to significant performance degradation\. This result validates the*positive transfer*effect induced by IBPO and the effectiveness of implicit process\-level rewards\.

##### IBPO \(k=1k=1\)\.

Only a single reasoning trajectory is used\. Without multi\-trajectory counterfactual comparison, accuracy drops substantially\. This indicates that inconsistencies across multiple counterfactual trajectories play a critical role in error identification\. This setting is similar to GSPO \+ SCoRe, which adopts a two\-stage reinforcement learning approach and introduces tree\-structured rewards, thereby increasing the learning burden\.

##### IBPO \(Shaping Only\)\.

In this ablation, we disable joint training with multi\-trajectory comparison but retain the reward shaping term\. This demonstrates the positive transfer effect brought by joint training\.

Table 4:Evaluation results under differentλ\\lambdavalues\. \(IBPO score; Qwen3\-32B; AIME25\)\.We conduct a sensitivity analysis onλ\\lambdaand observe optimal performance around 0\.6\.

Overall, the ablation results clearly demonstrate that each component of IBPO makes a meaningful contribution to overall performance\.

## 6Conclusion

We propose Implicit Behavior Policy Optimization \(IBPO\), a reinforcement learning paradigm that extracts implicit process\-level learning signals from sparse terminal rewards through counterfactual reasoning trajectory comparison\. By sampling multiple trajectories per input and comparing their outcomes, IBPO achieves stable credit assignment without requiring step\-level supervision or auxiliary value models\.

Experiments demonstrate that when combined with sequence\-level optimizers such as GSPO, IBPO consistently improves performance and training stability on mathematical and code reasoning benchmarks under compute\-matched settings\. Its formulation is agnostic to the underlying RL algorithm and directly compatible with GRPO variants and other policy gradient methods—providing a scalable path toward more robust multi\-step reasoning in large language models\.

## Limitations

IBPO incurs additional computational overhead due to sampling multiple counterfactual trajectories per input\. Although our results demonstrate higher sample efficiency under a fixed budget, reducing this overhead remains important for large\-scale deployment\. Furthermore, if counterfactual trajectories exhibit systematic errors, the comparison signal may weaken\. Enhancing trajectory diversity or incorporating external verifiers can mitigate this issue\.

## Ethics Statement

This work does not present any known ethical risks within its current scope\.

## Reproducibility Statement

We will release code, model checkpoints, and detailed experimental configurations through an anonymous public repository to support full reproducibility\.

## References

- Balunović et al\. \(2025\)Balunović, M\., Dekoninck, J\., Petrov, I\., Jovanović, N\., and Vechev, M\.Matharena: Evaluating llms on uncontaminated math competitions, February 2025\.URL[https://matharena\.ai/](https://matharena.ai/)\.
- Guo et al\. \(2025\)Guo, D\., Yang, D\., Zhang, H\., Song, J\., Zhang, R\., Xu, R\., Zhu, Q\., Ma, S\., Wang, P\., Bi, X\., et al\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Jain et al\. \(2024\)Jain, N\., Han, K\., Gu, A\., Li, W\.\-D\., Yan, F\., Zhang, T\., Wang, S\., Solar\-Lezama, A\., Sen, K\., and Stoica, I\.Livecodebench: Holistic and contamination free evaluation of large language models for code\.*arXiv preprint*, 2024\.
- Kang et al\. \(2024\)Kang, J\., Li, X\. Z\., Chen, X\., Kazemi, A\., and Chen, B\.Mindstar: Enhancing math reasoning in pre\-trained llms at inference time\.*arXiv preprint arXiv:2405\.16265*, 2024\.
- Kumar et al\. \(2024\)Kumar, A\., Zhuang, V\., Agarwal, R\., Su, Y\., Co\-Reyes, J\. D\., Singh, A\., Baumli, K\., Iqbal, S\., Bishop, C\., Roelofs, R\., et al\.Training language models to self\-correct via reinforcement learning\.*arXiv preprint arXiv:2409\.12917*, 2024\.
- Li et al\. \(2024\)Li, M\., Chen, L\., Chen, J\., He, S\., Gu, J\., and Zhou, T\.Selective reflection\-tuning: Student\-selected data recycling for llm instruction\-tuning\.In*Findings of the Association for Computational Linguistics ACL 2024*, pp\. 16189–16211, 2024\.
- Lightman et al\. \(2023\)Lightman, H\., Kosaraju, V\., Burda, Y\., Edwards, H\., Baker, B\., Lee, T\., Leike, J\., Schulman, J\., Sutskever, I\., and Alignment, K\. C\.Let’s verify step by step\.*arXiv preprint arXiv:2305\.20050*, 2023\.
- Luo et al\. \(2024\)Luo, L\., Liu, Y\., Liu, R\., Phatale, S\., Lara, H\., Li, Y\., Shu, L\., Zhu, Y\., Meng, L\., Sun, J\., et al\.Improve mathematical reasoning in language models by automated process supervision\.*arXiv e\-prints*, pp\. arXiv–2406, 2024\.
- Mathematical Association of America \(2025\)Mathematical Association of America\.2025 AIME I and AIME II Problems and Solutions, 2025\.URL[https://artofproblemsolving\.com/wiki/index\.php/2025\_AIME\_I\_Problems](https://artofproblemsolving.com/wiki/index.php/2025_AIME_I_Problems)\.Accessed: Jan 6, 2026\.
- Qi et al\. \(2024\)Qi, Z\., Ma, M\., Xu, J\., Zhang, L\. L\., Yang, F\., and Yang, M\.Mutual reasoning makes smaller llms stronger problem\-solvers\.*arXiv preprint arXiv:2408\.06195*, 2024\.
- Shao et al\. \(2024\)Shao, Z\., Wang, P\., Zhu, Q\., Xu, R\., Song, J\., Bi, X\., Zhang, H\., Zhang, M\., Li, Y\., Wu, Y\., et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Team \(2025\)Team, Q\.Qwen3 technical report, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Wang et al\. \(2024a\)Wang, C\., Deng, Y\., Lv, Z\., Yan, S\., and Bo, A\.Q\*: Improving multi\-step reasoning for llms with deliberative planning, 2024a\.
- Wang et al\. \(2024b\)Wang, P\., Li, L\., Shao, Z\., Xu, R\. X\., Dai, D\., Li, Y\., Chen, D\., Wu, Y\., and Sui, Z\.Math\-shepherd: Verify and reinforce llms step\-by\-step without human annotations, 2024b\.
- Yang et al\. \(2024\)Yang, A\., Yang, B\., Zhang, B\., Hui, B\., Zheng, B\., Yu, B\., Li, C\., Liu, D\., Huang, F\., Wei, H\., Lin, H\., Yang, J\., Tu, J\., Zhang, J\., Yang, J\., Yang, J\., Zhou, J\., Lin, J\., Dang, K\., Lu, K\., Bao, K\., Yang, K\., Yu, L\., Li, M\., Xue, M\., Zhang, P\., Zhu, Q\., Men, R\., Lin, R\., Li, T\., Xia, T\., Ren, X\., Ren, X\., Fan, Y\., Su, Y\., Zhang, Y\., Wan, Y\., Liu, Y\., Cui, Z\., Zhang, Z\., and Qiu, Z\.Qwen2\.5 technical report\.*arXiv preprint arXiv:2412\.15115*, 2024\.
- Yang et al\. \(2025\)Yang, A\., Li, A\., Yang, B\., Zhang, B\., Hui, B\., Zheng, B\., Yu, B\., Gao, C\., Huang, C\., Lv, C\., et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Yu et al\. \(2023\)Yu, F\., Gao, A\., and Wang, B\.Outcome\-supervised verifiers for planning in mathematical reasoning\.*arXiv preprint arXiv:2311\.09724*, 2023\.
- Yu et al\. \(2025\)Yu, Q\., Zhang, Z\., Zhu, R\., Yuan, Y\., Zuo, X\., Yue, Y\., Fan, T\., Liu, G\., Liu, L\., Liu, X\., et al\.Dapo: An open\-source llm reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.
- Zheng et al\. \(2025\)Zheng, C\., Liu, S\., Li, M\., Chen, X\.\-H\., Yu, B\., Gao, C\., Dang, K\., Liu, Y\., Men, R\., Yang, A\., et al\.Group sequence policy optimization\.*arXiv preprint arXiv:2507\.18071*, 2025\.
- Zheng et al\. \(2023\)Zheng, L\., Chiang, W\.\-L\., Sheng, Y\., Zhuang, S\., Wu, Z\., Zhuang, Y\., Lin, Z\., Li, Z\., Li, D\., Xing, E\. P\., et al\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.*Advances in Neural Information Processing Systems*, 36, 2023\.

![Refer to caption](https://arxiv.org/html/2605.16302v1/quxiaotu_compute.png)Figure 2:Training curves based on fine\-tuning Qwen3\-Next\-80B\-A3B\-Thinking indicate that IBPO achieves significantly higher training efficiency compared to GSPO\.## Appendix ATheoretical Analysis: Variance Reduction Properties of IBPO

To formally characterize the advantage of IBPO in credit assignment, we use the representative GSPO\-class method as a baseline and prove under reasonable assumptions that the implicit process\-level advantage estimator constructed by IBPO has lower variance than the pure terminal reward advantage used in GSPO, thereby leading to more stable policy gradient updates\.

We consider a set of trajectories\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}independently sampled from the policyπθ\\pi\_\{\\theta\}given a fixed inputxx\. Let:

- •Yi=R\(τi\)∈\{−1,1\}Y\_\{i\}=R\(\\tau\_\{i\}\)\\in\\\{\-1,1\\\}denote the terminal reward of theii\-th trajectory \(assumed to be binary for analytical convenience\);
- •ϕi∈\[0,1\]\\phi\_\{i\}\\in\[0,1\]be the counterfactual comparison signal introduced by IBPO, satisfying: ϕi=\{0,ifYi=1\(correct trajectory\);\>0,ifYi=−1\(incorrect trajectory\)\.\\phi\_\{i\}=\\begin\{cases\}0,&\\text\{if \}Y\_\{i\}=1\\ \(\\text\{correct trajectory\}\);\\\\ \>0,&\\text\{if \}Y\_\{i\}=\-1\\ \(\\text\{incorrect trajectory\}\)\.\\end\{cases\}Furthermore, we assume thatϕi\\phi\_\{i\}effectively reflects the “recoverability” or “consistency with correct reasoning” of a trajectory—i\.e\., the closer an incorrect trajectory is to the correct reasoning process, the largerϕi\\phi\_\{i\}is\.

Based on this, GSPO and IBPO define the following within\-group advantage estimators \(normalization constants are omitted as they only introduce positive proportionality factors irrelevant to variance comparison\):

AiGSPO\\displaystyle A^\{\\mathrm\{GSPO\}\}\_\{i\}=Yi−Y¯,whereY¯=1G∑j=1GYj,\\displaystyle=Y\_\{i\}\-\\bar\{Y\},\\quad\\text\{where \}\\bar\{Y\}=\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}Y\_\{j\},\(10\)AiIBPO\\displaystyle A^\{\\mathrm\{IBPO\}\}\_\{i\}=\(Yi\+λϕi\)−\(Y¯\+λϕ¯\)=AiGSPO\+λ\(ϕi−ϕ¯\),\\displaystyle=\(Y\_\{i\}\+\\lambda\\phi\_\{i\}\)\-\(\\bar\{Y\}\+\\lambda\\bar\{\\phi\}\)=A^\{\\mathrm\{GSPO\}\}\_\{i\}\+\\lambda\(\\phi\_\{i\}\-\\bar\{\\phi\}\),\(11\)whereλ\>0\\lambda\>0is the shaping weight andϕ¯=1G∑j=1Gϕj\\bar\{\\phi\}=\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}\\phi\_\{j\}\.

We make the following key assumption:

###### Assumption A\.1\(Negative Correlation\)\.

The terminal rewardYiY\_\{i\}and the comparison signalϕi\\phi\_\{i\}satisfyCov\(Yi,ϕi\)<0\\mathrm\{Cov\}\(Y\_\{i\},\\phi\_\{i\}\)<0\. This holds because correct trajectories \(Yi=1Y\_\{i\}=1\) enforceϕi=0\\phi\_\{i\}=0, while incorrect trajectories \(Yi=−1Y\_\{i\}=\-1\) correspond toϕi\>0\\phi\_\{i\}\>0, and a largerϕi\\phi\_\{i\}indicates closer proximity to correct reasoning\.

###### Theorem A\.2\(Variance Reduction of IBPO Relative to GSPO\)\.

Under Assumption[A\.1](https://arxiv.org/html/2605.16302#A1.Thmtheorem1)and group sizeG≥2G\\geq 2, there existsλmax\>0\\lambda\_\{\\max\}\>0such that for anyλ∈\(0,λmax\)\\lambda\\in\(0,\\lambda\_\{\\max\}\):

Var\(AiIBPO\)<Var\(AiGSPO\)\.\\mathrm\{Var\}\\left\(A^\{\\mathrm\{IBPO\}\}\_\{i\}\\right\)<\\mathrm\{Var\}\\left\(A^\{\\mathrm\{GSPO\}\}\_\{i\}\\right\)\.Furthermore, if the policy gradient direction vector∇θlog⁡πθ\(τi∣x\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\\mid x\)is weakly correlated with the advantage estimator \(or its norm varies slowly\), then the variance of the IBPO policy gradient estimator is strictly less than that of GSPO:

Var\[AiIBPO⋅∇θlog⁡πθ\(τi∣x\)\]<Var\[AiGSPO⋅∇θlog⁡πθ\(τi∣x\)\]\.\\mathrm\{Var\}\\left\[A^\{\\mathrm\{IBPO\}\}\_\{i\}\\cdot\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\\mid x\)\\right\]<\\mathrm\{Var\}\\left\[A^\{\\mathrm\{GSPO\}\}\_\{i\}\\cdot\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\\mid x\)\\right\]\.

###### Proof\.

FromAiIBPO=AiGSPO\+λ\(ϕi−ϕ¯\)A^\{\\mathrm\{IBPO\}\}\_\{i\}=A^\{\\mathrm\{GSPO\}\}\_\{i\}\+\\lambda\(\\phi\_\{i\}\-\\bar\{\\phi\}\), we expand its variance:

Var\(AiIBPO\)\\displaystyle\\mathrm\{Var\}\(A^\{\\mathrm\{IBPO\}\}\_\{i\}\)=Var\(AiGSPO\+λ\(ϕi−ϕ¯\)\)\\displaystyle=\\mathrm\{Var\}\\big\(A^\{\\mathrm\{GSPO\}\}\_\{i\}\+\\lambda\(\\phi\_\{i\}\-\\bar\{\\phi\}\)\\big\)=Var\(AiGSPO\)\+λ2Var\(ϕi−ϕ¯\)\+2λCov\(AiGSPO,ϕi−ϕ¯\)\.\\displaystyle=\\mathrm\{Var\}\(A^\{\\mathrm\{GSPO\}\}\_\{i\}\)\+\\lambda^\{2\}\\mathrm\{Var\}\(\\phi\_\{i\}\-\\bar\{\\phi\}\)\+2\\lambda\\,\\mathrm\{Cov\}\\big\(A^\{\\mathrm\{GSPO\}\}\_\{i\},\\,\\phi\_\{i\}\-\\bar\{\\phi\}\\big\)\.\(12\)Note thatAiGSPO=Yi−Y¯A^\{\\mathrm\{GSPO\}\}\_\{i\}=Y\_\{i\}\-\\bar\{Y\}\. For a fixed inputxxand sufficiently large group sizeGG,Y¯\\bar\{Y\}andϕ¯\\bar\{\\phi\}can be approximately treated as constants \(converging in probability to their population means\)\. Therefore:

Cov\(AiGSPO,ϕi−ϕ¯\)≈Cov\(Yi,ϕi\)<0,\\mathrm\{Cov\}\\big\(A^\{\\mathrm\{GSPO\}\}\_\{i\},\\,\\phi\_\{i\}\-\\bar\{\\phi\}\\big\)\\approx\\mathrm\{Cov\}\(Y\_\{i\},\\phi\_\{i\}\)<0,where the inequality is guaranteed by Assumption[A\.1](https://arxiv.org/html/2605.16302#A1.Thmtheorem1)\.

LetC=−Cov\(Yi,ϕi\)\>0C=\-\\mathrm\{Cov\}\(Y\_\{i\},\\phi\_\{i\}\)\>0andVϕ=Var\(ϕi−ϕ¯\)≥0V\_\{\\phi\}=\\mathrm\{Var\}\(\\phi\_\{i\}\-\\bar\{\\phi\}\)\\geq 0\. Then:

Var\(AiIBPO\)≤Var\(AiGSPO\)−2λC\+λ2Vϕ\.\\mathrm\{Var\}\(A^\{\\mathrm\{IBPO\}\}\_\{i\}\)\\leq\\mathrm\{Var\}\(A^\{\\mathrm\{GSPO\}\}\_\{i\}\)\-2\\lambda C\+\\lambda^\{2\}V\_\{\\phi\}\.WhenVϕ\>0V\_\{\\phi\}\>0, this quadratic is strictly less thanVar\(AiGSPO\)\\mathrm\{Var\}\(A^\{\\mathrm\{GSPO\}\}\_\{i\}\)forλ∈\(0,2CVϕ\)\\lambda\\in\\left\(0,\\frac\{2C\}\{V\_\{\\phi\}\}\\right\); whenVϕ=0V\_\{\\phi\}=0, it holds for anyλ\>0\\lambda\>0\. Settingλmax=2CVϕ\+ϵ\\lambda\_\{\\max\}=\\frac\{2C\}\{V\_\{\\phi\}\+\\epsilon\}\(ϵ\>0\\epsilon\>0to avoid division by zero\) guarantees strict variance reduction\.

Regarding gradient variance, since∇θlog⁡πθ\(τi∣x\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}\\mid x\)is primarily determined by the trajectoryτi\\tau\_\{i\}, the shaping termλϕi\\lambda\\phi\_\{i\}inAiIBPOA^\{\\mathrm\{IBPO\}\}\_\{i\}injects a low\-noise signal related to the trajectory’s process quality, making it more correlated with the gradient direction than the pure terminal reward\. Consequently, IBPO achieves significantly lower gradient variance in practice, especially in scenarios with longer trajectories or multiple reasoning errors\. ∎

##### Discussion\.

This theorem shows that, under Assumption[A\.1](https://arxiv.org/html/2605.16302#A1.Thmtheorem1), the shaping termλϕi\\lambda\\phi\_\{i\}introduced by IBPO’s counterfactual comparison can reduce the variance of advantage estimation\. It is worth noting that the negative correlation in Assumption[A\.1](https://arxiv.org/html/2605.16302#A1.Thmtheorem1)is directly implied by the construction ofϕi\\phi\_\{i\}\(correct trajectories haveϕi=0\\phi\_\{i\}=0, incorrect trajectories haveϕi\>0\\phi\_\{i\}\>0\), so this assumption is essentially a corollary of the definition rather than an additional constraint\. The main value of the theorem lies in quantifying the range ofλ\\lambdafor which variance reduction holds, providing theoretical guidance for hyperparameter selection\. More importantly,ϕi\\phi\_\{i\}encodes process\-level information, enabling the advantage estimate to reflect not only whether the answer is correct but also thedegreeto which the reasoning deviates from correctness\. This achieves finer\-grained credit assignment and empirically supports the optimization stability and sample efficiency demonstrated by IBPO in mathematical and code reasoning tasks\.

##### Empirical Verification\.

To verify the practical significance of the above theoretical analysis, we directly measured the policy gradient variance of GSPO and GSPO\+IBPO during Qwen3\-32B training on AIME25\. The results show that: \(i\) the negative correlation condition in Assumption[A\.1](https://arxiv.org/html/2605.16302#A1.Thmtheorem1)consistently holds during actual training \(Cov\(Yi,ϕi\)\\mathrm\{Cov\}\(Y\_\{i\},\\phi\_\{i\}\)is negative at all checkpoints\); \(ii\) IBPO reduces policy gradient variance by approximately 30% on average, which is consistent with the smoother reward evolution observed in the training curves \(Figure[2](https://arxiv.org/html/2605.16302#A0.F2)\) and faster convergence speed\.

## Appendix BDetails of Compute Budget Matching

To ensure fair comparison, we match the total training compute budget across different methods by considering the following factors: \(1\) the number of sampled trajectories, \(2\) the total computational cost\. Therefore, we only compare performance under the same training compute budget\.

In our experiments, the compute budget of IBPO\+GSPO and GSPO is matched via actual GPU usage time\. Specifically, for each promptxx, IBPO\+GSPO first generates 8 responsesyy\. For each incorrect responseyiy\_\{i\}, it is concatenated with the original inputxxand a randomly sampled correct response to form a new input, and a correction output is generated\. We note that the input sequences in the correction phase are longer due to the concatenated context, and the quadratic complexity of attention computation makes each correction trajectory more expensive than base sampling\. Therefore, we do not assume that the FLOPs of the two\-stage generation are identical to those of GSPO’s 64 samples\. Instead, we adopt a more direct matching approach: we measure the actual GPU usage time of GSPO with 64 samples and run IBPO\+GSPO within the same GPU time budget\. This means that under the same wall\-clock training time constraint, the two methods consume equivalent actual computational resources\. The horizontal axis in Figure[2](https://arxiv.org/html/2605.16302#A0.F2)corresponds to this actual training compute budget\.

## Appendix CInstantiation Details of IBPO

This appendix presents a specific instance of IBPO used in our experiments, namely the*compare\-and\-correct*mechanism, along with the integrated training pipeline and implementation details when combined with GSPO\. We emphasize that the following design is one specific choice for the comparison operatorℳ\\mathcal\{M\}described in the main text, and the core formulation of IBPO does not depend on this particular instantiation\.

### C\.1Operator

##### Instantiation Choice\.

In our experiments, we use the*compare\-and\-correct*mechanism to instantiate the general comparison operatorℳ\\mathcal\{M\}\. We emphasize that this is*one specific implementation*of IBPO, chosen for its simplicity and effectiveness on verifiable reasoning tasks, rather than a requirement of the IBPO formulation itself\.

In our experiments, we adopt the*compare\-and\-correct*instantiation of the operatorℳ\\mathcal\{M\}\. Specifically, we first generate multiple candidate reasoning trajectories, then use the model itself to compare these trajectories and rewrite them into corrected outputs\. This implementation can be viewed as a specific choice ofℳ\\mathcal\{M\}that maps counterfactual differences to computable shaping termsϕ\(⋅\)\\phi\(\\cdot\)\. To avoid limiting the contribution of this work to a specific implementation, we defer implementation details—such as how to induce trajectory diversity and how to construct comparison inputs—to the appendix\.

Given two candidate reasoning trajectories/responses for the same inputxx, namely the target responseyyand the reference responseyrefy^\{\\mathrm\{ref\}\}, we construct the correction inputx~=\(x;y,yref\)\\tilde\{x\}=\(x;y,y^\{\\mathrm\{ref\}\}\), and let the model generate a revised output conditioned onx~\\tilde\{x\}:

y^∼πθ\(⋅∣x~\),y^=𝒞\(x;y,yref\)\.\\hat\{y\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\tilde\{x\}\),\\qquad\\hat\{y\}=\\mathcal\{C\}\(x;y,y^\{\\mathrm\{ref\}\}\)\.\(13\)

##### Prompt Template\.

In all experiments, we use the following compare\-and\-correct instruction template \(with minor adjustments depending on the task format\):

> You are given two candidate solutions to the same problem\. Compare them step by step, identify any inconsistencies or errors, and then produce a corrected solution and final answer\.

##### Reference Sampling\.

For each target responseyyto be corrected, we sample the reference responseyrefy^\{\\mathrm\{ref\}\}as follows: if there exists at least one correct response within the group, we uniformly sample from the set of correct responses; otherwise, we uniformly sample from the set of incorrect responses\. This strategy aims to provide a relatively stronger \(or at least different\) counterfactual reference for comparison, without introducing external supervision\.

### C\.2Shaping Instance: Recoverability\-Induced Reward

We adopt a*recoverability*\-based shaping instance to defineϕ\(⋅\)\\phi\(\\cdot\)\. Letr\(x,y\)∈\{0,1\}r\(x,y\)\\in\\\{0,1\\\}denote the terminal correctness reward \(i\.e\., whether the final answer is correct\)\. For an incorrect responseyyand its corrected outputy^=𝒞\(x;y,yref\)\\hat\{y\}=\\mathcal\{C\}\(x;y,y^\{\\mathrm\{ref\}\}\), we define the implicit process shaping term as:

Δ\(x;y,yref\)=β⋅𝕀\[r\(x,y\)=0∧r\(x,y^\)=1\],β=0\.5\.\\Delta\(x;y,y^\{\\mathrm\{ref\}\}\)=\\beta\\cdot\\mathbb\{I\}\\\!\\left\[r\(x,y\)=0\\ \\wedge\\ r\(x,\\hat\{y\}\)=1\\right\],\\qquad\\beta=0\.5\.\(14\)The resulting shaped sequence\-level reward is

r′\(x,y\)=r\(x,y\)\+λΔ\(x;y,yref\)\.r^\{\\prime\}\(x,y\)=r\(x,y\)\+\\lambda\\,\\Delta\(x;y,y^\{\\mathrm\{ref\}\}\)\.\(15\)
##### Token\-Level Edit Distance Variants\.

When correction succeeds \(r\(x,y\)=0∧r\(x,y^\)=1r\(x,y\)=0\\wedge r\(x,\\hat\{y\}\)=1\), we can further localize the specific tokens that were modified by computing the token\-level edit distance between the original responseyyand the corrected outputy^\\hat\{y\}\. Lety=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\),y^=\(y^1,…,y^T\)\\hat\{y\}=\(\\hat\{y\}\_\{1\},\\ldots,\\hat\{y\}\_\{T\}\)\(after alignment\), and define the set of unmodified tokens𝒰=\{t:yt=y^t\}\\mathcal\{U\}=\\\{t:y\_\{t\}=\\hat\{y\}\_\{t\}\\\}\.

IBPO\-ratio\(unmodified ratio reward shaping\): The proportion of unmodified tokens\|𝒰\|/T\|\\mathcal\{U\}\|/Tis used as the recoverability measure, directly serving asβ\\betain Equation \([14](https://arxiv.org/html/2605.16302#A3.E14)\):

Δratio\(x;y,yref\)=\|𝒰\|T⋅𝕀\[r\(x,y\)=0∧r\(x,y^\)=1\]\.\\Delta^\{\\mathrm\{ratio\}\}\(x;y,y^\{\\mathrm\{ref\}\}\)=\\frac\{\|\\mathcal\{U\}\|\}\{T\}\\cdot\\mathbb\{I\}\\\!\\left\[r\(x,y\)=0\\ \\wedge\\ r\(x,\\hat\{y\}\)=1\\right\]\.\(16\)Intuitively, a higher unmodified ratio indicates that the original reasoning is closer to being correct and should receive a higher shaping reward\.

IBPO\-mask\(token\-level gradient masking\): Gradient contributions from unmodified tokens are masked, and policy gradient updates are applied only to modified tokens\. Define the token\-level maskmt=𝕀\[t∉𝒰\]m\_\{t\}=\\mathbb\{I\}\[t\\notin\\mathcal\{U\}\], then the gradient coefficient for thett\-th token in the policy gradient is multiplied bymtm\_\{t\}:

∇θ𝒥mask∝∑t=1Tmt⋅At⋅∇θlog⁡πθ\(yt∣y<t,x\),\\nabla\_\{\\theta\}\\mathcal\{J\}^\{\\mathrm\{mask\}\}\\propto\\sum\_\{t=1\}^\{T\}m\_\{t\}\\cdot A\_\{t\}\\cdot\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\),\(17\)where unmodified tokens \(mt=0m\_\{t\}=0\) are not penalized, as they were retained after correction, indicating that these tokens are likely correct reasoning steps\.

##### Why No Penalty for Failed Corrections\.

For cases wherer\(x,y\)=0r\(x,y\)=0andr\(x,y^\)=0r\(x,\\hat\{y\}\)=0, we do not impose additional negative penalties, to avoid misattributing insufficient correction capability or poor reference quality to the intrinsic quality of the original reasoning\. This design choice helps prevent unnecessary bias and training instability\.

### C\.3Full Rewrite Detection and Suppression

##### Problem\.

During the counterfactual compare\-and\-correct process, the model may not perform local repairs on the original erroneous trajectory but instead completely ignore the original reasoning and generate an entirely new solution from scratch\. This “full rewrite” behavior leads to reward hijacking: when the corrected output happens to be correct, the shaping reward is incorrectly attributed to the “recoverability” of the original trajectory, while in reality the original reasoning process was not utilized\.

##### Detection Mechanism\.

We use Python’s token\-level edit distance to detect full rewrites\. Letd\(a,b\)d\(a,b\)denote the normalized edit distance between sequencesaaandbb\(valued in\[0,1\]\[0,1\]\)\. When correction succeeds \(r\(x,y\)=0∧r\(x,y^\)=1r\(x,y\)=0\\wedge r\(x,\\hat\{y\}\)=1\), if both of the following conditions are satisfied, the case is classified as a full rewrite andΔ=0\\Delta=0is set:

d\(y,y^\)\>αandd\(y,y^\)\>d\(y^,yref\),d\(y,\\hat\{y\}\)\>\\alpha\\quad\\text\{and\}\\quad d\(y,\\hat\{y\}\)\>d\(\\hat\{y\},y^\{\\mathrm\{ref\}\}\),\(18\)whereα\\alphais the edit distance threshold\. The first condition detects the degree of deviation of the corrected output from the original trajectory; the second condition confirms that the corrected output is closer to the reference answer than to the original trajectory, i\.e\., the model tends to copy the reference rather than repair the original reasoning\.

Combining with Equation \([14](https://arxiv.org/html/2605.16302#A3.E14)\), the complete shaping term with rewrite filtering is:

Δ\(x;y,yref\)=β⋅𝕀\[r\(x,y\)=0∧r\(x,y^\)=1∧¬rewrite\(y,y^,yref\)\]\.\\Delta\(x;y,y^\{\\mathrm\{ref\}\}\)=\\beta\\cdot\\mathbb\{I\}\\\!\\left\[r\(x,y\)=0\\wedge r\(x,\\hat\{y\}\)=1\\wedge\\neg\\text\{rewrite\}\(y,\\hat\{y\},y^\{\\mathrm\{ref\}\}\)\\right\]\.\(19\)

##### Threshold Sensitivity Analysis\.

We conducted a sensitivity analysis on the thresholdα\\alphausing Qwen3\-32B \(AIME25\), with results shown in Table[5](https://arxiv.org/html/2605.16302#A3.T5)\.

Table 5:Rewrite detection rate and AIME25 accuracy under different edit distance thresholdsα\\alpha\(Qwen3\-32B\)\. Performance remains stable in the 55%–70% range, and 60% is selected as the midpoint of this plateau\.
##### Adaptive Threshold\.

A fixed threshold is not conducive to cross\-domain generalization\. A more robust approach is distribution\-based anomaly detection: compute the meanμd\\mu\_\{d\}and standard deviationσd\\sigma\_\{d\}of all edit distances within the current batch, and set the threshold asα=μd\+2σd\\alpha=\\mu\_\{d\}\+2\\sigma\_\{d\}\. By Chebyshev’s inequality, even if the distribution is non\-normal, at least 75% of the data falls withinμd±2σd\\mu\_\{d\}\\pm 2\\sigma\_\{d\}, making this criterion conservatively effective under any distribution\.

##### Suppressing Rewrites at the Source via Reinforcement Learning\.

Beyond post\-hoc detection, we also incorporate the edit distance constraint directly into the reward function: when the edit distance between the corrected output and the original erroneous trajectory is too large, an additional negative reward penalty is imposed\. Specifically, multiple correction results are generated simultaneously, and contrastive reinforcement learning is applied within the group, so that the model naturally learns during training that local repair yields higher returns than full rewriting, thereby suppressing rewrite tendencies at the behavioral policy level\. This is a more fundamental solution than threshold filtering: rather than discarding samples after rewrites occur, the incentive mechanism encourages the model to actively avoid rewrites\.

##### Positioning Note\.

The edit distance check is a defensive safeguard against reward hijacking, not a core component of the IBPO framework\. When the model possesses sufficient base capability, full rewrites are inherently rare events, and the specific threshold choice has negligible impact on the main experimental results\. Combined with the above RL training mechanism, as training progresses, the model’s rewrite tendency further decreases, and the dependence on the threshold correspondingly diminishes\.

### C\.4Trajectory Diversity and Coupling Reduction

IBPO relies on sampling multiple trajectories with sufficient diversity under the same input\. In our experiments, we adopt the following strategies to increase trajectory diversity and reduce trajectory coupling:

- •Stochastic decoding\.We use different random seeds, temperatures, and top\-pp/top\-kksampling parameters\.
- •Prompt perturbation \(optional\)\.We apply slight perturbations to the system prompt or format prompt to induce trajectory\-level differences\.

## Appendix DImplementation Details

##### Infrastructure\.

Experiments are conducted on 32 Nvidia A800 \(80G\) GPUs\.

##### Optimization\.

We use an initial learning rate of5×10−75\\times 10^\{\-7\}with cosine decay \(minimum ratio 0\.1\) and linear warmup over 3% of total steps\. The entropy regularization coefficient is set to0\.

##### Sampling\.

GSPO usesG=64G=64rollouts per prompt, while IBPO usesG=8G=8, to approximately match the overall compute budget across methods\.

## Appendix EAn Instantiation of IBPO: Integration with GSPO

GSPO is used solely as the carrier optimizer; replacing GSPO with GRPO or PPO does not alter the IBPO formulation\. In the preceding sections, we have introduced the general formulation of IBPO and its reward shaping definition\. IBPO is a training formulation orthogonal to the underlying sequence\-level reinforcement learning method\. This section describes how to seamlessly integrate this formulation into a representative sequence\-level reinforcement learning algorithm—GSPO\.

### E\.1Preliminaries: Sequence\-Level Reinforcement Learning

We treat the autoregressive language model parameterized byθ\\thetaas a policyπθ\\pi\_\{\\theta\}\. Let𝒟\\mathcal\{D\}denote the query set\. Given a queryxx, the model generates a complete responsey=\(y1,…,y\|y\|\)y=\(y\_\{1\},\\dots,y\_\{\|y\|\}\), with sequence probability

πθ\(y∣x\)=∏t=1\|y\|πθ\(yt∣x,y<t\)\.\\pi\_\{\\theta\}\(y\\mid x\)=\\prod\_\{t=1\}^\{\|y\|\}\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.\(20\)
We consider a general class of sequence\-level policy optimization objectives:

J\(θ\)=𝔼x∼𝒟,y∼πθold\(⋅∣x\)\[ℒ\(s\(θ;x,y\),A^\(x,y\)\)\],J\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,y\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\)\}\\Bigl\[\\mathcal\{L\}\\bigl\(s\(\\theta;x,y\),\\hat\{A\}\(x,y\)\\bigr\)\\Bigr\],\(21\)wheres\(θ;x,y\)s\(\\theta;x,y\)denotes the sequence\-level importance sampling weight, andA^\(x,y\)\\hat\{A\}\(x,y\)is constructed from the sequence\-level reward\. Methods such as GSPO and GRPO can be viewed as specific instantiations of this formulation\.

IBPO does not alter the optimization form in Equation \([21](https://arxiv.org/html/2605.16302#A5.E21)\), but instead shapes the original sequence\-level reward through the model’s self\-correction process\.

### E\.2IBPO and GSPO: Single\-Pass Joint Training

In each iteration, we perform asinglepolicy update: for the same batch of queriesxx, we simultaneously construct the candidate response set for sequence\-level reinforcement learning, as well as the self\-correction results for evaluating recoverability\. IBPO only modifies the reward definition \(shaping it intor′r^\{\\prime\}\), while keeping the GSPO surrogate objective unchanged\.

##### \(A\) GSPO Backbone with Shaped Reward\.

For each queryxx, sampleGGresponses\{yi\}i=1G\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}from the old policy\. The sequence\-level importance ratio in GSPO is defined as

si\(θ\)=\(πθ\(yi∣x\)πθold\(yi∣x\)\)1\|yi\|\.s\_\{i\}\(\\theta\)=\\left\(\\frac\{\\pi\_\{\\theta\}\(y\_\{i\}\\mid x\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{i\}\\mid x\)\}\\right\)^\{\\\!\\frac\{1\}\{\|y\_\{i\}\|\}\}\.\(22\)We use theshaped rewardto construct the within\-group advantage:

A^i=r′\(x,yi\)−μ′σ′,\\hat\{A\}\_\{i\}=\\frac\{r^\{\\prime\}\(x,y\_\{i\}\)\-\\mu^\{\\prime\}\}\{\\sigma^\{\\prime\}\},\(23\)whereμ′\\mu^\{\\prime\}andσ′\\sigma^\{\\prime\}denote the mean and standard deviation of the shaped rewards within the group, respectively\. The GSPO optimization objective is

JGSPO\(θ\)=𝔼x∼𝒟,\{yi\}i=1G∼πθold\(⋅∣x\)\[1G∑i=1Gmin\(\\displaystyle J\_\{\\mathrm\{GSPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\)\}\\Biggl\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\Bigl\(si\(θ\)A^i,\\displaystyle s\_\{i\}\(\\theta\)\\,\\hat\{A\}\_\{i\},\(24\)clip\(si\(θ\),1−ϵ,1\+ϵ\)A^i\)\]\.\\displaystyle\\operatorname\{clip\}\\\!\\bigl\(s\_\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\bigr\)\\,\\hat\{A\}\_\{i\}\\Bigr\)\\Biggr\]\.

##### \(B\) Self\-Correction Shaping Signal\.

To compute the shaped rewardr′\(x,yi\)r^\{\\prime\}\(x,y\_\{i\}\), we construct a self\-correction instance for each incorrect responseyiy\_\{i\}\. Specifically, we randomly sample a reference responseyirefy\_\{i\}^\{\\text\{ref\}\}\(sampled from correct responses if at least one exists in the group; otherwise sampled from incorrect responses\), and ask the model to compare and correct:

y^i=𝒞\(x;yi,yiref\)\.\\hat\{y\}\_\{i\}=\\mathcal\{C\}\(x;y\_\{i\},y\_\{i\}^\{\\text\{ref\}\}\)\.\(25\)This process does not introduce any additional supervision and is used solely to evaluate whether the model can correct an incorrect response to a correct one\. Based on an instantiation of Equation \([14](https://arxiv.org/html/2605.16302#A3.E14)\), we define

Δi=β⋅𝕀\[r\(x,yi\)<1∧r\(x,y^i\)=1\],β=0\.5,\\Delta\_\{i\}=\\beta\\cdot\\mathbb\{I\}\\\!\\left\[r\(x,y\_\{i\}\)<1\\;\\wedge\\;r\(x,\\hat\{y\}\_\{i\}\)=1\\right\],\\qquad\\beta=0\.5,\(26\)and obtain the sequence\-level shaped reward:

r′\(x,yi\)=r\(x,yi\)\+λΔi\.r^\{\\prime\}\(x,y\_\{i\}\)=r\(x,y\_\{i\}\)\+\\lambda\\,\\Delta\_\{i\}\.\(27\)Althoughr′\(x,yi\)r^\{\\prime\}\(x,y\_\{i\}\)remains a sequence\-level scalar in form, its value depends on the counterfactual self\-correction process, thereby implicitly encoding process\-level \(step\-level\) credit assignment information\.

##### \(C\) Joint GSPO Training on Correction Behavior\.

In addition to using the implicit process rewardr′\(x,y\)r^\{\\prime\}\(x,y\)for sequence\-level optimization on the original reasoning inputxx, we further treat thecorrection behavior itself as a reasoning task of the same policy on an extended input space, and train it using thesame GSPO objectivejointly\. Formally, this process does not introduce a new Markov Decision Process \(MDP\), but merely applies the policy to different conditional inputs \(i\.e\., a prompt\-conditioned policy\)\.

Specifically, for each responseyiy\_\{i\}whose recoverability needs to be evaluated, we construct a correction input:

x~i=\(x;yi,yiref\),\\tilde\{x\}\_\{i\}=\(x;\\;y\_\{i\},\\;y\_\{i\}^\{\\text\{ref\}\}\),\(28\)whereyirefy\_\{i\}^\{\\text\{ref\}\}denotes the reference response\. Conditioned on this input, the model generates a corrected outputy^i∼πθ\(⋅∣x~i\)\\hat\{y\}\_\{i\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\tilde\{x\}\_\{i\}\), and receives a terminal reward based on the correctness of the final answerr\(x~i,y^i\)∈\{0,1\}r\(\\tilde\{x\}\_\{i\},\\hat\{y\}\_\{i\}\)\\in\\\{0,1\\\}\.

We include these correction samples together with the original reasoning samples in GSPO’s sequence\-level optimization\. Formally, the GSPO objective can be written as a unified expectation over amixed input distribution:

JGSPOjoint\(θ\)=𝔼x~∼𝒟mix\[1G∑i=1Gmin\(\\displaystyle J\_\{\\mathrm\{GSPO\}\}^\{\\mathrm\{joint\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\tilde\{x\}\\sim\\mathcal\{D\}\_\{\\mathrm\{mix\}\}\}\\Biggl\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\Bigl\(si\(θ\)A^i,\\displaystyle s\_\{i\}\(\\theta\)\\,\\hat\{A\}\_\{i\},\(29\)clip\(si\(θ\),1−ϵ,1\+ϵ\)A^i\)\],\\displaystyle\\operatorname\{clip\}\\\!\\bigl\(s\_\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\bigr\)\\,\\hat\{A\}\_\{i\}\\Bigr\)\\Biggr\],where𝒟mix\\mathcal\{D\}\_\{\\mathrm\{mix\}\}denotes the mixed distribution composed of original query inputsxxand correction inputsx~=\(x;yi,yiref\)\\tilde\{x\}=\(x;y\_\{i\},y\_\{i\}^\{\\text\{ref\}\}\)\. The corresponding sequence\-level rewards are defined as:r′\(x,y\)r^\{\\prime\}\(x,y\)for original inputs andr\(x~,y^\)r\(\\tilde\{x\},\\hat\{y\}\)for correction inputs, while both share the same GSPO surrogate form\.

We emphasize that this joint training processdoes not introduce additional optimization stages or different loss functions\. Learning the correction capability is purely manifested as behavioral generalization of the policy under different input conditions, enabling the model to gradually internalize compare\-and\-correct capabilities during training, without requiring additional multi\-turn calls at inference time\.

### E\.3IBPO \+ GSPO Algorithm Pseudocode

Combining all the above stages, the complete workflow of IBPO \+ GSPO is summarized in Algorithm[1](https://arxiv.org/html/2605.16302#alg1)\.

Algorithm 1Instantiation of IBPO with GSPO: Single\-Pass Joint Training0:Dataset

𝒟\\mathcal\{D\}; current policy

πθ\\pi\_\{\\theta\}; old policy

πθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}; group size

GG; clipping parameter

ϵ\\epsilon; shaping weight

λ\\lambda; recoverability scale

β\\beta; correction operator

𝒞\\mathcal\{C\}; terminal reward

r\(⋅\)∈\{0,1\}r\(\\cdot\)\\in\\\{0,1\\\}\.

0:Updated parameters

θ\\theta\.

1:foreach iterationdo

2:Sample a mini\-batch of prompts

ℬ=\{x\}\\mathcal\{B\}=\\\{x\\\}from

𝒟\\mathcal\{D\}\.

3:foreach prompt

x∈ℬx\\in\\mathcal\{B\}do

4:\(A\) Sample rollout trajectories and compute GSPO ratios\.

5:Sample

GGresponses

\{yi\}i=1G∼πθold\(⋅∣x\)\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\)\.

6:for

i=1i=1to

GGdo

7:Compute terminal reward

ri←r\(x,yi\)r\_\{i\}\\leftarrow r\(x,y\_\{i\}\)\.

8:Compute sequence\-level ratio

9:

si\(θ\)←\(πθ\(yi∣x\)πθold\(yi∣x\)\)1\|yi\|s\_\{i\}\(\\theta\)\\leftarrow\\left\(\\frac\{\\pi\_\{\\theta\}\(y\_\{i\}\\mid x\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{i\}\\mid x\)\}\\right\)^\{\\frac\{1\}\{\|y\_\{i\}\|\}\}\.

10:endfor

11:\(B\) Self\-correction shaping signal and shaped reward\.

12:for

i=1i=1to

GGdo

13:if

ri=0r\_\{i\}=0then

14:Sample reference response

yirefy\_\{i\}^\{\\mathrm\{ref\}\}:

15:ifthere exists

jjsuch that

r\(x,yj\)=1r\(x,y\_\{j\}\)=1then

16:Uniformly sample

yirefy\_\{i\}^\{\\mathrm\{ref\}\}from

\{yj:r\(x,yj\)=1\}\\\{y\_\{j\}:r\(x,y\_\{j\}\)=1\\\}\.

17:else

18:Uniformly sample

yirefy\_\{i\}^\{\\mathrm\{ref\}\}from

\{yj:r\(x,yj\)=0\}\\\{y\_\{j\}:r\(x,y\_\{j\}\)=0\\\}\.

19:endif

20:Generate correction result

y^i←𝒞\(x;yi,yiref\)\\hat\{y\}\_\{i\}\\leftarrow\\mathcal\{C\}\(x;y\_\{i\},y\_\{i\}^\{\\mathrm\{ref\}\}\)\.

21:Set

Δi←β⋅𝕀\[r\(x,yi\)=0∧r\(x,y^i\)=1\]\\Delta\_\{i\}\\leftarrow\\beta\\cdot\\mathbb\{I\}\\\!\\left\[r\(x,y\_\{i\}\)=0\\wedge r\(x,\\hat\{y\}\_\{i\}\)=1\\right\]\.

22:else

23:Set

Δi←0\\Delta\_\{i\}\\leftarrow 0\.

24:endif

25:Shaped reward

ri′←ri\+λΔir^\{\\prime\}\_\{i\}\\leftarrow r\_\{i\}\+\\lambda\\,\\Delta\_\{i\}\.

26:endfor

27:\(A continued\) Construct group advantage from shaped rewards\.

28:

μ′←1G∑i=1Gri′\\mu^\{\\prime\}\\leftarrow\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}r^\{\\prime\}\_\{i\}\.

29:

σ′←1G∑i=1G\(ri′−μ′\)2\\sigma^\{\\prime\}\\leftarrow\\sqrt\{\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\(r^\{\\prime\}\_\{i\}\-\\mu^\{\\prime\}\)^\{2\}\}\.

30:for

i=1i=1to

GGdo

31:

A^i←ri′−μ′σ′\+10−8\\hat\{A\}\_\{i\}\\leftarrow\\frac\{r^\{\\prime\}\_\{i\}\-\\mu^\{\\prime\}\}\{\\sigma^\{\\prime\}\+10^\{\-8\}\}\.

32:endfor

33:GSPO surrogate objective on original prompts\.

34:Accumulate

35:

𝒥x\(θ\)←1G∑i=1Gmin⁡\(si\(θ\)A^i,clip\(si\(θ\),1−ϵ,1\+ϵ\)A^i\)\\mathcal\{J\}\_\{x\}\(\\theta\)\\leftarrow\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\\!\\Big\(s\_\{i\}\(\\theta\)\\hat\{A\}\_\{i\},\\;\\mathrm\{clip\}\(s\_\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{i\}\\Big\)\.

36:endfor

37:\(C\) Optional: Joint GSPO training on correction behavior\.

38:Construct correction input

x~i←\(x;yi,yiref\)\\tilde\{x\}\_\{i\}\\leftarrow\(x;\\,y\_\{i\},\\,y\_\{i\}^\{\\mathrm\{ref\}\}\)for corrected cases\.

39:Sample

y^i∼πθold\(⋅∣x~i\)\\hat\{y\}\_\{i\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid\\tilde\{x\}\_\{i\}\)and compute terminal reward

r\(x~i,y^i\)r\(\\tilde\{x\}\_\{i\},\\hat\{y\}\_\{i\}\)\.

40:If joint training is enabled, add the same GSPO surrogate on

x~i\\tilde\{x\}\_\{i\}\(Equation[29](https://arxiv.org/html/2605.16302#A5.E29)\)\.

41:Update

θ\\thetaby maximizing

∑x∈ℬ𝒥x\(θ\)\\sum\_\{x\\in\\mathcal\{B\}\}\\mathcal\{J\}\_\{x\}\(\\theta\)\(plus the joint objective if enabled\)\.

42:endfor
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Similar Articles

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution [R]

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Structured Role-Aware Policy Optimization for Multimodal Reasoning

Submit Feedback

Similar Articles

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization
RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution [R]
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Structured Role-Aware Policy Optimization for Multimodal Reasoning