FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning
Summary
The paper proposes FBOS-RL, a feedback-driven bi-objective synergistic reinforcement learning framework that improves training efficiency and performance ceiling over GRPO in LLM alignment and reasoning by using feedback-guided exploration and two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment and Exploration-oriented Capability Cultivation.
View Cached Full Text
Cached at: 05/21/26, 06:20 AM
# FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning
Source: [https://arxiv.org/html/2605.20256](https://arxiv.org/html/2605.20256)
Xikai Zhang1Yongzhi Li3Likang Xiao1Yingze Zhang1Yanhua Cheng3 Quan Chen3Peng Jiang3Wenjun Wu2,1Liu Liu2,1∗ 1Hangzhou International Innovation Institute, Beihang University 2School of Artificial Intelligence, Beihang University 3Kuaishou Technology
###### Abstract
Reinforcement learning \(RL\) has become a cornerstone for aligning and unlocking the reasoning capabilities of large\-scale models\. At its core, the training loop of GRPO and its variants alternates between*rollout sampling*and*policy update*: the policy first samples rollouts from its action space, and then updates its parameters according to the advantages computed over them\. Unlike supervised learning, where each gradient step is anchored to an explicit ground\-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high\-quality rollouts drawn during the sampling stage therefore act as the implicit "teacher" that guides every parameter update\. However, mainstream RL algorithms such as GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt\. When a task lies beyond the policy model’s current capability, this sampling scheme rarely yields a high\-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which causes training to stall\. To address this issue, we proposeFBOS\-RL, a*Feedback\-Driven Bi\-Objective Synergistic*reinforcement learning framework\. Specifically, we let the model perform Feedback\-Guided Exploration Enhancement based on the feedback provided by the environment, and on top of this we design two mutually reinforcing training objectives:*Exploitation\-oriented Policy Alignment*\(EPA\) and*Exploration\-oriented Capability Cultivation*\(ECC\)\. Extensive experiments demonstrate that EPA and ECC can mutually reinforce each other, forming a positive flywheel effect that significantly improves both the training efficiency and the final performance ceiling of reinforcement learning\. Specifically, under an identical rollout budget \(i\.e\., the same training steps\), FBOS\-RL learns substantially faster than GRPO and feedback\-based baselines and ultimately attains a higher performance ceiling, while exhibiting higher policy entropy and lower gradient norms throughout training\.
\*\*footnotetext:Corresponding author:liuliubh@buaa\.edu\.cn## 1Introduction
Reinforcement learning \(RL\) plays a pivotal role in the alignment and reasoning fine\-tuning of large language models \(LLMs\)\[[6](https://arxiv.org/html/2605.20256#bib.bib6),[7](https://arxiv.org/html/2605.20256#bib.bib7),[8](https://arxiv.org/html/2605.20256#bib.bib8),[15](https://arxiv.org/html/2605.20256#bib.bib15),[4](https://arxiv.org/html/2605.20256#bib.bib4)\]\. In essence, the training process of GRPO and its variants alternates between two stages,*rollout sampling*and*policy update*\[[5](https://arxiv.org/html/2605.20256#bib.bib5),[4](https://arxiv.org/html/2605.20256#bib.bib4),[28](https://arxiv.org/html/2605.20256#bib.bib28)\]: the model first samples in its action space to draw multiple rollouts, and then updates its policy parameters according to these rollouts and their advantages\. Unlike supervised learning, where the model is directly updated toward an explicit ground truth, in this setting the model does not know the optimal gradient direction a priori when its parameters are updated\. The high\-quality rollouts that the model happens to draw during the sampling stage in fact play the role of a “teacher” that guides the parameter update\. Only when the model samples rollouts that are better than its current behavior can the parameter update obtain a better gradient direction, thereby driving the model’s capability forward\.
Despite the empirical success of mainstream RL algorithms in practice, prevailing large\-scale\-model RL algorithms \(such as GRPO and its variants\)\[[4](https://arxiv.org/html/2605.20256#bib.bib4),[12](https://arxiv.org/html/2605.20256#bib.bib12),[15](https://arxiv.org/html/2605.20256#bib.bib15)\]commonly adopt a single and blind sampling paradigm during the sampling stage: given the original prompt, the model directly generates multiple rollouts conditioned on it\[[4](https://arxiv.org/html/2605.20256#bib.bib4),[12](https://arxiv.org/html/2605.20256#bib.bib12)\]\. This paradigm has a key limitation: when faced with complex tasks that exceed the model’s current capability, the model tends to enter a regime of inefficient sampling, and rarely samples a high\-quality rollout \(just like a monkey banging on a typewriter, who could not produce the complete works of Shakespeare even before the universe ends, see Figure[1](https://arxiv.org/html/2605.20256#S1.F1)\)\. In this situation, a single scalar reward can only tell the model that its current policy “performs poorly”, but cannot indicate “why” it is poor or “how to improve” \(akin to a student drilling problems without any explanation of past mistakes, whose progress is slow\)\[[14](https://arxiv.org/html/2605.20256#bib.bib14)\]\. Because the model cannot obtain high\-quality rollouts as optimization anchors during the sampling stage, it loses a reliable gradient direction when its parameters are updated, which eventually leads to low training efficiency and even prolonged training stagnation\[[23](https://arxiv.org/html/2605.20256#bib.bib23),[24](https://arxiv.org/html/2605.20256#bib.bib24),[22](https://arxiv.org/html/2605.20256#bib.bib22)\]\.
Figure 1:An illustrative analogy of the rollout\-sampling stage in vanilla GRPO\-style RL: a monkey randomly hitting keys on a typewriter is highly unlikely to ever produce the works of Shakespeare\. Likewise, when a prompt exceeds the policy’s current capability, simple sampling strategies rarely produce a high\-quality rollout, leaving training without a meaningful gradient anchor\.To address this issue, we proposeFBOS\-RL:Feedback\-DrivenBi\-ObjectiveSynergisticReinforcementLearning\. The core innovation of FBOS\-RL is that, through the guidance of feedback, the rollout\-sampling stage is enhanced, and on top of this, two mutually promoting training objectives are designed, namely*Exploitation\-oriented Policy Alignment*\(EPA\) and*Exploration\-oriented Capability Cultivation*\(ECC\)\. By alternately optimizing these two objectives, FBOS\-RL forms a positive bootstrapping flywheel, thereby substantially improving both the training efficiency and the final performance ceiling of reinforcement learning\. Although several recent works have attempted to introduce external feedback during the sampling stage, most of them merely treat feedback as a means of data augmentation, that is, they use feedback to generate higher\-quality data and then let the model imitate it\[[9](https://arxiv.org/html/2605.20256#bib.bib9),[10](https://arxiv.org/html/2605.20256#bib.bib10),[11](https://arxiv.org/html/2605.20256#bib.bib11),[29](https://arxiv.org/html/2605.20256#bib.bib29)\], while overlooking the cultivation of the model’s own capability to sample better rollouts according to feedback\. In contrast, FBOS\-RL not only leverages feedback to enhance the quality of the rollouts obtained during the sampling stage, but also explicitly sets “understanding feedback and correcting errors” itself as an additional reinforcement learning objective\. Through the alternating optimization of the two training objectives, a positive bootstrapping flywheel is formed, which improves both the training efficiency and the final performance ceiling\.
Our contributions are summarized as follows:
- •We point out the inefficiency and blindness of mainstream reinforcement learning algorithms \(such as GRPO\) during the rollout\-sampling stage, and propose an interactive sampling mechanism, namely Feedback\-Guided Exploration Enhancement, which effectively breaks the inefficient\-sampling bottleneck on complex reasoning tasks\.
- •We propose theFeedback\-Driven Bi\-Objective Synergistic Reinforcement Learning \(FBOS\-RL\)framework\. By designing two mutually promoting training objectives \(Exploitation\-oriented Policy Alignment and Exploration\-oriented Capability Cultivation\), it forms a positive bootstrapping flywheel that substantially improves both the training efficiency and the final performance ceiling of reinforcement learning\.
- •Extensive experiments on different datasets and on models of different families and scales validate that FBOS\-RL can substantially improve both the training efficiency and the final performance ceiling of reinforcement learning\. Meanwhile, FBOS\-RL also avoids entropy collapse, maintains higher policy entropy, and exhibits a lower gradient norm, indicating its stronger exploration capability and better training stability\.
## 2Related works
### 2\.1Reinforcement Learning for LLM Reasoning
Reinforcement learning \(RL\) has become a central tool for aligning LLMs and unlocking their reasoning capabilities\[[6](https://arxiv.org/html/2605.20256#bib.bib6),[7](https://arxiv.org/html/2605.20256#bib.bib7),[8](https://arxiv.org/html/2605.20256#bib.bib8)\]\. Building on PPO\[[5](https://arxiv.org/html/2605.20256#bib.bib5)\], scalable algorithms such as GRPO\[[4](https://arxiv.org/html/2605.20256#bib.bib4)\]and DAPO\[[12](https://arxiv.org/html/2605.20256#bib.bib12)\]have driven remarkable progress on mathematical and code reasoning, exemplified by DeepSeek\-R1\[[15](https://arxiv.org/html/2605.20256#bib.bib15)\]\. To improve credit assignment, process reward models\[[13](https://arxiv.org/html/2605.20256#bib.bib13),[16](https://arxiv.org/html/2605.20256#bib.bib16)\]and implicit process rewards\[[32](https://arxiv.org/html/2605.20256#bib.bib32)\]provide step\-level supervision, while ProRL\[[22](https://arxiv.org/html/2605.20256#bib.bib22)\]and entropy\-aware optimization\[[24](https://arxiv.org/html/2605.20256#bib.bib24)\]extend the reasoning frontier through prolonged or stabilized training\. A growing body of work, however, points out that the rollout\-sampling stage in mainstream RL pipelines is fundamentally blind: rollouts are sampled solely from the original prompt, and a scalar reward only signals that the policy “performs poorly” without indicating “why” or “how to improve”\[[14](https://arxiv.org/html/2605.20256#bib.bib14)\]\. Recent studies confirm that vanilla RL struggles to push the policy beyond the base model’s reasoning boundary\[[23](https://arxiv.org/html/2605.20256#bib.bib23)\]and emphasize the under\-investigated role of exploration\[[28](https://arxiv.org/html/2605.20256#bib.bib28),[25](https://arxiv.org/html/2605.20256#bib.bib25)\]\. Mitigations include reverse curricula\[[31](https://arxiv.org/html/2605.20256#bib.bib31)\], tree search guided RL\[[27](https://arxiv.org/html/2605.20256#bib.bib27)\], learning from negative data\[[21](https://arxiv.org/html/2605.20256#bib.bib21)\], test\-time compute scaling\[[30](https://arxiv.org/html/2605.20256#bib.bib30)\], and self\-play or self\-rewarding schemes\[[26](https://arxiv.org/html/2605.20256#bib.bib26),[20](https://arxiv.org/html/2605.20256#bib.bib20)\]\. Yet the gradient\-driving rollouts are still produced by a single, feedback\-free pass, leaving the core blindness largely unaddressed\.
### 2\.2Feedback\-Driven Self\-Refinement and Self\-Correction
A complementary line of work investigates how natural\-language feedback can help LLMs refine their outputs\. Self\-Refine\[[9](https://arxiv.org/html/2605.20256#bib.bib9)\]and Reflexion\[[10](https://arxiv.org/html/2605.20256#bib.bib10)\]use self\-critique or verbal reflection at inference time, and self\-critiquing models\[[19](https://arxiv.org/html/2605.20256#bib.bib19)\], learning\-to\-self\-correct\[[18](https://arxiv.org/html/2605.20256#bib.bib18)\], and tool\-augmented critiquing such as CRITIC\[[17](https://arxiv.org/html/2605.20256#bib.bib17)\]further show that feedback\-conditioned generation raises output quality\. More recent works train the self\-correction behavior itself, e\.g\., SCoRe\[[11](https://arxiv.org/html/2605.20256#bib.bib11)\]and recursive introspection\[[29](https://arxiv.org/html/2605.20256#bib.bib29)\]\. Earlier studies caution that LLMs cannot reliably self\-correct without external grounding\[[14](https://arxiv.org/html/2605.20256#bib.bib14)\], motivating verifiable signals\[[16](https://arxiv.org/html/2605.20256#bib.bib16),[13](https://arxiv.org/html/2605.20256#bib.bib13)\]\. Off\-policy guidance methods such as LUFFY\[[3](https://arxiv.org/html/2605.20256#bib.bib3)\]additionally reweight the importance ratio to amplify learning on critical low\-probability tokens\. Most of these approaches, however, treat feedback either as an inference\-time scaffold or as data for imitation, and rarely close the loop between feedback\-augmented exploration and the underlying policy optimization; the resulting prompt distribution shift is typically ignored, yielding biased gradients\. This fundamental limitation restricts the model’s ability to iteratively refine its behavior according to real\-time feedback signals\.
## 3Feedback\-Driven Bi\-Objective Synergistic Reinforcement Learning
In this section, we introduce FBOS\-RL, the Feedback\-Driven Bi\-Objective Synergistic Reinforcement Learning framework, whose overall pipeline is illustrated in Figure[2](https://arxiv.org/html/2605.20256#S3.F2)\. FBOS\-RL performs reinforcement learning on a single policy modelπθ\\pi\_\{\\theta\}and does not rely on any external model\. It substantially boosts both training efficiency and the attainable performance ceiling through two mutually reinforcing training objectives that together form a positive bootstrapping flywheel\. Each training step of FBOS\-RL is organized into three stages: the*Initial Exploration*stage, the*Feedback\-Guided Exploration Enhancement*stage, and the*Bi\-Objective Synergistic Training*stage, which we elaborate on in turn below\.
Figure 2:Overview of our Feedback\-Driven Bi\-Objective Synergistic RL \(FBOS\-RL\) framework\. In the sampling phase, the policy first generatesnninitial rollouts from the original promptqq; a rule\-based verifier produces a natural\-language feedback for each, which is then concatenated withqqand the rollout to form a Feedback\-Augmented Prompt \(FAP\) used for a second round of feedback\-guided sampling\. In the optimization phase, two complementary objectives are co\-optimized: Exploitation\-oriented Policy Alignment \(EPA\) via FA\-GRPO and Exploration\-oriented Capability Cultivation \(ECC\), which together induce a positive bootstrapping flywheel\.### 3\.1Initial Exploration
In the*Initial Exploration*stage, corresponding to the leftmost “Step 1: Initial Exploration” block of Figure[2](https://arxiv.org/html/2605.20256#S3.F2), the policy modelπθ\\pi\_\{\\theta\}independently samples, in parallel,nninitial rollouts\{ansi1\}i=1n∼πθ\(⋅∣q\)\\\{\\mathrm\{ans\}^\{1\}\_\{i\}\\\}\_\{i=1\}^\{n\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q\)conditioned solely on the original promptqq\. Each rolloutansi1\\mathrm\{ans\}^\{1\}\_\{i\}withi∈\[n\]i\\in\[n\]is then evaluated by a rule\-based verifier that returns both a scalar rewardri1r^\{1\}\_\{i\}and a natural\-language feedbackFi1F^\{1\}\_\{i\}that explicitly identifies the flaws in the current answer \(e\.g\., a specific reasoning step that violates a constraint, an incorrect intermediate result, or a violation of the required output format\)\. Unlike a single scalar reward,Fi1F^\{1\}\_\{i\}provides actionable, error\-localized information that tells the model not only thatansi1\\mathrm\{ans\}^\{1\}\_\{i\}is suboptimal, but alsowhyandwhereit went wrong\.
### 3\.2Feedback\-Guided Exploration Enhancement
To enhance the quality of the rollouts sampled during the sampling stage, so that the model can obtain higher\-quality gradient guidance when its parameters are updated in the subsequent training phase, we proceed from the Initial Exploration stage to the*Feedback\-Guided Exploration Enhancement*stage, which corresponds to the orange block in Figure[2](https://arxiv.org/html/2605.20256#S3.F2)\. For each initial rolloutansi1\\mathrm\{ans\}^\{1\}\_\{i\}, we concatenate the original promptqq, the rollout itself, and its corresponding feedbackFi1F^\{1\}\_\{i\}to construct aFeedback\-Augmented Prompt \(FAP\):q~i≜\(q⊕ansi1⊕Fi1\)\\tilde\{q\}\_\{i\}\\triangleq\(q\\oplus\\mathrm\{ans\}^\{1\}\_\{i\}\\oplus F^\{1\}\_\{i\}\)\. The FAP is then used to re\-prompt the policy model for a second round of sampling: for eachq~i\\tilde\{q\}\_\{i\}, the model sampleskkrollouts\{ansi,j2\}j=1k∼πθ\(⋅∣q~i\)\\\{\\mathrm\{ans\}^\{2\}\_\{i,j\}\\\}\_\{j=1\}^\{k\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\tilde\{q\}\_\{i\}\), and eachansi,j2\\mathrm\{ans\}^\{2\}\_\{i,j\}is again scored by the same rule\-based verifier to obtain a rewardri,j2r^\{2\}\_\{i,j\}\.
In total, the two stages above producen\+n⋅kn\+n\\cdot krollouts for each query, among whichnnrollouts are generated by the model conditioned on the original promptqq, and the remainingn⋅kn\\cdot krollouts are generated conditioned on the feedback\-augmented promptsq~i\\tilde\{q\}\_\{i\}\. The primary advantage of this mechanism lies in overcoming the blindness of the traditional sampling stage, and substantially boosting the quality of the rollouts sampled during the sampling stage\. Instead of forcing the model to blindly sample rollouts from the original promptqq, which often yields low\-quality rollouts when the task exceeds the model’s current capability, we provide targeted feedback for each initial rollout\. This feedback explicitly informs the model onhowto adjust its subsequent generation\. Consequently, even when the model’s initial capability is weak, the FAPs serve as a guiding scaffold, improving the quality of the second\-round rollouts and enabling the model to align with the environment rapidly\. This process is analogous to a student learning from a detailed explanation of their mistakes rather than blindly guessing the correct answer\.
### 3\.3Bi\-Objective Synergistic Training
Building upon the two sampling stages described above, we design two mutually reinforcing training objectives, namely*Exploitation\-oriented Policy Alignment*\(EPA\) and*Exploration\-oriented Capability Cultivation*\(ECC\), as illustrated in the rightmost green block of Figure[2](https://arxiv.org/html/2605.20256#S3.F2)\. At each training step, we sequentially update the policy parameters using these two objectives\.
In this paper, the qualifiers “exploitation\-oriented” and “exploration\-oriented” carry operational meanings specific to our framework\. By “exploitation\-oriented” we mean exploiting the high\-quality rollouts already collected as a learning signal for policy alignment\. By “exploration\-oriented” we mean cultivating the model’s capability to sample better rollouts when conditioned on feedback\-augmented prompts\. These usages are scoped to GRPO\-style training and are not intended to coincide with the classical action\-selection trade\-off in textbook reinforcement learning\.
Objective 1: Exploitation\-oriented Policy Alignment \(EPA\)\. We aggregate thenninitial rollouts\{ansi1\}i=1n\\\{\\mathrm\{ans\}^\{1\}\_\{i\}\\\}\_\{i=1\}^\{n\}\(drawn fromπθold\(⋅∣q\)\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid q\)\) and then⋅kn\\cdot krollouts\{ansi,j2\}i=1,j=1n,k\\\{\\mathrm\{ans\}^\{2\}\_\{i,j\}\\\}\_\{i=1,j=1\}^\{n,k\}produced in the Feedback\-Guided Exploration Enhancement stage \(drawn fromπθold\(⋅∣q~i\)\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid\\tilde\{q\}\_\{i\}\)\) into a single group𝒢q≜\{ansi1\}i=1n∪\{ansi,j2\}i=1,j=1n,k,\\mathcal\{G\}\_\{q\}\\;\\triangleq\\;\\\{\\mathrm\{ans\}^\{1\}\_\{i\}\\\}\_\{i=1\}^\{n\}\\,\\cup\\,\\\{\\mathrm\{ans\}^\{2\}\_\{i,j\}\\\}\_\{i=1,j=1\}^\{n,\\,k\},of sizeN≜n\+n⋅kN\\triangleq n\+n\\cdot k\. Based on the rewardsℛq≜\{ri1\}i=1n∪\{ri,j2\}i=1,j=1n,k\\mathcal\{R\}\_\{q\}\\triangleq\\\{r^\{1\}\_\{i\}\\\}\_\{i=1\}^\{n\}\\cup\\\{r^\{2\}\_\{i,j\}\\\}\_\{i=1,j=1\}^\{n,k\}associated with these rollouts, we compute a group\-normalized advantage following GRPO\[[4](https://arxiv.org/html/2605.20256#bib.bib4)\]for every rollouta∈𝒢qa\\in\\mathcal\{G\}\_\{q\}and rewardr\(a\)∈ℛqr\(a\)\\in\\mathcal\{R\}\_\{q\}as
A^EPA\(a\)=r\(a\)−μ𝒢qσ𝒢q\+ϵ,withμ𝒢q=1N∑a′∈𝒢qr\(a′\),σ𝒢q2=1N∑a′∈𝒢q\(r\(a′\)−μ𝒢q\)2\.\\hat\{A\}^\{\\text\{EPA\}\}\(a\)\\;=\\;\\frac\{r\(a\)\-\\mu\_\{\\mathcal\{G\}\_\{q\}\}\}\{\\sigma\_\{\\mathcal\{G\}\_\{q\}\}\+\\epsilon\},\\qquad\\text\{with\}\\quad\\mu\_\{\\mathcal\{G\}\_\{q\}\}=\\frac\{1\}\{N\}\\sum\\nolimits\_\{a^\{\\prime\}\\in\\mathcal\{G\}\_\{q\}\}r\(a^\{\\prime\}\),\\quad\\sigma\_\{\\mathcal\{G\}\_\{q\}\}^\{2\}=\\frac\{1\}\{N\}\\sum\\nolimits\_\{a^\{\\prime\}\\in\\mathcal\{G\}\_\{q\}\}\\big\(r\(a^\{\\prime\}\)\-\\mu\_\{\\mathcal\{G\}\_\{q\}\}\\big\)^\{2\}\.For an initial rolloutansi1\\mathrm\{ans\}^\{1\}\_\{i\}\(drawn fromπθold\(⋅∣q\)\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid q\)\), the importance sampling ratio at tokenttretains the standard form, i\.e\.,
ρi,t1\(θ\)=πθ\(ansi,t1∣q,ansi,<t1\)πθold\(ansi,t1∣q,ansi,<t1\)\.\\rho^\{1\}\_\{i,t\}\(\\theta\)\\;=\\;\\frac\{\\pi\_\{\\theta\}\(\\mathrm\{ans\}^\{1\}\_\{i,t\}\\mid q,\\,\\mathrm\{ans\}^\{1\}\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\mathrm\{ans\}^\{1\}\_\{i,t\}\\mid q,\\,\\mathrm\{ans\}^\{1\}\_\{i,<t\}\)\}\.
However, for a rolloutansi,j2\\mathrm\{ans\}^\{2\}\_\{i,j\}collected in the Feedback\-Guided Exploration Enhancement stage, the behavior policy that actually drew it isπθold\(⋅∣q~i\)\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid\\tilde\{q\}\_\{i\}\), whereas the policy we ultimately deploy at inference time only sees the original promptqq\. By definition of an importance sampling ratio, the numerator must correspond to the target policy that we wish to optimize and deploy, while the denominator must correspond to the behavior policy that actually produced the sample\. The corresponding token\-level importance sampling ratio is therefore defined as
ρi,j,t2\(θ\)=πθ\(ansi,j,t2∣q,ansi,j,<t2\)πθold\(ansi,j,t2∣q~i,ansi,j,<t2\)\.\\rho^\{2\}\_\{i,j,t\}\(\\theta\)\\;=\\;\\frac\{\\pi\_\{\\theta\}\(\\mathrm\{ans\}^\{2\}\_\{i,j,t\}\\mid q,\\,\\mathrm\{ans\}^\{2\}\_\{i,j,<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\mathrm\{ans\}^\{2\}\_\{i,j,t\}\\mid\\tilde\{q\}\_\{i\},\\,\\mathrm\{ans\}^\{2\}\_\{i,j,<t\}\)\}\.
Inspired by Yan et al\.\[[3](https://arxiv.org/html/2605.20256#bib.bib3)\], we additionally introduce a reweighting functionf\(ρ\)=ρ/\(ρ\+0\.1\)f\(\\rho\)=\\rho/\(\\rho\+0\.1\)that acts onρi,j,t2\(θ\)\\rho^\{2\}\_\{i,j,t\}\(\\theta\), in order to strengthen the learning signal on low\-probability tokens that the model has not yet mastered but that may correspond to critical reasoning steps\. We then treat𝒢q\\mathcal\{G\}\_\{q\}as a single group and perform GRPO training over it\. Putting the above ingredients together, and applying the reweighting functionf\(⋅\)f\(\\cdot\)exclusively toρi,j,t2\(θ\)\\rho^\{2\}\_\{i,j,t\}\(\\theta\), the EPA objective is given by
ℒEPA\(θ\)=ℒEPAinit\(θ\)\+ℒEPAFAP\(θ\),\\mathcal\{L\}\_\{\\text\{EPA\}\}\(\\theta\)\\;=\\;\\mathcal\{L\}\_\{\\text\{EPA\}\}^\{\\text\{init\}\}\(\\theta\)\\;\+\\;\\mathcal\{L\}\_\{\\text\{EPA\}\}^\{\\text\{FAP\}\}\(\\theta\),\(3\.1\)whereℒEPAinit\(θ\)\\mathcal\{L\}\_\{\\text\{EPA\}\}^\{\\text\{init\}\}\(\\theta\)accounts for thenninitial rollouts drawn fromπθold\(⋅∣q\)\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid q\)andℒEPAFAP\(θ\)\\mathcal\{L\}\_\{\\text\{EPA\}\}^\{\\text\{FAP\}\}\(\\theta\)accounts for then⋅kn\\cdot krollouts drawn fromπθold\(⋅∣q~i\)\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid\\tilde\{q\}\_\{i\}\), respectively,
ℒEPAinit\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\text\{EPA\}\}^\{\\text\{init\}\}\(\\theta\)=−𝔼\[1N∑i=1n1\|ansi1\|∑t=1\|ansi1\|min\(ρi,t1\(θ\)A^EPA\(ansi1\),clipinitA^EPA\(ansi1\)\)\],\\displaystyle=\-\\,\\mathbb\{E\}\\Bigg\[\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{n\}\\frac\{1\}\{\|\\mathrm\{ans\}^\{1\}\_\{i\}\|\}\\sum\_\{t=1\}^\{\|\\mathrm\{ans\}^\{1\}\_\{i\}\|\}\\min\\\!\\Big\(\\rho^\{1\}\_\{i,t\}\(\\theta\)\\,\\hat\{A\}^\{\\text\{EPA\}\}\(\\mathrm\{ans\}^\{1\}\_\{i\}\),\\;\\mathrm\{clip\}\_\{\\text\{init\}\}\\hat\{A\}^\{\\text\{EPA\}\}\(\\mathrm\{ans\}^\{1\}\_\{i\}\)\\Big\)\\Bigg\],ℒEPAFAP\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\text\{EPA\}\}^\{\\text\{FAP\}\}\(\\theta\)=−𝔼\[1N∑i=1n∑j=1k1\|ansi,j2\|∑t=1\|ansi,j2\|min\(f\(ρi,j,t2\(θ\)\)A^EPA\(ansi,j2\),clipFAPA^EPA\(ansi,j2\)\)\]\.\\displaystyle=\-\\,\\mathbb\{E\}\\Bigg\[\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{k\}\\frac\{1\}\{\|\\mathrm\{ans\}^\{2\}\_\{i,j\}\|\}\\sum\_\{t=1\}^\{\|\\mathrm\{ans\}^\{2\}\_\{i,j\}\|\}\\min\\\!\\Big\(f\\\!\\big\(\\rho^\{2\}\_\{i,j,t\}\(\\theta\)\\big\)\\,\\hat\{A\}^\{\\text\{EPA\}\}\(\\mathrm\{ans\}^\{2\}\_\{i,j\}\),\\;\\mathrm\{clip\}\_\{\\text\{FAP\}\}\\hat\{A\}^\{\\text\{EPA\}\}\(\\mathrm\{ans\}^\{2\}\_\{i,j\}\)\\Big\)\\Bigg\]\.Note thatclipinit=clip\(ρi,t1\(θ\),1−ε,1\+ε\)\\mathrm\{clip\}\_\{\\text\{init\}\}=\\mathrm\{clip\}\\\!\\big\(\\rho^\{1\}\_\{i,t\}\(\\theta\),\\,1\-\\varepsilon,\\,1\+\\varepsilon\\big\)\\,andclipFAP=clip\(f\(ρi,j,t2\(θ\)\),1−ε,1\+ε\)\\mathrm\{clip\}\_\{\\text\{FAP\}\}=\\mathrm\{clip\}\\\!\\big\(f\\\!\\big\(\\rho^\{2\}\_\{i,j,t\}\(\\theta\)\\big\),\\,1\-\\varepsilon,\\,1\+\\varepsilon\\big\)\\,\. The essence of EPA is therefore to pull the model’s*feedback\-free*policyπθ\(⋅∣q\)\\pi\_\{\\theta\}\(\\cdot\\mid q\)toward the high\-quality rollouts discovered during the feedback\-guided sampling stage, thereby achieving highly efficient exploitation of the collected high\-quality rollouts\. We empirically verify that EPA can be optimized in a stable manner: as shown in Appendices[C](https://arxiv.org/html/2605.20256#A3), on both Llama\-3\.1\-8B\-Instruct and Qwen3\-14B, the EPA training\-set score steadily rises while its std steadily decreases, and the same trend holds across training subsets of different difficulty levels\.
Objective 2: Exploration\-oriented Capability Cultivation \(ECC\)\. To cultivate the model’s ability to discover higher\-quality rollouts when guided by FAPs, we treat thekkrollouts generated in the second round foreachspecific FAPq~i\\tilde\{q\}\_\{i\}as an independent group𝒢q~i=\{ansi,j2\}j=1k\\mathcal\{G\}\_\{\\tilde\{q\}\_\{i\}\}=\\\{\\mathrm\{ans\}^\{2\}\_\{i,j\}\\\}\_\{j=1\}^\{k\}and perform standard GRPO training\. This objective focuses exclusively on cultivating the model’sfeedback\-conditioned sampling capability, i\.e\., its ability to discover better rollouts when conditioned on FAPs that contain feedback signals\.
Based on the rewards\{ri,j2\}j=1k\\\{r^\{2\}\_\{i,j\}\\\}\_\{j=1\}^\{k\}associated with these rollouts, we compute a group\-normalized advantage for each rollouta∈𝒢q~ia\\in\\mathcal\{G\}\_\{\\tilde\{q\}\_\{i\}\}following GRPO\[[4](https://arxiv.org/html/2605.20256#bib.bib4)\]as
A^ECC\(a\)=r\(a\)−μq~iσq~i\+ϵ,withμq~i=1k∑a′∈𝒢q~ir\(a′\),σq~i2=1k∑a′∈𝒢q~i\(r\(a′\)−μq~i\)2\.\\hat\{A\}^\{\\text\{ECC\}\}\(a\)\\;=\\;\\frac\{r\(a\)\-\\mu\_\{\\tilde\{q\}\_\{i\}\}\}\{\\sigma\_\{\\tilde\{q\}\_\{i\}\}\+\\epsilon\},\\qquad\\text\{with\}\\quad\\mu\_\{\\tilde\{q\}\_\{i\}\}=\\frac\{1\}\{k\}\\\!\\\!\\sum\\nolimits\_\{a^\{\\prime\}\\in\\mathcal\{G\}\_\{\\tilde\{q\}\_\{i\}\}\}\\\!\\\!r\(a^\{\\prime\}\),\\quad\\sigma\_\{\\tilde\{q\}\_\{i\}\}^\{2\}=\\frac\{1\}\{k\}\\sum\\nolimits\_\{a^\{\\prime\}\\in\\mathcal\{G\}\_\{\\tilde\{q\}\_\{i\}\}\}\\big\(r\(a^\{\\prime\}\)\-\\mu\_\{\\tilde\{q\}\_\{i\}\}\\big\)^\{2\}\.Note thatr\(a\)r\(a\)denotes the reward corresponding to the rolloutaa\. Because every rollout in𝒢q~i\\mathcal\{G\}\_\{\\tilde\{q\}\_\{i\}\}shares the same conditioning promptq~i\\tilde\{q\}\_\{i\}, the standard GRPO importance ratio applies without modification:ρ~i,j,t\(θ\)=πθ\(ansi,j,t2∣q~i,ansi,j,<t2\)πθold\(ansi,j,t2∣q~i,ansi,j,<t2\)\.\\tilde\{\\rho\}\_\{i,j,t\}\(\\theta\)\\;=\\;\\frac\{\\pi\_\{\\theta\}\(\\mathrm\{ans\}^\{2\}\_\{i,j,t\}\\mid\\tilde\{q\}\_\{i\},\\,\\mathrm\{ans\}^\{2\}\_\{i,j,<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\mathrm\{ans\}^\{2\}\_\{i,j,t\}\\mid\\tilde\{q\}\_\{i\},\\,\\mathrm\{ans\}^\{2\}\_\{i,j,<t\}\)\}\.The ECC objective is defined as follows:
ℒECC\(θ\)=−𝔼\[1k∑j=1k1\|ansi,j2\|∑t=1\|ansi,j2\|min\(ρ~i,j,t\(θ\)A^ECC\(ansi,j2\),clipECCA^ECC\(ansi,j2\)\)\],\\displaystyle\\mathcal\{L\}\_\{\\text\{ECC\}\}\(\\theta\)=\-\\,\\mathbb\{E\}\\Bigg\[\\frac\{1\}\{k\}\\sum\_\{j=1\}^\{k\}\\frac\{1\}\{\|\\mathrm\{ans\}^\{2\}\_\{i,j\}\|\}\\sum\_\{t=1\}^\{\|\\mathrm\{ans\}^\{2\}\_\{i,j\}\|\}\\min\\\!\\Big\(\\tilde\{\\rho\}\_\{i,j,t\}\(\\theta\)\\,\\hat\{A\}^\{\\text\{ECC\}\}\(\\mathrm\{ans\}^\{2\}\_\{i,j\}\),\\;\\mathrm\{clip\}\_\{\\text\{ECC\}\}\\hat\{A\}^\{\\text\{ECC\}\}\(\\mathrm\{ans\}^\{2\}\_\{i,j\}\)\\Big\)\\Bigg\],whereclipECC=clip\(ρ~i,j,t\(θ\),1−ε,1\+ε\)\\mathrm\{clip\}\_\{\\text\{ECC\}\}=\\mathrm\{clip\}\\big\(\\tilde\{\\rho\}\_\{i,j,t\}\(\\theta\),\\,1\-\\varepsilon,\\,1\+\\varepsilon\\big\)\\,\. By optimizingℒECC\\mathcal\{L\}\_\{\\text\{ECC\}\}, ECC improves the model’s ability to discover higher\-quality rollouts based on FAPs, so that it learns how to better correct its mistakes according to the feedback\. We empirically verify that ECC can be optimized in a stable manner as well: as shown in Appendices[D](https://arxiv.org/html/2605.20256#A4), on both Llama\-3\.1\-8B\-Instruct and Qwen3\-14B, the ECC training\-set score steadily rises while its std steadily decreases, and the same trend holds across training subsets of different difficulty levels\. The two objectives are closely analogous to how a student tackles practice problems: ECC corresponds to the act of refining answers based on the explanations of past mistakes, while EPA corresponds to gradually internalizing those lessons into the realization that “ah, so next time I encounter this problem, I should solve it this way\!” Together, they enable the student to learn more efficiently and make rapid progress\.
### 3\.4The Positive Bootstrapping Flywheel Effect
In what follows, we describe how EPA and ECC mutually reinforce each other and together give rise to a positive bootstrapping flywheel\. The two directions of this mutual reinforcement are empirically validated by our ablation study in Section[4\.3](https://arxiv.org/html/2605.20256#S4.SS3)\.
ECC boosts EPA\(empirically verified in Section[4\.3\.1](https://arxiv.org/html/2605.20256#S4.SS3.SSS1)\):ECC trains the model to discover higher\-quality rollouts conditioned on FAPs, and these high\-quality rollouts are subsequently injected into the group of EPA for GRPO training\. Therefore, ECC raises the quality of the rollouts used for EPA training, thereby boosting EPA by providing it with a higher\-quality gradient direction and elevating its policy\-alignment ceiling\.
EPA elevates ECC\(empirically verified in Section[4\.3\.2](https://arxiv.org/html/2605.20256#S4.SS3.SSS2)\):At the very beginning of training, the model typically does not yet possess the capability to generate sufficiently good rollouts conditioned solely on the original promptqq, and frequently makes relatively low\-level mistakes\. Consequently, the feedbacks\{Fi1\}i=1n\\\{F^\{1\}\_\{i\}\\\}\_\{i=1\}^\{n\}received by the rollouts\{ansi1\}i=1n\\\{\\mathrm\{ans\}^\{1\}\_\{i\}\\\}\_\{i=1\}^\{n\}produced in the Initial Exploration stage are mostly targeted at low\-level errors \(e\.g\., format errors\)\. At this stage, ECC can only train the model’s ability to discover better rollouts based on “feedback targeting low\-level errors”, and is hardly able to train its ability to discover better rollouts based on “feedback targeting higher\-level errors” \(since the model rarely receives feedback targeting higher\-level errors at this point\)\. EPA, in contrast, trains the model to generate better rollouts conditioned solely on the original promptqq\. By bringing EPA into the loop, the model increasingly receives “feedback targeting higher\-level errors” during the Initial Exploration stage, so that ECC can in turn train the model’s ability to discover better rollouts based on “feedback targeting higher\-level errors” \(avoiding a mismatch between the model’s capability and the difficulty of the feedback\)\. The model begins to break through low\-level errors, thereby triggering “advanced feedback” targeting deeper logical flaws\. This capability progression enables ECC to continuously train the model’s correction ability in an error space of ever\-higher difficulty\.
## 4Experiments
### 4\.1Experimental Setup
##### Datasets\.
We conduct experiments on the TravelPlanner dataset\[[2](https://arxiv.org/html/2605.20256#bib.bib2)\]and the MiniF2F dataset\[[1](https://arxiv.org/html/2605.20256#bib.bib1)\]\(Lean4 version\), respectively\. The TravelPlanner dataset evaluates the ability of LLMs to perform complex, multi\-constraint planning in real\-world scenarios, while MiniF2F \- Lean4 benchmarks their capacity for formal mathematical reasoning and proof generation within the Lean 4 formalization environment\. Detailed descriptions of both datasets are provided in Appendix[A](https://arxiv.org/html/2605.20256#A1)\. For the TravelPlanner dataset, we use its training set \(45 samples\) and validation set \(180 samples\), totaling 225 samples, as our training set, and adopt its test set \(1000 samples\) as our validation set\. For the MiniF2F \- Lean4 dataset, we use its validation set \(244 samples\) as our training set, and use its test set \(244 samples\) as our validation set\.
##### Models
On the TravelPlanner dataset, we conduct experiments on the Llama\-3\.1\-8B\-Instruct model and the Qwen3\-14B model, respectively\. On the MiniF2F \- Lean4 dataset, we conduct experiments on the Qwen3\.5\-27B model\. All training is conducted on NVIDIA H200 GPUs\.
##### Baseline Methods\.
We evaluate our method against GRPO\[[4](https://arxiv.org/html/2605.20256#bib.bib4)\], as well as three comparison methods designed by ourselves: FBOS\-RL w/o EPA, FBOS\-RL w/o ECC, and GRPO w/ Extra Update\. Specifically:
- •FBOS\-RL w/o EPA: A variant of FBOS\-RL that removes the EPA training objective, which is used to ablate the contribution of the EPA objective\.
- •FBOS\-RL w/o ECC: A variant of FBOS\-RL that removes the ECC training objective, which is used to ablate the contribution of the ECC objective\.
- •GRPO w/ Extra Update: A variant of GRPO that adds one extra update step at each training step on top of GRPO, which is used to control for the effect of additional gradient updates introduced by FBOS\-RL\. This baseline rules out the possibility that the gains of FBOS\-RL stem merely from the additional parameter update introduced by ECC; even under this strengthened baseline, FBOS\-RL still substantially outperforms it, confirming that the improvements come from the synergistic flywheel rather than from the extra update itself\. Detailed results of this controlled experiment on both the Qwen3\-14B and Llama\-3\.1\-8B\-Instruct models are provided in Section[4\.4](https://arxiv.org/html/2605.20256#S4.SS4)\.
To ensure a fair comparison, we keep the order of the training data identical across runs and ensure that the number of rollouts per training step is strictly the same\. To mitigate the inherent stochasticity of reinforcement learning, we independently repeat each experiment three times\. In addition, on the TravelPlanner dataset, we also include the following baselines:
- •Greedy SearchTo evaluate the performance of traditional search algorithms in TravelPlanner, we adopt the greedy search strategy as one of the baselines\. Greedy Search focuses on cost minimization as its core objective\. Among transportation options, it selects the one with the lowest cost; for dining, it chooses restaurants with the lowest average expenditure; for accommodation, it selects the cheapest option; and for sightseeing, it arranges attractions by randomly selecting them each day\. For a 5\-day or 7\-day travel plan, select the top 1 to 2 cities as destinations from the returned city search results\.
- •Sole\-Planning ModeWe focus on the sole\-planning mode of the TravelPlan task\. In this mode, the model is provided in advance with sufficient and necessary reference information required for reasoning and planning\. This setting is used to evaluate the model’s ability to perform complex reasoning and planning directly based on the given information\. The baselines under the sole\-planning mode include the following models and strategies: - –Models:GPT\-3\.5\-Turbo, GPT\-4\-Turbo, GPT\-4o \(version: 2024\-11\-20\), Mixtral\-8×7B\-MoE, Gemini Pro, Qwen3\-8B\-Instruct, DeepSeek\-R1\. - –Strategies: Our baselines include the following strategies: Direct, CoT, ReAct, Reflexion, and prompt reflect\. Specifically, "Direct" refers to prompting the model to directly generate the final travel plan; "CoT" refers to prompting the model to reason step by step before producing the final travel plan; "ReAct" refers to prompting the model to solve the task by alternating between Thought, Action, and Observation steps; "Reflexion" refers to prompting the model to perform self\-reflection before generating the final travel plan; "Prompt reflect" refers to prompting the model to reflect on its previous reasoning before producing the final travel plan\.
Table 1:Experimental results comparing FBOS\-RL with different baselines on TravelPlanner with five criteriaEvaluation Metrics\.We adopt standard metrics\[[2](https://arxiv.org/html/2605.20256#bib.bib2)\]to evaluate the performance of our method, including Commonsense Constraint Pass Rate, Hard Constraint Pass Rate, and Final Pass Rate\. Detailed mathematical definitions of all the above metrics are provided in Appendix[B](https://arxiv.org/html/2605.20256#A2)\. For each sample in the MiniF2F \- Lean4 dataset, we adopt the following evaluation scheme: if the model gives a correct proof, it is awarded\+1\+1point; if the model gives a response in the correct format but the proof is wrong or some steps are not finished, it is awarded0point; if the model’s response is in an incorrect format, it is awarded−1\-1point\. We apply the above evaluation scheme to every sample, and finally average the scores as the average score\.
### 4\.2Main Results
We first present the experimental results on the TravelPlanner dataset in Table[1](https://arxiv.org/html/2605.20256#S4.T1), and Figure[3](https://arxiv.org/html/2605.20256#S4.F3)shows the comparison curves between our method and GRPO: on TravelPlanner with the Llama\-3\.1\-8B\-Instruct model \(left\) and the Qwen3\-14B model \(middle\), and on MiniF2F \- Lean4 with the Qwen3\.5\-27B model \(right\)\. The bottom horizontal axis denotes the number of training steps, the top horizontal axis denotes the cumulative number of rollouts, and the vertical axis denotes the final pass rate \(for TravelPlanner\) or the validation score \(for MiniF2F \- Lean4\) on the validation set\. To ensure a fair comparison, we keep the order of the training data identical across runs and ensure that the number of rollouts per training step is strictly the same\. To mitigate the inherent stochasticity of reinforcement learning, we independently repeat each experiment three times\.
Figure 3:Performance on the validation set during training: our method \(FBOS\-RL\) vs\. vanilla GRPO\. Left: final pass rate of the Llama\-3\.1\-8B\-Instruct model on TravelPlanner\. Middle: final pass rate of the Qwen3\-14B model on TravelPlanner\. Right: validation score of the Qwen3\.5\-27B model on MiniF2F \- Lean4\. The bottomxx\-axis denotes the number of training steps, the topxx\-axis denotes the cumulative number of rollouts, and theyy\-axis denotes the final pass rate \(for TravelPlanner\) or the validation score \(for MiniF2F \- Lean4\) on the validation set\.The TravelPlanner dataset partitions samples by difficulty into "easy", "medium", and "hard"\. We also report the final pass rate on the validation set across the three difficulty levels \("easy", "medium", "hard"\) during training\. For the Llama\-3\.1\-8B\-Instruct model, the results are shown in Figure[4](https://arxiv.org/html/2605.20256#S4.F4)\.
Figure 4:Final pass rate of the Llama\-3\.1\-8B\-Instruct model on the TravelPlanner validation set during training, broken down by difficulty level: easy \(left\), medium \(middle\), and hard \(right\)\. Our method \(FBOS\-RL\) is compared with vanilla GRPO\.For the Qwen3\-14B model, the results are shown in Figure[5](https://arxiv.org/html/2605.20256#S4.F5)\. We observe that the advantage of our method grows as the difficulty increases\.
Figure 5:Final pass rate of the Qwen3\-14B model on the TravelPlanner validation set during training, broken down by difficulty level: easy \(left\), medium \(middle\), and hard \(right\)\. Our method \(FBOS\-RL\) is compared with vanilla GRPO\.We further report the following four metrics on the TravelPlanner validation set during training: Commonsense Constraint Pass Rate \(Micro\), Commonsense Constraint Pass Rate \(Macro\), Hard Constraint Pass Rate \(Micro\), and Hard Constraint Pass Rate \(Macro\)\. The results for the Llama\-3\.1\-8B\-Instruct model and the Qwen3\-14B model are shown in Figure[6](https://arxiv.org/html/2605.20256#S4.F6)and Figure[7](https://arxiv.org/html/2605.20256#S4.F7), respectively\.
Figure 6:Commonsense and hard constraint pass rates of the Llama\-3\.1\-8B\-Instruct model on the TravelPlanner validation set during training: Commonsense Constraint Pass Rate \(Micro\) \(a\), Commonsense Constraint Pass Rate \(Macro\) \(b\), Hard Constraint Pass Rate \(Micro\) \(c\), and Hard Constraint Pass Rate \(Macro\) \(d\)\. Our method \(FBOS\-RL\) is compared with vanilla GRPO\.Figure 7:Commonsense and hard constraint pass rates of the Qwen3\-14B model on the TravelPlanner validation set during training: Commonsense Constraint Pass Rate \(Micro\) \(a\), Commonsense Constraint Pass Rate \(Macro\) \(b\), Hard Constraint Pass Rate \(Micro\) \(c\), and Hard Constraint Pass Rate \(Macro\) \(d\)\. Our method \(FBOS\-RL\) is compared with vanilla GRPO\.Furthermore, as shown in Figure[8](https://arxiv.org/html/2605.20256#S4.F8), our method does not exhibit entropy collapse, and its entropy values remain higher than those of vanilla GRPO, indicating that our method achieves stronger exploration than vanilla GRPO\.
Figure 8:Actor entropy during training: our method \(FBOS\-RL\) vs\. vanilla GRPO\. Left: Llama\-3\.1\-8B\-Instruct on TravelPlanner\. Middle: Qwen3\-14B on TravelPlanner\. Right: Qwen3\.5\-27B on MiniF2F \- Lean4\. Across all three settings, our method does not suffer from entropy collapse and consistently maintains higher entropy than vanilla GRPO\.Figure 9:Gradient norm during training: our method \(FBOS\-RL\) vs\. vanilla GRPO\. Left: Llama\-3\.1\-8B\-Instruct on TravelPlanner\. Middle: Qwen3\-14B on TravelPlanner\. Right: Qwen3\.5\-27B on MiniF2F \- Lean4\. Across all three settings, our method exhibits a lower gradient norm than vanilla GRPO, indicating better training stability\.Figure 10:OOD generalization to the GPQA\-Diamond dataset: comparison of three models \(Llama\-3\.1\-8B\-Instruct, Qwen3\-14B, and Qwen3\.5\-27B\) before training and after training with our FBOS\-RL method\. The Llama\-3\.1\-8B\-Instruct and Qwen3\-14B models are trained with FBOS\-RL on the TravelPlanner dataset, while the Qwen3\.5\-27B model is trained with FBOS\-RL on the MiniF2F \- Lean4 dataset\. None of the models are further trained on GPQA\-Diamond, so this evaluation directly measures the OOD generalization gains brought by FBOS\-RL\.Moreover, as shown in Figure[9](https://arxiv.org/html/2605.20256#S4.F9), the gradient norm of our method is lower than that of vanilla GRPO, indicating that our method is more stable than vanilla GRPO\. To evaluate the generality, we conduct experiments on the MiniF2F \- Lean4 dataset with the Qwen3\.5\-27B model\. The results are shown in Figure[3](https://arxiv.org/html/2605.20256#S4.F3)\(right\)\. We further evaluate the trained models on out\-of\-distribution \(OOD\) datasets, with results shown in Figure[10](https://arxiv.org/html/2605.20256#S4.F10)\. Note that the Llama\-3\.1\-8B\-Instruct and Qwen3\-14B models are only trained on the TravelPlanner dataset, while the Qwen3\.5\-27B model is only trained on the MiniF2F \- Lean4 dataset\.
### 4\.3Ablation Study
To demonstrate that training Objective 1 \(Exploitation\-oriented Policy Alignment, EPA\) and Objective 2 \(Exploration\-oriented Capability Cultivation, ECC\) can mutually reinforce each other, forming a positive flywheel \(bootstrapping\) effect, we conduct in\-depth experiments on the Qwen3\-14B model\.
#### 4\.3\.1Objective 2 \(ECC\) Boosts Objective 1 \(EPA\)
We design a baseline that only optimizes Objective 1 \(EPA\) during training\.
Objective 2 \(ECC\) trains the model to sample higher\-quality rollouts conditioned on the Feedback\-Augmented Prompt \(FAP\)\. These high\-quality rollouts are then injected into the GRPO group used by Objective 1 \(EPA\)\. In this way, Objective 2 \(ECC\) effectively raises the quality of the rollouts that Objective 1 \(EPA\) leverages during the sampling stage, thereby boosting Objective 1 \(EPA\) \(and significantly raising the ceiling of policy alignment\)\. \(ECC, which cultivates the feedback\-conditioned sampling capability, provides EPA, which exploits the already\-sampled high\-quality rollouts, with high\-quality gradient guidance: it teaches the model how to produce better rollouts based on feedback\.\)
The figure below reports, at each training step on the training set, the average quality \(left\) and the best \(max\) quality \(right\) of rollouts generated by the model conditioned on the Feedback\-Augmented Prompt \(FAP\)\.
Figure 11:Mean quality \(left\) and max quality \(right\) of rollouts generated by the Qwen3\-14B model conditioned on the Feedback\-Augmented Prompt \(FAP\) at each training step on the training set: our method vs\. the baseline that only trains Objective 1 \(EPA\)\.In addition, we report results separately for each difficulty level\.
Mean quality of rollouts generated by the model conditioned on the Feedback\-Augmented Prompt \(FAP\):
Figure 12:Mean quality of FAP\-conditioned rollouts on the training set, broken down by difficulty level: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the baseline that only trains Objective 1 \(EPA\)\.Max quality of rollouts generated by the model conditioned on the Feedback\-Augmented Prompt \(FAP\):
Figure 13:Max quality of FAP\-conditioned rollouts on the training set, broken down by difficulty level: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the baseline that only trains Objective 1 \(EPA\)\.These figures show that introducing Objective 2 \(ECC\) leads the model to generate higher\-quality rollouts under the FAP \(both mean and max quality are higher, across every difficulty level\)\.
The figure below reports, at each training step on the training set, the average quality of rollouts generated during the sampling phase \(including both the initial sampling and the second\-round FAP\-guided sampling\)\.
Figure 14:Mean quality of rollouts generated during the entire sampling phase \(initial sampling and FAP\-guided second\-round sampling combined\) at each training step on the training set: our method vs\. the baseline that only trains Objective 1 \(EPA\)\.Furthermore, we report the mean and max quality for each difficulty level\.
Mean quality at each difficulty level:
Figure 15:Mean quality of rollouts generated during the sampling phase on the training set, broken down by difficulty level: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the baseline that only trains Objective 1 \(EPA\)\.Max quality at each difficulty level:
Figure 16:Max quality of rollouts generated during the sampling phase on the training set, broken down by difficulty level: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the baseline that only trains Objective 1 \(EPA\)\.These figures show that introducing Objective 2 \(ECC\) significantly improves the quality of rollouts discovered by the model during the sampling phase\.
On the validation set, we observe that our method significantly outperforms the baseline:
Figure 17:Final pass rate on the TravelPlanner validation set: our method vs\. the baseline that only trains Objective 1 \(EPA\)\.The figure below reports, during training, the final pass rate on the validation set for each difficulty level \(“easy”, “medium”, “hard”\)\. As difficulty increases, the lead of our method grows larger\.
Figure 18:Final pass rate on the TravelPlanner validation set across different difficulty levels: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the baseline that only trains Objective 1 \(EPA\)\.The figure below shows the following four metrics on the validation set during training: Commonsense Constraint Pass Rate \- Micro, Commonsense Constraint Pass Rate \- Macro, Hard Constraint Pass Rate \- Micro, Hard Constraint Pass Rate \- Macro\. Our method significantly outperforms the baseline on all four metrics\.
Figure 19:Commonsense and hard constraint pass rates \(micro and macro\) on the TravelPlanner validation set: our method vs\. the baseline that only trains Objective 1 \(EPA\)\.This demonstrates that Objective 2 \(ECC\) can effectively boost Objective 1 \(EPA\)\.
#### 4\.3\.2Objective 1 \(EPA\) Boosts Objective 2 \(ECC\)
We design a baseline that only optimizes Objective 2 \(ECC\) during training\.
The figure below reports, at each training step on the training set, the average quality of rollouts generated by the model during the sampling phase \(including both the initial sampling and the second\-round sampling guided by the Feedback\-Augmented Prompt \(FAP\)\)\.
Figure 20:Mean quality of rollouts generated during the entire sampling phase at each training step on the training set: our method vs\. the baseline that only trains Objective 2 \(ECC\)\.In addition, we further report the mean quality at each difficulty level:
Figure 21:Mean quality of rollouts generated during the sampling phase on the training set, broken down by difficulty level: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the baseline that only trains Objective 2 \(ECC\)\.These figures show that introducing Objective 1 \(EPA\) significantly improves the quality of rollouts discovered by the model during the sampling phase\.
On the validation set, we observe that our method significantly outperforms the baseline:
Figure 22:Final pass rate on the TravelPlanner validation set: our method vs\. the baseline that only trains Objective 2 \(ECC\)\.The figure below reports, during training, the final pass rate on the validation set for each difficulty level \(“easy”, “medium”, “hard”\)\.
Figure 23:Final pass rate on the TravelPlanner validation set across different difficulty levels: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the baseline that only trains Objective 2 \(ECC\)\.The figure below shows the following four metrics on the validation set during training: Commonsense Constraint Pass Rate \- Micro, Commonsense Constraint Pass Rate \- Macro, Hard Constraint Pass Rate \- Micro, Hard Constraint Pass Rate \- Macro\. Our method significantly outperforms the baseline on all four metrics\.
Figure 24:Commonsense and hard constraint pass rates \(micro and macro\) on the TravelPlanner validation set: our method vs\. the baseline that only trains Objective 2 \(ECC\)\.This demonstrates that Objective 1 \(EPA\) can effectively boost Objective 2 \(ECC\)\.
### 4\.4Controlling for the Number of Parameter Updates
Since our method performs two parameter updates per training step, while standard GRPO performs only one parameter update per training step, we design the following experiment to control for the effect of the number of parameter updates:
In our method, at each training step, Objective 1 \(EPA\) first performs one parameter update using all 72 rollouts, after which Objective 2 \(ECC\) performs an additional parameter update using the 64 rollouts produced by the second\-round sampling\. We construct a strengthened baseline \(referred to asGRPO w/ Extra Update\): at each training step, standard GRPO is also given one additional parameter update\. Concretely, it first performs one parameter update using all 72 sampled rollouts, and then randomly samples 64 rollouts from these 72 to perform an additional parameter update, thereby matching the number of parameter updates in our method\.
We conduct this controlled experiment on both the Qwen3\-14B model \(Section[4\.4\.1](https://arxiv.org/html/2605.20256#S4.SS4.SSS1)\) and the Llama\-3\.1\-8B\-Instruct model \(Section[4\.4\.2](https://arxiv.org/html/2605.20256#S4.SS4.SSS2)\)\. The two settings share an identical experimental protocol and only differ in the underlying base model\.
#### 4\.4\.1Qwen3\-14B
The experimental results on the Qwen3\-14B model are as follows:
Figure 25:Final pass rate of the Qwen3\-14B model on the TravelPlanner validation set: our method vs\. the GRPO w/ Extra Update baseline\.The results across different difficulty levels are as follows:
Figure 26:Final pass rate of the Qwen3\-14B model on the TravelPlanner validation set across different difficulty levels: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the GRPO w/ Extra Update baseline\.The four constraint pass rate metrics on the validation set are reported below:
Figure 27:Commonsense and hard constraint pass rates \(micro and macro\) of the Qwen3\-14B model on the TravelPlanner validation set: our method vs\. the GRPO w/ Extra Update baseline\.These results indicate that our method’s gains do not stem from the additional parameter update; even when the baseline is given the same number of parameter updates per training step, our method still significantly outperforms it\.
#### 4\.4\.2Llama\-3\.1\-8B\-Instruct
We further repeat the same controlled experiment on the Llama\-3\.1\-8B\-Instruct model\. The experimental setup, including the construction of the GRPO w/ Extra Update baseline and the matching of the number of parameter updates per training step, is identical to that in Section[4\.4\.1](https://arxiv.org/html/2605.20256#S4.SS4.SSS1)\.
The final pass rate on the validation set is reported in Figure[28](https://arxiv.org/html/2605.20256#S4.F28):
Figure 28:Final pass rate of the Llama\-3\.1\-8B\-Instruct model on the TravelPlanner validation set: our method vs\. the GRPO w/ Extra Update baseline\.The results across different difficulty levels are reported in Figure[29](https://arxiv.org/html/2605.20256#S4.F29):
Figure 29:Final pass rate of the Llama\-3\.1\-8B\-Instruct model on the TravelPlanner validation set across different difficulty levels: easy \(left\), medium \(middle\), and hard \(right\)\. Our method vs\. the GRPO w/ Extra Update baseline\.The four constraint pass rate metrics on the validation set are reported in Figure[30](https://arxiv.org/html/2605.20256#S4.F30):
Figure 30:Commonsense and hard constraint pass rates \(micro and macro\) of the Llama\-3\.1\-8B\-Instruct model on the TravelPlanner validation set: our method vs\. the GRPO w/ Extra Update baseline\.Consistent with the observations on the Qwen3\-14B model, on the Llama\-3\.1\-8B\-Instruct model our method also significantly outperforms the GRPO w/ Extra Update baseline across all metrics and difficulty levels, further confirming that the gains of FBOS\-RL do not stem from the additional parameter update introduced by ECC\.
## 5Conclusion
We proposeFBOS\-RL, aFeedback\-DrivenBi\-ObjectiveSynergisticReinforcementLearning framework\. FBOS\-RL synergistically optimizes two mutually reinforcing training objectives,*Exploitation\-oriented Policy Alignment*\(EPA\) and*Exploration\-oriented Capability Cultivation*\(ECC\), forming a positive self\-bootstrapping flywheel that improves both the training efficiency and the final performance ceiling of reinforcement learning\. Extensive experiments on TravelPlanner and MiniF2F\-Lean4, across Llama\-3\.1\-8B\-Instruct, Qwen3\-14B and Qwen3\.5\-27B, show that FBOS\-RL substantially improves both training efficiency and the final performance ceiling over vanilla GRPO and strong controlled baselines, and exhibits clear OOD generalization to GPQA Diamond\. Meanwhile, FBOS\-RL avoids entropy collapse, maintains higher policy entropy, and exhibits a lower gradient norm, evidencing stronger exploration and better training stability\.
##### Limitations\.
FBOS\-RL relies on feedback to enhance the sampling stage, and therefore its effectiveness is limited on tasks for which reliable feedback is hard to obtain\.
## Appendix Contents
- •
- •
- •
- •
## Appendix ADetailed Descriptions of Datasets
In this section, we provide the detailed descriptions of the datasets used in our experiments\.
##### TravelPlanner\.
The TravelPlanner dataset is a benchmark designed to evaluate the planning capabilities of language agents in realistic travel\-planning scenarios\. Each query specifies a travel request \(e\.g\., origin, destination\(s\), duration, budget, and personalized requirements\), and the agent is required to produce a detailed multi\-day itinerary\.
##### MiniF2F \- Lean4\.
The MiniF2F \- Lean4 dataset is a benchmark for formal mathematical reasoning, consisting of Olympiad\-level and high\-school competition problems \(drawn from sources such as AMC, AIME, and IMO\) formalized as theorem\-proving problems in the Lean4 proof assistant\.
## Appendix BDetailed Definitions of Evaluation Metrics
In this section, we provide the detailed definitions of the evaluation metrics used in the main paper, following\[[2](https://arxiv.org/html/2605.20256#bib.bib2)\]\.
- •Commonsense Constraint Pass Rate: This metric covers eight commonsense dimensions: whether the cities visited in the itinerary are reasonable, whether restaurant choices are non\-repetitive, whether attractions are non\-repetitive, whether the accommodations are reasonable, whether the transportation modes are reasonable, whether all daily activities \(meals, attractions, accommodations\) take place in the city the traveler is in on that day, whether all referenced information \(e\.g\., restaurant names, flight numbers\) exists in the closed sandbox database, and whether the plan’s information is complete\. It evaluates whether the model can incorporate commonsense knowledge into the generated plan without being explicitly instructed\.
- •Hard Constraint Pass Rate: This metric measures whether the generated travel plan satisfies the hard constraints explicitly given in the query \(e\.g\., dietary, accommodation, transportation, and budget constraints\), aiming to test the model’s ability to adjust its plan according to diverse user requirements\.
- •Final Pass Rate: This metric denotes the proportion of feasible plans \(i\.e\., those satisfying all of the aforementioned constraints, including every commonsense constraint and every hard constraint\) among all evaluated plans, and measures the model’s ability to generate plans that meet practical standards\. Final Pass Rate=∑p∈P𝕀IsSatisfied\(Cpall,p\)\|P\|,\\text\{Final Pass Rate\}=\\frac\{\\sum\_\{p\\in P\}\\mathbb\{I\}\_\{\\text\{ IsSatisfied\}\(C\_\{p\}^\{\\text\{all\}\},p\)\}\}\{\|P\|\},\(B\.1\)wherePPdenotes the set of all plans being evaluated, andCpallC\_\{p\}^\{\\text\{all\}\}denotes the set of all constraints applicable to a specific planpp\(including all commonsense constraints and all hard constraints\)\.𝕀IsSatisfied\(X,Y\)\\mathbb\{I\}\_\{\\text\{IsSatisfied\}\(X,Y\)\}is an indicator function that returns 1 ifYYsatisfies the constraintXX, and 0 otherwise\. For both the Commonsense Constraint Pass Rate and the Hard Constraint Pass Rate, we adopt two evaluation strategies: micro and macro\. The micro strategy computes the proportion of satisfied constraints over the total number of constraints, and The macro strategy computes the proportion of plans that satisfy all commonsense constraints \(or all hard constraints\) among the evaluated plans,Micro Pass Rate=∑p∈P∑c∈Cp𝕀IsSatisfied\(c,p\)∑p∈P\|Cp\|,Macro Pass Rate=∑p∈P𝕀IsSatisfied\(Cp,p\)\|P\|,\\text\{Micro Pass Rate\}=\\frac\{\\sum\_\{p\\in P\}\\sum\_\{c\\in C\_\{p\}\}\\mathbb\{I\}\_\{\\text\{ IsSatisfied\}\(c,p\)\}\}\{\\sum\_\{p\\in P\}\|C\_\{p\}\|\},\\text\{Macro Pass Rate\}=\\frac\{\\sum\_\{p\\in P\}\\mathbb\{I\}\_\{\\text\{ IsSatisfied\}\(C\_\{p\},p\)\}\}\{\|P\|\},whereCpC\_\{p\}denotes the set of constraints applicable to a specific planpp, and\|Cp\|\|C\_\{p\}\|denotes the number of constraints in this set\. These two strategies evaluate the model’s ability to follow individual constraints and the full set of constraints, respectively\.
## Appendix CTraining Dynamics of the EPA Objective
To empirically demonstrate that the*Exploitation\-oriented Policy Alignment*\(EPA\) objective introduced in Section[3\.3](https://arxiv.org/html/2605.20256#S3.SS3)can be trained in a stable and effective manner, we report its training\-set score and the corresponding standard deviation \(std\) as a function of training steps on two policy models, Llama\-3\.1\-8B\-Instruct \(Figures[31](https://arxiv.org/html/2605.20256#A3.F31)and[32](https://arxiv.org/html/2605.20256#A3.F32)\) and Qwen3\-14B \(Figures[33](https://arxiv.org/html/2605.20256#A3.F33)and[34](https://arxiv.org/html/2605.20256#A3.F34)\)\. Across both models, the training\-set score of EPA exhibits a clear and stable upward trend while its std steadily decreases, which together confirm that EPA can be properly optimized\. Moreover, this positive trend consistently holds on training subsets of different difficulty levels \(easy,medium,hard\), indicating that EPA is able to make robust progress regardless of sample difficulty\.
Figure 31:Training dynamics of the EPA objective on Llama\-3\.1\-8B\-Instruct\. \(a\) Training\-set score of EPA steadily increases along training steps\. \(b\) The corresponding std steadily decreases, indicating that EPA can be optimized in a stable manner\.Figure 32:Training\-set score of the EPA objective on Llama\-3\.1\-8B\-Instruct, broken down by training\-sample difficulty \(easy,medium,hard, from left to right\)\. The score consistently rises along training steps across all three difficulty levels\.Figure 33:Training dynamics of the EPA objective on Qwen3\-14B\. \(a\) Training\-set score of EPA steadily increases along training steps\. \(b\) The corresponding std steadily decreases, indicating that EPA can be optimized in a stable manner\.Figure 34:Training\-set score of the EPA objective on Qwen3\-14B, broken down by training\-sample difficulty \(easy,medium,hard, from left to right\)\. The score consistently rises along training steps across all three difficulty levels\.
## Appendix DTraining Dynamics of the ECC Objective
We similarly validate the trainability of the*Exploration\-oriented Capability Cultivation*\(ECC\) objective introduced in Section[3\.3](https://arxiv.org/html/2605.20256#S3.SS3)\. We report the training\-set score and the corresponding standard deviation \(std\) of ECC along training steps on Llama\-3\.1\-8B\-Instruct \(Figures[35](https://arxiv.org/html/2605.20256#A4.F35)and[36](https://arxiv.org/html/2605.20256#A4.F36)\) and Qwen3\-14B \(Figures[37](https://arxiv.org/html/2605.20256#A4.F37)and[38](https://arxiv.org/html/2605.20256#A4.F38)\)\. As with EPA, we observe that the ECC training\-set score steadily rises and its std steadily decreases on both models, and the same trend holds across all three difficulty levels \(easy,medium,hard\)\. These results demonstrate that ECC can also be properly trained, confirming that the model indeed acquires an improved ability to discover higher\-quality rollouts when guided by Feedback\-Augmented Prompts \(FAPs\)\.
Figure 35:Training dynamics of the ECC objective on Llama\-3\.1\-8B\-Instruct\. \(a\) Training\-set score of ECC steadily increases along training steps\. \(b\) The corresponding std steadily decreases, indicating that ECC can be optimized in a stable manner\.Figure 36:Training\-set score of the ECC objective on Llama\-3\.1\-8B\-Instruct, broken down by training\-sample difficulty \(easy,medium,hard, from left to right\)\. The score consistently rises along training steps across all three difficulty levels\.Figure 37:Training dynamics of the ECC objective on Qwen3\-14B\. \(a\) Training\-set score of ECC steadily increases along training steps\. \(b\) The corresponding std steadily decreases, indicating that ECC can be optimized in a stable manner\.Figure 38:Training\-set score of the ECC objective on Qwen3\-14B, broken down by training\-sample difficulty \(easy,medium,hard, from left to right\)\. The score consistently rises along training steps across all three difficulty levels\.
## References
- \[1\]K\. Zheng, J\. M\. Han, and S\. Polu, “Minif2f: a cross\-system benchmark for formal olympiad\-level mathematics,” in*International Conference on Learning Representations \(ICLR\)*, 2022\.
- \[2\]J\. Xie, K\. Zhang, J\. Chen, T\. Zhu, R\. Lou, Y\. Tian, Y\. Xiao, and Y\. Su, “Travelplanner: A benchmark for real\-world planning with language agents,”*arXiv preprint arXiv:2402\.01622*, 2024\.
- \[3\]J\. Yan, Y\. Li, Z\. Hu, Z\. Wang, G\. Cui, X\. Qu, Y\. Cheng, and Y\. Zhang, “Learning to reason under off\-policy guidance,”*arXiv preprint arXiv:2504\.14945*, 2025\.
- \[4\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu*et al\.*, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”*arXiv preprint arXiv:2402\.03300*, 2024\.
- \[5\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov, “Proximal policy optimization algorithms,”*arXiv preprint arXiv:1707\.06347*, 2017\.
- \[6\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray*et al\.*, “Training language models to follow instructions with human feedback,” in*Advances in Neural Information Processing Systems*, vol\. 35, 2022, pp\. 27 730–27 744\.
- \[7\]Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon*et al\.*, “Constitutional ai: Harmlessness from ai feedback,”*arXiv preprint arXiv:2212\.08073*, 2022\.
- \[8\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in*Advances in Neural Information Processing Systems*, 2023\.
- \[9\]A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang*et al\.*, “Self\-refine: Iterative refinement with self\-feedback,” in*Advances in Neural Information Processing Systems*, 2023\.
- \[10\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in*Advances in Neural Information Processing Systems*, 2023\.
- \[11\]A\. Kumar, V\. Zhuang, R\. Agarwal, Y\. Su, J\. D\. Co\-Reyes, A\. Singh, K\. Baumli, S\. Iqbal, C\. Bishop, R\. Roelofs*et al\.*, “Training language models to self\-correct via reinforcement learning,” 2024\.
- \[12\]Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, T\. Fan, G\. Liu, L\. Liu, X\. Liu*et al\.*, “Dapo: An open\-source llm reinforcement learning system at scale,”*arXiv preprint arXiv:2503\.14476*, 2025\.
- \[13\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe, “Let’s verify step by step,”*arXiv preprint arXiv:2305\.20050*, 2023\.
- \[14\]J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou, “Large language models cannot self\-correct reasoning yet,”*arXiv preprint arXiv:2310\.01798*, 2023\.
- \[15\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi*et al\.*, “Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning,”*arXiv preprint arXiv:2501\.12948*, 2025\.
- \[16\]P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui, “Math\-shepherd: Verify and reinforce llms step\-by\-step without human annotations,”*Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.
- \[17\]Z\. Gou, Z\. Shao, Y\. Gong, Y\. Shen, Y\. Yang, N\. Duan, and W\. Chen, “Critic: Large language models can self\-correct with tool\-interactive critiquing,” in*International Conference on Learning Representations \(ICLR\)*, 2024\.
- \[18\]S\. Welleck, X\. Lu, P\. West, F\. Brahman, T\. Shen, D\. Khashabi, and Y\. Choi, “Generating sequences by learning to self\-correct,” in*International Conference on Learning Representations \(ICLR\)*, 2023\.
- \[19\]W\. Saunders, C\. Yeh, J\. Wu, S\. Bills, L\. Ouyang, J\. Ward, and J\. Leike, “Self\-critiquing models for assisting human evaluators,”*arXiv preprint arXiv:2206\.05802*, 2022\.
- \[20\]W\. Yuan, R\. Y\. Pang, K\. Cho, S\. Sukhbaatar, J\. Xu, and J\. Weston, “Self\-rewarding language models,”*International Conference on Machine Learning \(ICML\)*, 2024\.
- \[21\]A\. Setlur, S\. Garg, X\. Geng, N\. Garg, V\. Smith, and A\. Kumar, “Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight\-fold,”*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- \[22\]M\. Liu, S\. Diao, X\. Lu, J\. Hu, X\. Dong, Y\. Choi, J\. Kautz, and Y\. Dong, “Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models,”*arXiv preprint arXiv:2505\.24864*, 2025\.
- \[23\]Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang, “Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?”*arXiv preprint arXiv:2504\.13837*, 2025\.
- \[24\]G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen*et al\.*, “The entropy mechanism of reinforcement learning for reasoning language models,”*arXiv preprint arXiv:2505\.22617*, 2025\.
- \[25\]Y\. Wang, Q\. Yang, Z\. Zeng, L\. Ren, L\. Liu, B\. Peng, H\. Cheng, X\. He, K\. Wang, J\. Gao*et al\.*, “Reinforcement learning for reasoning in large language models with one training example,”*arXiv preprint arXiv:2504\.20571*, 2025\.
- \[26\]Z\. Chen, Y\. Deng, H\. Yuan, K\. Ji, and Q\. Gu, “Self\-play fine\-tuning converts weak language models to strong language models,”*International Conference on Machine Learning \(ICML\)*, 2024\.
- \[27\]T\. Zheng*et al\.*, “Beyond grpo: Tree\-search enhanced reinforcement learning for reasoning,”*arXiv preprint arXiv:2502\.10717*, 2025\.
- \[28\]Y\. Tang*et al\.*, “Exploration–exploitation trade\-off in reinforcement learning for large language models,”*arXiv preprint arXiv:2506\.10202*, 2025\.
- \[29\]Y\. Qu, T\. Zhang, N\. Garg, and A\. Kumar, “Recursive introspection: Teaching language model agents how to self\-improve,”*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- \[30\]C\. Snell, J\. Lee, K\. Xu, and A\. Kumar, “Scaling llm test\-time compute optimally can be more effective than scaling model parameters,”*arXiv preprint arXiv:2408\.03314*, 2024\.
- \[31\]Z\. Xi, W\. Yang, R\. Chen, B\. Ding, Y\. Liu, J\. Liu, R\. Zheng, W\. Zhou, T\. Gui, Q\. Zhang, and X\. Huang, “Training large language models for reasoning through reverse curriculum reinforcement learning,” 2024\.
- \[32\]G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, W\. Li, B\. He, Y\. Fan, T\. Yu, Q\. Xu, W\. Chen*et al\.*, “Process reinforcement through implicit rewards,”*arXiv preprint arXiv:2502\.01456*, 2025\.Similar Articles
BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
BiasGRPO proposes a framework using Group Relative Policy Optimization (GRPO) to stabilize social bias mitigation in LLMs by normalizing rewards across sampled completions, outperforming DPO and PPO on multiple benchmarks. The authors also release a compute-efficient bias reward model designed for integration into multi-objective RLHF pipelines.
Fair Reinforcement Learning
Fair Reinforcement Learning introduces Democratic Alignment to incorporate multiple competing value sets from different agents, overcoming traditional RLHF limitations, and achieves orders of magnitude faster optimization via a black-box policy wrapper.
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
The SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives, achieving outperforming results across six benchmarks.
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology.
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
The paper introduces PRISM, a method that inserts a distribution-alignment stage between supervised fine-tuning and reinforcement learning to mitigate distributional drift in multimodal models. It uses a black-box adversarial game with an MoE discriminator to improve RLVR performance on models like Qwen3-VL.