ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

arXiv cs.AI 05/18/26, 04:00 AM Papers
Summary
This paper introduces ICRL, a framework that jointly trains a solver and critic with reinforcement learning to internalize critique guidance, enabling the solver to improve without external critique. It uses distribution calibration and role-wise group advantage estimation, achieving 6-7 point gains over GRPO on agentic and mathematical reasoning tasks.
arXiv:2605.15224v1 Announce Type: new Abstract: Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:31 AM
# Learning to Internalize Self-Critique with Reinforcement Learning
Source: [https://arxiv.org/html/2605.15224](https://arxiv.org/html/2605.15224)
Jianbo Lin1,2Xiaomin Yu1Yi Xin2Yifu Guo3Zhuosong Jiang4 Zhongqi Yue7Weishi Wang6Heqing Zou5Chengwei Qin1†Hui Xiong1 1Hong Kong University of Science and Technology \(Guangzhou\)2Nanjing University 3Sun Yat\-sen University4National University of Singapore 5Nanyang Technological University6SAP7Microsoft Research jianbo\.lin@outlook\.com

###### Abstract

Large language model\-based agents make mistakes, yet critique can often guide the same model toward correct behavior\. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique’s guidance into its underlying capability\. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self\-improvement\. To address this, we propose learning toInternalize self\-Critique withReinforcementLearning\(ICRL\), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique\-induced success into unassisted solver ability\. The critic is rewarded based on the solver’s subsequent performance gain, incentivizing actionable feedback\. To address the distribution shift between critique\-conditioned and critique\-free behavior, ICRL introduces a distribution\-calibration re\-weighting ratio that selectively transfers critique\-guided improvements compatible with the solver’s own prompt distribution\. Additionally, a role\-wise group advantage estimation stabilizes joint optimization across the two roles\. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique\-conditioned behavior\. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3\-4B and Qwen3\-8B as backbones\. Results show consistent improvements, with average gains of 6\.4 points over GRPO on agentic tasks, and 7\.0 points on mathematical reasoning\. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens\. The code is available at[https://github\.com/brick\-pid/ICRL](https://github.com/brick-pid/ICRL)\.

22footnotetext:Corresponding author\.## 1Introduction

Large language model \(LLM\)\-based agents make mistakes when solving complex tasksyang2024sweagent;wang2024openhands;wang2024mobile;qin2025ui;li2025websailor;li2025chain\. As illustrated in Figure[1](https://arxiv.org/html/2605.15224#S1.F1), critique can guide the same model to correct its errors and successfully complete the taskmadaan2023self\-refine;shinn2023reflexion;liu2025trust\_verify;chen2026learning\_self\_verify;gou2023critic;Asai2023SelfRAGLT\. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the guidance from critique into its underlying capability\. Meanwhile, a frozen critic cannot improve its feedback quality over the course of training, limiting the potential for iterative self\-improvement\.

The challenge is*internalization*: how can critique\-guided revisions be transferred into the solver’s critique\-free policy? A fundamental distributional obstacle arises when critique\-based self\-improvement is introduced into training\. As illustrated in Figure[1](https://arxiv.org/html/2605.15224#S1.F1)\(a\), when an agent fails on queryqqbut succeeds after receiving critiquecc, the successful trajectory is sampled from a critique\-conditioned behavior distributionπ\(y∣q,c\)\\pi\(y\\mid q,c\)rather than the solver’s original distributionπ\(y∣q\)\\pi\(y\\mid q\)\. Training on such trajectories reinforces critique\-dependent behavior, where the solver learns to perform well*given*critique, not to perform well*without*itscheurer2023training\-feedback\. Without explicit correction for this distributional mismatch, standard policy optimization produces a biased estimate of the intended critique\-free policy update\.

The quality of critique is equally important\. The ability to diagnose the solver’s failures and propose actionable corrections should itself be learnable and co\-evolve with the solver\. However, existing critique\-based methods typically rely on a frozen critic modelzhang2025critique;tang2025self, whose feedback quality remains static regardless of how the solver updates\. This decoupling limits the agent’s capacity for sustained self\-improvement: as the solver advances, a stale critic may produce increasingly irrelevant or redundant feedback\.

![Refer to caption](https://arxiv.org/html/2605.15224v1/x1.png)Figure 1:Critique can turn failed trajectories into successful revisions, while training should internalize such revision behavior into the critique\-free solver\.In this paper, we proposeICRL, a reinforcement\-learning framework that converts critique\-induced success into unassisted solver ability\. ICRL jointly trains a solver and a critic from a shared backbone, without external teacher models or manually annotated critique data\. The critic is rewarded for producing a critique that improves the solver’s subsequent attempt, creating learning signals tied to direct critique utility\. To internalize critique\-guided revisions, ICRL reconditions revised trajectories under the solver’s critique\-free prompt and applies a token\-level distribution\-calibration re\-weighting ratio\. This ratio selectively transfers tokens whose generation is already plausible under the critique\-free distribution, while down\-weighting tokens that depend heavily on the critique context\. The result is that the solver internalizes revision patterns compatible with its own prompt distribution\. To further stabilize joint optimization, ICRL employs role\-wise group advantage estimation that normalizes solver and critic rewards separately, preserving distinct learning signals for each role\.

To our knowledge, ICRL is the first framework to improve both critique internalization and critic learning in a reinforcement\-learning setting\. We evaluate ICRL on diverse environments, including text\-world tasksALFWorld20, e\-commerce web navigationyao2022webshop, multi\-hop question answeringHotpotQA;2WikiMultiHopQA;Bamboogle;Musique, and mathematical reasoninghendrycks2021math500;lewkowycz2022minerva;he\-etal\-2024\-olympiadbench;numina\_math\_datasets\. Experiments are conducted on Qwen3\-4B and Qwen3\-8B, and compared with prompting\-based baselines, reinforcement\-learning baselines, and critique\-based methods\. Experimental results demonstrate empirical improvements over baselines on both agentic tasks and mathematical reasoning tasks\. We summarize our contributions as follows:

- •We propose the ICRL framework, a solver\-critic reinforcement\-learning framework that enables iterative self\-improvement by jointly learning to critique and to internalize critique\.
- •We introduce a distribution\-calibration re\-weighting ratio that corrects the distributional shift between critique\-conditioned and critique\-free behavior\. We further propose role\-wise group advantage estimation to stabilize joint solver\-critic optimization\.
- •Experimental results demonstrate empirical improvements over baselines on both agentic and mathematical reasoning tasks\. The jointly learned critic achieves performance comparable to 32B frozen critics while using fewer tokens\.

## 2Preliminary

### 2\.1Task Formulation

Given a queryq∈𝒬q\\in\\mathcal\{Q\}, an LLM\-based agentπθ\\pi\_\{\\theta\}, parameterized byθ\\theta, interacts with the environment to sample a trajectoryτ\\tau, and receives rewardr\(τ\)r\(\\tau\)from a task evaluator\. We model agentic tasks as a Partially Observable Markov Decision Process \(POMDP\) followingxi2025agentgym;xi2025agentgym\-rl\. At each time steptt, the agent conditions on historyht=\(q,o0,a0,…,ot\)h\_\{t\}=\(q,o\_\{0\},a\_\{0\},\\ldots,o\_\{t\}\)consisting of the initial query and the sequence of past observations and actions up to timett\. Based on this history, the agent samples an actionat∼πθ\(⋅∣ht\)a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid h\_\{t\}\)\. This interaction proceeds forHHsteps, resulting in a complete trajectoryτ=\(q,o0,a0,…,aH−1,oH\)\\tau=\(q,o\_\{0\},a\_\{0\},\\ldots,a\_\{H\-1\},o\_\{H\}\)\. The training objective is

J\(θ\)=𝔼τ∼πθ\[r\(τ\)\]\.J\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\left\[r\(\\tau\)\\right\]\.\(1\)

### 2\.2Group\-Relative Policy Optimization

We adopt GRPOguo2025deepseekas the underlying reinforcement\-learning primitive\. For a queryqq, GRPO samples a group of candidate trajectories𝒢\(q\)=\{τ1,…,τG\}\\mathcal\{G\}\(q\)=\\\{\\tau\_\{1\},\\dots,\\tau\_\{G\}\\\}and computes the group\-normalized advantage asA^i=r\(τi\)−meanj⁡r\(τj\)stdj⁡r\(τj\)\+δ\\hat\{A\}\_\{i\}=\\frac\{r\(\\tau\_\{i\}\)\-\\operatorname\{mean\}\_\{j\}\\,r\(\\tau\_\{j\}\)\}\{\\operatorname\{std\}\_\{j\}\\,r\(\\tau\_\{j\}\)\+\\delta\}, whereδ\>0\\delta\>0is a small constant for numerical stability\. Letyty\_\{t\}denote thett\-th generated token of sampleτi\\tau\_\{i\}; the importance\-sampling ratioρt\(θ\)=πθ\(yt∣q,y<t\)πθold\(yt∣q,y<t\)\\rho\_\{t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(y\_\{t\}\\mid q,y\_\{<t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{t\}\\mid q,y\_\{<t\}\)\}measures the deviation of the current policy from the behavior policy used to collect the group\. The resulting clipped surrogate objective is

JGRPO\(θ\)=𝔼i,t\[min⁡\(ρt\(θ\)A^i,clip⁡\(ρt\(θ\),1−ϵ,1\+ϵ\)A^i\)\],J\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{i,\\,t\}\\\!\\left\[\\min\\\!\\Big\(\\rho\_\{t\}\(\\theta\)\\,\\hat\{A\}\_\{i\},\\;\\operatorname\{clip\}\\\!\\big\(\\rho\_\{t\}\(\\theta\),\\,1\-\\epsilon,\\,1\+\\epsilon\\big\)\\,\\hat\{A\}\_\{i\}\\Big\)\\right\],\(2\)whereiiindexes a trajectory in the sampled group𝒢\(q\)\\mathcal\{G\}\(q\),ttindexes a generated token position within that trajectory, andϵ\\epsilonis the clipping parameter\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.15224v1/x2.png)Figure 2:Overview of the ICRL framework\. \(1\) Rollout with critique alternates solver and critic; \(2\) Policy optimization jointly trains both solver and critic; \(3\) Internalizing critique into solver\.### 3\.1Self\-Improving Workflow

Self\-Improving Workflow\.ICRL instantiates two agent roles, a solver and a critic, from a shared backbone with parametersθ\\thetaand role\-specific promptsp𝒮p^\{\\mathcal\{S\}\}andp𝒞p^\{\\mathcal\{C\}\}\. Specifically, thesolverπθ𝒮\\pi^\{\\mathcal\{S\}\}\_\{\\theta\}generates task\-solving trajectories, while thecriticπθ𝒞\\pi^\{\\mathcal\{C\}\}\_\{\\theta\}generates natural\-language critique after failed attempts\. For each queryqq, we execute an iterative self\-improvement session with at mostKKtotal rounds\. The solver first samples an initial trajectoryτ1∼πθ𝒮\(⋅∣q\)\\tau\_\{1\}\\sim\\pi^\{\\mathcal\{S\}\}\_\{\\theta\}\(\\cdot\\mid q\), and the environment provides an outcome rewardr\(τ1\)r\(\\tau\_\{1\}\)\. If the current trajectoryτi\\tau\_\{i\}fails, the critic analyzes it and generates critiqueci∼πθ𝒞\(⋅∣q,τi\)c\_\{i\}\\sim\\pi^\{\\mathcal\{C\}\}\_\{\\theta\}\(\\cdot\\mid q,\\tau\_\{i\}\)\. The solver then generates a revised trajectory conditioned on the critique,τi\+1∼πθ𝒮\(⋅∣q,ci\)\\tau\_\{i\+1\}\\sim\\pi^\{\\mathcal\{S\}\}\_\{\\theta\}\(\\cdot\\mid q,c\_\{i\}\)\. This iterative process repeats until the task succeeds or the round budget is exhausted\. A complete rollout session may thus contain multiple rounds of solver trajectories and critic outputs, denoted asS=\(τ1,c1,τ2,…,ck−1,τk\)S=\(\\tau\_\{1\},c\_\{1\},\\tau\_\{2\},\\ldots,c\_\{k\-1\},\\tau\_\{k\}\)withk≤Kk\\leq K\.

Solver and Critic Reward\.The policy optimization is guided by separate reward signals for each role\. For the solver, we use the task outcome rewardr\(τ\)∈\[0,1\]r\(\\tau\)\\in\[0,1\]for optimization, wherer\(τ\)=1r\(\\tau\)=1indicates successful task completion\. For the critic, the reward measures whether its critique improves the solver’s subsequent performance\. Specifically, after the critic produces a critiquecic\_\{i\}for the solver trajectoryτi\\tau\_\{i\}\(i<ki<k\), the solver generates a revised trajectoryτi\+1\\tau\_\{i\+1\}conditioned oncic\_\{i\}\. As shown in Eq\. \([3](https://arxiv.org/html/2605.15224#S3.E3)\), the critic receives a reward of11if the revised trajectory succeeds; otherwise, its reward equals the temporal improvement in the solver’s reward, which is nonzero only when the environment provides non\-binary dense rewards\.

r\(ci\)=\{1,ifτi\+1succeeds,r\(τi\+1\)−r\(τi\),otherwise\.r\(c\_\{i\}\)=\\begin\{cases\}1,&\\text\{if \}\\tau\_\{i\+1\}\\text\{ succeeds\},\\\\ r\(\\tau\_\{i\+1\}\)\-r\(\\tau\_\{i\}\),&\\text\{otherwise\}\.\\end\{cases\}\(3\)This design provides learning signals tied to downstream revision utility, where the critic is rewarded not for producing plausible\-sounding feedback, but for feedback that demonstrably helps the solver\.

### 3\.2Self\-Improvement Policy Optimization

Self\-improvement introduces two challenges absent from standard policy optimization\.\(i\) Mixed prompt prefixes\.GRPO relies on group\-relative comparisons among samples generated under a shared prompt prefix and behavior distribution\. In our workflow, however, initial attempts, critiques, and revisions are generated under different role\-specific and information\-conditioned prompts, so their rewards are not directly comparable within a single group\.\(ii\) Distributional shift\.The critic changes the solver’s effective sampling distribution\. Initial attempts are drawn from the critique\-free solver behavior distributionπθrollout𝒮\(⋅∣q\)\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(\\cdot\\mid q\), whereas revised attempts are drawn from a critique\-conditioned behavior distributionπθrollout𝒮\(⋅∣q,c\)\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(\\cdot\\mid q,c\)\. Treating revised trajectories as if they were sampled from the critique\-free solver distribution yields a biased estimate of the intended policy update \(see Appendix[A](https://arxiv.org/html/2605.15224#A1)for detailed analysis\)\.

The revised trajectories are valuable because they contain solutions discovered through the critic’s feedback\. However, these trajectories are sampled from a critique\-conditioned behavior distributionπθrollout𝒮\(⋅∣q,c\)\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(\\cdot\\mid q,c\), they cannot be used to update the critique\-free policy without correction\.

Critique\-Conditioned Distribution Calibration\.To optimize critique\-guided revisions under the solver’s original behavior distribution, we recondition each revised trajectory by removing the critique from its prompt and treating it as evidence forπθ𝒮\(⋅∣q\)\\pi\_\{\\theta\}^\{\\mathcal\{S\}\}\(\\cdot\\mid q\)\. We then introduce a token\-level reweight ratio to calibrate this behavior distribution mismatch\. For a self\-improved round\(τi,ci,τi\+1\)\(\\tau\_\{i\},c\_\{i\},\\tau\_\{i\+1\}\), whereτi∼πθrollout𝒮\(⋅∣q\)\\tau\_\{i\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(\\cdot\\mid q\)andτi\+1∼πθrollout𝒮\(⋅∣q,ci\)\\tau\_\{i\+1\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(\\cdot\\mid q,c\_\{i\}\)\. We define the token\-wise reweight ratio as

wt=\{πθrollout𝒮\(yt∣q,y<t\)πθrollout𝒮\(yt∣q,c,y<t\),critique\-guided solver trajectories,1,otherwise\.w\_\{t\}=\\begin\{cases\}\\frac\{\\pi^\{\\mathcal\{S\}\}\_\{\\theta\_\{\\mathrm\{rollout\}\}\}\(y\_\{t\}\\mid q,y\_\{<t\}\)\}\{\\pi^\{\\mathcal\{S\}\}\_\{\\theta\_\{\\mathrm\{rollout\}\}\}\(y\_\{t\}\\mid q,c,y\_\{<t\}\)\}\\,,&\\text\{critique\-guided solver trajectories\},\\\\ 1,&\\text\{otherwise\}\.\\end\{cases\}\(4\)
Algorithm 1ICRL Policy Optimization1:Problem batch𝒬\\mathcal\{Q\}, group sizeGG, maximum roundsKK, current parametersθ\\theta

2:foreach problemq∈𝒬q\\in\\mathcal\{Q\}do

3:𝒢𝒮\(q\)←∅\\mathcal\{G\}^\{\\mathcal\{S\}\}\(q\)\\leftarrow\\emptyset,𝒢𝒞\(q\)←∅\\mathcal\{G\}^\{\\mathcal\{C\}\}\(q\)\\leftarrow\\emptyset

4:forGGindependent sessionsdo

5:Sampleτ1∼πθrollout𝒮\(⋅∣q\)\\tau\_\{1\}\\sim\\pi^\{\\mathcal\{S\}\}\_\{\\theta\_\{\\mathrm\{rollout\}\}\}\(\\cdot\\mid q\)

6:Compute solver rewardr𝒮\(τ1\)∈\[0,1\]r^\{\\mathcal\{S\}\}\(\\tau\_\{1\}\)\\in\[0,1\]

7:Add\(q,τ1,r𝒮\(τ1\)\)\(q,\\tau\_\{1\},r^\{\\mathcal\{S\}\}\(\\tau\_\{1\}\)\)to𝒢𝒮\(q\)\\mathcal\{G\}^\{\\mathcal\{S\}\}\(q\)

8:fori=1i=1toK−1K\-1do

9:ifr𝒮\(τi\)=1r^\{\\mathcal\{S\}\}\(\\tau\_\{i\}\)=1then

10:break

11:Sampleci∼πθrollout𝒞\(⋅∣q,τi\)c\_\{i\}\\sim\\pi^\{\\mathcal\{C\}\}\_\{\\theta\_\{\\mathrm\{rollout\}\}\}\(\\cdot\\mid q,\\tau\_\{i\}\)

12:Sampleτi\+1∼πθrollout𝒮\(⋅∣q,ci\)\\tau\_\{i\+1\}\\sim\\pi^\{\\mathcal\{S\}\}\_\{\\theta\_\{\\mathrm\{rollout\}\}\}\(\\cdot\\mid q,c\_\{i\}\)

13:Compute solver rewardr𝒮\(τi\+1\)r^\{\\mathcal\{S\}\}\(\\tau\_\{i\+1\}\)

14:Compute critic rewardr𝒞\(ci\)r^\{\\mathcal\{C\}\}\(c\_\{i\}\)via Eq\. \([3](https://arxiv.org/html/2605.15224#S3.E3)\)

15:Add\(q,τi,ci,r𝒞\(ci\)\)\(q,\\tau\_\{i\},c\_\{i\},r^\{\\mathcal\{C\}\}\(c\_\{i\}\)\)to𝒢𝒞\(q\)\\mathcal\{G\}^\{\\mathcal\{C\}\}\(q\)

16:Add\(q,ci,τi\+1,r𝒮\(τi\+1\)\)\(q,c\_\{i\},\\tau\_\{i\+1\},r^\{\\mathcal\{S\}\}\(\\tau\_\{i\+1\}\)\)to𝒢𝒮\(q\)\\mathcal\{G\}^\{\\mathcal\{S\}\}\(q\)

17:ComputeA^i𝒮\\hat\{A\}\_\{i\}^\{\\mathcal\{S\}\}andA^i𝒞\\hat\{A\}\_\{i\}^\{\\mathcal\{C\}\}via Eq\. \([5](https://arxiv.org/html/2605.15224#S3.E5)\)

18:Compute reweight ratioswtw\_\{t\}for revised solver trajectories via Eq\. \([4](https://arxiv.org/html/2605.15224#S3.E4)\)

19:Updateθ\\thetaby maximizing the multi\-role objectiveJ\(θ\)J\(\\theta\)in Eq\. \([6](https://arxiv.org/html/2605.15224#S3.E6)\)

This ratio measures whether a critique\-guided token is already plausible under the critique\-free solver behavior distribution\. Tokens whose probability remains high without the critique are transferred strongly toπθ𝒮\(⋅∣q\)\\pi^\{\\mathcal\{S\}\}\_\{\\theta\}\(\\cdot\\mid q\), while tokens that depend heavily on the critic context are downweighted\. Whenwt≈1w\_\{t\}\\approx 1, the token’s generation does not depend on the critique context and can be directly transferred to the critique\-free policy\. Whenwt≪1w\_\{t\}\\ll 1, the token relies heavily on the critique and is downweighted to avoid reinforcing critique\-dependent behavior\. Conversely,wt\>1w\_\{t\}\>1indicates tokens that the critique\-free solver would have generated with higher probability, which are upweighted\. In this way, the solver internalizes revision behavior that is compatible with its own prompt distribution instead of blindly imitating critic\-assisted outputs\.

Following the GRPO importance\-sampling ratio introduced in Section[2](https://arxiv.org/html/2605.15224#S2), we useρt\(θ\)\\rho\_\{t\}\(\\theta\)to denote the token\-level importance\-sampling ratio\. For solver trajectories, this ratio is computed under the critique\-free prompt context\(q,y<t\)\(q,y\_\{<t\}\)after removing the critique; critic trajectories are evaluated under their original prompt\(q,τi\)\(q,\\tau\_\{i\}\)\. The reweight ratiowtw\_\{t\}defaults to11for all non\-revised trajectories, whileρt\(θ\)\\rho\_\{t\}\(\\theta\)retains its standard role as the clipped GRPO importance\-sampling ratio\.

Role\-wise Advantage Estimation\.Solver and critic trajectories are generated under different prompt prefixes with distinct reward functions: the solver is optimized for task completion, while the critic is optimized for revision utility\. Directly normalizing these heterogeneous rewards within a single group would violate the relative\-comparison principle and produce misleading advantage estimates\. We therefore compute the group\-relative advantage separately for each role\. Formally, for each queryqqand roleg∈\{𝒮,𝒞\}g\\in\\\{\\mathcal\{S\},\\mathcal\{C\}\\\}, we collect a role\-specific group𝒢g\(q\)=\{τ1g,…,τGgg\}\\mathcal\{G\}^\{g\}\(q\)=\\\{\\tau\_\{1\}^\{g\},\\dots,\\tau\_\{G\_\{g\}\}^\{g\}\\\}and compute the advantage within that group:

A^ig=r\(τig\)−meanj⁡r\(τjg\)stdj⁡r\(τjg\)\+δ,g∈\{𝒮,𝒞\}\.\\hat\{A\}\_\{i\}^\{g\}=\\frac\{r\(\\tau\_\{i\}^\{g\}\)\-\\operatorname\{mean\}\_\{j\}\\,r\(\\tau\_\{j\}^\{g\}\)\}\{\\operatorname\{std\}\_\{j\}\\,r\(\\tau\_\{j\}^\{g\}\)\+\\delta\},\\quad g\\in\\\{\\mathcal\{S\},\\mathcal\{C\}\\\}\.\(5\)
whereδ\>0\\delta\>0is a small constant for numerical stability\. This role\-wise baseline preserves the relative\-comparison principle while respecting the distinct semantics of each role\.

Policy Optimization Objective\.Our final objective is a GRPO\-style clipped update over sampled trajectories and token positions\. For brevity, we omit the role superscript inρt\(θ\)\\rho\_\{t\}\(\\theta\)when the role is clear from the trajectoryτ\\tau:

J\(θ\)=𝔼τ,t\[min⁡\(wt,wmax\)min⁡\(ρt\(θ\)A^\(τ\),clip⁡\(ρt\(θ\),1−ϵ,1\+ϵ\)A^\(τ\)\)\],J\(\\theta\)=\\mathbb\{E\}\_\{\\tau,\\,t\}\\\!\\left\[\\min\\\!\\big\(w\_\{t\},w\_\{\\mathrm\{max\}\}\\big\)\\,\\min\\\!\\Big\(\\rho\_\{t\}\(\\theta\)\\hat\{A\}\(\\tau\),\\;\\operatorname\{clip\}\\\!\\big\(\\rho\_\{t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\big\)\\hat\{A\}\(\\tau\)\\Big\)\\right\],\(6\)
where only critique\-guided revised solver trajectories receive the distribution\-calibration reweight \(wtw\_\{t\}from Eq\. \([4](https://arxiv.org/html/2605.15224#S3.E4)\)\), while initial solver and critic trajectories retainwt=1w\_\{t\}=1\. The upper boundwmaxw\_\{\\mathrm\{max\}\}prevents excessively large weights that could arise when the critique\-free probability substantially exceeds the critique\-conditioned probability, thereby bounding gradient variance\. Since the solver and critic are two prompted roles instantiated from the same backbone, maximizing Eq\. \([6](https://arxiv.org/html/2605.15224#S3.E6)\) jointly trains both roles through their role\-specific advantages and rewards\. The solver learns from initial and calibrated revised trajectories, progressively internalizing critique\-guided improvements into its critique\-free policy\. The critic simultaneously learns to produce feedback that maximizes the improvement of the downstream solver\.

## 4Experiments

### 4\.1Experimental Setup

Environments\.To comprehensively evaluate ICRL, we conduct experiments on four types of tasks\. \(1\) Text world: we use ALFWorldALFWorld20, a text\-based environment that simulates embodied household tasks requiring multi\-step navigation\. \(2\) Web navigation: we employ WebShopyao2022webshop, an e\-commerce website environment that requires agents to navigate, search for, and purchase products\. \(3\) Multi\-hop question answering: we evaluate on multi\-hop question answering tasks using an RAG\-based search environment, including HotpotQAHotpotQA, 2WikiMultiHopQA2WikiMultiHopQA, BamboogleBamboogle, and MuSiQueMusique\. \(4\) Mathematical reasoning: we evaluate on five benchmarks, including MATH500hendrycks2021math500, Minerva Mathlewkowycz2022minerva, OlympiadBenchhe\-etal\-2024\-olympiadbench, AIME24numina\_math\_datasets, and AMC23numina\_math\_datasets\. These datasets consist of high\-school and college\-level math problems\. Detailed environment descriptions are provided in Appendix[D](https://arxiv.org/html/2605.15224#A4)\.

Baselines and Backbone Models\.We compare our method with different baselines\. For prompting\-based baselines, we use off\-the\-shelf models, including Qwen3\-4Byang2025qwen3, Qwen3\-8Byang2025qwen3, Qwen3\-30B\-A3Byang2025qwen3, Gemini\-2\.5\-Flashcomanici2025gemini25pushingfrontier, and Gemini\-3\-Flashgoogle2026gemini3flash\. For single\-agent RL baselines, we include GRPOguo2025deepseekand GSPOzheng2025group\. To compare against agent\-oriented training methods, we further include ScalingInter\-RLxi2025agentgym\-rl, which gradually increases the interaction horizon; MATPOmo2025matpo, which trains the planner and subagent through role\-specific policy optimization; and Critique\-GRPOzhang2025critique\(self\-critique\), which introduces natural\-language critiques for critique\-guided policy optimization\. For backbone models, we consider Qwen3\-4B and Qwen3\-8B\.

Table 1:Main results on agentic tasks\.Boldindicates the best performance within each group\.
### 4\.2Results on Agentic Tasks

As shown in Table[1](https://arxiv.org/html/2605.15224#S4.T1), ICRL improves agentic\-task performance across the three agentic environments\. On Qwen3\-4B, ICRL achieves the best average score, reaching 57\.0%\. It improves over GRPO by 7\.8 points and surpasses Critique\-GRPO by 1\.1 points on average\. Specifically, ICRL achieves the best success rate on ALFWorld and WebShop, indicating improved decision quality in long\-horizon environments\. On 2WikiMultiHopQA and MuSiQue, ICRL also achieves the best results, and on HotpotQA and Bamboogle, it remains competitive with the best\-performing baselines\.

On Qwen3\-8B, ICRL further obtains the highest average score of 57\.8%, improving over GRPO and Critique\-GRPO by 5\.0 points and 1\.2 points, respectively\. On the WebShop environment, ICRL achieves the best success rate and reward\. On ALFWorld, ICRL also achieves competitive results\. For multi\-hop search tasks, ICRL achieves the best results on 2WikiMultiHopQA under both backbones and remains competitive on MuSiQue, HotpotQA, and Bamboogle, although it is not uniformly the best on all datasets\. Overall, these results suggest that ICRL is effective on the evaluated agentic environments and backbone scales\.

### 4\.3Results on Math Tasks

As shown in Table[2](https://arxiv.org/html/2605.15224#S4.T2), ICRL performs effectively on mathematical reasoning tasks\. On Qwen3\-8B, SFT provides only moderate improvements, increasing the average score from 55\.0% to 59\.2%, while GRPO reaches 68\.3% by directly optimizing answer correctness\. Critique\-GRPO further improves the average score to 73\.3%, showing the benefit of critique information\. ICRL achieves the best overall performance with an average score of 75\.3%, outperforming GRPO and Critique\-GRPO by 7\.0 points and 2\.0 points, respectively\. The improvements are especially clear on challenging competition\-style benchmarks, such as OlympiadBench and AIME24, where ICRL improves over GRPO from 65\.6% to 68\.9% and from 50\.0% to 65\.1%\. Compared with Critique\-GRPO, ICRL performs better on four out of five benchmarks, with the only exception being AMC23\. These math results are consistent with ICRL better internalizing critique\-guided improvements, which is associated with stronger direct reasoning performance even without critique conditioning\.

Table 2:Main results on math reasoning tasks\.Boldindicates the best performance\.

## 5Discussion

### 5\.1Test\-Time Self\-Improvement

To examine whether the learned critic can further improve agents at inference time, we evaluate multi\-round test\-time refinement on ALFWorld, WebShop, and SearchQA\. As shown in Figure[3](https://arxiv.org/html/2605.15224#S5.F3), all methods improve as the number of refinement rounds increases, but the sources of improvement differ\. GRPO exhibits only gradual gains, suggesting that repeated attempts mainly provide additional sampling opportunities rather than targeted correction\. Critic\-GRPO improves more consistently by conditioning later attempts on critique information, but its gains remain limited across all three environments\. In contrast, ICRL achieves both stronger first\-round performance and larger improvements across refinement rounds, indicating that the learned critic provides feedback that is more useful for revising failed trajectories\.

The advantage of ICRL is consistent across task types\. On ALFWorld, ICRL starts near the strongest baseline and continues to improve, reaching 98% success rate by the third round\. On WebShop, it improves from 76% to over 83\.5%, while maintaining a clear margin over both GRPO and Critic\-GRPO throughout refinement\. A similar trend appears on SearchQA, where ICRL increases from roughly 37\.5% to 45% and consistently outperforms the critique\-based baseline\. These results suggest that ICRL does not merely benefit from additional inference\-time computation; rather, its critic learns actionable diagnostic signals that help the solver identify previous errors and make more effective revisions\. This is consistent with the hypothesis that joint solver\-critic training can produce a self\-improvement mechanism that remains useful at test time\.

![Refer to caption](https://arxiv.org/html/2605.15224v1/fig/tts.png)Figure 3:Test\-time self\-improvement performance on ALFWorld, WebShop, and SearchQA\.
### 5\.2Training Dynamics

Training Stability\.Stable optimization is crucial for ICRL, since the solver and critic are trained under a shared backbone\. Figure[4](https://arxiv.org/html/2605.15224#S5.F4)reports the training dynamics of ICRL\. The solver reward steadily increases from around 40% to above 90%, showing that the solver improves its task performance throughout training\. The critic reward also improves, starting from around 26% and rising to around 60%\. This indicates that the critic learns to generate more useful critiques\. It is also observed that the critic reward curve is lower and exhibits greater fluctuations, which suggests that learning to critique is more challenging and less directly tied to task success than training the solver\. Meanwhile, the gradient norm remains bounded throughout training, suggesting that ICRL maintains stable updates and that the role\-wise advantage estimation and re\-weighting ratio are effective in practice\.

Re\-weighting Ratio Dynamics\.The re\-weighting ratiowtw\_\{t\}provides a view of how critique\-guided revisions are transferred back to the critique\-free solver distribution\. Early in training,wtw\_\{t\}stays below the unity baseline, indicating that many successful revised actions still depend strongly on the critic context and should therefore receive a conservative update when used to train the direct solver\. As training proceeds,wtw\_\{t\}increases steadily, suggesting that more tokens from critique\-guided revisions become compatible with the solver’s original prompt distribution\. This trend supports the intended effect of distribution calibration: early training avoids blindly imitating critic\-dependent behavior, while later training allows more revised trajectories to contribute once the solver begins to internalize the underlying correction pattern\.

![Refer to caption](https://arxiv.org/html/2605.15224v1/fig/dynamics.png)Figure 4:Training dynamics on ALFWorld\. \(1\) Reward curve of the solver agent\. \(2\) Reward curve of the critic agent\. \(3\) Curve of re\-weighting ratiowtw\_\{t\}\(Eq\. \([4](https://arxiv.org/html/2605.15224#S3.E4)\)\)\. \(4\) The curve of gradient norm\.
### 5\.3Learning to Critique

To investigate whether the critic learns useful feedback generation, we conduct a critic\-swap evaluation by fixing the solver as ICRL\-8B and replacing only the critic used for revision\. As shown in Table[3](https://arxiv.org/html/2605.15224#S5.T3), the learned ICRL\-8B critic provides highly effective feedback despite being much smaller than the frozen reference critics\. It achieves a 95\.0% success rate on ALFWorld, matching or surpassing larger critics, and improves WebShop performance from 76\.0% to 78\.5%\. Notably, it does so with much shorter critiques, using only 57\.0 tokens on ALFWorld and 93\.9 tokens on WebShop, far fewer than the 20B and 32B critics\. These results suggest that, in our evaluation, effective criticism is not explained by scale or verbosity alone\. Instead, by optimizing the critic for downstream revision utility, ICRL learns solver\-specific, concise, and actionable feedback that aligns with the solver’s failure modes while reducing inference cost\.

Table 3:Critic\-swap evaluation\. The revision solver is fixed as ICRL\-8B, and only the critic is replaced\. Direct denotes no critique\-based revision\. Token denotes the average critique length\.
### 5\.4Ablation Studies

Table 4:Ablation studies of ICRL\. Each variant removes one component from the full method\. Avg\. reports the average score across the evaluated benchmarks\.As shown in Table[4](https://arxiv.org/html/2605.15224#S5.T4), removing either component consistently degrades performance across all evaluated benchmarks\. Without role\-wise advantage normalization, the average score drops from 69\.8% to 68\.4%, indicating that solver and critic samples should not be normalized under a single shared reward scale\. Since the solver is optimized for task completion while the critic is optimized for improving the solver’s next attempt, role\-wise normalization preserves their distinct learning signals during joint training\. Removing the re\-weight ratio causes a drop to 67\.8%, suggesting that the critique\-conditioned response should not be optimized as a critique\-free solver rollout\. By re\-weighting these revised trajectories, ICRL better aligns the training objective with the critique\-free behavior distribution\. Overall, the ablation results show that the gains of ICRL come not only from using a critic, but also from stabilizing the joint solver\-critic optimization through role\-aware advantage estimation and re\-weight ratio adjustment\.

## 6Related Work

Critique and self\-critique\.Natural\-language critique, feedback, and reflection have been widely used to improve language\-model reasoning and agent behavior\. Methods such as Self\-Refinemadaan2023self\-refineand Reflexionshinn2023reflexionprompt models to analyze previous attempts and generate revised answers or trajectories\. Other works introduce external or self\-generated critics to detect mistakes, verify reasoning, or improve retrieval and generation qualitygou2023critic;Asai2023SelfRAGLT;liu2025trust\_verify;chen2026learning\_self\_verify\. Recent studies further show that models can improve by identifying errors in their own answers or intermediate reasoning steps and revising them accordinglyweng2023large;zhang2025incentivizing;ma2025s2r\. These methods demonstrate that critique can provide effective local guidance: a failed attempt can often be repaired when the model is conditioned on appropriate feedback\. However, they primarily improve critique\-conditioned behavior at inference time, and do not guarantee that the model internalizes the guidance into its critique\-free policy\.

Policy Optimization\.Policy\-gradient methods have become a central tool for post\-training language models and agents\. Algorithms such as PPOschulman2017proximal, GRPOguo2025deepseek, and GSPOzheng2025grouphave shown strong potential in improving both reasoning models and interactive agents\. Recent work studies stabilization strategies for agent RLchen2025minimax;zheng2025stabilizing, as well as curriculum or interaction\-scaling methods for long\-horizon trainingxi2025agentgym\-rl\. Beyond single\-agent optimization, researchers have started to explore reinforcement learning for multi\-agent or multi\-role systems, including different agent architectures and role\-specific training objectiveszhao2025stronger\-mas;feng2026dr\-mas;mo2025matpo\. Furthermore, expert data or expert models are incorporated into the training processluffy;zhang2025critique;zhang2025bread\. Specifically, Critique\-GRPOzhang2025critiqueleverages critic feedback for policy optimization, but still relies on a static critic model\. While these methods improve agent training, they typically do not study how critique\-guided revisions can be internalized into a critique\-free solver\.

## 7Conclusions

In this work, we presented ICRL, a reinforcement\-learning framework that jointly learns to critique and to internalize self\-critique\. ICRL converts critique\-induced success into unassisted solver ability through two key mechanisms: a distribution\-calibration re\-weight ratio that selectively transfers critique\-guided improvements compatible with the solver’s critique\-free distribution, and a role\-wise group advantage estimation that stabilizes joint optimization of two roles\. Extensive experiments across different models demonstrate its superiority over strong baselines on both agentic and mathematical reasoning tasks\. Our analyses further show that the learned critic remains useful at test time and provides concise feedback comparable to much larger frozen critics, suggesting that self\-critique may be most effective when critique generation and critique internalization are learned together\.

## References

## Appendix AAnalysis of Critique\-Conditioned Trajectories

Critique\-guided trajectories are not sampled from the same behavior distribution as a critique\-free solver\. The initial attempt is drawn fromπθrollout𝒮\(τ∣q\)\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(\\tau\\mid q\), while a revised trajectory is generated under an extra critiqueccand is therefore drawn fromπθrollout𝒮\(τ∣q,c\)\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(\\tau\\mid q,c\)\. This means revised trajectories are informative, but they are distribution\-shifted relative to the critique\-free prompt\. If they are used directly, the update targets the critique\-conditioned behavior distribution rather than the intended critique\-free behavior distribution, which is exactly the mismatch we want to control\.

The re\-weight in Eq\. \([4](https://arxiv.org/html/2605.15224#S3.E4)\) is meant to transfer only the parts of a trajectory that remain plausible once the critique is removed\. For a tokenyty\_\{t\}in a revised trajectory, the critique\-free and critique\-conditioned behavior mismatch is captured bywt=πθrollout𝒮\(yt∣q,y<t\)/πθrollout𝒮\(yt∣q,c,y<t\)w\_\{t\}=\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(y\_\{t\}\\mid q,y\_\{<t\}\)/\\pi\_\{\\theta\_\{\\mathrm\{rollout\}\}\}^\{\\mathcal\{S\}\}\(y\_\{t\}\\mid q,c,y\_\{<t\}\)\. We clip this weight in practice to avoid high\-variance of the gradient\. The resulting update downweights tokens that depend too heavily on the critique context and keeps the solver focused on behaviors it can internalize under its original prompt distribution\.

## Appendix BLimitations

A potential limitation of ICRL is that it relies on the critique for revision, which may lead to longer rollout times for long\-tail trajectories under synchronous RL training\. In practice, this can create a bottleneck when difficult samples require multiple rounds of critique and revision before they converge\. It may also reduce overall throughput if the training pipeline must wait for the slowest trajectories in each synchronous update\. In our current training setup, we set the iteration round to 2 to improve training efficiency\. The use of asynchronous trainingfu2025arealremains unexplored and may further improve the efficiency of iterative training by decoupling these slow paths from the main optimization loop\.

## Appendix CTraining Settings

For all agentic environments, including ALFWorld, WebShop, and SearchQA, we use the AgentGym frameworkxi2025agentgym, for math environment, we follow the same setting from Critique\-GRPOzhang2025critique\.

Training Settings\.We set the maximum number of agent\-environment interactions to 30 turns\. We use the Adam optimizer with a constant learning rate of1×10−61\\times 10^\{\-6\}, a weight decay of0\.10\.1, andβ1=0\.9,β2=0\.98\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.98\. For each query, we sample 8 rollouts with a temperature of 1\.0, a maximum context length of 16384, a maximum response length of 2048 for each turn\. By default, training runs with tensor parallel size 2, sequence parallelism, dynamic batch size, and a maximum of 16384 tokens per GPU\. For ICRL settings, the max iteration round is set to 2, and the re\-weight ratio upper boundwmaxw\_\{\\mathrm\{max\}\}is set to 2\.

Compute Resources\.For ALFWorld, WebShop and Math environments, Qwen3\-4B experiments are trained on 2 H100 GPUs, and Qwen3\-8B experiments are trained on 4 H100 GPUs\. For SearchQA, we use 4 H100 GPUs and 8 H100 GPUs for 4B and 8B experiments, respectively\.

## Appendix DEnvironments Setting

### D\.1ALFWorld

ALFWorld is a text\-based household environment that aligns abstract TextWorld interaction with embodied household tasks from ALFREDALFWorld20\. In this environment, the agent receives a natural\-language household goal and must complete it through multi\-turn interaction, including navigation, object manipulation, and receptacle operations\. ALFWorld is evaluated by task completion\. The environment returns a binary outcome reward, wherer\(τ\)=1r\(\\tau\)=1if the final environment state satisfies the household goal andr\(τ\)=0r\(\\tau\)=0otherwise\. The prompt of ALFWorld environment is as follows\.

ALFWorld[⬇](data:text/plain;base64,WW91IGFyZSB0aGUgZXhlY3V0b3IgYWdlbnQgdG8gY29tcGxldGUgYSB0YXNrIGluIGEgZ2l2ZW4gZW52aXJvbm1lbnQuCgpFbnZpcm9ubWVudDoKQUxGV29ybGQgaXMgYSBob3VzZWhvbGQgZW52aXJvbm1lbnQgd2l0aCByb29tcywgcmVjZXB0YWNsZXMsIGFuZCBtYW5pcHVsYWJsZSBvYmplY3RzLgpZb3UgbmVlZCB0byBzb2x2ZSBhIGhvdXNlaG9sZCBnb2FsIGJ5IG5hdmlnYXRpbmcgYW5kIGludGVyYWN0aW5nIHdpdGggb2JqZWN0cyBhbmQgY29udGFpbmVycy4KClRhc2s6Cnt7dGFza319CgpFbnZpcm9ubWVudCBhY3Rpb246CkF2YWlsYWJsZSBBY3Rpb25zOgotIDxhY3Rpb24+Z28gdG8gW0xPQ0FUSU9OXTwvYWN0aW9uPjogbW92ZSB0byBhIHJlY2VwdGFjbGUgb3IgbG9jYXRpb24uCi0gPGFjdGlvbj50YWtlIFtPQkpFQ1RdIGZyb20gW1JFQ0VQVEFDTEVdPC9hY3Rpb24+OiBwaWNrIHVwIGFuIG9iamVjdCBmcm9tIGEgcmVjZXB0YWNsZS4KLSA8YWN0aW9uPnB1dCBbT0JKRUNUXSBpbi9vbiBbUkVDRVBUQUNMRV08L2FjdGlvbj46IHB1dCBhbiBvYmplY3Qgb24gYSByZWNlcHRhY2xlLgotIDxhY3Rpb24+b3BlbiBbUkVDRVBUQUNMRV08L2FjdGlvbj46IG9wZW4gYSByZWNlcHRhY2xlIHRvIHJldmVhbCBpdHMgY29udGVudHMuCi0gPGFjdGlvbj5jbG9zZSBbUkVDRVBUQUNMRV08L2FjdGlvbj46IGNsb3NlIGEgcmVjZXB0YWNsZS4KLSA8YWN0aW9uPnRvZ2dsZSBbT0JKRUNUXSBbUkVDRVBUQUNMRV08L2FjdGlvbj46IHN3aXRjaCBhbiBvYmplY3Qgb24gb3Igb2ZmLgotIDxhY3Rpb24+Y2xlYW4gW09CSkVDVF0gd2l0aCBbUkVDRVBUQUNMRV08L2FjdGlvbj46IGNsZWFuIGFuIG9iamVjdCB1c2luZyBhIHJlY2VwdGFjbGUuCi0gPGFjdGlvbj5oZWF0IFtPQkpFQ1RdIHdpdGggW1JFQ0VQVEFDTEVdPC9hY3Rpb24+OiBoZWF0IGFuIG9iamVjdCB1c2luZyBhIHJlY2VwdGFjbGUuCi0gPGFjdGlvbj5jb29sIFtPQkpFQ1RdIHdpdGggW1JFQ0VQVEFDTEVdPC9hY3Rpb24+OiBjb29sIGFuIG9iamVjdCB1c2luZyBhIHJlY2VwdGFjbGUuCi0gPGFjdGlvbj5pbnZlbnRvcnk8L2FjdGlvbj46IGxpc3Qgb2JqZWN0cyBjdXJyZW50bHkgYmVpbmcgY2FycmllZC4KLSA8YWN0aW9uPmxvb2s8L2FjdGlvbj46IGRlc2NyaWJlIHRoZSBjdXJyZW50IHNpdHVhdGlvbi4KLSA8YWN0aW9uPmV4YW1pbmUgW09CSkVDVF08L2FjdGlvbj46IGV4YW1pbmUgYW4gb2JqZWN0IG9yIHJlY2VwdGFjbGUgaW4gZGV0YWlsLgoKT3V0cHV0IGZvcm1hdDoKPHRoaW5rPgpzaG9ydCByZWFzb24KPC90aGluaz4KPGFjdGlvbj4KY2hvc2VuIGFjdGlvbgo8L2FjdGlvbj4K)Youaretheexecutoragenttocompleteataskinagivenenvironment\.Environment:ALFWorldisahouseholdenvironmentwithrooms,receptacles,andmanipulableobjects\.Youneedtosolveahouseholdgoalbynavigatingandinteractingwithobjectsandcontainers\.Task:\{\{task\}\}Environmentaction:AvailableActions:\-<action\>goto\[LOCATION\]</action\>:movetoareceptacleorlocation\.\-<action\>take\[OBJECT\]from\[RECEPTACLE\]</action\>:pickupanobjectfromareceptacle\.\-<action\>put\[OBJECT\]in/on\[RECEPTACLE\]</action\>:putanobjectonareceptacle\.\-<action\>open\[RECEPTACLE\]</action\>:openareceptacletorevealitscontents\.\-<action\>close\[RECEPTACLE\]</action\>:closeareceptacle\.\-<action\>toggle\[OBJECT\]\[RECEPTACLE\]</action\>:switchanobjectonoroff\.\-<action\>clean\[OBJECT\]with\[RECEPTACLE\]</action\>:cleananobjectusingareceptacle\.\-<action\>heat\[OBJECT\]with\[RECEPTACLE\]</action\>:heatanobjectusingareceptacle\.\-<action\>cool\[OBJECT\]with\[RECEPTACLE\]</action\>:coolanobjectusingareceptacle\.\-<action\>inventory</action\>:listobjectscurrentlybeingcarried\.\-<action\>look</action\>:describethecurrentsituation\.\-<action\>examine\[OBJECT\]</action\>:examineanobjectorreceptacleindetail\.Outputformat:<think\>shortreason</think\><action\>chosenaction</action\>

### D\.2Webshop

WebShop is an e\-commerce environment in which the agent searches, navigates product pages, inspects item attributes, and purchases a product that matches a user shopping instructionyao2022webshop;xi2025agentgym\. The task requires both information seeking and decision making because the agent must choose useful search queries and compare candidate products before purchasing\. Following the WebShop evaluation protocolyao2022webshop, the final reward compares the purchased productyywith the user instructionuuby matching product type, attributes, options, and price:

r=rtype⋅\|Uatt∩Yatt\|\+\|Uopt∩Yopt\|\+𝟏\[yprice≤uprice\]\|Uatt\|\+\|Uopt\|\+1,r=r\_\{\\mathrm\{type\}\}\\cdot\\frac\{\|U\_\{\\mathrm\{att\}\}\\cap Y\_\{\\mathrm\{att\}\}\|\+\|U\_\{\\mathrm\{opt\}\}\\cap Y\_\{\\mathrm\{opt\}\}\|\+\\mathbf\{1\}\[y\_\{\\mathrm\{price\}\}\\leq u\_\{\\mathrm\{price\}\}\]\}\{\|U\_\{\\mathrm\{att\}\}\|\+\|U\_\{\\mathrm\{opt\}\}\|\+1\},\(7\)whereUUandYYdenote instruction\-side constraints and purchased\-product fields, andrtyper\_\{\\mathrm\{type\}\}penalizes mismatched product types\. An trajectory is considered successful whenr=1r=1, meaning the purchased product fully satisfies the instruction\. The prompt of WebShop environment is as follows\.

WebShop[⬇](data:text/plain;base64,WW91IGFyZSB0aGUgZXhlY3V0b3IgYWdlbnQgdG8gY29tcGxldGUgYSB0YXNrIGluIGEgZ2l2ZW4gZW52aXJvbm1lbnQuCgpFbnZpcm9ubWVudDoKV2Vic2hvcCBpcyBhbiBlLWNvbW1lcmNlIHNob3BwaW5nIGVudmlyb25tZW50IHdoZXJlIHlvdSBuZWVkIHRvIGZpbmQgYW5kIHB1cmNoYXNlIHByb2R1Y3QgYmFzZWQgb24gZ2l2ZW4gc2hvcHBpbmcgZ29hbC4KClRhc2s6Cnt7dGFza319CgpFbnZpcm9ubWVudCBhY3Rpb246CkV2ZXJ5IHJvdW5kIEkgd2lsbCBnaXZlIHlvdSBhbiBvYnNlcnZhdGlvbiBhbmQgYSBsaXN0IG9mIGF2YWlsYWJsZSBhY3Rpb25zLCB5b3UgaGF2ZSB0byByZXNwb25kIGFuIGFjdGlvbiBiYXNlZCBvbiB0aGUgc3RhdGUgYW5kIGluc3RydWN0aW9uLgpZb3UgY2FuIHVzZSBzZWFyY2ggYWN0aW9uIGlmIHNlYXJjaCBpcyBhdmFpbGFibGUuCllvdSBjYW4gY2xpY2sgb25lIG9mIHRoZSBidXR0b25zIGluIGNsaWNrYWJsZXMuCklmIHRoZSBhY3Rpb24gaXMgbm90IHZhbGlkLCBwZXJmb3JtIG5vdGhpbmcuCktleXdvcmRzIGluIHNlYXJjaCBhcmUgdXAgdG8geW91LCBidXQgdGhlIHZhbHVlIGluIGNsaWNrIG11c3QgYmUgYSB2YWx1ZSBpbiB0aGUgbGlzdCBvZiBhdmFpbGFibGUgYWN0aW9ucy4KUmVtZW1iZXIgdGhhdCB5b3VyIGtleXdvcmRzIGluIHNlYXJjaCBzaG91bGQgYmUgY2FyZWZ1bGx5IGRlc2lnbmVkLgoKVGhlcmUgYXJlIGRpZmZlcmVudCB0eXBlcyBvZiBwYWdlczoKLSBJbml0aWFsIFBhZ2U6IFlvdSBjYW4gcGVyZm9ybSBzZWFyY2ggYWN0aW9ucyB0byBmaW5kIHByb2R1Y3RzLgotIFNlYXJjaCBSZXN1bHRzIFBhZ2U6IFlvdSBjYW4gdmlldyBzZWFyY2ggcmVzdWx0cywgbmF2aWdhdGUgdGhyb3VnaCBwYWdlcyB1c2luZyBjbGlja1tOZXh0ID5dIGFuZCBjbGlja1s8IFByZXZdLCBhbmQgY2xpY2sgb24gcHJvZHVjdCBBU0lOIHRvIHZpZXcgZGV0YWlscy4KLSBQcm9kdWN0IERldGFpbCBQYWdlOiBZb3UgY2FuIHZpZXcgcHJvZHVjdCBkZXRhaWxzLCBjaGVjayBkZXNjcmlwdGlvbiBhbmQgZmVhdHVyZXMsIGFuZCBpZiB0aGUgcHJvZHVjdCBtYXRjaGVzIHRoZSByZXF1aXJlbWVudHMsIHlvdSBjYW4gcHVyY2hhc2UgdGhlIHByb2R1Y3QuIElmIG5vdCwgeW91IGNhbiBnbyBiYWNrIHRvIHRoZSBzZWFyY2ggcmVzdWx0cyBvciBnbyB0byBpbml0aWFsIHBhZ2UuCgpBdmFpbGFibGUgQWN0aW9uczoKLSA8YWN0aW9uPnNlYXJjaFtxdWVyeV08L2FjdGlvbj46IHNlYXJjaCBmb3IgcHJvZHVjdHMgdXNpbmcgdGhlIHNwZWNpZmllZCBxdWVyeSAoaW5pdGlhbCBwYWdlIG9ubHkpLgotIDxhY3Rpb24+Y2xpY2tbYnV0dG9uXTwvYWN0aW9uPjogbmF2aWdhdGUgYnkgY2xpY2tpbmcgYnV0dG9ucyBvciBpdGVtcy4KCk91dHB1dCBmb3JtYXQ6Cjx0aGluaz4Kc2hvcnQgcmVhc29uCjwvdGhpbms+CjxhY3Rpb24+CmNob3NlbiBhY3Rpb24KPC9hY3Rpb24+)Youaretheexecutoragenttocompleteataskinagivenenvironment\.Environment:Webshopisane\-commerceshoppingenvironmentwhereyouneedtofindandpurchaseproductbasedongivenshoppinggoal\.Task:\{\{task\}\}Environmentaction:EveryroundIwillgiveyouanobservationandalistofavailableactions,youhavetorespondanactionbasedonthestateandinstruction\.Youcanusesearchactionifsearchisavailable\.Youcanclickoneofthebuttonsinclickables\.Iftheactionisnotvalid,performnothing\.Keywordsinsearchareuptoyou,butthevalueinclickmustbeavalueinthelistofavailableactions\.Rememberthatyourkeywordsinsearchshouldbecarefullydesigned\.Therearedifferenttypesofpages:\-InitialPage:Youcanperformsearchactionstofindproducts\.\-SearchResultsPage:Youcanviewsearchresults,navigatethroughpagesusingclick\[Next\>\]andclick\[<Prev\],andclickonproductASINtoviewdetails\.\-ProductDetailPage:Youcanviewproductdetails,checkdescriptionandfeatures,andiftheproductmatchestherequirements,youcanpurchasetheproduct\.Ifnot,youcangobacktothesearchresultsorgotoinitialpage\.AvailableActions:\-<action\>search\[query\]</action\>:searchforproductsusingthespecifiedquery\(initialpageonly\)\.\-<action\>click\[button\]</action\>:navigatebyclickingbuttonsoritems\.Outputformat:<think\>shortreason</think\><action\>chosenaction</action\>

### D\.3SearchQA

SearchQA is a search\-augmented question\-answering environment built on multi\-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, Bamboogle, and MuSiQueHotpotQA;2WikiMultiHopQA;Bamboogle;Musique\. The agent alternates between issuing search queries and reading retrieved information until it has enough evidence to produce a concise final answer\. Intermediate search actions are used only to gather evidence, and the trajectory is evaluated after the agent emits a final answer or reaches the interaction budget\. The environment returns a binary outcome reward based on answer correctness, withr\(τ\)=1r\(\\tau\)=1when the normalized final answer matches an accepted reference answer andr\(τ\)=0r\(\\tau\)=0otherwise\. The prompt of SearchQA environment is as follows\.

SearchQA[⬇](data:text/plain;base64,WW91IGFyZSB0aGUgZXhlY3V0b3IgYWdlbnQgdG8gY29tcGxldGUgYSB0YXNrIGluIGEgZ2l2ZW4gZW52aXJvbm1lbnQuCgpFbnZpcm9ubWVudDoKU2VhcmNoRW52IGlzIGEgd2Vic2VhcmNoIGVudmlyb25tZW50LiBZb3UgbmVlZCB0byBhbnN3ZXIgYSBxdWVzdGlvbiBieSBpc3N1aW5nIHNlYXJjaCBxdWVyaWVzIGFuZCBpdGVyYXRpdmVseSByZWZpbmluZyB5b3VyIHNlYXJjaCBiYXNlZCBvbiByZXRyaWV2ZWQgaW5mb3JtYXRpb24uCldoZW4gYWxsIG5lY2Vzc2FyeSBpbmZvcm1hdGlvbiBpcyBnYXRoZXJlZCwgcmV0dXJuIGEgc2hvcnQgYW5kIGNvbmNpc2UgZmluYWwgYW5zd2VyLgoKVGFzazogCllvdSBtdXN0IHJlYXNvbiBpbnNpZGUgPHRoaW5rPi4uLjwvdGhpbms+IGZpcnN0LiBJZiB5b3UgZG8gbm90IGhhdmUgZW5vdWdoIGtub3dsZWRnZSwgaXNzdWUgYSA8c2VhcmNoPi4uLjwvc2VhcmNoPiBhbmQgdGhlbiBTVE9QLiBEbyBub3QgZ2VuZXJhdGUgPGluZm9ybWF0aW9uPiBvciA8YW5zd2VyPiB5ZXQuIFdhaXQgZm9yIGV4dGVybmFsIGlucHV0IHdyYXBwZWQgaW4gPGluZm9ybWF0aW9uPi4uLjwvaW5mb3JtYXRpb24+LiBBZnRlciByZWNlaXZpbmcgaW5mb3JtYXRpb24sIHJlYXNvbiBhZ2FpbiBpbiA8dGhpbms+LiBJZiBjb25maWRlbnQsIG91dHB1dCB5b3VyIGZpbmFsIGFuc3dlciBpbiA8YW5zd2VyPi4uLjwvYW5zd2VyPi4gRG8gbm90IG91dHB1dCA8YW5zd2VyPiBiZWZvcmUgcmVjZWl2aW5nIDxpbmZvcm1hdGlvbj4gdW5sZXNzIHlvdSBhcmUgZnVsbHkgY29uZmlkZW50LiBJZiB5b3UgZmluZCBubyBmdXJ0aGVyIGV4dGVybmFsIGtub3dsZWRnZSBuZWVkZWQsIHlvdSBjYW4gZGlyZWN0bHkgcHJvdmlkZSB0aGUgYW5zd2VyIGluc2lkZSA8YW5zd2VyPiBhbmQgPC9hbnN3ZXI+LCB3aXRob3V0IGRldGFpbGVkIGlsbHVzdHJhdGlvbnMuIEZvciBleGFtcGxlLCA8YW5zd2VyPiBCZWlqaW5nIDwvYW5zd2VyPi4gRm9sbG93IHRoaXMgcHJvY2VzcyBldmVyeSB0aW1lLgp7e3Rhc2t9fQoKRW52aXJvbm1lbnQgYWN0aW9uOgpBdmFpbGFibGUgQWN0aW9uczoKLSA8c2VhcmNoPnF1ZXJ5PC9zZWFyY2g+OiBzZWFyY2ggZm9yIHJlbGV2YW50IGluZm9ybWF0aW9uLgotIDxhbnN3ZXI+YW5zd2VyPC9hbnN3ZXI+OiBwcm92aWRlIHRoZSBmaW5hbCBjb25jaXNlIGFuc3dlci4KCldoZW4gZ2l2aW5nIHRoZSBmaW5hbCBhbnN3ZXIsIG1ha2UgaXQgc2hvcnQgYW5kIGNvbmNpc2UuIERvbid0IGluY2x1ZGUgYW55IGFkZGl0aW9uYWwgZXhwbGFuYXRpb25zIG9yIG5vdGVzLgpGb3IgZXhhbXBsZSwgaWYgdGhlIHF1ZXN0aW9uIGlzICJXaGF0IGlzIHRoZSBjYXBpdGFsIG9mIEZyYW5jZT8iOgo8YW5zd2VyPlBhcmlzPC9hbnN3ZXI+CgpPdXRwdXQgZm9ybWF0Ogo8dGhpbms+CnNob3J0IHJlYXNvbgo8L3RoaW5rPgo8YWN0aW9uPgpjaG9zZW4gYWN0aW9uCjwvYWN0aW9uPg==)Youaretheexecutoragenttocompleteataskinagivenenvironment\.Environment:SearchEnvisawebsearchenvironment\.Youneedtoansweraquestionbyissuingsearchqueriesanditerativelyrefiningyoursearchbasedonretrievedinformation\.Whenallnecessaryinformationisgathered,returnashortandconcisefinalanswer\.Task:Youmustreasoninside<think\>\.\.\.</think\>first\.Ifyoudonothaveenoughknowledge,issuea<search\>\.\.\.</search\>andthenSTOP\.Donotgenerate<information\>or<answer\>yet\.Waitforexternalinputwrappedin<information\>\.\.\.</information\>\.Afterreceivinginformation,reasonagainin<think\>\.Ifconfident,outputyourfinalanswerin<answer\>\.\.\.</answer\>\.Donotoutput<answer\>beforereceiving<information\>unlessyouarefullyconfident\.Ifyoufindnofurtherexternalknowledgeneeded,youcandirectlyprovidetheanswerinside<answer\>and</answer\>,withoutdetailedillustrations\.Forexample,<answer\>Beijing</answer\>\.Followthisprocesseverytime\.\{\{task\}\}Environmentaction:AvailableActions:\-<search\>query</search\>:searchforrelevantinformation\.\-<answer\>answer</answer\>:providethefinalconciseanswer\.Whengivingthefinalanswer,makeitshortandconcise\.Don’tincludeanyadditionalexplanationsornotes\.Forexample,ifthequestionis"WhatisthecapitalofFrance?":<answer\>Paris</answer\>Outputformat:<think\>shortreason</think\><action\>chosenaction</action\>

### D\.4Math

The math environment evaluates mathematical reasoning on MATH500, Minerva Math, OlympiadBench, AMC23, and AIME24hendrycks2021math500;lewkowycz2022minerva;he\-etal\-2024\-olympiadbench;numina\_math\_datasets\. The agent is given a math problem and should reason step by step to arrive at the final answer\. The final answer is checked against the reference answer\. The environment returns a binary outcome reward, wherer\(τ\)=1r\(\\tau\)=1if the final answer is equivalent to the reference solution andr\(τ\)=0r\(\\tau\)=0otherwise\. The prompt of Math environment is as follows\.

Math[⬇](data:text/plain;base64,WW91IGFyZSBhIG1hdGggcHJvYmxlbSBzb2x2ZXIuCgpUYXNrOgp7e3Rhc2t9fQpQbGVhc2UgcmVhc29uIHN0ZXAgYnkgc3RlcCwgYW5kIHB1dCB5b3VyIGZpbmFsIGFuc3dlciB3aXRoaW4gXGJveGVke30uCgpPdXRwdXQgZm9ybWF0Ogo8dGhpbms+CnNob3J0IHJlYXNvbgo8L3RoaW5rPgo8YW5zd2VyPgpcYm94ZWR7YW5zd2VyfQo8L2Fuc3dlcj4=)Youareamathproblemsolver\.Task:\{\{task\}\}Pleasereasonstepbystep,andputyourfinalanswerwithin\\boxed\{\}\.Outputformat:<think\>shortreason</think\><answer\>\\boxed\{answer\}</answer\>

## Appendix ETrajectory Examples

The following boxes show a trajectory comparison between a failed GRPO trajectory and a successful trajectory produced by our method under the same task setting\.

GRPO[⬇](data:text/plain;base64,VGFzazogcHV0IGEgaG90IHBvdGF0byBpbiBmcmlkZ2UuClJlc3BvbnNlOgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gTGV0IG1lIHN0YXJ0IGJ5IGdvaW5nIHRvIGNvdW50ZXJ0b3AgMSB0byBjaGVjay4KPC90aGluaz4KCjxhY3Rpb24+Z28gdG8gY291bnRlcnRvcCAxPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCk9uIHRoZSBjb3VudGVydG9wIDEsIHlvdSBzZWUgYSBkaXNoc3BvbmdlIDMsIGEgZGlzaHNwb25nZSAyLCBhIGZvcmsgMSwgYSBzcG9vbiAxLCBhbmQgYSB0b21hdG8gMS48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+Ci4uLiBJIG5lZWQgdG8gZmluZCB0aGUgcG90YXRvIGZpcnN0LiBMZXQncyBnbyB0byBjb3VudGVydG9wIDIgbmV4dC4KPC90aGluaz4KCjxhY3Rpb24+Z28gdG8gY291bnRlcnRvcCAyPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCk9uIHRoZSBjb3VudGVydG9wIDIsIHlvdSBzZWUgYSBicmVhZCAyLCBhIGN1cCAxLCBhbmQgYSBmb3JrIDMuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gTGV0J3MgZ28gdG8gY291bnRlcnRvcCAzIG5leHQuCjwvdGhpbms+Cgo8YWN0aW9uPmdvIHRvIGNvdW50ZXJ0b3AgMzwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpPbiB0aGUgY291bnRlcnRvcCAzLCB5b3Ugc2VlIGEgYnV0dGVya25pZmUgMiwgYSBidXR0ZXJrbmlmZSAxLCBhIGxldHR1Y2UgMywgYSBsZXR0dWNlIDIsIGEgbXVnIDEsIGEgc29hcGJvdHRsZSAzLCBhbmQgYSBzcGF0dWxhIDEuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gU28gcGVyaGFwcyBnbyB0byBjYWJpbmV0IDEsIG9wZW4gaXQsIGFuZCBjaGVjayBjb250ZW50cy4gTGV0J3MgdHJ5IHRoYXQuCjwvdGhpbms+Cgo8YWN0aW9uPmdvIHRvIGNhYmluZXQgMTwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpUaGUgY2FiaW5ldCAxIGlzIGNsb3NlZC48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+Li4uIExldCdzIG9wZW4gY2FiaW5ldCAxIHRvIHNlZS48L3RoaW5rPgoKPGFjdGlvbj5vcGVuIGNhYmluZXQgMTwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpZb3Ugb3BlbiB0aGUgY2FiaW5ldCAxLiBUaGUgY2FiaW5ldCAxIGlzIG9wZW4uIEluIGl0LCB5b3Ugc2VlIGEgYm93bCAxLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4uLi4gTGV0J3MgZ28gdG8gY2FiaW5ldCAyIGFuZCBvcGVuIGl0LjwvdGhpbms+Cgo8YWN0aW9uPmdvIHRvIGNhYmluZXQgMjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpUaGUgY2FiaW5ldCAyIGlzIGNsb3NlZC48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+Ck9rYXksIGNhYmluZXQgMiBpcyBjbG9zZWQuIExldCdzIG9wZW4gaXQgdG8gY2hlY2suIElmIHRoZXJlJ3MgYSBwb3RhdG8gaW5zaWRlLCB0aGF0J3MgZ3JlYXQuIElmIG5vdCwgcHJvY2VlZCB0byB0aGUgbmV4dCBjYWJpbmV0LiBBbHRlcm5hdGl2ZWx5LCBtYXliZSBjaGVjayBkcmF3ZXJzLiBCdXQgc2luY2UgY2FiaW5ldHMgYXJlIGNvbW1vbiBzdG9yYWdlLCBwZXJoYXBzIGNvbnRpbnVlIGNoZWNraW5nIGNhYmluZXRzLiBMZXQncyBvcGVuIGNhYmluZXQgMi4KPC90aGluaz4KCjxhY3Rpb24+b3BlbiBjYWJpbmV0IDI8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKWW91IG9wZW4gdGhlIGNhYmluZXQgMi4gVGhlIGNhYmluZXQgMiBpcyBvcGVuLiBJbiBpdCwgeW91IHNlZSBub3RoaW5nLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KLi4uIExldCdzIGNoZWNrIGRyYXdlciAxLgo8L3RoaW5rPgoKPGFjdGlvbj5nbyB0byBkcmF3ZXIgMTwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpUaGUgZHJhd2VyIDEgaXMgY2xvc2VkLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KRHJhd2VyIDEgaXMgY2xvc2VkLiBOZWVkIHRvIG9wZW4gaXQuIExldCdzIG9wZW4gZHJhd2VyIDEgdG8gY2hlY2sgaWYgdGhlcmUncyBhIHBvdGF0byBpbnNpZGUuIElmIG5vdCwgcHJvY2VlZCB0byB0aGUgbmV4dCBkcmF3ZXIgb3IgY2FiaW5ldC4gTGV0J3Mgb3BlbiBpdC4KPC90aGluaz4KCjxhY3Rpb24+b3BlbiBkcmF3ZXIgMTwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpZb3Ugb3BlbiB0aGUgZHJhd2VyIDEuIFRoZSBkcmF3ZXIgMSBpcyBvcGVuLiBJbiBpdCwgeW91IHNlZSBhIGZvcmsgMi48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+Ci4uLiBMZXQncyB0cnkgZHJhd2VyIDIuCjwvdGhpbms+Cgo8YWN0aW9uPmdvIHRvIGRyYXdlciAyPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyClRoZSBkcmF3ZXIgMiBpcyBjbG9zZWQuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgpEcmF3ZXIgMiBpcyBjbG9zZWQuIExldCdzIG9wZW4gaXQgdG8gY2hlY2suIElmIHRoZXJlJ3MgYSBwb3RhdG8sIGdyZWF0LiBJZiBub3QsIGNvbnRpbnVlIHRvIG5leHQgZHJhd2VyIG9yIGNoZWNrIG90aGVyIGxvY2F0aW9ucy4gTGV0J3Mgb3BlbiBkcmF3ZXIgMi4KPC90aGluaz4KCjxhY3Rpb24+b3BlbiBkcmF3ZXIgMjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpZb3Ugb3BlbiB0aGUgZHJhd2VyIDIuIFRoZSBkcmF3ZXIgMiBpcyBvcGVuLiBJbiBpdCwgeW91IHNlZSBhIGtuaWZlIDEuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gTGV0J3MgdHJ5IGRyYXdlciAzLgo8L3RoaW5rPgoKPGFjdGlvbj5nbyB0byBkcmF3ZXIgMzwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpUaGUgZHJhd2VyIDMgaXMgY2xvc2VkLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KRHJhd2VyIDMgaXMgY2xvc2VkLiBMZXQncyBvcGVuIGl0IHRvIGNoZWNrLiBJZiB0aGVyZSdzIGEgcG90YXRvLCBncmVhdC4gSWYgbm90LCBtb3ZlIHRvIGRyYXdlciA0LiBMZXQncyBvcGVuIGRyYXdlciAzLgo8L3RoaW5rPgoKPGFjdGlvbj5vcGVuIGRyYXdlciAzPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCllvdSBvcGVuIHRoZSBkcmF3ZXIgMy4gVGhlIGRyYXdlciAzIGlzIG9wZW4uIEluIGl0LCB5b3Ugc2VlIGEgcGVwcGVyc2hha2VyIDEsIGFuZCBhIHNwb29uIDIuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gTGV0J3MgcHJvY2VlZCB0byBkcmF3ZXIgNC4KPC90aGluaz4KCjxhY3Rpb24+Z28gdG8gZHJhd2VyIDQ8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKVGhlIGRyYXdlciA0IGlzIGNsb3NlZC48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+Ci4uLiBMZXQncyBvcGVuIGRyYXdlciA0Lgo8L3RoaW5rPgoKPGFjdGlvbj5vcGVuIGRyYXdlciA0PC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCllvdSBvcGVuIHRoZSBkcmF3ZXIgNC4gVGhlIGRyYXdlciA0IGlzIG9wZW4uIEluIGl0LCB5b3Ugc2VlIG5vdGhpbmcuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gTWF5YmUgdGhlIHBvdGF0byBpcyBpbiB0aGUgZnJpZGdlIGFscmVhZHk/IE5vLCB0aGUgdGFzayBpcyB0byBwdXQgaXQgaW4gdGhlIGZyaWRnZS4gU28gaXQgbXVzdCBiZSBlbHNld2hlcmUuIExldCdzIGNoZWNrIGNhYmluZXQgMyBuZXh0Lgo8L3RoaW5rPgoKPGFjdGlvbj5nbyB0byBjYWJpbmV0IDM8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKVGhlIGNhYmluZXQgMyBpcyBjbG9zZWQuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gTGV0J3MgZ28gdG8gY2FiaW5ldCA0IG5leHQuCjwvdGhpbms+Cgo8YWN0aW9uPmdvIHRvIGNhYmluZXQgNDwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpPbiB0aGUgY2FiaW5ldCA0LCB5b3Ugc2VlIGEgc29hcGJvdHRsZSAxLjx8aW1fZW5kfD4=)Task:putahotpotatoinfridge\.Response:<\|im\_start\|\>assistant<think\>\.\.\.Letmestartbygoingtocountertop1tocheck\.</think\><action\>gotocountertop1</action\><\|im\_end\|\><\|im\_start\|\>userOnthecountertop1,youseeadishsponge3,adishsponge2,afork1,aspoon1,andatomato1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Ineedtofindthepotatofirst\.Let’sgotocountertop2next\.</think\><action\>gotocountertop2</action\><\|im\_end\|\><\|im\_start\|\>userOnthecountertop2,youseeabread2,acup1,andafork3\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’sgotocountertop3next\.</think\><action\>gotocountertop3</action\><\|im\_end\|\><\|im\_start\|\>userOnthecountertop3,youseeabutterknife2,abutterknife1,alettuce3,alettuce2,amug1,asoapbottle3,andaspatula1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Soperhapsgotocabinet1,openit,andcheckcontents\.Let’strythat\.</think\><action\>gotocabinet1</action\><\|im\_end\|\><\|im\_start\|\>userThecabinet1isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’sopencabinet1tosee\.</think\><action\>opencabinet1</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthecabinet1\.Thecabinet1isopen\.Init,youseeabowl1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’sgotocabinet2andopenit\.</think\><action\>gotocabinet2</action\><\|im\_end\|\><\|im\_start\|\>userThecabinet2isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>Okay,cabinet2isclosed\.Let’sopenittocheck\.Ifthere’sapotatoinside,that’sgreat\.Ifnot,proceedtothenextcabinet\.Alternatively,maybecheckdrawers\.Butsincecabinetsarecommonstorage,perhapscontinuecheckingcabinets\.Let’sopencabinet2\.</think\><action\>opencabinet2</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthecabinet2\.Thecabinet2isopen\.Init,youseenothing\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’scheckdrawer1\.</think\><action\>gotodrawer1</action\><\|im\_end\|\><\|im\_start\|\>userThedrawer1isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>Drawer1isclosed\.Needtoopenit\.Let’sopendrawer1tocheckifthere’sapotatoinside\.Ifnot,proceedtothenextdrawerorcabinet\.Let’sopenit\.</think\><action\>opendrawer1</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthedrawer1\.Thedrawer1isopen\.Init,youseeafork2\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’strydrawer2\.</think\><action\>gotodrawer2</action\><\|im\_end\|\><\|im\_start\|\>userThedrawer2isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>Drawer2isclosed\.Let’sopenittocheck\.Ifthere’sapotato,great\.Ifnot,continuetonextdrawerorcheckotherlocations\.Let’sopendrawer2\.</think\><action\>opendrawer2</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthedrawer2\.Thedrawer2isopen\.Init,youseeaknife1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’strydrawer3\.</think\><action\>gotodrawer3</action\><\|im\_end\|\><\|im\_start\|\>userThedrawer3isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>Drawer3isclosed\.Let’sopenittocheck\.Ifthere’sapotato,great\.Ifnot,movetodrawer4\.Let’sopendrawer3\.</think\><action\>opendrawer3</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthedrawer3\.Thedrawer3isopen\.Init,youseeapeppershaker1,andaspoon2\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’sproceedtodrawer4\.</think\><action\>gotodrawer4</action\><\|im\_end\|\><\|im\_start\|\>userThedrawer4isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’sopendrawer4\.</think\><action\>opendrawer4</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthedrawer4\.Thedrawer4isopen\.Init,youseenothing\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Maybethepotatoisinthefridgealready?No,thetaskistoputitinthefridge\.Soitmustbeelsewhere\.Let’scheckcabinet3next\.</think\><action\>gotocabinet3</action\><\|im\_end\|\><\|im\_start\|\>userThecabinet3isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\>\.\.\.Let’sgotocabinet4next\.</think\><action\>gotocabinet4</action\><\|im\_end\|\><\|im\_start\|\>userOnthecabinet4,youseeasoapbottle1\.<\|im\_end\|\>Reward = 0

ICRL[⬇](data:text/plain;base64,VGFzazogcHV0IGEgaG90IHBvdGF0byBpbiBmcmlkZ2UuClJlc3BvbnNlOgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgouLi4gTGV0J3Mgc3RhcnQgYnkgY2hlY2tpbmcgdGhlIGNvdW50ZXJ0b3BzIGFuZCBvdGhlciB2aXNpYmxlIHBsYWNlcy4KPC90aGluaz4KCjxhY3Rpb24+Cmxvb2sKPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCllvdSBhcmUgaW4gdGhlIG1pZGRsZSBvZiBhIHJvb20uIExvb2tpbmcgcXVpY2tseSBhcm91bmQgeW91LCB5b3Ugc2VlIG5vdGhpbmcuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgo8L3RoaW5rPgoKPGFjdGlvbj4KZ28gdG8gY291bnRlcnRvcCAxCjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpPbiB0aGUgY291bnRlcnRvcCAxLCB5b3Ugc2VlIGEgZGlzaHNwb25nZSAzLCBhIGRpc2hzcG9uZ2UgMiwgYSBmb3JrIDEsIGEgc3Bvb24gMSwgYW5kIGEgdG9tYXRvIDEuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgo8L3RoaW5rPgoKPGFjdGlvbj4KZ28gdG8gY291bnRlcnRvcCAyCjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpPbiB0aGUgY291bnRlcnRvcCAyLCB5b3Ugc2VlIGEgYnJlYWQgMiwgYSBjdXAgMSwgYW5kIGEgZm9yayAzLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KPC90aGluaz4KCjxhY3Rpb24+CmdvIHRvIGNvdW50ZXJ0b3AgMwo8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKT24gdGhlIGNvdW50ZXJ0b3AgMywgeW91IHNlZSBhIGJ1dHRlcmtuaWZlIDIsIGEgYnV0dGVya25pZmUgMSwgYSBsZXR0dWNlIDMsIGEgbGV0dHVjZSAyLCBhIG11ZyAxLCBhIHNvYXBib3R0bGUgMywgYW5kIGEgc3BhdHVsYSAxLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KPC90aGluaz4KCjxhY3Rpb24+CmdvIHRvIHN0b3ZlYnVybmVyIDEKPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCk9uIHRoZSBzdG92ZWJ1cm5lciAxLCB5b3Ugc2VlIGEgcG90IDEuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgo8L3RoaW5rPgoKPGFjdGlvbj4KZ28gdG8gc3RvdmVidXJuZXIgMgo8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKT24gdGhlIHN0b3ZlYnVybmVyIDIsIHlvdSBzZWUgYSBwb3QgMi48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+CjwvdGhpbms+Cgo8YWN0aW9uPgpnbyB0byBzdG92ZWJ1cm5lciAzCjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpPbiB0aGUgc3RvdmVidXJuZXIgMywgeW91IHNlZSBub3RoaW5nLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KPC90aGluaz4KCjxhY3Rpb24+CmdvIHRvIHN0b3ZlYnVybmVyIDQKPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCk9uIHRoZSBzdG92ZWJ1cm5lciA0LCB5b3Ugc2VlIG5vdGhpbmcuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgo8L3RoaW5rPgoKPGFjdGlvbj4KZ28gdG8gbWljcm93YXZlIDEKPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyClRoZSBtaWNyb3dhdmUgMSBpcyBjbG9zZWQuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgo8L3RoaW5rPgoKPGFjdGlvbj4Kb3BlbiBtaWNyb3dhdmUgMQo8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKWW91IG9wZW4gdGhlIG1pY3Jvd2F2ZSAxLiBUaGUgbWljcm93YXZlIDEgaXMgb3Blbi4gSW4gaXQsIHlvdSBzZWUgbm90aGluZy48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+CjwvdGhpbms+Cgo8YWN0aW9uPgpnbyB0byBmcmlkZ2UgMQo8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKVGhlIGZyaWRnZSAxIGlzIGNsb3NlZC48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+CjwvdGhpbms+Cgo8YWN0aW9uPgpvcGVuIGZyaWRnZSAxCjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpZb3Ugb3BlbiB0aGUgZnJpZGdlIDEuIFRoZSBmcmlkZ2UgMSBpcyBvcGVuLiBJbiBpdCwgeW91IHNlZSBhIGFwcGxlIDIsIGEgYnJlYWQgMSwgYSBlZ2cgMiwgYSBwb3RhdG8gMiwgYSBwb3RhdG8gMSwgYW5kIGEgdG9tYXRvIDIuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgo8L3RoaW5rPgoKPGFjdGlvbj4KaW52ZW50b3J5CjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpZb3UgYXJlIG5vdCBjYXJyeWluZyBhbnl0aGluZy48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+CjwvdGhpbms+Cgo8YWN0aW9uPgp0YWtlIHBvdGF0byAxIGZyb20gZnJpZGdlIDEKPC9hY3Rpb24+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCllvdSBwaWNrIHVwIHRoZSBwb3RhdG8gMSBmcm9tIHRoZSBmcmlkZ2UgMS48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PmFzc2lzdGFudAo8dGhpbms+CjwvdGhpbms+Cgo8YWN0aW9uPgpnbyB0byBtaWNyb3dhdmUgMQo8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKVGhlIG1pY3Jvd2F2ZSAxIGlzIG9wZW4uIEluIGl0LCB5b3Ugc2VlIG5vdGhpbmcuPHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD5hc3Npc3RhbnQKPHRoaW5rPgo8L3RoaW5rPgoKPGFjdGlvbj4KaGVhdCBwb3RhdG8gMSB3aXRoIG1pY3Jvd2F2ZSAxCjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpZb3UgaGVhdCB0aGUgcG90YXRvIDEgdXNpbmcgdGhlIG1pY3Jvd2F2ZSAxLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KPC90aGluaz4KCjxhY3Rpb24+CmdvIHRvIGZyaWRnZSAxCjwvYWN0aW9uPjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+dXNlcgpUaGUgZnJpZGdlIDEgaXMgb3Blbi4gSW4gaXQsIHlvdSBzZWUgYSBhcHBsZSAyLCBhIGJyZWFkIDEsIGEgZWdnIDIsIGEgcG90YXRvIDIsIGFuZCBhIHRvbWF0byAyLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KPC90aGluaz4KCjxhY3Rpb24+CnB1dCBwb3RhdG8gMSBpbi9vbiBmcmlkZ2UgMQo8L2FjdGlvbj48fGltX2VuZHw+Cjx8aW1fc3RhcnR8PnVzZXIKWW91IHB1dCB0aGUgcG90YXRvIDEgaW4vb24gdGhlIGZyaWRnZSAxLjx8aW1fZW5kfD4KPHxpbV9zdGFydHw+YXNzaXN0YW50)Task:putahotpotatoinfridge\.Response:<\|im\_start\|\>assistant<think\>\.\.\.Let’sstartbycheckingthecountertopsandothervisibleplaces\.</think\><action\>look</action\><\|im\_end\|\><\|im\_start\|\>userYouareinthemiddleofaroom\.Lookingquicklyaroundyou,youseenothing\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotocountertop1</action\><\|im\_end\|\><\|im\_start\|\>userOnthecountertop1,youseeadishsponge3,adishsponge2,afork1,aspoon1,andatomato1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotocountertop2</action\><\|im\_end\|\><\|im\_start\|\>userOnthecountertop2,youseeabread2,acup1,andafork3\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotocountertop3</action\><\|im\_end\|\><\|im\_start\|\>userOnthecountertop3,youseeabutterknife2,abutterknife1,alettuce3,alettuce2,amug1,asoapbottle3,andaspatula1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotostoveburner1</action\><\|im\_end\|\><\|im\_start\|\>userOnthestoveburner1,youseeapot1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotostoveburner2</action\><\|im\_end\|\><\|im\_start\|\>userOnthestoveburner2,youseeapot2\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotostoveburner3</action\><\|im\_end\|\><\|im\_start\|\>userOnthestoveburner3,youseenothing\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotostoveburner4</action\><\|im\_end\|\><\|im\_start\|\>userOnthestoveburner4,youseenothing\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotomicrowave1</action\><\|im\_end\|\><\|im\_start\|\>userThemicrowave1isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>openmicrowave1</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthemicrowave1\.Themicrowave1isopen\.Init,youseenothing\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotofridge1</action\><\|im\_end\|\><\|im\_start\|\>userThefridge1isclosed\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>openfridge1</action\><\|im\_end\|\><\|im\_start\|\>userYouopenthefridge1\.Thefridge1isopen\.Init,youseeaapple2,abread1,aegg2,apotato2,apotato1,andatomato2\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>inventory</action\><\|im\_end\|\><\|im\_start\|\>userYouarenotcarryinganything\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>takepotato1fromfridge1</action\><\|im\_end\|\><\|im\_start\|\>userYoupickupthepotato1fromthefridge1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotomicrowave1</action\><\|im\_end\|\><\|im\_start\|\>userThemicrowave1isopen\.Init,youseenothing\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>heatpotato1withmicrowave1</action\><\|im\_end\|\><\|im\_start\|\>userYouheatthepotato1usingthemicrowave1\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>gotofridge1</action\><\|im\_end\|\><\|im\_start\|\>userThefridge1isopen\.Init,youseeaapple2,abread1,aegg2,apotato2,andatomato2\.<\|im\_end\|\><\|im\_start\|\>assistant<think\></think\><action\>putpotato1in/onfridge1</action\><\|im\_end\|\><\|im\_start\|\>userYouputthepotato1in/onthefridge1\.<\|im\_end\|\><\|im\_start\|\>assistantReward = 1

## Appendix FCritic Examples

The following box shows an example of the learned critic diagnosing a failed ALFWorld trajectory and producing concise feedback for the next solver attempt\.

ALFWorld Critique[⬇](data:text/plain;base64,PHxpbV9zdGFydHw+c3lzdGVtCllvdSBhcmUgdGhlIGNyaXRpYyBhZ2VudC4gWW91IHdpbGwgYmUgZ2l2ZW4gYW4gZXhlY3V0b3IgdHJhamVjdG9yeSB0aGF0IGZhaWwgdG8gY29tcGxldGUgdGhlIHRhc2suIElkZW50aWZ5IHRoZSBjb25jcmV0ZSBmYWlsdXJlIG1vZGUgYW5kIHByb2R1Y2UgYSBzaG9ydCBzZXQgb2YgY29yZSB0aXBzIHRoYXQgY2FuIGhlbHAgdGhlIG5leHQgZXhlY3V0b3Igcm91bmQgc3VjY2VlZC4KCkVudmlyb25tZW50Ogp7ZW52aXJvbm1lbnR9CgpSdWxlczoKLSBEaWFnbm9zZSB0aGUgaGlnaGVzdC12YWx1ZSBtaXN0YWtlLgotIEZvY3VzIG9uIGFjdGlvbmFibGUgY29ycmVjdGlvbnMgZm9yIHRoZSBuZXh0IGV4ZWN1dG9yIHJ1bi4KLSBUaGUgbmV4dCByb3VuZCBpcyBhIGZyZXNoIHJldHJ5IGZyb20gdGhlIGluaXRpYWwgb2JzZXJ2YXRpb24uCi0gRG8gbm90IG91dHB1dCBhY3Rpb25zLCBYTUwgYWN0aW9uIHRhZ3MsIG9yIGxvbmcgbmFycmF0aXZlcy4KLSBTdW1tYXJpemUgb25seSB0aGUgcmV1c2FibGUgY29yZSB0aXBzLgoKT3V0cHV0IGZvcm1hdDoKPHRoaW5rPgpzaG9ydCByZWFzb24KPC90aGluaz4KPGNyaXRpYz4KVGlwIDE6IC4uLgpUaXAgMjogLi4uClRpcCAzOiAuLi4KPC9jcml0aWM+PHxpbV9lbmR8Pgo8fGltX3N0YXJ0fD51c2VyCiMgVGFzawp7dGFza30KIyBUcmFqZWN0b3J5IFN1bW1hcnkKcmV3YXJkOiB7cmV3YXJkfQp0cmFqZWN0b3J5OiB7dHJhamVjdG9yeX0KPHxpbV9zdGFydHw+YXNzaXN0YW50Cjx0aGluaz4KT2theSwgbGV0J3MgbG9vayBhdCB0aGUgdHJhamVjdG9yeSB0aGUgZXhlY3V0b3IgdG9vay4gVGhlIHRhc2sgd2FzIHRvIGNvb2wgYW4gYXBwbGUgYW5kIHB1dCBpdCBpbiB0aGUgbWljcm93YXZlLiBUaGUgZXhlY3V0b3Igc3RhcnRlZCBieSBnb2luZyB0byB0aGUgZnJpZGdlLCBvcGVuZWQgaXQsIGJ1dCBkaWRuJ3QgZmluZCBhbiBhcHBsZSB0aGVyZS4gVGhlbiB0aGV5IGNoZWNrZWQgY291bnRlcnRvcHMgYW5kIGZvdW5kIGFwcGxlIDEgb24gY291bnRlcnRvcCAzLCB0b29rIGl0LCB3ZW50IHRvIHRoZSBzaW5rLCB0cmllZCB0byBjb29sIGl0IHdpdGggdGhlIHNpbmssIGJ1dCBub3RoaW5nIGhhcHBlbmVkLiBUaGVuIHRoZXkgY2xlYW5lZCBpdCwgcHV0IGl0IGluIHRoZSBtaWNyb3dhdmUsIGNsb3NlZCBpdCwgYnV0IHdoZW4gdHJ5aW5nIHRvIHRha2UgaXQgYmFjaywgaXQgZGlkbid0IHdvcmsuIFRoZXkga2VwdCBsb29waW5nIGJldHdlZW4gZnJpZGdlIGFuZCBzaW5rLCB0cnlpbmcgdG8gY29vbCB0aGUgYXBwbGUgYWdhaW4uCgpIbW0sIHRoZSBtYWluIGlzc3VlIGhlcmUgaXMgdGhhdCB0aGUgZXhlY3V0b3IgZGlkbid0IHJlYWxpemUgdGhhdCBjb29saW5nIHRoZSBhcHBsZSByZXF1aXJlcyBwbGFjaW5nIGl0IGluIHRoZSBmcmlkZ2UgZmlyc3QuIFRoZSBhdmFpbGFibGUgYWN0aW9ucyBtaWdodCBoYXZlIGEgc3BlY2lmaWMgd2F5IHRvIGNvb2wgYW4gaXRlbSwgbGlrZSBwdXR0aW5nIGl0IGluIHRoZSBmcmlkZ2UuIEluc3RlYWQgb2YgdXNpbmcgdGhlIHNpbmsgb3IgdHJ5aW5nIHRvIGNsZWFuIGl0LCB0aGV5IHNob3VsZCBoYXZlIHBsYWNlZCB0aGUgYXBwbGUgaW4gdGhlIGZyaWRnZSB0byBjb29sIGl0LiBBbHNvLCBhZnRlciBjb29saW5nLCB0aGV5IG5lZWRlZCB0byB0YWtlIGl0IG91dCBvZiB0aGUgZnJpZGdlIGFuZCB0aGVuIHB1dCBpdCBpbiB0aGUgbWljcm93YXZlLiBUaGUgZXhlY3V0b3Iga2VwdCB0cnlpbmcgdG8gY29vbCBpdCB3aXRoIHRoZSBzaW5rLCB3aGljaCBpc24ndCB0aGUgY29ycmVjdCBtZXRob2QuIFNvIHRoZSBjb3JlIG1pc3Rha2VzIGFyZSBub3QgdXNpbmcgdGhlIGZyaWRnZSBmb3IgY29vbGluZyBhbmQgbm90IHByb3Blcmx5IHJldHJpZXZpbmcgdGhlIGFwcGxlIGFmdGVyIGNvb2xpbmcuCjwvdGhpbms+Cgo8Y3JpdGljPgpUaXAgMTogVXNlIGZyaWRnZSBmb3IgY29vbGluZywgbm90IHNpbmsuIENvb2xpbmcgcmVxdWlyZXMgcGxhY2luZyBpdGVtIGluIGZyaWRnZSwgbm90IGludGVyYWN0aW5nIHdpdGggc2luay4KVGlwIDI6IEFmdGVyIGNvb2xpbmcsIHJldHJpZXZlIGFwcGxlIGZyb20gZnJpZGdlIGJlZm9yZSBwbGFjaW5nIGluIG1pY3Jvd2F2ZS4KVGlwIDM6IEF2b2lkIHJlZHVuZGFudCBhY3Rpb25zIC0gZG9uJ3QgcmVwZWF0ZWRseSB0cnkgdG8gY29vbCBhbHJlYWR5IGNvb2xlZCBhcHBsZS4KPC9jcml0aWM+PHxpbV9lbmR8Pgo=)<\|im\_start\|\>systemYouarethecriticagent\.Youwillbegivenanexecutortrajectorythatfailtocompletethetask\.Identifytheconcretefailuremodeandproduceashortsetofcoretipsthatcanhelpthenextexecutorroundsucceed\.Environment:\{environment\}Rules:\-Diagnosethehighest\-valuemistake\.\-Focusonactionablecorrectionsforthenextexecutorrun\.\-Thenextroundisafreshretryfromtheinitialobservation\.\-Donotoutputactions,XMLactiontags,orlongnarratives\.\-Summarizeonlythereusablecoretips\.Outputformat:<think\>shortreason</think\><critic\>Tip1:\.\.\.Tip2:\.\.\.Tip3:\.\.\.</critic\><\|im\_end\|\><\|im\_start\|\>user\#Task\{task\}\#TrajectorySummaryreward:\{reward\}trajectory:\{trajectory\}<\|im\_start\|\>assistant<think\>Okay,let’slookatthetrajectorytheexecutortook\.Thetaskwastocoolanappleandputitinthemicrowave\.Theexecutorstartedbygoingtothefridge,openedit,butdidn’tfindanapplethere\.Thentheycheckedcountertopsandfoundapple1oncountertop3,tookit,wenttothesink,triedtocoolitwiththesink,butnothinghappened\.Thentheycleanedit,putitinthemicrowave,closedit,butwhentryingtotakeitback,itdidn’twork\.Theykeptloopingbetweenfridgeandsink,tryingtocooltheappleagain\.Hmm,themainissuehereisthattheexecutordidn’trealizethatcoolingtheapplerequiresplacingitinthefridgefirst\.Theavailableactionsmighthaveaspecificwaytocoolanitem,likeputtingitinthefridge\.Insteadofusingthesinkortryingtocleanit,theyshouldhaveplacedtheappleinthefridgetocoolit\.Also,aftercooling,theyneededtotakeitoutofthefridgeandthenputitinthemicrowave\.Theexecutorkepttryingtocoolitwiththesink,whichisn’tthecorrectmethod\.Sothecoremistakesarenotusingthefridgeforcoolingandnotproperlyretrievingtheappleaftercooling\.</think\><critic\>Tip1:Usefridgeforcooling,notsink\.Coolingrequiresplacingiteminfridge,notinteractingwithsink\.Tip2:Aftercooling,retrieveapplefromfridgebeforeplacinginmicrowave\.Tip3:Avoidredundantactions\-don’trepeatedlytrytocoolalreadycooledapple\.</critic\><\|im\_end\|\>Reward = 1
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Similar Articles

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

Self-Review Reinforcement Learning (SRRL) with Cross-Episode Memory and Policy Distillation

Critic Experience Bank: Self-Evolving Step-Level Confidence Estimation for LLM Agents

Rethinking Groups in Critic-Free RLVR

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Submit Feedback

Similar Articles

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
Self-Review Reinforcement Learning (SRRL) with Cross-Episode Memory and Policy Distillation
Critic Experience Bank: Self-Evolving Step-Level Confidence Estimation for LLM Agents
Rethinking Groups in Critic-Free RLVR
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards