Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
Summary
Role-Agent introduces a framework where a single LLM acts as both agent and environment, enabling bootstrapped co-evolution via World-In-Agent and Agent-In-World components. It achieves over 4% average improvement on multiple benchmarks over strong baselines.
View Cached Full Text
Cached at: 06/10/26, 06:17 AM
# Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
Source: [https://arxiv.org/html/2606.10917](https://arxiv.org/html/2606.10917)
Xucong Wang1,2Ziyu Ma2Shidong Yang2Tongwen Huang2 Pengkun Wang1†Yong Wang2†Xiangxiang Chu2 1University of Science and Technology of China2AMAP, Alibaba Group GitHub:[https://github\.com/AMAP\-ML/roleagent](https://github.com/AMAP-ML/roleagent)Work done during internship at AMAP, Alibaba\.†Project lead: Yong Wang; Corresponding authors: Yong Wang and Pengkun Wang
###### Abstract
Although Large Language Model \(LLM\) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization\. To address these limitations, this paper introduces Role\-Agent,a frameworkthat harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co\-evolution\. Role\-Agent comprises two synergistic components: World\-In\-Agent \(WIA\) and Agent\-In\-World \(AIW\)\. In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment\-aware reasoning\. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice\. Experiments on multiple benchmarks show that Role\-Agent consistently improves performance, yielding an average gain of over 4% over strong baselines\.
![[Uncaptioned image]](https://arxiv.org/html/2606.10917v1/x2.png)Role\-Agent: Bootstrapping LLM Agents via Dual\-Role Evolution
Xucong Wang1,2††thanks:Work done during internship at AMAP, Alibaba\.†Project lead: Yong Wang; Corresponding authors: Yong Wang and Pengkun WangZiyu Ma2Shidong Yang2Tongwen Huang2Pengkun Wang1†Yong Wang2†Xiangxiang Chu21University of Science and Technology of China2AMAP, Alibaba GroupGitHub:[https://github\.com/AMAP\-ML/roleagent](https://github.com/AMAP-ML/roleagent)
## 1Introduction
Beyond simple question answering, Large Language Model \(LLM\) agentsTeamet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib4)\); Yanget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib62)\); Chenet al\.\([2025a](https://arxiv.org/html/2606.10917#bib.bib1)\); Ouet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib30)\); Maet al\.\([2026](https://arxiv.org/html/2606.10917#bib.bib48)\)have found wide application in complex real\-world challenges, owing to their unique abilities to think, reason, and reflectYaoet al\.\([2022b](https://arxiv.org/html/2606.10917#bib.bib57)\); Shinnet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib58)\); Liuet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib26)\); Donget al\.\([2025a](https://arxiv.org/html/2606.10917#bib.bib28)\)within their environments\. In more dynamic applications such as codingJianget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib27)\), navigationComaniciet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib61)\), deep researchCitron \([2024](https://arxiv.org/html/2606.10917#bib.bib32)\); Teamet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib31)\), and embodied applicationsShridharet al\.\([2020](https://arxiv.org/html/2606.10917#bib.bib2)\), the multi\-turn tool\-use and long\-horizon capabilities of agents arecritical and have therefore been widely exploredLiuet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib26)\); Donget al\.\([2025b](https://arxiv.org/html/2606.10917#bib.bib60)\)\.
Building on the use of Reinforcement Learning \(RL\) in LLM post\-trainingSchulmanet al\.\([2017](https://arxiv.org/html/2606.10917#bib.bib45)\); Rafailovet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib54)\); Chuet al\.\([2026](https://arxiv.org/html/2606.10917#bib.bib49)\), Agentic Reinforcement Learning \(ARL\) incorporates full interaction rollout trajectories into the RL framework, enabling agents to optimize their problem\-solving abilities through environment feedback\. In contrast to supervised fine\-tuningZhang and Zhang \([2024](https://arxiv.org/html/2606.10917#bib.bib18)\), where agents are trained to mimic expert behavior, ARL allows greater solution diversityGuoet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib53)\)and can substantially enhance agents’ reasoning and problem\-solving capabilities\.
Figure 1:\(a\):Static environments provide sparse and non\-specific feedback that limits the agent’s exploration;\(b\):Synthetic environments incur high labor and runtime costs;\(c\):The proposed Role\-Agent enables one model to switch roles between agent and environment to achieve bootstrapped co\-evolution\.Beyond using expert trajectories and static rewards to optimize agent policies, recent studies of self\-evolving agentsGaoet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib23)\)focus on continuous capability growth by autonomously discovering their own deficiencies and updating agent harnessFernandoet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib17)\); Hemberget al\.\([2024](https://arxiv.org/html/2606.10917#bib.bib16)\); Zhanget al\.\([2025a](https://arxiv.org/html/2606.10917#bib.bib21),[2026b](https://arxiv.org/html/2606.10917#bib.bib20)\); Anthropic \([2024](https://arxiv.org/html/2606.10917#bib.bib19)\); Xiaet al\.\([2026](https://arxiv.org/html/2606.10917#bib.bib22)\)\. However, most existing methods evolve only the agent itself while treating the environment as a fixed source of tasks, observations, and rewards; The environment fails to expose the agent’s hidden weaknesses or provide feedback targeted to its current failure modes\. A more desirable paradigm is the synthetic environment, where the agent improves through interaction while the environment also adapts to diagnose the agent’s deficiencies and present more challenges\. Yet building such an adaptive environment often requires additional environment models, task generators, or scheduling mechanismsZhuoet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib59)\); Xueet al\.\([2026](https://arxiv.org/html/2606.10917#bib.bib14)\), which increases deployment complexity\. This raises a natural question:can we achieve agent\-environment co\-evolution by using a single LLM to act as both the agent and the environment?
Guided by this idea, we propose Role\-Agent, which enables bootstrapped agent\-environment co\-evolution by using a single LLM as a dual\-role entity\. Role\-Agent consists of:
\(a\) World\-In\-Agent \(WIA\), where the LLM agent predicts the future observations resulting from its actions, thereby incorporating environment priors into its rollouts\. Role\-Agent measures the gap between agent\-predicted future states and actual states to estimate the agent’s ability to predict action consequences\. By integrating this measure into reward and credit assignment, WIA encourages more reliable decision\-making in states where action consequences are uncertain\.
\(b\) Agent\-In\-World \(AIW\), where the same LLM provides environment feedback and adapts the data distribution to prioritize difficult and easily overlooked tasks\. Specifically, we instruct the LLM to analyze failed trajectories step by step, producing failure modes and reflections that reveal the root causes of failure\. We then retrieve tasks with similar failure modes and adjust the data distribution accordingly, enabling the agent to focus training on its historical deficiencies\.
Extensive experiments demonstrate that Role\-Agent consistently outperforms existing approaches, showing that a single LLM can serve as both agent and environment to achieve practical gains in text\-based interactive environments\. Our contributions are threefold:
- •Different from agent\-side self\-improvement and state\-grouped RL methods, we investigate bootstrapped agent\-environment co\-evolution without human supervision\.
- •We propose Role\-Agent, which uses the World\-In\-Agent and Agent\-In\-World modules to cast a single LLM in dual roles, enabling fine\-grained environment prediction and adaptive task redistribution\.
- •Extensive experiments demonstrate that Role\-Agent achieves substantial improvements over strong baselines across diverse benchmarks\.
## 2Related Work
#### Large Language Model \(LLM\) Agents\.
Large language models \(LLMs\) are increasingly being adopted as autonomous agents across a wide range of domainsWanget al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib33),[2024](https://arxiv.org/html/2606.10917#bib.bib36)\); Jianget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib27)\); Ouet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib30)\)\. Early LLM agents were equipped with tool\-useYaoet al\.\([2022b](https://arxiv.org/html/2606.10917#bib.bib57)\), reflectionShinnet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib58)\), or memory schemesXuet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib56)\); Fanget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib25)\); Zhanget al\.\([2026a](https://arxiv.org/html/2606.10917#bib.bib24)\)to transform LLM backbones into autonomous, interactive agents\. More recent studies incorporate RL methodsLambertet al\.\([2024](https://arxiv.org/html/2606.10917#bib.bib37)\)to endow agents with long\-horizon reasoning and multi\-turn interaction abilities, exemplified by PPOSchulmanet al\.\([2017](https://arxiv.org/html/2606.10917#bib.bib45)\), DPORafailovet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib54)\), GRPOGuoet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib53)\), DAPOYuet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib51)\), GSPOZhenget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib50)\), and GPGChuet al\.\([2026](https://arxiv.org/html/2606.10917#bib.bib49)\)\. While these approaches sample full tool\-use trajectories and leverage final outcome rewardswith limited extra supervision, another line of studies adopts process reward modelsShaoet al\.\([2024](https://arxiv.org/html/2606.10917#bib.bib46)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.10917#bib.bib34)\); Wanget al\.\([2025b](https://arxiv.org/html/2606.10917#bib.bib43)\)to assign credit to each action,improving complex reasoning tasks\.
#### Self\-Evolving Agents\.
Unlike optimized under fixed data distributions and tasks, self\-evolving agentsGaoet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib23)\); Zhaiet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib15)\)emphasize autonomous capability iteration within dynamically evolving open environments\. EvolveRWuet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib13)\)introduces a self\-contained lifecycle where the agent distills its own experiences into principles and evolves its policy\. Other worksHuet al\.\([2024](https://arxiv.org/html/2606.10917#bib.bib38)\); Novikovet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib39)\)focus on automated exploration of agent design\. MAEChenet al\.\([2025b](https://arxiv.org/html/2606.10917#bib.bib35)\)instantiates three roles \(Proposer, Solver and Judge\) to co\-evolve without human\-curated data\. More recently, AgentevolverZhaiet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib15)\)leverages self\-questioning, self\-navigation, and self\-attribution to facilitate agent evolution\.GiGPOFenget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib47)\)further introduces state\-grouped advantage estimation for LLM agent RL\.In contrast, Role\-Agent achieves bootstrapped agent\-environment co\-evolution,differing from these studies whose auxiliary roles mainly remain on the agent side\.
## 3Methodology
### 3\.1Preliminaries
#### Problem Setup\.
We first formalize general multi\-step agent\-environment interaction tasks as follows: given a task prompt𝒙∈𝒳\\bm\{x\}\\in\\mathcal\{X\}, the agent generates an action𝒂t∈𝒜\\bm\{a\}\_\{t\}\\in\{\\mathcal\{A\}\}based on the current state𝒔t\\bm\{s\}\_\{t\}and its policyπθ\(𝒂t\|𝒔t,𝒙\)\\pi\_\{\\theta\}\(\\bm\{a\}\_\{t\}\|\\bm\{s\}\_\{t\},\\bm\{x\}\)at each steptt\(1≤t≤T1\\leq t\\leq T, whereTTis the interaction length of the trajectory andθ\\thetadenotes the policy parameters\)\. The environment then provides the next state𝒔t\+1\\bm\{s\}\_\{t\+1\}and an instant rewardrtr\_\{t\}\. This yields a trajectory \(rollout\)𝝉=\{\(𝒔t,𝒂t,rt\)\}t=1T\\bm\{\\tau\}=\\\{\(\\bm\{s\}\_\{t\},\\bm\{a\}\_\{t\},r\_\{t\}\)\\\}\_\{t=1\}^\{T\}\. We denote a batch of rollouts as𝒯=\{𝝉i\}i=1N\\mathcal\{T\}=\\\{\\bm\{\\tau\}\_\{i\}\\\}\_\{i=1\}^\{N\}\. Notably, in sparse\-reward open\-world applications, process\-level rewardsrtr\_\{t\}are often replaced by trajectory\-level rewardsℛE\(𝝉i\)\\mathcal\{R\}^\{E\}\(\\bm\{\\tau\}\_\{i\}\), such as whether the agent achieves the goal at the final stepShridharet al\.\([2020](https://arxiv.org/html/2606.10917#bib.bib2)\)\.
Figure 2:Overview of the Role\-Agent\. A single LLM is leveraged to switch between the roles of agent and environment\. As an agent, it is prompted to predict states for the nextHHsteps; the alignment between these predictions and ground\-truth states serves as a reward signal to compute trajectory\-level and state\-level advantages\. As the environment, it analyzes failure modes from failed trajectories and reshapes the data distribution by retrieving tasks with similar modes\. This closed\-loop process enables bootstrapped agent\-environment co\-evolution\.
#### Agent Reinforcement Learning \(ARL\)\.
ARL incorporates full trajectories of agent reasoning and actionsWanget al\.\([2025a](https://arxiv.org/html/2606.10917#bib.bib29)\)into the RL framework\. A representative formulation is Group Relative Policy Optimization \(GRPO\)Guoet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib53)\); for the task𝒙\\bm\{x\}and sampled rollouts\{𝝉i\}i=1N∼𝝅old\\\{\\bm\{\\tau\}\_\{i\}\\\}\_\{i=1\}^\{N\}\\\!\\sim\\\!\\bm\{\\pi\}\_\{old\}, GRPO is formulated as the following:
𝒥\(θ\)=1N∑i=1N1\|𝝉i\|∑t=1\|𝝉i\|min\(ρθ,t\(i\)AE\(𝝉i\),clip\(ρθ,t\(i\),1±ϵ\)AE\(𝝉i\)\)−β𝒟KL\[𝝅θ\|\|𝝅ref\]AE\(𝝉i\)=ℛE\(𝝉i\)−avg\(\{ℛE\(𝝉i\)\}i=1N\)std\(\{ℛE\(𝝉i\)\}i=1N\),\\begin\{split\}&\\mathcal\{J\}\(\{\\theta\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{\|\\bm\{\\tau\}\_\{i\}\|\}\\sum\_\{t=1\}^\{\|\\bm\{\\tau\}\_\{i\}\|\}\{\\rm min\}\(\\rho\_\{\\theta,t\}^\{\(i\)\}A^\{E\}\(\\bm\{\\tau\}\_\{i\}\),\\\\ &\{\\rm clip\}\(\\rho\_\{\\theta,t\}^\{\(i\)\},1\\pm\\epsilon\)A^\{E\}\(\\bm\{\\tau\}\_\{i\}\)\)\\\!\-\\\!\\beta\\mathcal\{D\}\_\{KL\}\[\\bm\{\\pi\}\_\{\\theta\}\|\|\\bm\{\\pi\}\_\{ref\}\]\\\\ &A^\{E\}\(\\bm\{\\tau\}\_\{i\}\)=\\frac\{\\mathcal\{R\}^\{E\}\(\\bm\{\\tau\}\_\{i\}\)\-\{\\rm avg\}\(\\\{\\mathcal\{R\}^\{E\}\(\\bm\{\\tau\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}\)\}\{\{\\rm std\}\(\\\{\\mathcal\{R\}^\{E\}\(\\bm\{\\tau\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}\)\},\\end\{split\}\(1\)where𝒚t\(i\)\\bm\{y\}^\{\(i\)\}\_\{t\}represents partial trajectory under rolloutiiat tokentt,ρθ,t\(i\)=𝝅θ\(𝒚t\(i\)\|𝒙,𝒚<t\(i\)\)/𝝅old\(𝒚t\(i\)\|𝒙,𝒚<t\(i\)\)\\rho\_\{\\theta,t\}^\{\(i\)\}\\\!=\\\!\\nicefrac\{\{\\bm\{\\pi\}\_\{\\theta\}\(\\bm\{y\}^\{\(i\)\}\_\{t\}\|\\bm\{x\},\\bm\{y\}^\{\(i\)\}\_\{<t\}\)\}\}\{\{\\bm\{\\pi\}\_\{old\}\(\\bm\{y\}^\{\(i\)\}\_\{t\}\|\\bm\{x\},\\bm\{y\}^\{\(i\)\}\_\{<t\}\)\}\}is the importance sampling ratio of𝒚t\(i\)\\bm\{y\}^\{\(i\)\}\_\{t\}, rolloutii;𝒟KL\\mathcal\{D\}\_\{KL\},𝝅old\\bm\{\\pi\}\_\{old\}and𝝅ref\\bm\{\\pi\}\_\{ref\}are the KL divergence, old policy and reference policy respectively\.β\\betacontrols the penalty degree of the KL\-loss\. The following subsections present our proposed Role\-Agent, which integrates the World\-In\-Agent \(WIA\) and Agent\-In\-World \(AIW\) design to achieve the bootstrapped agent\-environment co\-evolution\.
### 3\.2World\-In\-Agent \(WIA\)
Role\-Agent first assigns the LLM the role of an agent and requires it to develop fine\-grained, interleaved perception of the world\.Inspired by world modelsLiet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib41)\); Guet al\.\([2024](https://arxiv.org/html/2606.10917#bib.bib40)\), we internalize environment dynamics into the agent by rewarding future\-state prediction\.
#### Predicting the Future State\.
During rollout, at each interaction steptt, after the agent generates an action𝒂t\\bm\{a\}\_\{t\}, we prompt it with the augmented prompt𝒙pre\\bm\{x\}\_\{pre\}to predict the future states induced by this action\. This encourages the agent to explicitly model how its actions may change the environment, rather than relying only on observed rewards\. For each prediction horizonh∈\{1,…,H\}h\\in\\\{1,\\ldots,H\\\}, the agent predicts the state at stept\+ht\+h:
𝒔^t,h∼𝝅\(⋅∣𝒂t,𝒙pre\),\\hat\{\\bm\{s\}\}\_\{t,h\}\\sim\\bm\{\\pi\}\(\\cdot\\mid\\bm\{a\}\_\{t\},\\bm\{x\}\_\{pre\}\),\(2\)where𝒔^t,h\\hat\{\\bm\{s\}\}\_\{t,h\}denotes the prediction made at stepttfor the future state𝒔t\+h\\bm\{s\}\_\{t\+h\}\. We denote the prediction set at stepttasℰpre,t=\{𝒔^t,h∣1≤h≤H\}\\mathcal\{E\}\_\{pre,t\}=\\\{\\hat\{\\bm\{s\}\}\_\{t,h\}\\mid 1\\leq h\\leq H\\\}, and collect all prediction sets after rollout:
ℰpre=\{ℰpre,t∣1≤t≤T\}\.\\mathcal\{E\}\_\{pre\}=\\\{\\mathcal\{E\}\_\{pre,t\}\\mid 1\\leq t\\leq T\\\}\.\(3\)
Inspired by GiGPOFenget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib47)\), we measure the discrepancy between predicted and ground\-truth states using the Longest Matching Subsequence \(LMS\) over their textual state contexts, yielding a predictive reward matrix𝒓~∈ℝT×H\\tilde\{\\bm\{r\}\}\\in\\mathbb\{R\}^\{T\\times H\}:
r~t,h=LMS\(𝒔^t,h,𝒔t\+h\),\\tilde\{r\}\_\{t,h\}=\\operatorname\{LMS\}\(\\hat\{\\bm\{s\}\}\_\{t,h\},\\bm\{s\}\_\{t\+h\}\),\(4\)eachr~t,h∈\[0,1\]\\tilde\{r\}\_\{t,h\}\\in\[0,1\]quantifies the agent’s foresight in predicting the statehhsteps ahead\. In implementation, predictive rewards are computed at the end of each rollout\. In parallel, we obtain the full trajectory𝝉=\{\(𝒔t,𝒂t,rt\)\}t=1T\\bm\{\\tau\}=\\\{\(\\bm\{s\}\_\{t\},\\bm\{a\}\_\{t\},r\_\{t\}\)\\\}\_\{t=1\}^\{T\}\. The task reward for each action𝒂t\\bm\{a\}\_\{t\}is computed as the discounted return from steptt, while the predictive reward aggregates future\-state prediction scores within horizonHH:
ℛtask\(𝒂t\)=∑k=tTγk−trk,ℛpre\(𝒂t\)=∑h=1Hγh−1r~t,h\.\\mathcal\{R\}\_\{task\}\(\\bm\{a\}\_\{t\}\)\\\!\\\!=\\\!\\\!\\sum\_\{k=t\}^\{T\}\\gamma^\{k\-t\}r\_\{k\},\\mathcal\{R\}\_\{pre\}\(\\bm\{a\}\_\{t\}\)\\\!\\\!=\\\!\\\!\\sum\_\{h=1\}^\{H\}\\gamma^\{h\-1\}\\tilde\{r\}\_\{t,h\}\.\(5\)
We combine the task and predictive rewards according to two principles:\(a\)accurate state prediction preserves and amplifies the original credit, reflecting reliable environment perception; and\(b\)inaccurate prediction weakens the advantage signal, reducing credit for actions that achieve high returns by chance\. Thus, predictive reward serves as a reliability\-aware modulation of task reward:
ℛt=ℛtask\(𝒂t\)\(1\+ℛpre\(𝒂t\)\)\.\\mathcal\{R\}\_\{t\}=\\mathcal\{R\}\_\{task\}\(\\bm\{a\}\_\{t\}\)\(1\+\\mathcal\{R\}\_\{pre\}\(\\bm\{a\}\_\{t\}\)\)\.\(6\)We use multiplication rather than addition so that predictive reward cannot independently introduce extra credit\. Instead, it only modulates actions with non\-zero task reward, preventing failed trajectories from being rewarded solely for plausible state predictions\.
#### State Grouping & State\-level Advantage\.
Instead of employing the trajectory\-level advantage,followingGiGPOFenget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib47)\), we observe that even within the same environment and initial settings, there can be significant redundancy among states in a trajectory\. By grouping actions that occur under identical states and computing state\-level advantages, we can more clearly attribute rewards at the state level, independent of their temporal ordering\. Formally, we identify a set ofnon\-repetitive states𝒪=\{𝒔o†\}o=1\|𝒪\|\\mathcal\{O\}\\\!\\\!=\\\!\\\!\\\{\\bm\{s\}^\{\\dagger\}\_\{o\}\\\}\_\{o=1\}^\{\|\\mathcal\{O\}\|\}from the batch with hash\-maps, then group the actions like:
𝒢=\{\{\(𝒔t,𝒂t\)\|hash\(𝒔t\(i\)\)=hash\(𝒔o†\)\}\|𝒔o†∈𝒪\}\\mathcal\{G\}\\\!\\\!=\\\!\\\!\\\{\\\{\(\\bm\{s\}\_\{t\},\\\!\\bm\{a\}\_\{t\}\)\|\{\\rm hash\}\(\\bm\{s\}^\{\(i\)\}\_\{t\}\)\\\!\\\!=\\\!\\\!\{\\rm hash\}\(\\bm\{s\}^\{\\dagger\}\_\{o\}\)\\\}\|\\bm\{s\}^\{\\dagger\}\_\{o\}\\\!\\\!\\in\\\!\\\!\\mathcal\{O\}\\\}\(7\)Accordingly, we denote𝒢\(o\)=\{\(𝒔t\(o\),𝒂t\(o\)\)\}\\mathcal\{G\}^\{\(o\)\}\\\!\\\!=\\\!\\\!\\\{\(\\bm\{s\}^\{\(o\)\}\_\{t\}\\\!\\\!,\\bm\{a\}^\{\(o\)\}\_\{t\}\)\\\}as the set of state\-action pairs grouped by𝒔o†\\bm\{s\}^\{\\dagger\}\_\{o\}\. Finally, the state\-level advantage for each𝒂t\(o\)\\bm\{a\}\_\{t\}^\{\(o\)\}is calculated as:
AS\(𝒂t\(o\)\)=ℛt\(o\)−avg\(\{ℛt\(o\)\|\(𝒔t\(o\),𝒂t\(o\)\)∈𝒢\(o\)\}\)std\(\{ℛt\(o\)\|\(𝒔t\(o\),𝒂t\(o\)\)∈𝒢\(o\)\}\)A^\{S\}\(\\bm\{a\}^\{\(o\)\}\_\{t\}\)\\\!\\\!=\\\!\\\!\\frac\{\\mathcal\{R\}^\{\(o\)\}\_\{t\}\\\!\\\!\-\\\!\\\!\{\\rm avg\}\(\\\{\\mathcal\{R\}^\{\(o\)\}\_\{t\}\|\(\\bm\{s\}^\{\(o\)\}\_\{t\}\\\!\\\!\\\!,\\bm\{a\}^\{\(o\)\}\_\{t\}\)\\\!\\\!\\in\\\!\\mathcal\{G\}^\{\(o\)\}\\\}\)\}\{\{\\rm std\}\(\\\{\\mathcal\{R\}^\{\(o\)\}\_\{t\}\|\(\\bm\{s\}^\{\(o\)\}\_\{t\}\\\!\\\!\\\!,\\bm\{a\}^\{\(o\)\}\_\{t\}\)\\\!\\\!\\in\\\!\\mathcal\{G\}^\{\(o\)\}\\\}\)\}\(8\)With the state\-level advantages, we finally revise the trajectory\-level policy optimization of GRPO into the following variants:
𝒥ours\(θ\)=1N∑i=1N1\|𝝉i\|∑t=1\|𝝉i\|min\(ρθ,t\(i\)A\(𝒂t\(i\)\),clip\(ρθ,t\(i\),1±ϵ\)A\(𝒂t\(i\)\)\)−β𝒟KL\[πθ\|\|πref\]\\begin\{split\}&\\mathcal\{J\}\_\{ours\}\(\\theta\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{\|\\bm\{\\tau\}\_\{i\}\|\}\\sum\_\{t=1\}^\{\|\\bm\{\\tau\}\_\{i\}\|\}\{\\rm min\}\(\\rho\_\{\\theta,t\}^\{\(i\)\}A\(\\bm\{a\}\_\{t\}^\{\(i\)\}\),\\\\ &\{\\rm clip\}\(\\rho\_\{\\theta,t\}^\{\(i\)\},1\\pm\\epsilon\)A\(\\bm\{a\}\_\{t\}^\{\(i\)\}\)\)\\\!\-\\\!\\beta\\mathcal\{D\}\_\{KL\}\[\\pi\_\{\\theta\}\|\|\\pi\_\{ref\}\]\\end\{split\}\(9\)whereρθ,t\(i\)=𝝅θ\(𝒂t\(i\)\|𝒔t\(i\),𝒚<t\(i\)\)/𝝅old\(𝒂t\(i\)\|𝒔t\(i\),𝒚<t\(i\)\)\\rho\_\{\\theta,t\}^\{\(i\)\}\\\!=\\\!\\nicefrac\{\{\\bm\{\\pi\}\_\{\\theta\}\(\\bm\{a\}^\{\(i\)\}\_\{t\}\|\\bm\{s\}\_\{t\}^\{\(i\)\},\\bm\{y\}^\{\(i\)\}\_\{<t\}\)\}\}\{\{\\bm\{\\pi\}\_\{old\}\(\\bm\{a\}^\{\(i\)\}\_\{t\}\|\\bm\{s\}\_\{t\}^\{\(i\)\},\\bm\{y\}^\{\(i\)\}\_\{<t\}\)\}\}is the importance sampling ratio at stepttfor rolloutii\. The advantage is derived from the trajectory\-level and state\-level advantages, linked with coefficientα\\alpha, i\.e\.,A\(𝒂t\(i\)\)=AS\(𝒂t\(o\)\)\+α⋅AE\(𝝉i\)A\(\\bm\{a\}\_\{t\}^\{\(i\)\}\)\\\!\\\!=\\\!\\\!A^\{S\}\(\\bm\{a\}^\{\(o\)\}\_\{t\}\)\\\!\+\\alpha\\\!\\cdot\\\!A^\{E\}\(\\bm\{\\tau\}\_\{i\}\), whereoodenotes the group to which𝒂t\(i\)\\bm\{a\}\_\{t\}^\{\(i\)\}belongs\.
### 3\.3Agent\-In\-World \(AIW\)
Beyond enabling the agent to perceive world dynamics, we argue that the environment should also dynamically adjust itself based on the agent’s capability\. To this end, we propose Agent\-In\-World \(AIW\), which allows the agent itself to act as a source of environmental feedback\. By receiving, validating, and filtering its own interaction history, the agent expands the data distribution in a self\-regulated manner\.
#### Failure Mode Analysis\.
For each failed trajectory, we feed all interaction sequences, along with the task description and objective, into an LLM for analysis\. We prompt the LLM to identify one or more action patterns that led to the failure, and to generate a failure\-mode reflection that includes the failure type, core lessons, and query contexts to be used for retrieving similar tasks subsequently\.
#### Task Retrieval & Changing Data Distribution\.
Subsequently, we store these failure modes along with the corresponding failed trajectories and task information in an offline interaction history\. The entire history of failure modes is then fed into the LLM, which is instructed to retrieve patterns similar to the current failure mode and return the indices of relevant interaction histories\.In practice, we organize tasks under unique failure modes rather than referring to every failed trajectory\. On ALFWorld, this library comprises 11 unique modes across training, and storage or retrieval costs remain negligible\.Tasks grouped by shared failure modes highlight the LLM’s deficiencies and oversights when facing specific situations\. Accordingly, we reintegrate these retrieved tasks into the training set\.Compared with random failed\-task replay or task\-text retrieval, AIW retrieves by the underlying error pattern, which can connect surface\-different tasks sharing the same procedural weakness\.By using the same LLM to switch roles, we establish an agent\-environment co\-evolutionwithout introducing a separate model in the fine\-tuning stage\.
TypeMethodALFWorldWebShopPickLookCleanHeatCoolPick2AllScoreSucc\.Closed\-source ModelPromptingGPT\-4o75\.360\.831\.256\.721\.649\.848\.031\.823\.7PromptingGemini\-2\.5\-Pro92\.863\.362\.169\.026\.658\.760\.342\.535\.9Qwen2\.5\-1\.5B\-InstructPromptingQwen\-2\.55\.95\.53\.39\.74\.20\.04\.123\.15\.2PromptingReAct17\.420\.515\.76\.27\.72\.012\.840\.111\.3PromptingReflexion35\.322\.221\.713\.619\.43\.721\.855\.821\.9RL TrainingPPO64\.840\.557\.160\.646\.447\.454\.473\.851\.5RL TrainingRLOO88\.352\.871\.062\.866\.456\.969\.773\.952\.1RL TrainingGRPO85\.353\.784\.578\.259\.753\.572\.875\.856\.8RL TrainingGiGPO94\.467\.594\.894\.479\.876\.486\.783\.165\.0RL TrainingRole\-Agent95\.878\.395\.097\.087\.591\.790\.987\.771\.9Qwen2\.5\-7B\-InstructPromptingQwen\-2\.533\.421\.619\.36\.92\.83\.214\.826\.47\.8PromptingReAct48\.535\.434\.313\.218\.217\.631\.246\.219\.5PromptingReflexion62\.041\.644\.930\.936\.323\.842\.758\.128\.8RL TrainingPPO92\.364\.092\.589\.580\.368\.880\.481\.468\.7RL TrainingRLOO87\.678\.287\.381\.371\.948\.975\.580\.365\.7RL TrainingGRPO90\.866\.189\.374\.772\.564\.777\.679\.366\.1RL TrainingGiGPO97\.782\.798\.883\.789\.379\.290\.884\.472\.8RL TrainingRole\-Agent98\.393\.798\.588\.990\.092\.893\.888\.077\.1
Table 1:Performance comparison on ALFWorld and WebShop\. We report the average success rate \(%\) for each task and the averaged performance in ALFWorld; For WebShop, we report the averaged score and success rate \(%\)\.
## 4Experiments
### 4\.1Experiment Setups
#### Benchmarks\.
We evaluate our method across three types of tasks: ALFWorldShridharet al\.\([2020](https://arxiv.org/html/2606.10917#bib.bib2)\), WebShopYaoet al\.\([2022a](https://arxiv.org/html/2606.10917#bib.bib12)\), and search\-augmented question answering \(QA\)\. ALFWorld assesses the model’s multi\-step decision\-making abilities through household tasks, where they are required to navigate the environment using textual commands to achieve given goals\. WebShop is a simulated e\-commerce platform where agents interact with a realistic web interface containing over 1\.18 million real\-world products\. Additionally, we employ search\-augmented QA tasks, which include single\-hop QA datasets such as NQKwiatkowskiet al\.\([2019](https://arxiv.org/html/2606.10917#bib.bib11)\), TriviaQAJoshiet al\.\([2017](https://arxiv.org/html/2606.10917#bib.bib10)\), and PopQAMallenet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib9)\), as well as multi\-hop QA datasets including HotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.10917#bib.bib7)\), 2WikiMultiHopQAHoet al\.\([2020](https://arxiv.org/html/2606.10917#bib.bib8)\), MuSiQueTrivediet al\.\([2022](https://arxiv.org/html/2606.10917#bib.bib6)\), and BambooglePresset al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib3)\)\. Together, these benchmarks enable a comprehensive evaluation of an agent’s ability to ground language while effectively leveraging external information\.
#### Baselines\.
We compare Role\-Agent with various competitive models, categorized as follows:\(a\)Closed\-source models: GPT\-4oAchiamet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib5)\)and Gemini\-2\.5\-ProTeamet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib4)\), which achieve superior performance in general\-purpose reasoning and understanding\.\(b\)Prompt engineering methods: ReActYaoet al\.\([2022b](https://arxiv.org/html/2606.10917#bib.bib57)\)and ReflexionShinnet al\.\([2023](https://arxiv.org/html/2606.10917#bib.bib58)\), which leverage prompts to structure the multi\-step behavior of agents\.\(c\)RL training methods: PPOSchulmanet al\.\([2017](https://arxiv.org/html/2606.10917#bib.bib45)\), which utilizes the collaboration between actor and critic networks along with a reward model; RLOOKoolet al\.\([2019](https://arxiv.org/html/2606.10917#bib.bib44)\); Ahmadianet al\.\([2024](https://arxiv.org/html/2606.10917#bib.bib42)\)and GRPOGuoet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib53)\), which compute advantages within grouped trajectories\.\(d\)Search\-based models \(evaluated only on search\-QA tasks\): R1\-InstructJinet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib55)\), Search\-R1Jinet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib55)\), ZeroSearchSunet al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib52)\), and StepSearchWanget al\.\([2025b](https://arxiv.org/html/2606.10917#bib.bib43)\)\.
#### Implementation Details\.
We employ Qwen2\.5\-1\.5/3B/7B\-Instruct as backbone models for all experiments\. All baselines adopt the same hyper\-parameters \(if shared\) values as our method\. FollowingFenget al\.\([2025](https://arxiv.org/html/2606.10917#bib.bib47)\), the group size for RLOO and GRPO is set to 8\. For the search tasks we use E5 as the retriever, with a group size of 5 and a maximum of 4 turns\. All models are trained on a single node with 8 NVIDIA H20 GPUs\. State grouping is performed by merging states whose longest\-matching subsequence similarity exceeds 0\.9\.We keep this threshold from GiGPO for fair comparison; Also, since states of all datasets we employed are short and templated, a high threshold avoids the conflation of genuinely different states\. The maximal stepsTmaxT\_\{max\}for ALFWorld, WebShop and Search QA are 50, 15, 4 respectively\.Details of datasets, implementations and prompts are inAppendix[A](https://arxiv.org/html/2606.10917#A1)/[C](https://arxiv.org/html/2606.10917#A3)/[D](https://arxiv.org/html/2606.10917#A4)respectively\.
TypeMethodSingle\-Hop QAMulti\-Hop QAAvg\.NQ†TriviaQA∗PopQA∗HotpotQA†2Wiki∗MuSiQue∗Bamboogle∗RL TrainingR1\-Instruct27\.053\.719\.923\.729\.27\.229\.327\.1RL TrainingSearch\_R134\.154\.537\.832\.431\.910\.326\.432\.5RL TrainingZero\-Search41\.457\.444\.827\.430\.09\.811\.131\.7RL TrainingStepSearch\-\-\-34\.532\.017\.434\.4\-RL TrainingGiGPO42\.059\.542\.436\.937\.012\.664\.142\.1RL TrainingRole\-Agent40\.160\.449\.838\.845\.217\.868\.445\.8
Table 2:Comparison on search\-augmented QA tasks\. Role\-Agent is trained on NQ and HotpotQA\.†\\daggerand∗\*indicate in\-domain and out\-of\-domain datasets, respectively\. All methods are experimented with Qwen2\.5\-3B\-Instruct\.
### 4\.2Experimental Results
#### Results on ALFWorld and WebShop\.
Table[1](https://arxiv.org/html/2606.10917#S3.T1)presents the comparison results, showing that Role\-Agent consistently outperforms various existing baselines\. Our key observations are as follows:
\(a\):Traditional prompt\-based methods such as ReAct and Reflexion yield considerable gains over zero\-shot models but still underperform compared to RL\-based methods\. Role\-Agent, in particular, outperforms these approaches by an average of 78\.0% on ALFWorld and 59\.1% on WebShop in terms of success rate\. While closed\-source models like Gemini achieve competitive performance on specific tasks \(e\.g\., 92\.8% on Pick\-ALFWorld\), their average performance lags behind, underscoring both the difficulty of the tasks and the benefits of post\-training\.This suggests that prompts can enhance the in\-context learning ability of agents but do not enable internal adaptation\.
\(b\):RL training methods yield substantial gains, as demonstrated by GRPO which achieves 72\.8% / 75\.8% on ALFWorld / WebShop, and GiGPO which achieves 86\.7% / 83\.1% respectively with Qwen2\.5\-1\.5B\-Instruct\. The success of GiGPO stems from its group\-level advantages, which unify action evaluation across different steps for the same state\. Nevertheless, Role\-Agent mostly outperforms GiGPO across both backbone models, with relative gains of 4\.2% / 6\.9% on two datasets, validating that the co\-evolution in Role\-Agent equips the agent with more generalization abilities\.
\(c\):Role\-Agent demonstrates consistent superiority across larger backbone models \(Qwen2\.5\-7B\-Instruct\), achieving average gains of 3\.8%\. Moreover, improvements are more pronounced in complex and compositional tasks\. For instance, Role\-Agent shows a \+11\.0% increase on the Look task \(i\.e\., look\_at\_obj\_in\_light\) and a \+13\.6% increase on the Pick2 task \(i\.e\., pick\_two\_obj\_and\_place\), both of which require stable memory and multi\-step planning to ensure the correctness of each sub\-task\. These results further indicate that the bootstrapped agent\-environment co\-evolution endows agents with substantial generalization capabilities\.
MethodALFWorldWebShopAverageRole\-Agent90\.971\.981\.4\- w/o Agent\-In\-World87\.566\.977\.2\- w/o Predictive Reward88\.068\.378\.2GiGPO86\.765\.075\.9
Table 3:Ablation study of components with Qwen2\.5\-1\.5B\-Instruct\.We report the average success rate\.Results on Search\-Augmented QA tasks\.Table[2](https://arxiv.org/html/2606.10917#S4.T2)presents the results\. Role\-Agent achieves the best average performance of 45\.8%,outperforming the GiGPO average by 3\.7%\.Notably, the performance gains are more pronounced on multi\-hop QA tasks compared to single\-hop ones, with improvements of \+8\.2% on 2Wiki and \+5\.2% on MuSiQue\. These results demonstrate that agent\-environment co\-evolution equips the agent with enhanced multi\-step retrieval and reasoning capabilities\. We also observe that Role\-Agent slightly underperforms GiGPO on NQ dataset, which we attribute to its stronger generalization capabilities rather than overfitting to the training set\.Since search\-agent baselines differ in training and evaluation protocols, we use these results as a cross\-domain comparison and rely on ALFWorld/WebShop comparisons for the most direct assessment\.
### 4\.3Ablation Study & Sensitivity Analysis
Effects of Components\.We conduct an ablation study by comparing the performance of Role\-Agent against variants with specific components removed\. The results are presented in Table[3](https://arxiv.org/html/2606.10917#S4.T3)\. Specifically, we find that removing either the AIW module or the predictive reward leads to a drop in overall performance, with the effect being more pronounced for AIW removal \(a 5\.0% decrease on WebShop\)\. This highlights the pivotal role of targeted environment feedback inAIW\. Without dynamic data distribution, the agent lacks iterative practice on critical failure modes\. The results also confirm that the predictive reasoning in WIA equips the agent with valuable implicit world priors, enhancing its decision\-making capabilities at every step\. Notably, both ablated variants still outperform GiGPO on average, indicating that WIA and AIW each provide gains beyond state\-grouped credit assignment alone and are complementary rather than redundant\.
Hyper\-Param\.ValueALFWorldWebShopAverageAdv\. ScalingCoef\.α\\alpha0\.589\.571\.080\.21\.090\.971\.981\.42\.086\.065\.475\.7\# PredictionStepHH5%⋅Tmax5\\%\\cdot T\_\{max\}90\.971\.981\.410%⋅Tmax10\\%\\cdot T\_\{max\}90\.268\.579\.320%⋅Tmax20\\%\\cdot T\_\{max\}75\.662\.369\.0
Table 4:Sensitivity Analysis of hyper\-parameters with Qwen2\.5\-1\.5B\-Instruct\. \(H≥1H\\geq 1and is rounded\)\.Figure 3:Running dynamics on ALFWorld\.\(left\):success rate on the validation set;\(right\):the averaged difference between training and inference rollouts\.Figure 4:Tasks of failure modes accumulated in training\. Tracked on ALFWorld with Qwen2\.5\-7B\-Instruct\.Figure 5:Case study of Agent\-In\-World in Role\-Agent on ALFWorld, illustrating how the environment LLM extracts failure modes from failed trajectories and retrieves tasks with similar failure modes\.Figure 6:Per\-step time breakdown of Role\-Agent\.The gray bar represents the average time of a complete generation\.The blue bar indicates the runtime of the comparative baseline \(GiGPO\), whilethe orange bars highlight the additional runtime from our method\.Failure Mode Evolution\.Figure[4](https://arxiv.org/html/2606.10917#S4.F4)visualizes the cumulative evolution of failure modes during training\. The total number of recorded failures grows quickly in the early stage and then gradually saturates, from 996 at step 15 to 3931 at step 150\. This suggests that tasks are assigned to failure\-mode buckets rapidly at first, and per\-mode accumulation then tapers off as the library becomes sufficiently populated\. Among different categories, repetitive exploration, wrong target location, and wrong receptacle account for a large proportion of failures, suggesting the importance of exploration and grounding\. The increment of other modes show that AIW captures diverse and fine\-grained weaknesses rather than a single dominant error type\. These results show that by accumulating structured failure modes over training, the environment can provide more targeted tasks for the agent to revisit its historical deficiencies\. Failure modes of all datasets are provided in Appendix C\.
Hyper\-parameter Sensitivity\.We vary the advantage scaling coefficientα\\alphaand the number of steps per predictionHH, and report the results in Table[4](https://arxiv.org/html/2606.10917#S4.T4)\. When studying one hyperparameter, the other is fixed at its optimal value\. Our findings are:
\(a\):The coefficientα\\alphacontrols the balance between trajectory\-level and state\-level advantages in the final optimization signal\. Whenα=0\.5\\alpha=0\.5, Role\-Agent achieves 89\.5% on ALFWorld and 71\.0% on WebShop, which is slightly lower than the default setting\. This indicates that under\-weighting the trajectory\-level advantage may weaken the global task\-completion signal\. Whenα\\alphais increased to 2\.0, the average performance drops clearly to 75\.7%, suggesting that excessive trajectory\-level weighting can dilute the fine\-grained state\-level credit assignment\. Therefore, settingα=1\.0\\alpha=1\.0provides a balanced integration of both advantage terms and achieves the best average performance\.
\(b\)IncreasingHHbeyond5%⋅Tmax5\\%\\cdot T\_\{max\}\(TmaxT\_\{max\}is the number of maximal interaction steps\) generally leads to a sharp decline in performance\. For instance, atH=10%⋅TmaxH\\\!\\\!=\\\!\\\!10\\%\\cdot T\_\{max\}, the average accuracy on WebShop drops to 68\.5%; further increases eventually render Role\-Agent ineffective, causing it to underperform most RL methods\. This degradation may be attributed to that excessive predictions occupy the in\-context window and diminish the agent’s focus on action planning\. Additionally, predicting states too far beyond the current context can lead to speculative guesswork and reward hacking\. Therefore, settingH=5%⋅TmaxH\\\!\\\!=\\\!\\\!5\\%\\cdot T\_\{max\}achieves a Pareto optimum in both efficiency and effectiveness\.
Running Dynamics\.Figure[3](https://arxiv.org/html/2606.10917#S4.F3)compares the running curves of Role\-Agent and GiGPO\. In Figure[3](https://arxiv.org/html/2606.10917#S4.F3)\(left\), we find that whileRole\-Agent falls behind GiGPO or shows fluctuationin the beginning stage, it generally achieves a higher performance ceiling \(90\.9%\) and faster convergence\. This suggests that the effects ofclosed\-loopagent\-environment evolution intensify with the accumulation of adjusteddata distribution, where the agent receives targeted training on its failures\. In Figure[3](https://arxiv.org/html/2606.10917#S4.F3)\(right\), we plot the difference between training and inference rollouts\. Compared with GiGPO, Role\-Agent brings a substantial mitigation of the train\-inference mismatch\.Higher consistency between the rollout and training policies leads to lower variance in gradient estimation and improves training stability\.
Efficiency Study\.Figure[6](https://arxiv.org/html/2606.10917#S4.F6)compares the running time of different components, with Role\-Agent\-specific costs highlighted in orange\. All of the efficiency results are evaluated on ALFWorld\.
The extra predictions during rollout, calculations of predictive reward andAgent\-In\-Worldfeedback are minor \(18\.63s, 0\.14s, 8\.92s\) compared with the overall running\-time per step, inducing only about 5\.2% extra computation\.The state comparison contains only the task description and two short state descriptions, and the retrieval repository contains only a small number of unique failure modes\. Additional retrieved tasks alter the sampling distribution but do not require a separate model\.Together with the gains in Table[1](https://arxiv.org/html/2606.10917#S3.T1), these results show that Role\-Agent balances efficiency and effectiveness\.
### 4\.4Case Study
Figure[5](https://arxiv.org/html/2606.10917#S4.F5)further illustrates how the environment LLM adjusts the data distribution by analyzing failure trajectories\. In the shown failed trajectory, the agent mistakenly picks "Apple 2" from the fridge in step 3\. The environment LLM then identifies the failure mode as ENTITY\_CONFUSION, along with a description of how the failure occurred and queries for retrieving similar failure modes\. Finally, it searches for analogous failures in the stored history of \(task, failure mode\) pairs\.This workflow shows how structured failure analysis enables more targeted subsequent training\.
## 5Conclusion
This paper introduces Role\-Agent, a bootstrapped framework for agent\-environment co\-evolution designed to overcome the challenges of undirected and non\-specific feedback instatic environments\. Role\-Agent leverages a single Large Language Model \(LLM\) toact as both the agent and the environment, realized through our World\-In\-Agent \(WIA\) and Agent\-In\-World \(AIW\)\. WIA enhances the agent’s planning and reasoning by equipping it with the ability to predict future states based on its actions; AIW uses the same LLM to analyze failure patterns from unsuccessful trajectories and retrieve analogous tasks, thereby dynamically reshaping the training data distribution\. Extensive experiments across diverse benchmarks validate that Role\-Agent achievesstrong performance, demonstrating the effectiveness of our approach\.
## Limitations
Despite its effectiveness, Role\-Agent has several limitations\. First, a stronger frozen environment LLM can improve the AIW component, but it also introduces extra external knowledge and changes the fairness of comparison with same\-backbone baselines\. Second, the state grouping mechanism employs a similarity threshold from previous studies, limiting cross\-task generalization\. Finally, the current evaluation is confined to text\-based environments;extensions to multi\-modal or real\-time embodied settings may require vision\-language state descriptions or latent\-state matching and remain important future work\.
## References
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024\)Back to basics: revisiting reinforce\-style optimization for learning from human feedback in llms\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12248–12267\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- The claude 3 model family: opus, sonnet, haiku\.Claude\-3 Model Card1\(1\),pp\. 4\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
- K\. Chen, P\. Wang, Y\. Yu, X\. Zhan, and H\. Wang \(2025a\)Large language model\-based data science agent: a survey\.arXiv preprint arXiv:2508\.02744\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- Y\. Chen, Y\. Wang, S\. Zhu, H\. Yu, T\. Feng, M\. Zhang, M\. Patwary, and J\. You \(2025b\)Multi\-agent evolve: llm self\-improve through co\-evolution\.arXiv preprint arXiv:2510\.23595\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Chu, H\. Huang, X\. Zhang, F\. Wei, and Y\. Wang \(2026\)GPG: a simple and strong reinforcement learning baseline for model reasoning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=inccdtfx8x)Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p2.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Citron \(2024\)Try deep research and our new experimental model in gemini, your ai assistant\.Google Blog, December11\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- G\. Dong, Y\. Chen, X\. Li, J\. Jin, H\. Qian, Y\. Zhu, H\. Mao, G\. Zhou, Z\. Dou, and J\. Wen \(2025a\)Tool\-star: empowering llm\-brained multi\-tool reasoner via reinforcement learning\.arXiv preprint arXiv:2505\.16410\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- G\. Dong, H\. Mao, K\. Ma, L\. Bao, Y\. Chen, Z\. Wang, Z\. Chen, J\. Du, H\. Wang, F\. Zhang,et al\.\(2025b\)Agentic reinforced policy optimization\.arXiv preprint arXiv:2507\.19849\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang \(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2025\)Group\-in\-group policy optimization for llm agent training\.arXiv preprint arXiv:2505\.10978\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px2.p1.1.1),[§3\.2](https://arxiv.org/html/2606.10917#S3.SS2.SSS0.Px1.p2.1),[§3\.2](https://arxiv.org/html/2606.10917#S3.SS2.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px3.p1.1)\.
- C\. Fernando, D\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktäschel \(2023\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.arXiv preprint arXiv:2309\.16797\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
- H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu,et al\.\(2025\)A survey of self\-evolving agents: what, when, how, and where to evolve on the path to artificial super intelligence\.arXiv preprint arXiv:2507\.21046\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Gu, K\. Zhang, Y\. Ning, B\. Zheng, B\. Gou, T\. Xue, C\. Chang, S\. Srivastava, Y\. Xie, P\. Qi,et al\.\(2024\)Is your llm secretly a world model of the internet? model\-based planning for web agents\.arXiv preprint arXiv:2411\.06559\.Cited by:[§3\.2](https://arxiv.org/html/2606.10917#S3.SS2.p1.1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p2.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.10917#S3.SS1.SSS0.Px2.p1.2),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- E\. Hemberg, S\. Moskal, and U\. O’Reilly \(2024\)Evolving code with a large language model\.Genetic Programming and Evolvable Machines25\(2\),pp\. 21\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2024\)Automated design of agentic systems\.arXiv preprint arXiv:2408\.08435\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Jiang, D\. Schmidt, D\. Srikanth, D\. Xu, I\. Kaplan, D\. Jacenko, and Y\. Wu \(2025\)Aide: ai\-driven exploration in the space of code\.arXiv preprint arXiv:2502\.13138\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1601–1611\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- W\. Kool, H\. van Hoof, and M\. Welling \(2019\)Buy 4 reinforce samples, get a baseline for free\!\.ICLR 2019 Workshop\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee,et al\.\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 453–466\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu,et al\.\(2024\)Tulu 3: pushing frontiers in open language model post\-training\.arXiv preprint arXiv:2411\.15124\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Li, D\. Guo, D\. Yang, R\. Xu, Y\. Wu, and J\. He \(2025\)Codei/o: condensing reasoning patterns via code input\-output prediction\.arXiv preprint arXiv:2502\.07316\.Cited by:[§3\.2](https://arxiv.org/html/2606.10917#S3.SS2.p1.1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2023\)Agentbench: evaluating llms as agents\.arXiv preprint arXiv:2308\.03688\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu \(2026\)SkillClaw: let skills evolve collectively with agentic evolver\.arXiv preprint arXiv:2604\.08377\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 9802–9822\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. Ruiz, A\. Mehrabian,et al\.\(2025\)Alphaevolve: a coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Ou, Y\. Luo, J\. Zheng, L\. Wei, Z\. Yu, S\. Qiao, J\. Zhang, D\. Zheng, Y\. Mao, Y\. Gao,et al\.\(2025\)Automind: adaptive knowledgeable agent for automated data science\.arXiv preprint arXiv:2506\.10974\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p2.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p2.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.10917#S3.SS1.SSS0.Px1.p1.14),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Sun, Z\. Qiao, J\. Guo, X\. Fan, Y\. Hou, Y\. Jiang, P\. Xie, Y\. Zhang, F\. Huang, and J\. Zhou \(2025\)Zerosearch: incentivize the search capability of llms without searching\.arXiv preprint arXiv:2505\.04588\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- T\. D\. Team, B\. Li, B\. Zhang, D\. Zhang, F\. Huang, G\. Li, G\. Chen, H\. Yin, J\. Wu, J\. Zhou,et al\.\(2025\)Tongyi deepresearch technical report\.arXiv preprint arXiv:2510\.24701\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wang, H\. Xu, H\. Jia, X\. Zhang, M\. Yan, W\. Shen, J\. Zhang, F\. Huang, and J\. Sang \(2024\)Mobile\-agent\-v2: mobile device operation assistant with effective navigation via multi\-agent collaboration\.Advances in Neural Information Processing Systems37,pp\. 2686–2710\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu,et al\.\(2025a\)Ragen: understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.arXiv preprint arXiv:2504\.20073\.Cited by:[§3\.1](https://arxiv.org/html/2606.10917#S3.SS1.SSS0.Px2.p1.2)\.
- Z\. Wang, X\. Zheng, K\. An, C\. Ouyang, J\. Cai, Y\. Wang, and Y\. Wu \(2025b\)Stepsearch: igniting llms search ability via step\-wise proximal policy optimization\.arXiv preprint arXiv:2505\.15107\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang,et al\.\(2025\)Evolver: self\-evolving llm agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)SkillRL: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Xue, C\. Peng, M\. Huang, L\. Guo, T\. Han, H\. Wang, J\. Wang, X\. Zhang, X\. Yang, D\. Zhao,et al\.\(2026\)Evocua: evolving computer use agents via learning from scalable synthetic experience\.arXiv preprint arXiv:2601\.15876\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022a\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022b\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p1.1),[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.10917#S4.SS1.SSS0.Px2.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhai, S\. Tao, C\. Chen, A\. Zou, Z\. Chen, Q\. Fu, S\. Mai, L\. Yu, J\. Deng, Z\. Cao,et al\.\(2025\)Agentevolver: towards efficient self\-evolving agent system\.arXiv preprint arXiv:2511\.10395\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Zhang, H\. Ren, C\. Zhan, Z\. Zhou, J\. Wang, H\. Zhu, W\. Zhou, and S\. Yan \(2025a\)Memevolve: meta\-evolution of agent memory systems\.arXiv preprint arXiv:2512\.18746\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
- H\. Zhang, Q\. Long, J\. Bao, T\. Feng, W\. Zhang, H\. Yue, and W\. Wang \(2026a\)MemSkill: learning and evolving memory skills for self\-evolving agents\.arXiv preprint arXiv:2602\.02474\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026b\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
- W\. Zhang, X\. Li, K\. Dong, Y\. Wang, P\. Jia, X\. Li, Y\. Zhang, D\. Xu, Z\. Du, H\. Guo,et al\.\(2025b\)Process vs\. outcome reward: which is better for agentic rag reinforcement learning\.arXiv preprint arXiv:2505\.14069\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhang and A\. Zhang \(2024\)You only look at screens: multimodal chain\-of\-action agents\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 3132–3149\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p2.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang,et al\.\(2025\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.Cited by:[§2](https://arxiv.org/html/2606.10917#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Y\. Zhuo, D\. Wang, H\. Ding, V\. Kumar, and Z\. Wang \(2025\)Cyber\-zero: training cybersecurity agents without runtime\.arXiv preprint arXiv:2508\.00910\.Cited by:[§1](https://arxiv.org/html/2606.10917#S1.p3.1)\.
## Appendix ADataset Details
### A\.1ALFWorld
ALFWorld is an interactive framework that bridges text\-based environments and physically embodied simulations\. Agents learn high\-level policies in TextWorld and apply them within the visual ALFRED benchmark\. With parallel representations of the same world, ALFWorld enables agents to leverage semantic priors and language\-based reasoning to generalize more effectively to new tasks\. This dual\-modality design promotes stronger generalization and greater training efficiency compared to vision\-only approaches\.
ParameterValueParameterValuelearning rate1\.00E\-06evaluation temperature0training batch size16/16/256max response length4096optimizerAdaWrewardsuc=1,fail=0clip ratio low0\.2max interaction step50/15/4clip ratio high0\.28state similarity threshold0\.9KL coefficient1\.00E\-03reflectiontemperaturereflectiontemperature0\.5rollout temperature0\.9total epoch150val\_data\_size128/128/512group\_size8
Table 5:Hyper\-parameters for RL training\.ALFWorldWebShopSearchrepetitive\_explorationirrelevant\_querywrong\_answerwrong\_target\_locationwrong\_product\_selectioninsufficient\_retrievalwrong\_receptaclewrong\_attribute\_selectionirrelevant\_retrieval\_querypremature\_give\_upmissing\_attribute\_selectionrepeated\_retrieval\_querymissing\_preconditionpremature\_purchaseinformation\_misinterpretationrepeated\_failed\_actionexcessive\_browsingpartial\_answernavigation\_looprepeated\_queryhallucinated\_answerentity\_confusionnavigation\_errorpremature\_answerwrong\_object\_interactionprice\_constraint\_violationaction\_format\_errorexhaustive\_exploration\_failureaction\_format\_erroraction\_format\_errorpremature\_termination
Table 6:Failure modes used in Agent\-In\-World across ALFWorld, WebShop, and search\-augmented QA tasks\.
### A\.2WebShop
WebShop is a large\-scale simulated e\-commerce environment comprising over 1\.18 million real\-world products and 12,087 crowd\-sourced natural language instructions for training grounded language agents\. In this benchmark, agents usetwo actions, i\.e\., search\[query\] and click\[element\], to fulfill complex user requirements\. The environment features an automatically computable reward function based on product attributes, which shows sim\-to\-real transfer capabilities when deployed on actual shopping websites like Amazon and eBay\.
### A\.3Search\-QA Tasks
Natural Questions \(NQ\) is a large\-scale open\-domain QA dataset built from real Google search queries\. Each example pairs a user question with an answer extracted from a Wikipedia page\. It’s widely used to benchmark single\-hop retrieval and reading comprehension, where the task is to locate and extract an answer from a single passage\.
TriviaQA consists of question\-answer\-evidence triples sourced from Wikipedia and news articles\. The questions involve complex entity relationships, and the evidence is collected via distant supervision, meaning the answer isn’t necessarily tied to a single pre\-selected passage\. It’s commonly used to test how well models retrieve and synthesize facts from unstructured text\.
PopQA focuses on long\-tail entities\. It samples over 14,000 questions about less frequently mentioned entities\. It tests whether a retrieval method can actually look up obscure knowledge instead of relying on parametric memory\.
HotpotQA requires multi\-hop reasoning across two or more Wikipedia paragraphs\. It’s designed to test whether a model can follow a chain of evidence, not just locate a single fact\.
2WikiMultihopQA \(2Wiki\) is constructed using a rule\-based template system\. This ensures each question has a predefined reasoning path, and questions are categorized by logical type, such as comparison, temporal, or compositional\. It provides a controlled setting for evaluating whether models can perform specific kinds of multi\-step inference\.
MuSiQue is built by programmatically composing single\-hop questions from existing datasets like SQuAD and TriviaQA\. The composition ensures strict connectivity between reasoning steps\. It includes unanswerable distractors\. This tests whether models can follow a dependency chain while filtering out irrelevant information\.
Bamboogle is designed to be unsolvable by parametric models alone\. The questions require decomposition and sequential retrieval across multiple documents\. It’s used to evaluate whether search\-augmented agents can generalize compositionally, meaning they can combine facts from different sources in ways that weren’t seen during training\.
## Appendix BMore Studies
### B\.1Standard Deviations
Table[7](https://arxiv.org/html/2606.10917#A2.T7)reports the mean and standard deviations over three runs\.
### B\.2Relation between predictive reward and outcome reward
On 200 ALFWorld rollouts with Qwen2\.5\-3B\-Instruct, the predictive reward has a point\-biserial correlation of 0\.41 \(p<0\.01p<0\.01\) with outcome reward\. Its average value also rises from about 0\.60 at initialization to the mid\-to\-high 0\.70 range near convergence, indicating improved state prediction quality\.
MethodALFWorldWebShopQwen2\.5\-1\.5B\-InstructGRPO72\.8±1\.572\.8\\pm 1\.556\.8±0\.756\.8\\pm 0\.7GiGPO86\.7±0\.686\.7\\pm 0\.665\.0±1\.165\.0\\pm 1\.1Role\-Agent90\.9±0\.890\.9\\pm 0\.871\.9±0\.971\.9\\pm 0\.9Qwen2\.5\-7B\-InstructGRPO77\.6±1\.077\.6\\pm 1\.066\.1±0\.966\.1\\pm 0\.9GiGPO90\.8±0\.590\.8\\pm 0\.572\.8±1\.872\.8\\pm 1\.8Role\-Agent93\.8±0\.893\.8\\pm 0\.877\.1±0\.677\.1\\pm 0\.6
Table 7:Stability results over three runs with Qwen2\.5\-1\.5B/7B\-Instruct\.
## Appendix CImplementation Details
Role\-Agent adopts the VeRL framework to train agents\. We list the detailed hyper\-parameters in Table[5](https://arxiv.org/html/2606.10917#A1.T5)\. All of the employed backbones in the experiments, i\.e\., Qwen2\.5\-1\.5/3/7B\-Instruct are trained on 8×\\timesNVIDIA H20 GPUs with tensor parallel equals 1\. The list of failure modes employed by Role\-Agent is shown in Table[6](https://arxiv.org/html/2606.10917#A1.T6)\.
## Appendix DPrompts
We provide all the prompts we used in the experiment in Figure[7](https://arxiv.org/html/2606.10917#A6.F7)to[9](https://arxiv.org/html/2606.10917#A6.F9)\.
To be specific, Figure[7](https://arxiv.org/html/2606.10917#A6.F7)shows the specific prompt for search\-augmented QA tasks, where we provide the history of interaction, search query and corresponding results\. We ask the LLM to either search the website or answer the question\.The promptin Figure[8](https://arxiv.org/html/2606.10917#A6.F8)firstfeedsthe LLM with the task context and failed trajectories, thenasksthe LLM to generate typical failure modes, including failure categories,core lessonsand suggested queries for the incoming retrieval\.The promptin Figure[9](https://arxiv.org/html/2606.10917#A6.F9)takes the generated content andasksthe LLM to retrieve tasks with similar failure modes, whicharestored in the offline library\.
## Appendix EAlgorithm
The algorithm is listed in Algorithm[1](https://arxiv.org/html/2606.10917#alg1)\.
Algorithm 1Role\-Agent Training1:Initial policy
πθ\\pi\_\{\\theta\}, reference policy
πref\\pi\_\{\\rm ref\}, task pool
𝒟\\mathcal\{D\}, prediction horizon
HH, discount factor
γ\\gamma, mixing coefficient
α\\alpha
2:Optimized policy
πθ\\pi\_\{\\theta\}
3:Initialize task distribution
p𝒟p\_\{\\mathcal\{D\}\}and failure memory
ℳ←∅\\mathcal\{M\}\\leftarrow\\emptyset
4:foreach training iterationdo
5:Sample a batch of tasks
\{qi\}i=1N∼p𝒟\\\{q\_\{i\}\\\}\_\{i=1\}^\{N\}\\sim p\_\{\\mathcal\{D\}\}
6:foreach task
qiq\_\{i\}do
7:Roll out the LLM agent to obtain trajectory
𝝉i=\{\(𝒔t\(i\),𝒂t\(i\),rt\(i\)\)\}t=1Ti\\bm\{\\tau\}\_\{i\}=\\\{\(\\bm\{s\}^\{\(i\)\}\_\{t\},\\bm\{a\}^\{\(i\)\}\_\{t\},r^\{\(i\)\}\_\{t\}\)\\\}\_\{t=1\}^\{T\_\{i\}\}
8:foreach step
ttin
𝝉i\\bm\{\\tau\}\_\{i\}do
9:Use the same LLM with prompt
𝒙pre\\bm\{x\}\_\{pre\}to predict future states
\{𝒔^t,h\(i\)\}h=1H\\\{\\hat\{\\bm\{s\}\}^\{\(i\)\}\_\{t,h\}\\\}\_\{h=1\}^\{H\}
10:Compute predictive scores
r~t,h\(i\)=LMS\(𝒔^t,h\(i\),𝒔t\+h\(i\)\)\\tilde\{r\}^\{\(i\)\}\_\{t,h\}=\\operatorname\{LMS\}\(\\hat\{\\bm\{s\}\}^\{\(i\)\}\_\{t,h\},\\bm\{s\}^\{\(i\)\}\_\{t\+h\}\)
11:Compute task and predictive rewards:
ℛtask\(i\)\(𝒂t\)=∑k=tTiγk−trk\(i\),ℛpre\(i\)\(𝒂t\)=∑h=1Hγh−1r~t,h\(i\)\\mathcal\{R\}\_\{task\}^\{\(i\)\}\(\\bm\{a\}\_\{t\}\)=\\sum\_\{k=t\}^\{T\_\{i\}\}\\gamma^\{k\-t\}r^\{\(i\)\}\_\{k\},\\quad\\mathcal\{R\}\_\{pre\}^\{\(i\)\}\(\\bm\{a\}\_\{t\}\)=\\sum\_\{h=1\}^\{H\}\\gamma^\{h\-1\}\\tilde\{r\}^\{\(i\)\}\_\{t,h\}
12:Modulate the reward:
ℛt\(i\)=ℛtask\(i\)\(𝒂t\)\(1\+ℛpre\(i\)\(𝒂t\)\)\\mathcal\{R\}^\{\(i\)\}\_\{t\}=\\mathcal\{R\}\_\{task\}^\{\(i\)\}\(\\bm\{a\}\_\{t\}\)\\bigl\(1\+\\mathcal\{R\}\_\{pre\}^\{\(i\)\}\(\\bm\{a\}\_\{t\}\)\\bigr\)
13:endfor
14:endfor
15:Group identical states across the rollout batch using hash maps
16:Compute state\-level advantage
AoS\(𝒂t\(i\)\)A^\{S\}\_\{o\}\(\\bm\{a\}^\{\(i\)\}\_\{t\}\)within each state group
𝒢o\\mathcal\{G\}\_\{o\}
17:Compute the final advantage:
A\(𝒂t\(i\)\)=AoS\(𝒂t\(i\)\)\+αAE\(𝝉i\)A\(\\bm\{a\}^\{\(i\)\}\_\{t\}\)=A^\{S\}\_\{o\}\(\\bm\{a\}^\{\(i\)\}\_\{t\}\)\+\\alpha A^\{E\}\(\\bm\{\\tau\}\_\{i\}\)
18:Update
πθ\\pi\_\{\\theta\}with the GRPO\-style clipped objective using
A\(𝒂t\(i\)\)A\(\\bm\{a\}^\{\(i\)\}\_\{t\}\)
19:foreach failed trajectory
𝝉i\\bm\{\\tau\}\_\{i\}do
20:Use the same LLM as the environment role to analyze failure causes
21:Generate failure mode and reflection, then store them in
ℳ\\mathcal\{M\}
22:endfor
23:Retrieve tasks in
𝒟\\mathcal\{D\}similar to the accumulated failure modes
24:Update
p𝒟p\_\{\\mathcal\{D\}\}to prioritize difficult and overlooked tasks
25:endfor
26:return
πθ\\pi\_\{\\theta\}
## Appendix FThe Use of Large Language Models
During manuscript preparation, we use large language models \(LLMs\) to \(i\) improve grammar and spelling without altering the intended scientificcontent, and \(ii\) provide lightweight coding assistance \(e\.g\., scripts and formatting help\)\. All reportednumericalresults, analyses, and claims are produced by the authors\. The authors design the methods, conduct the experiments, and verify the findings\.
Prompt Template for SearchYou are an expert assistant whose task is to answer the given question step by step\.Your question: \{task\_description\}\. So far, you have completed step\_count step\(s\)\. Below is the interaction history, where <search\> and </search\> enclose your previous search queries, and <information\> and </information\> enclose the corresponding results returned by the external search engine\. History: memory\_contextNow it is your turn to respond at the current step\. Begin by conducting your reasoning process\. This reasoning must be enclosed within <think\> and </think\> tags\.After reasoning, choose only one of the following actions \(do not attempt both\):\(1\) If you determine that you are missing some necessary information, you may use a search engine to obtain more external knowledge by formatting your query as: <search\> your query </search\>\.\(2\) If you have sufficient knowledge to confidently answer the question, provide your final answer enclosed within <answer\> and </answer\> tags, without any detailed explanation\.
Figure 7:The prompt template of Search agents\.Prompt Template for Abstracting Failure Modes from Failed TrajectoriesYou are an expert AI trainer specializing in diagnosing why AI agents fail at multi\-step reasoning tasks\.\#\# Task Context\{task\_description\}\#\# Failed TrajectoryThe agent attempted the task above but failed\. Here are the steps it took:\{trajectory\_description\}\#\# Your Analysis Task Carefully examine the trajectory and produce a structured failure analysis\.\*\*Step 1 – Root Cause Identification\*\*Identify the PRIMARY failure modes and describe briefly:\*\*Step 2 – Critical Step Identification\*\*Identify the SINGLE step where the failure became irreversible \(the "point of no return"\)\.\*\*Step 3 – Core Lesson\*\*State a concise, generalizable lesson \(1\-2 sentences\) that would help an agent avoid this class of mistake on SIMILAR tasks in the future\. Focus on the decision rule, not the specific content\.\#\# Output FormatWrap your entire analysis in <reflection\> tags using this exact structure:<reflection\>DOMINANT\_TYPE: \[category from Step 1\]DETAIL: \[1\-2 sentences explaining why this root cause applies\]CRITICAL\_STEP: \[step number and brief description of what went wrong\]CORE\_LESSON: \[the generalizable rule an agent should follow\]RETRIEVAL\_QUERY: \[the query for the retrieval stage\]</reflection\>
Figure 8:The prompt template for abstracting failure modes from failed trajectories\.Prompt Template for Retrieving Tasks with Similar Failure ModesYou are an expert AI curriculum designer\. Your job is to identify which historical training tasks are most relevant for helping an agent overcome a specific failure pattern\.\#\# Current Failure PatternThe agent is currently struggling with the following error pattern:error\_pattern\#\# Historical Task CandidatesBelow are historical tasks where the agent previously failed\. Each entry shows the task description and a brief failure analysis\.candidates\_text\#\# Your TaskSelect tasks from the list above that are MOST SIMILAR to the current failure pattern\.Similarity means:1\. The task requires the same type of reasoning or skill that the agent is currently failing at\.2\. The task’s failure analysis describes a similar root cause or mistake\.3\. Re\-training on this task would most directly help the agent overcome the current pattern\.\#\# Output FormatOutput ONLY the following structured block, with no additional text:<selected\_tasks\>INDEX/TASK/REFLECTIONS: <index, task and reflections from the candidate list\>REASON: <one sentence explaining why this task matches the current failure pattern\>INDEX/TASK/REFLECTIONS: <index, task and reflections from the candidate list\>REASON: <one sentence explaining why this task matches the current failure pattern\></selected\_tasks\>
Figure 9:The prompt template for retrieving tasks with similar failure modes\.OBSERVATION \(step 1\)"You are in the middle of a room\. Looking quickly around you, you see a cabinet 1, a cabinet 2, a countertop 1, a garbagecan 1, a handtowelholder 1, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1\. Your task is to: put a clean soapbar in toiletpaperhanger\."
AGENT\_ACTION \(step 1\)"go to cabinet 1"\.
OBSERVATION \(step 2\)"You open the cabinet 1\. The cabinet 1 is open\. In it, you see a cloth 1\."
AGENT\_ACTION \(step 2\)"go to cabinet 2"\.
OBSERVATION \(step 3\)"You open the cabinet 2\. The cabinet 2 is open\. In it, you see a soapbar 1\."
AGENT\_ACTION \(step 3\)"take soapbar 1 from cabinet 2"\.
OBSERVATION \(step 4\)"You pick up the soapbar 1 from the cabinet 2\."
AGENT\_ACTION \(step 4\)"go to toiletpaperhanger 1"
OBSERVATION \(step 5\)"You arrive at toiletpaperhanger 1\."
AGENT\_ACTION \(step 5\)"put soapbar 1 in/on toiletpaperhanger 1"
OBSERVATION \(step 6\)"You put the soapbar 1 in/on the sinkbasin 1\."
AGENT\_ACTION \(step 6\)"clean soapbar 1 with sinkbasin 1"
OBSERVATION……………
OBSERVATION"You arrive at cabinet 1\. The episode has ended due to exceeding max steps\."
AGENT\_ACTION""
Figure 10:Case\-1: failure trajectories\.Failure Mode Analysis for Case\-1DOMINANT\_TYPE: "MISSING\_PRECONDITION"DETAIL: "The agent picked up the soapbar and immediately attempted to place it at the target location without first checking whether the object satisfied the task precondition \(clean\)\. The task description explicitly states ‘clean soapbar’, but the agent ignored this constraint until the placement failed\.""CRITICAL\_STEP": "Step 3, agent navigated to the target location with a dirty soapbar instead of cleaning it first at the sinkbasin\.""CORE\_LESSON": "Before transporting an object to its target location, always verify that the object satisfies all task preconditions \(e\.g\., clean, heated, cooled\)\. If the task specifies an object state, perform the required state\-change action first\.""RETRIEVAL\_QUERY": "precondition check, object state, clean before place, task constraint, soapbar, sinkbasin"
Retrieved Tasks for Case\-1\[4\] "TASK": "Put a clean cloth in toilet\.","RETRIEVED\_REFLECTION": "Agent placed dirty cloth directly into toilet\. Should have cleaned cloth at sinkbasin first\. Rule: check object state precondition before final placement\."\[28\] "TASK": "Put a clean sponge in bathtubbasin\.","RETRIEVED\_REFLECTION": "Agent failed to clean sponge before placing in bathtubbasin\. Cleaning at faucet or sinkbasin is required when task specifies ’clean’ object\."\[50\] "TASK": "Put a clean dishsponge in cabinet\.","RETRIEVED\_REFLECTION": "Agent must clean dishsponge at sinkbasin before placing in cabinet\. Always read task description for object state requirements\."INDEX 4 REASON: "Same failure pattern: agent must clean an object \(cloth\) before placing it at the target\. Requires sinkbasin cleaning step before placement\."INDEX 28 REASON: "Identical precondition structure: object must be cleaned before placement\. Agent skipped cleaning step\.",INDEX 50 REASON: "Same root cause: MISSING\_PRECONDITION of ’clean’ precondition\. Agent transported object without state verification\."
OBSERVATION \(step 1\)"You are in the middle of a room\. You see a desk 1, a drawer 1, a drawer 2, a garbagecan 1, a shelf 1, a shelf 2, a shelf 3\. Your task is to: examine the alarmclock with the desklamp\."
AGENT\_ACTION \(step 1\)"go to desk 1"\.
OBSERVATION \(step 2\)"You arrive at desk 1\. On the desk 1, you see a alarmclock 1, a book 1, and a pen 1\."
AGENT\_ACTION \(step 2\)"take alarmclock 1 from desk 1"
OBSERVATION \(step 3\)"You pick up the alarmclock 1 from the desk 1\."
AGENT\_ACTION \(step 3\)"examine alarmclock 1"
OBSERVATION \(step 4\)"The alarmclock 1 is a digital alarm clock\. It is currently off\."
AGENT\_ACTION \(step 4\)"go to shelf 1"
OBSERVATION \(step 5\)"You arrive at shelf 1\. On the shelf 1, you see a creditcard 1 and a keychain 1\."
AGENT\_ACTION \(step 5\)"go to shelf 2"
OBSERVATION \(step 6\)"You arrive at shelf 2\. On the shelf 2, you see a laptop 1\."
AGENT\_ACTION \(step 6\)"go to shelf 3"
OBSERVATION……………
AGENT\_ACTION"The episode has ended due to exceeding max steps\. Desklamp was never found or used\."
AGENT\_ACTION""
Figure 11:Case\-2: failure trajectories\.Failure Mode Analysis for Case\-2DOMINANT\_TYPE: "WRONG\_TARGET\_LOCATION"DETAIL: "The agent correctly identified the need for a desklamp but failed to find it due to an inefficient and incomplete search strategy\. The agent searched shelves and drawers but never checked the desk itself for the desklamp, which is the most likely location for a desk\-related item\.""CRITICAL\_STEP": "Step 3, after picking up the alarmclock, the agent should have looked for the desklamp on the desk first, but instead began an unfocused search of peripheral locations\.","CORE\_LESSON": "When searching for a tool or instrument \(e\.g\., desklamp, knife, pan\), prioritize locations semantically associated with that object type before searching peripheral locations\. A desklamp is most likely on a desk; a knife is most likely in a drawer near a countertop\.""RETRIEVAL\_QUERY": "search strategy, desklamp, semantic location, object search, desk, unfocused exploration, tool finding"
Retrieved Tasks for Case\-2\[17\] "TASK": "Examine the book with the desklamp\.","RETRIEVED\_REFLECTION": "Desklamp search should start at desk\. Agent spent too many steps on shelves and drawers before checking the obvious location\."\[20\] "TASK": "Look at mug under the desklamp\.","SIMILARITY\_REASON": "Same WRONG\_TARGET\_LOCATION: desklamp not found within step budget due to poor search ordering\.",\[21\] "TASK": "Examine the pen with the desklamp\.","RETRIEVED\_REFLECTION": "Always check the desk for desklamp first\. If not on desk, check nearby shelves\. Do not exhaust steps on low\-probability locations\."INDEX 17: REASON: "Identical tool\-finding failure: agent must locate desklamp to examine an object\. Same WRONG\_TARGET\_LOCATION pattern\.",INDEX 20: REASON: "Agent wasted steps searching random locations for desklamp\. Desklamp is almost always on the desk\. Check desk first before exploring other furniture\."INDEX 21: REASON: "Same root cause: agent needs to find desklamp but uses inefficient search\. Semantic location heuristic applies\.",Similar Articles
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper introduces LLM-as-Environment-Engineer, a framework where LLMs design their own training environments for reinforcement learning in multi-agent reasoning tasks, enabling self-improving training that surpasses larger proprietary models.
OpenSkill: Open-World Self-Evolution for LLM Agents
OpenSkill is a framework for LLM agents to self-evolve skills and verification signals from open-world resources without target-task supervision, achieving high performance across benchmarks.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
This survey paper provides a unified review of LLM-based multi-agent systems, focusing on collaboration, failure attribution, and self-evolution through the LIFE framework, identifying open challenges and proposing a cross-stage research agenda.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.