SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

arXiv cs.AI 05/12/26, 04:00 AM Papers
Summary
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.
arXiv:2605.08693v1 Announce Type: new Abstract: Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/12/26, 07:20 AM
# SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
Source: [https://arxiv.org/html/2605.08693](https://arxiv.org/html/2605.08693)
Min Yang1,2, Jinghua Piao2,3\*\{\}^\{2,3~\\textbf\{\*\}\}, Xu Xia2,4, Xiaochong Lan3, Jiaju Chen2,5, Yongshun Gong1, Yong Li2,3 1Shandong University2Zhongguancun Academy3Tsinghua University 4Southeast University5University of Science and Technology of China minyang@mail\.sdu\.edu\.cn \{Pjh22, lanxc22\}@mail\.tsinghua\.edu\.cn s\-xx25@bza\.edu\.cn cjj01@mail\.ustc\.edu\.cn ysgong@sdu\.edu\.cn liyong07@tsinghua\.edu\.cn

###### Abstract

Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand\-designed rules, or auxiliary modules\. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience\. To endow LLM agents with autonomous skill mastery, we proposeSkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving\. This capability is achieved through three key designs\. First, we train agents through trajectory\-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes\. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill\-editing decisions\. Third, we introduce DualAdv\-GRPO, which separately estimates advantages for task\-solving actions and skill\-editing decisions, stabilizing joint training across task solving and skill management\. Experiments on ALFWorld and WebShop show thatSkillMasterimproves the overall success rate over state\-of\-the\-art baselines by 8\.8% and 9\.3%, respectively, achieving the best performance among all compared methods\. Further analysis reveals a marked shift in agent capability: agents trained withSkillMastercan identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill\-bank edits\. Overall,SkillMastermoves LLM agents beyond mere skill use toward self\-improving agents capable of developing, adapting, and applying their own skill repertoires\.00footnotetext:Our code is released at[https://github\.com/sduyangmin/Skill\-Master](https://github.com/sduyangmin/Skill-Master)\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.08693v1/Figures/motivation.png)Figure 1:From externally\-managed to autonomous skill mastery\.\(a\) Prior work:Skill management is handled by an external module; the agent only retrieves skills\.\(b\)SkillMaster:The agent self\-manages its skill bank via tool calls, forming a closed loop where skill management is a learned RL objective\.Large language model \(LLM\) agents have demonstrated impressive capabilities across complex tasks such as embodied household manipulation\(Shridharet al\.,[2021](https://arxiv.org/html/2605.08693#bib.bib2); Wanget al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib7)\), web navigation\(Yaoet al\.,[2022](https://arxiv.org/html/2605.08693#bib.bib3); Liuet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib41); Qiuet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib42)\), information seeking\(Jinet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib20); Zhanget al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib43); Liet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib44)\), and software engineering and code generation\(Soniet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib45); Rashidet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib46); Yuet al\.,[2025a](https://arxiv.org/html/2605.08693#bib.bib47); Puvvadiet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib48)\)\. Despite strong performance on individual tasks, LLM agents remain largely episodic, failing to effectively leverage past experience for cross\-task learning\(Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\)\. Memory\-based methods\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib19); Liuet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib31)\)can store raw trajectories, but these memories are lengthy and noisy, making it difficult to extract core principles\(Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\)\. To address this limitation, recent work has introduced the concept of skills: compact, reusable abstractions of effective experience that guide future behavior\(Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1); Wanget al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib37); Liet al\.,[2026a](https://arxiv.org/html/2605.08693#bib.bib38); Jianget al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib39); Gaoet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib40)\)\. Unlike raw trajectories, skills distill essential procedures and heuristics, substantially improving task success and efficiency\. The practical value is already evident in deployed systems: personal assistant OpenClaw\(OpenClaw Foundation,[2026](https://arxiv.org/html/2605.08693#bib.bib49)\)and coding agent Claude Code\(Anthropic,[2026](https://arxiv.org/html/2605.08693#bib.bib50)\)both rely on skill\-based approaches\. However, existing methods\(Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\)typically rely on an external LLM teacher that distills skills from completed trajectories on a fixed schedule, while the main policy only retrieves and applies them\. Consequently, skill management remains an external mechanism rather than a learnable component of the agent’s policy, limiting autonomous skill mastery\.

Recent work has attempted to make skill management learnable\(Liet al\.,[2026b](https://arxiv.org/html/2605.08693#bib.bib23); Zhanget al\.,[2026b](https://arxiv.org/html/2605.08693#bib.bib24); Wuet al\.,[2025b](https://arxiv.org/html/2605.08693#bib.bib26)\), but these approaches typically rely on auxiliary modules or separate pipelines\. Moreover, skill management is often guided solely by task outcome rewards, which are sparse and coarse, failing to capture how a specific skill edit impacts downstream behavior\. This lack of explicit skill\-quality signals prevents the agent from fully integrating skill management into its own policy\.

Consequently, enabling agents to achieve autonomous skill mastery remains challenging\.First, skills are externally managed, not internally mastered\.Skill mastery should be an internalized capability of the agent, rather than an external maintenance procedure\. Existing training can teach agents to invoke skills, but it rarely teaches them to treat the skill bank as something they can actively improve from their own experience\.Second, evaluating skill quality is difficult\.Task success alone is too sparse to indicate whether a specific skill edit helps\. Our key insight is thathigh\-quality skills should produce two measurable downstream effects: increasing success rates on previously failing tasks and reducing steps on already\-solvable tasks\.These observable effects provide the explicit signal missing from external management approaches\.Third, joint optimization is challenging\.The optimization objectives for skill management and task execution differ, and combining them in a single policy often causes interference, making training unstable\.

To address these challenges, we propose theSkillMasterframework, which incorporates skill management into the agent’s learning loop\. Our framework is built on three key designs\. The first,Trajectory\-informed skill review, allows the agent to use tool\-integrated reasoning to propose, update, or retain skills based on completed task trajectories, unifying task execution and skill management in an end\-to\-end reinforcement learning framework\. The second,Downstream utility reward, evaluates each candidate skill modification through counterfactual comparisons on related probe tasks, providing an explicit skill\-quality signal for training skill\-editing decisions\. Finally,DualAdv\-GRPOseparately normalizes advantages for task\-solving actions and skill\-editing decisions, enabling stable joint training of the two optimization objectives within a unified policy\.

Our contributions are as follows:

- •We proposeSkillMaster, a framework that integrates task execution with learned skill\-management decisions in a single policy, jointly optimized through reinforcement learning\.
- •We introduce a downstream utility reward that evaluates candidate skill revisions by measuring their counterfactual impact on related probe tasks\.
- •We propose DualAdv\-GRPO, which decouples action optimization from skill management optimization through separate advantage normalization, enabling joint training without objective interference\.
- •On ALFWorld and WebShop,SkillMasterimproves the overall success rate by 8\.8% and 9\.3%, respectively, over the strongest baseline\.

![Refer to caption](https://arxiv.org/html/2605.08693v1/Figures/main.png)Figure 2:Overview ofSkillMaster\.\(a\) Trajectory Design for Skill Mastery:The agent interacts with the environment guided by retrieved skills and then reviews the episode to propose, update, or retain skills via tool calls\.\(b\) Counterfactual Skill Utility Reward:Candidate skill changes are evaluated by counterfactual comparison on related probe tasks\.\(c\) DualAdv\-GRPO:Action and skill advantages are normalized separately and merged via a tunable weightγ\\gammainto a unified PPO loss for stable joint training\.
## 2Method

Figure[2](https://arxiv.org/html/2605.08693#S1.F2)provides an overview of theSkillMasterframework\. We describe its three main components, corresponding to the modules illustrated in the figure\. The first component,Trajectory Design for Skill Mastery\(§[2\.1](https://arxiv.org/html/2605.08693#S2.SS1)\), unifies the*Acting Phase*and the*Skill Mastery Phase*: the agent interacts with the environment guided by retrieved skills, and then reviews the episode to propose, update, or retain skills via tool calls\. The second component,Counterfactual Skill Utility Reward\(§[2\.2](https://arxiv.org/html/2605.08693#S2.SS2)\), evaluates candidate skill changes by comparing performance on related probe tasks using the original and modified skill banks\. The third component,DualAdv\-GRPO\(§[2\.3](https://arxiv.org/html/2605.08693#S2.SS3)\) separately normalizes advantages for task\-solving actions and skill\-editing decisions, merging them via a tunable weightγ\\gammainto a unified PPO loss to stabilize joint training\.

### 2\.1Trajectory Design for Skill Mastery

SkillMasteraugments standard agent RL training with an explicit skill\-mastery phase appended after each episode\. Every episode proceeds in two stages:

- •Acting Phase\.The agent interacts with the environment step by step\. At each step, relevant skills are retrieved from the skill bankℬ\\mathcal\{B\}based on the current task and injected into the observation prompt\. The agent produces an action and receives a scalar environment rewardrenvr\_\{\\text\{env\}\}\. A trajectoryτ=\{o0,a0,r0,…,oT,aT,rT\}\\tau=\\\{o\_\{0\},a\_\{0\},r\_\{0\},\\dots,o\_\{T\},a\_\{T\},r\_\{T\}\\\}is collected\.
- •Skill\-Mastery Phase\.After the episode terminates, the system constructs a*skill\-review prompt*that presents the task description, the retrieved skills, the trajectory, and the final environment feedback\. The agent must then output exactly one tool call:propose\_skill,update\_skill, orkeep\_skill\. Specially, we expose skill mastery through three function\-calling tools:propose\_skilladds a new skill,update\_skillrevises an existing skill, andkeep\_skillleaves the bank unchanged\. Each call is executed by a backend that mutates the skill bank and returns structured status metadata\. After each episode, the skill\-review prompt presents the task, outcome, retrieved skills, action\-observation trajectory, and final environment feedback\. The agent is instructed to reason briefly and output exactly one tool call\. The prompt also enforces grounding constraints that prevent common failure modes, such as proposing skills unrelated to the current domain or updating the bank solely because an episode failed\. Full tool schemas and prompt templates are provided in Appendix[E](https://arxiv.org/html/2605.08693#A5)and Appendix[A](https://arxiv.org/html/2605.08693#A1)\. This phase receives a dedicated skill\-mastery rewardRskillR\_\{\\text\{skill\}\}\(defined in §[2\.2](https://arxiv.org/html/2605.08693#S2.SS2)\)\.

The two phases follow distinct optimization objectives: the acting phase is optimized for immediate environment feedback, whereas the skill\-mastery phase is optimized for the long\-term quality of the skill bank\. We therefore treat them as*heterogeneous phases*and introduce a specialized optimization algorithm to train them jointly \(§[2\.3](https://arxiv.org/html/2605.08693#S2.SS3)\)\.

### 2\.2Counterfactual Skill Utility Reward

A central challenge in self\-managed skill evolution is defining what makes a skill*good*\. In prior work such asSkillRL\(Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\), skills are distilled by an external teacher and the agent is trained with only the environment outcome reward—a binary success signal that conflates task execution quality with skill quality\. This reward cannot distinguish a genuinely useful skill from one that merely*looks*plausible, because it provides no signal about whether the skill actually helps on future tasks\.

Our key insight is that a high\-quality skill should produce two measurable effects on tasks that share similar requirements: \(1\) tasks that previously*failed*should become more likely to*succeed*, and \(2\) tasks that already succeeded should be completable in*fewer steps*, since the skill encodes more efficient strategies\. We operationalize this intuition through a*downstream utility reward*based on counterfactual probe evaluation\.

#### 2\.2\.1Probe\-Based Counterfactual Evaluation

When the agent callspropose\_skillorupdate\_skill, we selectKK*probe tasks*that are semantically related to the current episode \(e\.g\., tasks from the same task family, which share similar skill requirements\)\. The selection uses a deterministic seed derived from the current task identifier, ensuring reproducibility\. The probe pool is drawn from a held\-out set of tasks from the training, so utility evaluation measures genuine skill transfer rather than memorization of seen tasks\. The benchmark\-specific definitions of probe task are detailed in §[3\.1](https://arxiv.org/html/2605.08693#S3.SS1)\.

For each probe taskpip\_\{i\}, we evaluate the impact of a candidate skill modification by comparing the task under two skill banks:

ℬ→apply candidate mutationℬ′\.\\mathcal\{B\}\\;\\xrightarrow\{\\;\\text\{apply candidate mutation\}\\;\}\\;\\mathcal\{B\}^\{\\prime\}\.Specifically, we first rolloutpip\_\{i\}using the original skill bankℬ\\mathcal\{B\}\. Next, we create a temporary bankℬ′\\mathcal\{B\}^\{\\prime\}by applying the candidate skill modification, and then rollout the same probe taskpip\_\{i\}usingℬ′\\mathcal\{B\}^\{\\prime\}\. This counterfactual comparison provides a direct measure of the skill modification’s effect on task performance\.

Each probe rollout is scored according to our two desiderata—success and efficiency:

score\(pi,ℬ\)=𝟏\[successi\]\+M−stepsiM⋅𝟏\[successi\]\\text\{score\}\(p\_\{i\},\\mathcal\{B\}\)=\\mathbf\{1\}\[\\text\{success\}\_\{i\}\]\\;\+\\;\\frac\{M\-\\text\{steps\}\_\{i\}\}\{M\}\\cdot\\mathbf\{1\}\[\\text\{success\}\_\{i\}\]\(1\)
whereMMis the maximum allowed steps\. The first term captures whether the task now succeeds where it may have failed before; the second term captures step efficiency—a skill that encodes a shorter, more direct strategy yields a higher score\. Failed probes receive0, since a skill that does not enable success confers no benefit regardless of step count\.

#### 2\.2\.2Utility Computation

Defineδi=score\(pi,ℬ′\)−score\(pi,ℬ\)\\delta\_\{i\}=\\text\{score\}\(p\_\{i\},\\mathcal\{B\}^\{\\prime\}\)\-\\text\{score\}\(p\_\{i\},\\mathcal\{B\}\)as the per\-probe performance delta\. OverKKprobes, let:

δ¯=1K∑i=1Kδi,w=∑i=1K𝟏\[δi\>0\],ℓ=∑i=1K𝟏\[δi<0\]\\bar\{\\delta\}=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\delta\_\{i\},\\qquad w=\\sum\_\{i=1\}^\{K\}\\mathbf\{1\}\[\\delta\_\{i\}\>0\],\\qquad\\ell=\\sum\_\{i=1\}^\{K\}\\mathbf\{1\}\[\\delta\_\{i\}<0\]\(2\)
The utility reward combines the average improvement magnitude with a directional consistency term:

Rutility=δ¯\+α⋅w−ℓK,R\_\{\\text\{utility\}\}=\\bar\{\\delta\}\+\\alpha\\cdot\\frac\{w\-\\ell\}\{K\},\\\(3\)here,α\\alphabalances the magnitude of improvement and directional consistency\. The directional consistency termα⋅\(w−ℓ\)/K\\alpha\\cdot\(w\-\\ell\)/Kreduces sensitivity to single\-probe outliers, favoring skill edits whose effects are supported by multiple probes\. This design encourages skill exploration while ensuring that rewards are assigned only to edits with consistent downstream impact\. The full skill\-mastery reward is:

Rskill=Rformat\+RutilityR\_\{\\text\{skill\}\}=R\_\{\\text\{format\}\}\+R\_\{\\text\{utility\}\}\(4\)
whereRformatR\_\{\\text\{format\}\}is a composite correctness reward that gives a small positive bonus for valid tool execution and penalizes parse errors, missing<think\>or<tool\_call\>tags, and placeholder content\.

### 2\.3DualAdv\-GRPO: Decoupled Optimization over Heterogeneous Phases

Standard GRPO normalizes rewards within a group ofGGrollouts, implicitly assuming that rewards share a common scale and semantics\. This assumption fails forSkillMaster: acting phases receive binary task rewardsrenvr\_\{\\text\{env\}\}, whereas skill\-mastery phases receive continuous rewardsRskillR\_\{\\text\{skill\}\}from tool validity and probe utility\. Normalizing them together would entangle task execution with skill evolution and distort within\-type preference ordering\.

#### 2\.3\.1Dual\-Stream Advantage Estimation

We address this through*dual\-stream advantage normalization*\. For each prompt, we sampleGGindependent trajectories, each containing acting phases followed by one skill\-mastery phase\. We compute separate GRPO statistics for the two reward streams:\(μact,σact\)\(\\mu\_\{\\text\{act\}\},\\sigma\_\{\\text\{act\}\}\)from\{renv\(j\)\}j=1G\\\{r\_\{\\text\{env\}\}^\{\(j\)\}\\\}\_\{j=1\}^\{G\}and\(μskill,σskill\)\(\\mu\_\{\\text\{skill\}\},\\sigma\_\{\\text\{skill\}\}\)from\{Rskill\(j\)\}j=1G\\\{R\_\{\\text\{skill\}\}^\{\(j\)\}\\\}\_\{j=1\}^\{G\}\. The corresponding normalized advantages are:

Aact\(j\)=renv\(j\)−μactσact\+ϵ,Askill\(j\)=Rskill\(j\)−μskillσskill\+ϵ,A\_\{\\text\{act\}\}^\{\(j\)\}=\\frac\{r\_\{\\text\{env\}\}^\{\(j\)\}\-\\mu\_\{\\text\{act\}\}\}\{\\sigma\_\{\\text\{act\}\}\+\\epsilon\},\\qquad A\_\{\\text\{skill\}\}^\{\(j\)\}=\\frac\{R\_\{\\text\{skill\}\}^\{\(j\)\}\-\\mu\_\{\\text\{skill\}\}\}\{\\sigma\_\{\\text\{skill\}\}\+\\epsilon\},\(5\)whereϵ\\epsilonis a small constant for numerical stability\. Thus, advantages assigned to action tokens are normalized only against action rewards from other trajectories of the same prompt, while advantages assigned to skill\-mastery tokens are normalized only against skill\-mastery rewards from the same prompt group\.

#### 2\.3\.2Type\-Conditioned Policy Gradient

We next merge the two normalized advantage streams at the token level\. For each trajectoryjj, letSact\(j\)S\_\{\\text\{act\}\}^\{\(j\)\}andSskill\(j\)S\_\{\\text\{skill\}\}^\{\(j\)\}denote the token positions belonging to acting phases and the skill\-mastery phase, respectively\. DualAdv\-GRPO assigns advantages according to token type:

ADualAdv\(j\)\(l\)=\{Aact\(j\),l∈Sact\(j\)γ⋅Askill\(j\),l∈Sskill\(j\)A\_\{\\text\{DualAdv\}\}^\{\(j\)\}\(l\)=\\begin\{cases\}A\_\{\\text\{act\}\}^\{\(j\)\},&l\\in S\_\{\\text\{act\}\}^\{\(j\)\}\\\\\[4\.0pt\] \\gamma\\cdot A\_\{\\text\{skill\}\}^\{\(j\)\},&l\\in S\_\{\\text\{skill\}\}^\{\(j\)\}\\end\{cases\}\(6\)whereγ\>0\\gamma\>0controls the relative weight of the skill\-mastery objective\. This preserves within\-type preference ordering while allowing action learning and skill\-mastery learning to update a single policy\.

The per\-token policy gradient is computed through the standard PPO\-clipped objective\. Letrl\(θ\)=πθ\(al∣ol\)/πold\(al∣ol\)r\_\{l\}\(\\theta\)=\\pi\_\{\\theta\}\(a\_\{l\}\\mid o\_\{l\}\)/\\pi\_\{\\text\{old\}\}\(a\_\{l\}\\mid o\_\{l\}\)be the probability ratio andLLthe number of generated tokens:

ℒ\(j\)\(θ\)=−1L∑l=1Lmin⁡\(rl\(θ\)ADualAdv\(j\)\(l\),clip⁡\(rl\(θ\),1−ε,1\+ε\)ADualAdv\(j\)\(l\)\)\\mathcal\{L\}^\{\(j\)\}\(\\theta\)=\-\\frac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\\min\\\!\\Big\(r\_\{l\}\(\\theta\)\\,A\_\{\\text\{DualAdv\}\}^\{\(j\)\}\(l\),\\;\\operatorname\{clip\}\\\!\\big\(r\_\{l\}\(\\theta\),\\,1\-\\varepsilon,\\,1\+\\varepsilon\\big\)\\,A\_\{\\text\{DualAdv\}\}^\{\(j\)\}\(l\)\\Big\)\(7\)
The full objective averages over allNNtrajectories with KL regularization:

ℒ\(θ\)=1N∑j=1Nℒ\(j\)\(θ\)\+β⋅ℒKL\(θ\)\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\mathcal\{L\}^\{\(j\)\}\(\\theta\)\\;\+\\;\\beta\\cdot\\mathcal\{L\}\_\{\\text\{KL\}\}\(\\theta\)\(8\)
The complete DualAdv\-GRPO procedure is summarized in Algorithm[1](https://arxiv.org/html/2605.08693#alg1)in Appendix[B](https://arxiv.org/html/2605.08693#A2)\.

## 3Experiments

We evaluate SkillMaster on two standard LLM\-agent benchmarks: ALFWorld and WebShop\. Our experiments are organized around four research questions:

1. Q1\.Does SkillMaster outperform state\-of\-the\-art methods including closed\-source LLMs, prompt\-based agents, and RL\-based approaches?
2. Q2\.How do individual components affect performance?
3. Q3\.Does SkillMaster internalize learned skills without relying on test\-time retrieval?
4. Q4\.How does SkillMaster manage skills in practice?

Table 1:Performance comparison on ALFWorld and WebShop,where we report ALFWorld per\-family success rates \(%\) and overall average success rate, together with WebShop score and success rate\. The best results are shown inboldand the second\-best results are underlined\.ALFWorldWebShopMethodPickLookCleanHeatCoolPick2AllScoreSucc\.Closed\-source LLMsGPT\-4o75\.360\.831\.256\.721\.649\.848\.031\.823\.7Gemini\-2\.5\-Pro92\.863\.362\.169\.026\.658\.760\.342\.535\.9Prompt\-based Agentic or Memory\-based MethodsQwen2\.5\-7B\-Instruct33\.421\.619\.36\.902\.803\.2014\.826\.47\.80ReAct48\.535\.434\.313\.218\.217\.631\.246\.219\.5Reflexion62\.041\.644\.930\.936\.323\.842\.758\.128\.8Mem054\.055\.026\.936\.420\.87\.6933\.623\.92\.00ExpeL21\.067\.055\.052\.071\.06\.0046\.330\.911\.2MemP54\.338\.548\.156\.232\.016\.741\.425\.36\.40SimpleMem64\.533\.320\.012\.533\.33\.8429\.733\.28\.59RL\-based MethodsRLOO87\.678\.287\.381\.371\.948\.975\.580\.365\.7GRPO90\.866\.189\.374\.772\.564\.777\.679\.366\.1Memory\-Augmented RL\-based MethodsMemRL62\.838\.522\.212\.58\.000\.0021\.429\.59\.20EvolveR64\.933\.346\.413\.333\.333\.343\.842\.517\.6Mem0\+GRPO78\.154\.856\.131\.065\.026\.954\.758\.137\.5SimpleMem\+GRPO89\.536\.360\.050\.064\.926\.362\.567\.846\.9SkillRL97\.971\.490\.090\.095\.587\.589\.985\.272\.7SkillMaster10010010097\.195\.710098\.795\.082\.0### 3\.1Experimental Setup

##### Environments\.

We evaluate on ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.08693#bib.bib2)\), an embodied household benchmark spanning six manipulation task families, and WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2605.08693#bib.bib3)\), an online shopping benchmark requiring multi\-step product search and purchase\. Detailed task descriptions are provided in Appendix[G](https://arxiv.org/html/2605.08693#A7)\.

##### Baselines\.

We compare against four categories of methods: closed\-source LLMs as zero\-shot agents \(GPT\-4o, Gemini\-2\.5\-Pro\); prompt\-based and memory\-augmented frameworks \(ReAct, Reflexion, Mem0, ExpeL, MemP\); standard RL methods \(RLOO, GRPO\); memory\-augmented RL approaches \(EvolveR, MemRL, Mem0\+GRPO, SimpleMem\+GRPO\); and SKILLRL as the state\-of\-the\-art teacher\-driven skill evolution baseline\. Full baseline descriptions are provided in Appendix[D](https://arxiv.org/html/2605.08693#A4)\. A detailed fair comparison statement is provided in Appendix[J](https://arxiv.org/html/2605.08693#A10)\.

##### Implementation Details\.

We follow the cold\-start SFT pipeline of SKILLRL\(Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\), using Claude as a teacher to generate skill\-augmented reasoning traces for the Qwen2\.5\-7B\-Instruct base model\. RL training uses GRPO with group sizeG=8G=8, KL penalty coefficient0\.010\.01, and learning rate1×10−61\\times 10^\{\-6\}, on 8×\\timesA100 GPUs via the Verl framework with vLLM for rollout generation\. All reported results are averaged over three independent runs\. The initial skill bank is adapted from SKILLRL with light deduplication\. Skill\-mastery phases are appended after every training episode\. The utility reward usesK=4K=4same\-family probes withα=0\.3\\alpha=0\.3, selected by task family for ALFWorld and by product category for WebShop\. Full configuration details are provided in Appendix[C](https://arxiv.org/html/2605.08693#A3)\.

### 3\.2Main Results

Table[1](https://arxiv.org/html/2605.08693#S3.T1)confirms two baseline trends\. First, RL\-trained agents substantially outperform closed\-source LLMs and prompt\-based methods, confirming that policy optimization is critical for interactive tasks\. Second, structured skill libraries \(SkillRL\) significantly outperform generic memory augmentation, demonstrating the value of procedural abstraction over raw trajectory storage\.

SkillMasterachieves the best performance while removing the external teacher\. On ALFWorld, it raises the strongest baseline from 89\.9% to 98\.7%, with perfect success on four of six task families and above 95% on the remaining two\. The average improvement of 8\.8 percentage points over SkillRL is particularly notable given that SkillRL already represents a strong starting point near 90% success\. On WebShop, success improves from 72\.7% to 82\.0%, and the overall score increases from 85\.2 to 95\.0\. The gap between score \(95\.0\) and success \(82\.0\) suggests thatSkillMasternot only completes more tasks but also achieves higher\-quality decisions on tasks it successfully solves\. These gains are broadly distributed across all task types, with no family falling below 95% on ALFWorld, indicating improved general competence rather than overfitting to specific patterns\.

Unlike the teacher\-driven approach in SkillRL,SkillMasterlearns skill management end\-to\-end as part of its policy\. The downstream utility reward provides a direct quality signal for skill edits via counterfactual comparison on probe tasks, and the agent learns to decide*when*and*how*to revise skills from trajectory evidence\. The substantial gains over SkillRL in Table[1](https://arxiv.org/html/2605.08693#S3.T1)demonstrate the advantage of this utility\-guided design over fixed external curation\. A detailed case study appears in Section[3\.5](https://arxiv.org/html/2605.08693#S3.SS5)\.

![Refer to caption](https://arxiv.org/html/2605.08693v1/Figures/ablation.png)Figure 3:\(a\)Ablation of skill mastery components on ALFWorld and WebShop\.\(b\)Skill internalization on ALFWorld: the trained SkillMaster policy evaluated with and without skill retrieval, compared against SkillRL which always uses skills\.![Refer to caption](https://arxiv.org/html/2605.08693v1/Figures/case_study.png)Figure 4:Case studies of skill management and utility evaluation\.Case 1 \(Propose Skill\):The agent failed a cooling task after exhaustively searching low\-probability food zones \(cabinets, drawers\)\. It proposedDo Not Search Invalid Zones, encoding a reusable search\-prior principle\.Case 2 \(Update Skill\):The agent observed that heating succeeded only when holding the apple, contradicting the existing skillOpen Then Heat\(which instructed placing the object inside first\)\. It revised the skill toHeat While Holding Target\.Case 3 \(Skill Utility\):A probe task \(heat apple→\\tofridge\) was evaluated before and after the skill revision of Case 2\. Before: the old skill caused an 8\-step microwave confusion \(place inside→\\toheat fails→\\torepeated open/close→\\totake back→\\tore\-heat\)\. After: the revised skill enabled correct 2\-step execution \(open microwave, heat while holding\), eliminating 6 unnecessary steps\.
### 3\.3Ablation Study

Figure[3](https://arxiv.org/html/2605.08693#S3.F3)\(a\) reports the impact of removing each component ofSkillMaster\. Removing the utility reward leads to a clear drop on both benchmarks, suggesting that counterfactual probe evaluation provides an important signal for distinguishing useful skill edits from merely well\-formed ones\. Replacing per\-type advantage normalization with a single coupled normalization group also reduces performance, showing that decoupling action and skill\-mastery advantages is important for joint optimization\. Random probes consistently underperform same\-family probes, showing that task\-related probe selection is important for informative skill\-quality feedback\. Removing the cold\-start SFT phase causes the largest degradation, highlighting the importance of initializing the agent with basic tool\-use and task\-execution capabilities\. Overall, each component contributes to stable and effective self\-managed skill evolution\. To further disentangle the effect of utility\-credited skill evolution from extra post\-episode computation, and to test robustness to weaker initial skill banks, we provide additional control experiments in Appendix[L](https://arxiv.org/html/2605.08693#A12)and Appendix[M](https://arxiv.org/html/2605.08693#A13)\.

### 3\.4Analysis of Skill Internalization

We observe an intriguing pattern on ALFWorld\. Figure[3](https://arxiv.org/html/2605.08693#S3.F3)\(b\) compares the trainedSkillMasterpolicy evaluated with and without test\-time skill retrieval, alongsideSkillRLevaluated with its skill library\. The twoSkillMastersettings differ by only0\.70\.7percentage points overall, and the gap is zero on four of six task families\. The only measurable differences appear in Heat \(0\.90\.9\) and Cool \(4\.04\.0\), two procedurally more complex families\. Notably,SkillMasterwithout test\-time skills still outperformsSkillRLwith skills on Pick, Pick Two, Look, and Clean, and achieves a higher overall score\. A plausible explanation is skill internalization: the training process forces the agent to compare trajectory evidence against retrieved skills and decide whether to propose, update, or retain them, which may encourage procedural knowledge to be absorbed into the policy parameters\. While other interpretations exist, the strong empirical evidence suggests thatSkillMastereffectively internalizes reusable knowledge, reducing its dependence on explicit skill retrieval at test time\.

### 3\.5Case Study

Figure[4](https://arxiv.org/html/2605.08693#S3.F4)presents three cases drawn from ALFWorld rollouts, illustrating how SkillMaster manages its skill bank through TIR\. Full case descriptions are provided in Appendix[H](https://arxiv.org/html/2605.08693#A8)\.

Case 1 demonstrates*proposing*a new skill when a gap is detected\. The agent failed a cooling task because it swept low\-probability food zones instead of checking high\-probability locations first\. During skill review, it recognized that none of the retrieved skills addressed search\-prior allocation, and calledpropose\_skillto add a reusable search heuristic\.

Case 2 demonstrates*updating*an existing skill when it proves imprecise\. The agent succeeded at a heating task but observed a recurring pattern: the existing skill instructed placing the object inside the microwave before heating, yet heating only worked when the object was held in hand\. The agent calledupdate\_skillto correct this operational guidance\.

Case 3*validates*the downstream impact of the revision from Case 2\. A probe task from the same family was evaluated before and after the skill change\. Before the revision, the imprecise skill caused an 8\-step confusion at the microwave\. After the revision, the same task completed heating in 2 steps, eliminating six unnecessary operations\. This direct efficiency gain is precisely what the utility reward is designed to measure and reinforce\.

### 3\.6Computational Overhead

We also analyze computational overhead in Appendix[K](https://arxiv.org/html/2605.08693#A11)\. On ALFWorld, enabling utility\-based skill management increases the average training time from 12\.13 minutes per step to about 16\.00 minutes per step, corresponding to a 31\.9% wall\-clock increase\. It also raises average GPU memory allocation from 33\.51 GiB to 34\.57 GiB and peak GPU memory allocation from 61\.29 GiB to 70\.49 GiB\. These results show that the utility reward introduces noticeable but manageable computational overhead\.

## 4Related Work

We summarize the most relevant work here and defer a broader discussion to Appendix[I](https://arxiv.org/html/2605.08693#A9)\.

##### Skill management in LLM agents\.

Prior agent\-memory methods store interaction traces for later retrieval\(Parket al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib6); Shinnet al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib8); Chhikaraet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib19)\), whereas skill\-based methods abstract experience into reusable procedural knowledge\(Wanget al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib7); Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\)\. In most such approaches, however, skill management is handled by an external teacher, fixed update rule, or specialized module rather than by the acting policy itself\. Recent concurrent methods make skill evolution more adaptive but still rely on managerial roles, controllers, or skill\-generation pipelines\(Liet al\.,[2026b](https://arxiv.org/html/2605.08693#bib.bib23); Zhanget al\.,[2026b](https://arxiv.org/html/2605.08693#bib.bib24); Wuet al\.,[2025b](https://arxiv.org/html/2605.08693#bib.bib26); Wanget al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib25); Zhanget al\.,[2026a](https://arxiv.org/html/2605.08693#bib.bib27)\)\. Moreover, they typically assess skill quality only indirectly through task outcomes or heuristic update rules, rather than explicitly measuring the causal utility of a particular skill edit\. It also leaves skill editing weakly supervised, since task\-level rewards need not reveal whether a specific write operation was beneficial\.SkillMasterinstead uses tool\-integrated reasoning to let the same policy both act and edit its own skill bank, with edits credited by counterfactual downstream utility\.

##### RL and tool\-integrated reasoning\.

RL post\-training methods such as PPO, RLOO, GRPO, DAPO, and GiGPO optimize LLM agents through environment feedback\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.08693#bib.bib5); Ahmadianet al\.,[2024](https://arxiv.org/html/2605.08693#bib.bib11); Shaoet al\.,[2024](https://arxiv.org/html/2605.08693#bib.bib4); Yuet al\.,[2025b](https://arxiv.org/html/2605.08693#bib.bib12); Fenget al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib13)\)\. Tool\-use and TIR methods train models to invoke tools or structured function calls\(Schicket al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib9); Weiet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib33); Fang and Sun,[2026](https://arxiv.org/html/2605.08693#bib.bib34); Zenget al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib35); Zhanget al\.,[2026d](https://arxiv.org/html/2605.08693#bib.bib36)\)\. Most such methods use tools to access external information or execute auxiliary computation, rather than to modify the agent’s own reusable memory\. Our setting is different because memory\-write actions affect future behavior only indirectly, often across multiple later episodes\. As a result, action decisions and write decisions induce different credit\-assignment structures and should not share a single normalized advantage signal\. Our work extends this paradigm to self\-editing skill memory, where write actions are evaluated by downstream utility and optimized separately from task\-execution turns\.

## 5Conclusion and Discussion

We presentedSkillMaster, a framework that enables LLM agents to manage reusable skill banks through tool\-integrated reasoning and reinforcement learning\. The central idea is to turn skill management from an externally managed maintenance process into an optimization problem within the agent’s own policy, where skill changes are generated through tool calls and credited by counterfactual downstream utility\. This enables the policy to jointly learn how to act and how to improve the knowledge it will reuse in future episodes\.Experiments on ALFWorld and WebShop show thatSkillMasteroutperforms standard RL and externally managed skill\-library baselines, with both the utility reward and decoupled optimization contributing to the gains\. These results suggest that learned skill management can be a useful mechanism for improving the quality of reusable procedural knowledge in LLM agents\. At the same time, the current method relies on predefined task groupings for probe selection, and the cost of probe evaluation grows with episode length and mutation frequency, even if the overhead is manageable in our current settings\. Moreover, our evaluation is limited to two relatively structured benchmarks, and further validation on more open\-ended environments is needed\. Promising directions for future work include skill deletion, multi\-agent skill sharing, continual learning under distribution shift, and extending TIR to other forms of agent self\-improvement\. A more detailed discussion is provided in Appendix[F](https://arxiv.org/html/2605.08693#A6)\.

## References

- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024\)Back to basics: revisiting reinforce\-style optimization for learning from human feedback in llms\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12248–12267\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- Anthropic \(2026\)Claude code\.Note:[https://www\.anthropic\.com/product/claude\-code](https://www.anthropic.com/product/claude-code)Accessed: 2026\-05\-07Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p1.1),[§1](https://arxiv.org/html/2605.08693#S1.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang \(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p1.1)\.
- Z\. Fang and R\. Sun \(2026\)AdaTIR: adaptive tool\-integrated reasoning via difficulty\-aware policy optimization\.arXiv preprint arXiv:2601\.14696\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2025\)Group\-in\-group policy optimization for llm agent training\.arXiv preprint arXiv:2505\.10978\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- Y\. Gao, Z\. Li, Z\. Ji, P\. Ma, S\. Wang,et al\.\(2026\)SkillReducer: optimizing llm agent skills for token efficiency\.arXiv preprint arXiv:2603\.29919\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026\)SoK: agentic skills–beyond tool use in llm agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun,et al\.\(2026a\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.arXiv preprint arXiv:2602\.12670\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou \(2025\)Search\-o1: agentic search\-enhanced large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 5420–5438\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- Y\. Li, R\. Miao, Z\. Qi, and T\. Lan \(2026b\)Arise: agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning\.arXiv preprint arXiv:2603\.16060\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p3.1),[§1](https://arxiv.org/html/2605.08693#S1.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Liu, Y\. Su, P\. Xia, S\. Han, Z\. Zheng, C\. Xie, M\. Ding, and H\. Yao \(2026\)SimpleMem: efficient lifelong memory for llm agents\.arXiv preprint arXiv:2601\.02553\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p1.1),[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- J\. Liu, J\. Hao, C\. Zhang, and Z\. Hu \(2025\)Wepo: web element preference optimization for llm\-based web navigation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 26614–26622\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- OpenClaw Foundation \(2026\)OpenClaw documentation\.Note:[https://docs\.openclaw\.ai/index](https://docs.openclaw.ai/index)Accessed: 2026\-05\-07Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.Advances in Neural Information Processing Systems37,pp\. 126544–126565\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p2.1)\.
- M\. Puvvadi, S\. K\. Arava, A\. Santoria, S\. S\. P\. Chennupati, and H\. V\. Puvvadi \(2025\)Coding agents: a comprehensive survey of automated bug fixing systems and benchmarks\.In2025 IEEE 14th International Conference on Communication Systems and Network Technologies \(CSNT\),pp\. 680–686\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- H\. Qiu, A\. R\. Fabbri, D\. Agarwal, K\. Huang, S\. Tan, N\. Peng, and C\. Wu \(2025\)Evaluating cultural and social awareness of llm web agents\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 3978–4005\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- M\. S\. Rashid, C\. Bock, Y\. Zhuang, A\. Buchholz, T\. Esler, S\. Valentin, L\. Franceschi, M\. Wistuba, P\. T\. Sivaprasad, W\. J\. Kim,et al\.\(2025\)Swe\-polybench: a multi\-language benchmark for repository level evaluation of coding agents\.arXiv preprint arXiv:2504\.08703\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.Advances in neural information processing systems36,pp\. 68539–68551\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Cote, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§1](https://arxiv.org/html/2605.08693#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.08693#S3.SS1.SSS0.Px1.p1.1)\.
- A\. B\. Soni, B\. Li, X\. Wang, V\. Chen, and G\. Neubig \(2026\)Coding agents with multimodal browsing are generalist problem solvers\.InFindings of the Association for Computational Linguistics: EACL 2026,pp\. 6052–6069\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- R\. S\. Sutton, D\. Precup, and S\. Singh \(1999\)Between mdps and semi\-mdps: a framework for temporal abstraction in reinforcement learning\.Artificial intelligence112\(1\-2\),pp\. 181–211\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p2.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p2.1),[§1](https://arxiv.org/html/2605.08693#S1.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong \(2025\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p3.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Wang, Z\. Tao, H\. Zeng, Z\. Yang, H\. Zamani, and H\. Yu \(2026\)TARSE: test\-time adaptation via retrieval of skills and experience for reasoning agents\.arXiv preprint arXiv:2603\.01241\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- Y\. Wei, X\. Yu, Y\. Weng, T\. Pan, A\. Li, and L\. Du \(2025\)Autotir: autonomous tools integrated reasoning via reinforcement learning\.arXiv preprint arXiv:2507\.21836\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang,et al\.\(2025a\)Evolver: self\-evolving llm agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p1.1)\.
- X\. Wu, Z\. Li, G\. Shi, A\. Duffy, T\. Marques, M\. L\. Olson, T\. Zhou, and D\. Manocha \(2025b\)Co\-evolving llm decision and skill bank agents for long\-horizon tasks\.arXiv preprint arXiv:2604\.20987\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p3.1),[§1](https://arxiv.org/html/2605.08693#S1.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[Appendix J](https://arxiv.org/html/2605.08693#A10.p1.1),[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p2.1),[§1](https://arxiv.org/html/2605.08693#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.08693#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.08693#S3.SS1.SSS0.Px3.p1.6),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)Swe\-agent: agent\-computer interfaces enable automated software engineering\.Advances in Neural Information Processing Systems37,pp\. 50528–50652\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§1](https://arxiv.org/html/2605.08693#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.08693#S3.SS1.SSS0.Px1.p1.1)\.
- B\. Yu, Y\. Zhu, P\. He, and D\. Kang \(2025a\)Utboost: rigorous evaluation of coding agents on swe\-bench\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3762–3774\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025b\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p1.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- Y\. Zeng, X\. Ding, Y\. Liu, Y\. Wang, Q\. Du, Y\. Hou, W\. Ning, H\. Song, D\. Tang, D\. Tu,et al\.\(2026\)AutoTool: automatic scaling of tool\-use capabilities in rl via decoupled entropy constraints\.arXiv preprint arXiv:2603\.13348\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.
- D\. Zhang, Y\. Zhao, J\. Wu, L\. Zhang, B\. Li, W\. Yin, Y\. Jiang, Y\. Li, K\. Tu, P\. Xie,et al\.\(2025\)Evolvesearch: an iterative self\-evolving search agent\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 13134–13147\.Cited by:[§1](https://arxiv.org/html/2605.08693#S1.p1.1)\.
- H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng,et al\.\(2026a\)CoEvoSkills: self\-evolving agent skills via co\-evolutionary verification\.arXiv preprint arXiv:2604\.01687\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p3.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- H\. Zhang, Q\. Long, J\. Bao, T\. Feng, W\. Zhang, H\. Yue, and W\. Wang \(2026b\)MemSkill: learning and evolving memory skills for self\-evolving agents\.arXiv preprint arXiv:2602\.02474\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p3.1),[§1](https://arxiv.org/html/2605.08693#S1.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026c\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[§I\.1](https://arxiv.org/html/2605.08693#A9.SS1.p1.1)\.
- X\. Zhang, Q\. He, Z\. Zheng, Z\. Zhang, X\. He, and D\. Li \(2026d\)ASTER: agentic scaling with tool\-integrated extended reasoning\.arXiv preprint arXiv:2602\.01204\.Cited by:[§I\.2](https://arxiv.org/html/2605.08693#A9.SS2.p2.1),[§4](https://arxiv.org/html/2605.08693#S4.SS0.SSS0.Px2.p1.1)\.

## Appendix ASkill\-Review Prompt Templates

Figures[5](https://arxiv.org/html/2605.08693#A1.F5)and[6](https://arxiv.org/html/2605.08693#A1.F6)show the full skill\-review prompt templates used inSkillMasterfor ALFWorld and WebShop, respectively\. The prompts are constructed bybuild\_skill\_management\_promptinskill\_management\.py\. Placeholders \(\{task\},\{category\}, etc\.\) are filled per\-episode with the task description, inferred skill category, retrieved skills, trajectory trace, and outcome\.

ALFWorld Skill\-Review Prompt Template```
You are reviewing a completed ALFWorld episode.
Decide whether the skill bank should be updated based on reusable evidence
from this episode.

Rules:
- Call at most one skill-management tool.
- Choose the tool that best reflects whether the episode reveals a reusable
  lesson that is missing or incorrect in the current bank.
- Use propose_skill only for a genuinely new reusable lesson.
- Use update_skill only when an existing skill should be revised.
- Use keep_skill only when the current retrieved skills already cover the
  observed strategy or failure pattern well enough.
- Base your decision on the task, episode evidence, and the current retrieved
  skills.
- Compare the observed pattern against the retrieved skills explicitly; do not
  choose keep_skill just because the broad task category already has some skills.
- Success alone is not a reason to keep_skill: if a successful episode
  demonstrates a concise reusable strategy that is not already covered,
  propose_skill or update_skill.
- Failure alone is not a reason to change the bank: if the failure does not
  reveal a concrete reusable lesson beyond the current retrieved skills, use
  keep_skill.
- For failed episodes, treat repeated invalid loops, repeated ineffective
  actions after "Nothing happens.", missed visible targets, incorrect subgoal
  switching, and losing track of required object counts as strong evidence for
  propose_skill or update_skill unless an existing retrieved skill already
  states that rule explicitly.
- For successful episodes, prefer propose_skill or update_skill when the
  success depends on a reusable tactic, ordering rule, or search heuristic that
  is not already stated explicitly in the retrieved skills.
- Add or revise a skill only if it is generic, concise, and useful for future
  ALFWorld episodes.
- Do not propose or update a skill that merely restates an already retrieved
  skill with minor wording changes.
- Only write skills grounded in the current ALFWorld household environment,
  task, objects, and receptacles.
- Do not store task-instance details such as specific object instances, room
  names, receptacle IDs, or one-off episode narration unless the lesson is
  clearly reusable.
- Do not include meta-commentary about skill-bank decisions, prompt quality,
  guidance quality, success/failure labels, or whether the bank should change.
- Do not output placeholders, ellipses, half-finished text, or copied
  trajectory fragments.
- Do not output an ALFWorld <action>; the episode is already over.
- Do not choose navigation or environment actions such as look, go to, take,
  move, clean, heat, cool, or done.
- First reason inside <think> </think>, then output exactly one skill-management
  tool call in JSON.
- Keep <think> extremely short: 1-3 sentences, no bullet points, no long
  episode recap, and no copied trajectory details.
- The JSON must be enclosed in <tool_call> </tool_call> tags.
- For task-specific skills, set category to the Skill Category shown below.
- Use category="general" only if clearly reusable across multiple task types.

Output requirements:
- Reason inside <think>...</think>, but keep it brief and decision-focused.
- Final content must be exactly one skill-management tool call wrapped in
  <tool_call>...</tool_call>.
- Inside <tool_call>...</tool_call>, output only valid JSON.
- Do not use markdown code fences.

Task: {task}
Skill Category: {category}
Episode Outcome: {outcome}
Episode Reward: {episode_reward}

Retrieved Skills:
{retrieved_skills_text}

Episode Trajectory:
{episode_trace}
END EPISODE TRAJECTORY.

Now output only the final skill-management decision using the required
<tool_call> JSON format.
```

Figure 5:ALFWorld skill\-review prompt template\.WebShop Skill\-Review Prompt Template```
You are reviewing a completed WebShop episode.
Decide whether the skill bank should be updated based on reusable evidence
from this shopping episode.

Rules:
- Call at most one skill-management tool.
- Use propose_skill only for a genuinely new reusable lesson.
- Use update_skill only when an existing skill should be revised.
- Use keep_skill only when the current retrieved skills already cover the
  observed strategy or failure pattern well enough.
- Every decision must be justified by concrete evidence from THIS EPISODE, not
  by generic real-world shopping advice.
- Compare the observed pattern against the retrieved skills explicitly.
- Success alone is not a reason to keep_skill.
- Failure alone is not a reason to change the bank.
- Favor lessons about search query formulation, attribute filtering, option
  selection, product comparison, and the search-click-option-buy workflow.
- Treat missed required attributes, wrong option selection, premature buying,
  ineffective repeated searches, and losing track of constraints as strong
  evidence for propose_skill or update_skill.
- For keep_skill, the reason must tie to the actual trajectory.
- Do not default to "verify more attributes" unless trajectory evidence
  specifically supports it over a search or result-screening lesson.
- Add or revise a skill only if it is generic, concise, and useful for future
  WebShop tasks.
- Do not store one-off product titles, product IDs, page indices, exact prices,
  or episode narration unless the lesson is clearly reusable.
- Do not output a WebShop <action>; the episode is already over.
- Do not choose environment actions such as search[...], click[...], buy, etc.
- First reason inside <think> </think>, then output exactly one skill-management
  tool call in JSON.
- Keep <think> extremely short: 1-3 sentences.
- The JSON must be enclosed in <tool_call> </tool_call> tags.
- For task-specific skills, set category to the WebShop Category shown below.
- Use category="general" only if clearly reusable across multiple categories.

Shopping Task: {task}
WebShop Category: {category}
Episode Outcome: {outcome}
Episode Reward: {episode_reward}
Goal Index: {goal_idx}
Goal Text: {goal}

Retrieved Skills:
{retrieved_skills_text}

Episode Trajectory:
{episode_trace}
END EPISODE TRAJECTORY.

Now output only the final skill-management decision using the required
<tool_call> JSON format.
```

Figure 6:WebShop skill\-review prompt template\.
## Appendix BDualAdv\-GRPO Algorithm

Algorithm[1](https://arxiv.org/html/2605.08693#alg1)summarizes one training iteration of the DualAdv\-GRPO optimization loop\.

Algorithm 1DualAdv\-GRPO: One Training Iteration0:Policy

πθ\\pi\_\{\\theta\}, reference policy

πref\\pi\_\{\\text\{ref\}\}, skill bank

ℬ\\mathcal\{B\}, environment

ℰ\\mathcal\{E\}, group size

GG, skill weight

γ\\gamma
1:foreach prompt

ppin batchdo

2:for

g=1g=1to

GGdo

3:Sample trajectory

τg\\tau\_\{g\}from

πθ\\pi\_\{\\theta\}: act in

ℰ\\mathcal\{E\}with skills from

ℬ\\mathcal\{B\}, then output one skill\-management tool call

4:Collect

renv\(g\)r\_\{\\text\{env\}\}^\{\(g\)\}and

Rskill\(g\)R\_\{\\text\{skill\}\}^\{\(g\)\}
5:endfor

6:Compute

Aact\(g\),Askill\(g\)A\_\{\\text\{act\}\}^\{\(g\)\},A\_\{\\text\{skill\}\}^\{\(g\)\}via Eq\.[5](https://arxiv.org/html/2605.08693#S2.E5)using per\-type group statistics

7:foreach token

llof

τg\\tau\_\{g\}do

8:if

l∈Sact\(g\)l\\in S\_\{\\text\{act\}\}^\{\(g\)\}then

9:Use advantage

Aact\(g\)A\_\{\\text\{act\}\}^\{\(g\)\}
10:else

11:Use advantage

Askill\(g\)A\_\{\\text\{skill\}\}^\{\(g\)\}weighted by

γ\\gamma
12:endif

13:endfor

14:endfor

15:

ℒ←1N∑j=1NEq\.[7](https://arxiv.org/html/2605.08693#S2.E7)\+β⋅ℒKL\\mathcal\{L\}\\leftarrow\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\text\{Eq\.~\\ref\{eq:traj\_loss\}\}\+\\beta\\cdot\\mathcal\{L\}\_\{\\text\{KL\}\}
16:

θ←θ−η∇θℒ\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}
17:returnUpdated policy

πθ\\pi\_\{\\theta\}

## Appendix CImplementation and Configuration Details

##### Probe task selection\.

For ALFWorld, probe tasks are selected from the same task family as the current episode, which forms a natural skill\-sharing boundary: a skill about microwave usage should affect all heat tasks but is unlikely to affect clean or pick\-and\-place tasks\. For WebShop, probes are drawn from the same product category \(apparel, footwear, electronics, accessories, home decor, beauty, health\), under the heuristic that skills for searching and filtering shirts transfer to other apparel tasks while being largely irrelevant to electronics or footwear\. The selection uses a deterministic seed derived from the current task identifier to ensure reproducibility while avoiding overlap with the current task\. Probe rollouts use a separate environment pool with a distinct seed offset by 50,000 to avoid data leakage\.

##### Training hyperparameters\.

All experiments use GRPO with group sizeG=8G=8, a KL penalty coefficient of0\.010\.01, and a constant learning rate of1×10−61\\times 10^\{\-6\}\. The utility reward usesK=4K=4same\-family probes andα=0\.3\\alpha=0\.3\. Skill\-management turns are configured at trajectory level\. The skill bank uses the standard SKILLRL JSON format with general skills and task\-specific skills organized by category\.

## Appendix DBaseline Descriptions

Our baselines span four categories\. First, we include strong closed\-source LLMs as zero\-shot agents: GPT\-4o and Gemini\-2\.5\-Pro serve as upper bounds for general\-purpose reasoning without task\-specific training\. Second, we compare against prompt\-based and memory\-augmented agent frameworks: ReAct and Reflexion use in\-context reasoning chains to improve multi\-step decision making; Mem0, ExpeL, and MemP maintain external memory stores or experience pools to guide agent behavior, but do not update model parameters through reinforcement learning\. Third, we evaluate standard RL\-based methods including RLOO and GRPO, which optimize agent policies through group\-relative advantage estimation over trajectory rollouts\. Fourth, we consider memory\-augmented RL approaches that embed persistent memory mechanisms directly into the RL optimization loop: EvolveR, MemRL, Mem0 combined with GRPO, and SimpleMem combined with GRPO\. Finally, we include SKILLRL as a direct point of comparison, since it represents the state of the art in teacher\-driven skill evolution and shares the same skill bank infrastructure\.

## Appendix ETool Schemas

The three skill\-management tools are registered in the multi\-turn rollout infrastructure as standard OpenAI\-compatible function\-calling schemas\. Figures[7](https://arxiv.org/html/2605.08693#A5.F7),[8](https://arxiv.org/html/2605.08693#A5.F8), and[9](https://arxiv.org/html/2605.08693#A5.F9)show the exact JSON definitions passed to the LLM during the skill\-management turn\.

propose\_skill```
{
  "type": "function",
  "function": {
    "name": "propose_skill",
    "description": "Propose and add a new reusable skill to the skill bank
                    after a completed episode.",
    "parameters": {
      "type": "object",
      "properties": {
        "category":      {"type": "string",
                          "description": "Skill category."},
        "title":         {"type": "string",
                          "description": "Short skill title."},
        "principle":     {"type": "string",
                          "description": "Reusable principle."},
        "when_to_apply": {"type": "string",
                          "description": "Trigger condition."},
        "evidence":      {"type": "string",
                          "description": "Episode evidence."}
      },
      "required": ["category", "title", "principle",
                   "when_to_apply", "evidence"]
    }
  }
}
```

Figure 7:propose\_skilltool schema\.update\_skill```
{
  "type": "function",
  "function": {
    "name": "update_skill",
    "description": "Update an existing skill in the skill bank after a
                    completed episode.",
    "parameters": {
      "type": "object",
      "properties": {
        "skill_id":      {"type": "string",
                          "description": "Existing skill_id or exact
                                          retrieved skill title."},
        "title":         {"type": "string",
                          "description": "Updated title."},
        "principle":     {"type": "string",
                          "description": "Updated principle."},
        "when_to_apply": {"type": "string",
                          "description": "Updated trigger condition."},
        "reason":        {"type": "string",
                          "description": "Why the skill should be revised."}
      },
      "required": ["skill_id", "title", "principle",
                   "when_to_apply", "reason"]
    }
  }
}
```

Figure 8:update\_skilltool schema\.keep\_skill```
{
  "type": "function",
  "function": {
    "name": "keep_skill",
    "description": "Keep the skill bank unchanged after a completed episode.",
    "parameters": {
      "type": "object",
      "properties": {
        "reason": {"type": "string",
                   "description": "Why no new skill or update is needed."}
      },
      "required": ["reason"]
    }
  }
}
```

Figure 9:keep\_skilltool schema\.
## Appendix FExtended Discussion

##### Probe selection\.

The current probe selection uses task families for ALFWorld and product categories for WebShop\. In more open\-ended environments without natural task groupings, an adaptive similarity metric, such as one based on skill\-retrieval embeddings or learned task representations, would be required\. This is a common challenge for any method that relies on measuring transfer between tasks, and we expect that advances in task representation learning will directly benefit this aspect of SkillMaster\.

##### Probe evaluation cost\.

Each skill mutation triggers O\(K\) additional probe rollouts\. While the overhead is modest in our settings, it would grow for environments with longer episodes or for skill banks that trigger frequent mutations\. Strategies such as importance sampling over probes, amortized evaluation across multiple candidate mutations, or learned value functions that approximate probe scores could reduce this cost in future work\.

##### Domain scope\.

Our experiments are conducted on two established but relatively structured benchmarks\. Whether SkillMaster transfers to more open\-ended environments, such as software engineering or general web browsing, remains to be validated\. The core mechanism is domain\-agnostic: any environment that admits a notion of related tasks can in principle support counterfactual probe evaluation\.

##### Future directions\.

Beyond addressing the limitations above, several research directions are promising\. First, the utility reward framework could support skill deletion, enabling the agent to identify and remove outdated or harmful skills rather than only adding and revising them\. Second, multi\-agent skill sharing, where agents trained on different task distributions exchange skills through a shared bank, could accelerate collective learning\. Third, combining SkillMaster with continual learning setups where the task distribution shifts over time would test whether self\-managed skill evolution enables faster adaptation compared to static baselines\. Finally, extending TIR to other forms of agent self\-improvement, such as prompt refinement or tool composition, could broaden the impact of this approach\.

## Appendix GEnvironment Details

ALFWorld is an embodied household benchmark where the agent navigates and manipulates objects in simulated indoor scenes to complete goal\-directed tasks\. It contains six task families: Pick and Place requires retrieving and delivering objects to specified locations; Look at Object involves examining items under a desklamp; Clean requires washing objects at a sink; Heat requires warming objects using a microwave; Cool requires chilling objects in a fridge; and Pick Two requires collecting and delivering two objects\. WebShop is an online shopping benchmark where the agent must search for products, navigate result pages, select appropriate options such as size and color, and complete purchases to match a given goal specification\.

## Appendix HExtended Case Study Details

All cases are drawn from ALFWorld validation rollouts\.

##### Case 1: Propose Skill\.

The agent was tasked with cooling a tomato and placing it on a countertop\. It spent over twenty steps exhaustively sweeping cabinets and drawers, which are low\-probability zones for food targets, before finally checking the fridge and countertops where the tomato was most likely to be found\. The episode failed\. During the subsequent skill\-review phase, the agent examined the retrieved skills \(Systematic Exploration, Immediate Acquisition, Destination First Policy\) and recognized a gap: none of them addressed the allocation of search effort across food versus non\-food zones\. It called propose\_skill to add Do Not Search Invalid Zones, which instructs future agents to prioritize high\-probability food locations over indiscriminate sweeping\.

##### Case 2: Update Skill\.

The agent was tasked with heating an apple and placing it in a sink basin\. The episode succeeded, but the trajectory revealed an operational imprecision in the existing skill Open Then Heat, which instructed the agent to open the microwave, place the apple inside, and then heat it\. In practice, heating failed every time the apple was placed inside first, and only succeeded when the agent held the apple in hand during the heat action\. The agent called update\_skill to revise this skill to Heat While Holding Target, correcting the operational guidance to keep holding the target object while executing the heat action\.

##### Case 3: Skill Utility\.

The probe task belongs to the same task family as the episode from Case 2 and was selected as one of the same\-family probes during the utility reward evaluation triggered by that update\. This probe task requires the agent to heat an apple and place it in a fridge\. Before the revision, the skill Open Then Heat caused an 8\-step confusion at the microwave: open the door, place the apple inside, attempt to heat with no effect, close and reopen the door repeatedly, take the apple back into hand, and finally heat successfully\. After the revision to Heat While Holding Target, the same probe task completed the heating phase in just 2 steps: open the microwave and heat while holding the apple\. Six unnecessary steps were eliminated purely by correcting the operational guidance\.

## Appendix IExtended Related Work

### I\.1Skill Management in LLM Agents

Early work on LLM agent memory stores raw interaction histories and retrieves relevant experiences at inference time\. Generative Agents\[Parket al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib6)\]maintain memory streams to support long\-term behavior, while Reflexion\[Shinnet al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib8)\]uses verbal self\-reflection to improve subsequent attempts\. More recent memory\-based agents such as Mem0\[Chhikaraet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib19)\], MemRL\[Zhanget al\.,[2026c](https://arxiv.org/html/2605.08693#bib.bib18)\], MemP\[Fanget al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib29)\], EvolveR\[Wuet al\.,[2025a](https://arxiv.org/html/2605.08693#bib.bib30)\], and SimpleMem\[Liuet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib31)\]further explore episodic, procedural, or lifelong memory mechanisms\. These methods demonstrate the value of reusing past experience, but raw or loosely organized memories can become redundant, noisy, and difficult to maintain as experience accumulates\.

A more compact alternative is to abstract trajectories into reusable*skills*\. Voyager\[Wanget al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib7)\]maintains a skill library in Minecraft by accumulating reusable action programs, enabling open\-ended exploration and reuse of past solutions\. SkillRL\[Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\]introduces a hierarchical SkillBank distilled by a teacher LLM and uses retrieved skills to guide RL training\. Related work on temporal abstraction and options in reinforcement learning also motivates the use of reusable procedural units for long\-horizon decision making\[Suttonet al\.,[1999](https://arxiv.org/html/2605.08693#bib.bib14)\]\. However, in these approaches, skill management is typically handled outside the agent’s own learned policy: a teacher model, fixed rule, or predefined procedure decides what to store and when to update\.

Recent concurrent work has begun to make skill management more adaptive\. ARISE\[Liet al\.,[2026b](https://arxiv.org/html/2605.08693#bib.bib23)\]adopts a hierarchical Manager\-Worker architecture with a shared policy, where the Manager maintains a tiered skill library and selects relevant skills, while the Worker generates task solutions conditioned on those skills\. MemSkill\[Zhanget al\.,[2026b](https://arxiv.org/html/2605.08693#bib.bib24)\]introduces a Controller\-Designer framework, where the Controller selects relevant memory skills and the Designer revises the skill set based on hard cases\. COS\-PLAY\[Wuet al\.,[2025b](https://arxiv.org/html/2605.08693#bib.bib26)\]co\-evolves a decision policy with a skill bank through an agent\-managed pipeline that extracts and refines skills from unlabeled rollouts\. Wang et al\.\[Wanget al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib25)\]use RL to transform interaction experiences into reusable action\-sequence skills, but rely on fixed extraction procedures rather than learned skill\-management decisions\. CoEvoSkills\[Zhanget al\.,[2026a](https://arxiv.org/html/2605.08693#bib.bib27)\]targets autonomous construction of multi\-file agent skill packages through a Skill Generator and a co\-evolving Surrogate Verifier\. These methods move toward adaptive skill evolution, but they still realize skill management through dedicated managerial roles, controllers, verifiers, or skill\-evolution pipelines, rather than exposing skill editing as explicit self\-directed decisions of the acting policy\.

SkillMasterdiffers from these approaches in two ways\. First, it does not delegate skill evolution to a dedicated managerial role, controller, or external skill\-evolution pipeline\. Instead, it applies tool\-integrated reasoning to skill management: the same policy that acts in the environment also decides, through structured tool calls, whether to keep, propose, or update skills\. This turns skill management into explicit self\-editing behavior within the acting policy itself\. Second, the value of a skill edit is not judged by surface form, heuristic triggers, or task outcome rewards alone\. Instead, each candidate edit is evaluated by its counterfactual downstream utility on related probe tasks, providing an explicit skill\-quality signal for RL optimization\.

### I\.2RL and Tool\-Integrated Reasoning

RL has become a standard post\-training paradigm for LLMs and LLM agents\. PPO\[Schulmanet al\.,[2017](https://arxiv.org/html/2605.08693#bib.bib5)\]is widely used for policy optimization, and recent variants such as RLOO\[Ahmadianet al\.,[2024](https://arxiv.org/html/2605.08693#bib.bib11)\], GRPO\[Shaoet al\.,[2024](https://arxiv.org/html/2605.08693#bib.bib4)\], DAPO\[Yuet al\.,[2025b](https://arxiv.org/html/2605.08693#bib.bib12)\], and GiGPO\[Fenget al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib13)\]further improve stability, scalability, or group\-based credit assignment\. These methods have been applied to interactive settings such as embodied environments, web interaction, software engineering, and search\-based reasoning\[Shridharet al\.,[2021](https://arxiv.org/html/2605.08693#bib.bib2), Yaoet al\.,[2022](https://arxiv.org/html/2605.08693#bib.bib3), Yanget al\.,[2024](https://arxiv.org/html/2605.08693#bib.bib21), Jinet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib20)\]\. Our work builds on this RL post\-training paradigm but introduces a heterogeneous trajectory structure, where task\-execution turns and skill\-management turns carry different reward semantics\.

Tool\-use methods train or prompt LLMs to invoke external functions, APIs, or tools during reasoning\. Toolformer\[Schicket al\.,[2023](https://arxiv.org/html/2605.08693#bib.bib9)\]shows that language models can learn to use tools through self\-supervision, and Gorilla\[Patilet al\.,[2024](https://arxiv.org/html/2605.08693#bib.bib10)\]connects LLMs with large\-scale API usage\. Recent tool\-integrated reasoning methods further combine tool calls with RL, including AutoTIR\[Weiet al\.,[2025](https://arxiv.org/html/2605.08693#bib.bib33)\], AdaTIR\[Fang and Sun,[2026](https://arxiv.org/html/2605.08693#bib.bib34)\], AutoTool\[Zenget al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib35)\], and ASTER\[Zhanget al\.,[2026d](https://arxiv.org/html/2605.08693#bib.bib36)\]\. These works typically treat tools as external services that return information, perform computation, or execute actions\. In contrast, the tool inSkillMasteredits the agent’s own skill memory\. This creates a distinct credit\-assignment problem: the value of a write action can only be assessed by its downstream effect on future behavior\. DualAdv\-GRPO addresses this setting by decoupling advantage normalization for action and skill\-management turns while still optimizing a single policy\.

## Appendix JFair Comparison Protocol

To ensure fair evaluation, all methods in our main results are compared under a consistent protocol\. All RL\-based methods, including SkillMaster, SKILLRL, GRPO, RLOO, and memory\-augmented RL baselines, share the same backbone model, Qwen2\.5\-7B\-Instruct\. For methods that use a skill bank, the same initial skill set is provided at the start of training\. All methods use the same rollout budget per episode and are evaluated under identical test\-time conditions\. For SkillMaster and SKILLRL, the test\-time skill retrieval mechanism uses identical top\-K settings and retrieval mode\. Closed\-source LLM baselines and prompt\-based methods are evaluated zero\-shot under the same task instructions and maximum step limits\. Results for SKILLRL, RLOO, GRPO, ReAct, and Reflexion are reproduced from prior work\[Xiaet al\.,[2026](https://arxiv.org/html/2605.08693#bib.bib1)\]under the same evaluation protocol\. Any differences in reported numbers reflect only the algorithmic contributions of each method, not confounding factors such as model scale, training data, or inference budget\.

## Appendix KComputational Overhead

Table[2](https://arxiv.org/html/2605.08693#A11.T2)reports the computational overhead introduced by utility\-based skill management on ALFWorld\. Without the utility reward, training takes on average 12\.13 minutes per step, while enabling the utility reward increases the average step time to about 16\.00 minutes\. This corresponds to an additional 3\.87 minutes per step, or roughly a 31\.9% increase in wall\-clock time\.

Table 2:Computational overhead of utility\-based skill management on ALFWorld\.MetricWithoutWithAvg\. time / step12\.13 min16\.00 minAvg\. GPU memory33\.51 GiB34\.57 GiBPeak GPU memory61\.29 GiB70\.49 GiB

We also compare GPU memory usage over the first four training steps\. With the utility reward enabled, the average GPU memory allocation increases from 33\.51 GiB to 34\.57 GiB, while the peak allocation rises from 61\.29 GiB to 70\.49 GiB\. These results suggest that the main computational cost ofSkillMastercomes from the additional probe\-based evaluation required by the utility reward, which increases runtime noticeably while introducing a more moderate increase in average memory usage\. Overall, the utility reward introduces a noticeable runtime overhead but a relatively modest increase in average GPU memory allocation, while providing a direct learning signal for evaluating whether a candidate skill edit improves downstream performance\.

## Appendix LAnalysis of Skill\-Management Necessity

Table[3](https://arxiv.org/html/2605.08693#A12.T3)analyzes whether the gains ofSkillMasterrequire learned skill\-management decisions, or can be explained merely by appending a post\-episode review turn without enabling skill management\. Adding a review turn alone yields only marginal improvements over GRPO on both benchmarks, suggesting that post\-episode reflection by itself is insufficient to explain the large performance gains\.

Table 3:Analysis of skill\-management necessity\. ALFWorld reports overall success rate \(All, %\), and WebShop reports success rate \(Succ\., %\)\.MethodALFWorldWebShopGRPO77\.666\.1GRPO \+ Review\-Only78\.466\.9SkillMaster98\.782\.0

FullSkillMasterremains substantially stronger, indicating that the benefit does not come from review alone, but from learning to make effective skill\-management decisions under utility\-based feedback\. The small gap between GRPO and GRPO \+ Review\-Only suggests that simply appending a post\-episode reflection step is insufficient to account for the gains ofSkillMaster\. These results suggest that the improvement is not explained by review alone, but requires effective skill\-management decisions trained with utility\-based feedback\.

## Appendix MRobustness to Weaker Initial Skill Banks

Table[4](https://arxiv.org/html/2605.08693#A13.T4)evaluates whether the gains ofSkillMasterdepend primarily on inheriting a strong initial skill bank fromSkillRL\. We vary the initial skill coverage by retaining only a subset of the original skills, or removing the initial bank entirely, while keeping the same initialization for both methods\.SkillMasterconsistently outperformsSkillRLacross all settings, and the performance gap remains substantial even when the initial skill bank is heavily weakened or completely removed\.

Table 4:Robustness to weaker initial skill banks\. Results report ALFWorld overall success rate \(All, %\)\.Initial bankSkillRLSkillMasterFull \(100%\)89\.998\.7Reduced \(50%\)84\.694\.1Sparse \(25%\)78\.892\.5None \(0%\)74\.889\.4

Performance degrades gracefully as the initial skill bank becomes weaker, butSkillMasterremains clearly stronger than the teacher\-driven baseline under the same initialization, including the 0% setting\. These results suggest that the gains ofSkillMastercannot be explained solely by inheriting a strong initial skill bank\. Instead, the method continues to benefit from its learned skill\-management and policy adaptation even when the initial skill bank is substantially weakened or removed\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction state the paper’s contributions and empirical scope: a learned skill\-management policy, a counterfactual utility reward, DualAdv\-GRPO, and evaluations on ALFWorld and WebShop\.
5. 2\.Limitations
6. Question: Does the paper discuss the limitations of the work performed by the authors?
7. Answer:\[Yes\]
8. Justification: Section Discussion discusses limitations including probe evaluation cost, domain scope, probe selection heuristics, and future extensions\.
9. 3\.Theory assumptions and proofs
10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
11. Answer:\[N/A\]
12. Justification: The paper does not include theoretical results or formal proofs\.
13. 4\.Experimental result reproducibility
14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
15. Answer:\[Yes\]
16. Justification: The paper describes the environments, backbone model, SFT setup, optimization hyperparameters, probe selection, tool schemas, and training protocol in the experimental setup and implementation details\.
17. 5\.Open access to data and code
18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
19. Answer:\[Yes\]
20. Justification: The submission includes anonymized code and reproduction instructions for the main experiments\. ALFWorld and WebShop are publicly available benchmarks\.
21. 6\.Experimental setting/details
22. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
23. Answer:\[Yes\]
24. Justification: The paper specifies the training and evaluation setup, optimizer, group size, KL coefficient, learning rate, probe configuration, skill\-bank format, and test\-time protocol in the main setup description and implementation details\.
25. 7\.Experiment statistical significance
26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
27. Answer:\[No\]
28. Justification: The current draft reports aggregate performance but does not include error bars, confidence intervals, or formal significance tests for the main results\.
29. 8\.Experiments compute resources
30. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
31. Answer:\[Yes\]
32. Justification: Hyperparameters, model configuration, SFT procedure, and training infrastructure are detailed in paper\.
33. 9\.Code of ethics
35. Answer:\[Yes\]
36. Justification:The work uses simulated benchmark environments and does not involve human subjects, personally identifiable data, or deceptive data collection\. The submission is anonymized and cites the external assets used\.
37. 10\.Broader impacts
38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
39. Answer:\[N/A\]
40. Justification: The current draft does not include a dedicated discussion of broader societal impacts, although the method could inform future work on more capable autonomous agents\.
41. 11\.Safeguards
42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
43. Answer:\[N/A\]
44. Justification: The submission does not release a new foundation model, scraped dataset, or other asset that poses an obvious high\-risk misuse concern\.
45. 12\.Licenses for existing assets
46. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
47. Answer:\[Yes\]
48. Justification: ALFWorld and WebShop are publicly available research benchmarks\. The SFT dataset is constructed using a publicly available LLM\.
49. 13\.New assets
50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
51. Answer:\[N/A\]
52. Justification: The current submission does not provide a new public dataset, model checkpoint, or code release alongside the paper\.
53. 14\.Crowdsourcing and research with human subjects
54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
55. Answer:\[N/A\]
56. Justification: This work studies LLM agents in simulated household and shopping environments and does not involve human subjects or personally identifiable data\.
57. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
59. Answer:\[N/A\]
60. Justification: The work evaluates agents in controlled benchmark environments and does not involve real\-world deployment or safety\-critical applications\.
61. 16\.Declaration of LLM usage
62. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
63. Answer:\[Yes\]
64. Justification: LLMs are a core component of the method and training pipeline\. The paper describes the backbone model, the teacher model used for cold\-start SFT data generation, and the role of tool\-calling in the skill\-management phase\.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Similar Articles

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SkillGen: Verified Inference-Time Agent Skill Synthesis

SkillOS: Learning Skill Curation for Self-Evolving Agents

Submit Feedback

Similar Articles

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillGen: Verified Inference-Time Agent Skill Synthesis
SkillOS: Learning Skill Curation for Self-Evolving Agents