SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

arXiv cs.CL Papers

Summary

SkillGraph is a framework that represents reusable skills as nodes in a directed graph to enable large language model agents to handle compositional tasks more effectively through structured skill retrieval and continuous evolution.

arXiv:2605.12039v1 Announce Type: new Abstract: Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:21 AM

# SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
Source: [https://arxiv.org/html/2605.12039](https://arxiv.org/html/2605.12039)
Xiaoyuan Li1Moxin Li3Keqin Bao1Yubo Ma2 Wenjie Wang1Dayiheng Liu2Fuli Feng1 1University of Science and Technology of China2Alibaba Group 3National University of Singapore

###### Abstract

Skill libraries enable large language model agents to reuse experience from past trajectories, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity\. This leads to key challenges for compositional tasks, where an agent must identify not only relevant skills but also how they depend on and build upon each other\. It also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed\. We proposeSKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co\-occurrence relations\. Given a new task,SKILLGRAPHretrieves not just individual skills, but an ordered skill subgraph that can guide multi\-step decision making\. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together\. Experiments on ALFWorld, WebShop, and seven search\-augmented QA tasks show that SKILLGRAPH achieves state\-of\-the\-art performance against memory\-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills\.

## 1Introduction

Large Language Model \(LLM\) agents have shown strong capabilities in complex interactive tasks, including web navigation\(Yaoet al\.,[2022a](https://arxiv.org/html/2605.12039#bib.bib5)\), embodied household manipulation\(Shridharet al\.,[2021](https://arxiv.org/html/2605.12039#bib.bib4)\), and tool\-augmented question answering\(Yaoet al\.,[2022b](https://arxiv.org/html/2605.12039#bib.bib7)\)\. Yet most agents treat task as episode\(Yaoet al\.,[2022b](https://arxiv.org/html/2605.12039#bib.bib7); Shinnet al\.,[2023](https://arxiv.org/html/2605.12039#bib.bib8)\), struggling to learn from past successes or failures even when structurally similar problems have been encountered\(Xiaet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib1)\)\. Since many tasks share recurring subproblems and compositional action patterns, an agent that can*learn from experience*—extracting reusable knowledge from past interactions—would avoid redundant exploration, transfer strategies to similar tasks, and progressively build up the ability to solve more complex problems\.

To reuse experience, a common approach is to maintain a*skill library*, which stores reusable units of knowledge for solving recurring subproblems\(Wanget al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib9); Zhaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib11); Xiaet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib1)\)\. A skill can be either manually designed by humans\(Xu and Yan,[2026](https://arxiv.org/html/2605.12039#bib.bib44)\)or automatically acquired from agent experience—for instance, by distilling successful trajectories into natural language\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib11); Xiaet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib1)\)or executable programs\(Wanget al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib9)\)\. Compared with manually crafted skills, automatically acquired skills are more scalable and can continuously expand as the agent encounters new tasks and environments\. Therefore, we focus on automatically acquiring skills from interaction trajectories\.

Despite their promise, existing skill libraries are often organized as flat collections, where each skill is stored as an independent entry and retrieved mainly by semantic similarity\(Xiaet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib1); Zhaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib11); Liuet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib33)\)\. This ignores the fact that skills are inherently related: some skills are prerequisites for others, some enhance others, and some frequently co\-occur in successful trajectories\. As a result, flat libraries suffer from two key limitations\. First,retrieval is not compositional\. Complex tasks often require an ordered sequence of skills; for example, a “heat and place” task in ALFWorld may require locating an object, picking it up, heating it with an appliance, and then placing it at the target destination\. A flat Top\-KKretriever can return relevant skills, but it does not indicate their dependencies or execution order\. Second,skill updates are not structured\. When skills are maintained independently, the library lacks explicit evidence for merging redundant skills, splitting overly broad skills, deprecating obsolete skills, or strengthening useful relations between skills\(Xu and Yan,[2026](https://arxiv.org/html/2605.12039#bib.bib44)\)\. These limitations suggest that the core problem is not only how to acquire skills, but also how to*organize, retrieve, and update*them\. If inter\-skill relations are explicitly represented, retrieval can produce dependency\-aware skill sequences rather than unordered hints, and both individual skills and their relations can be updated in a principled way\.

Motivated by this, we proposeSkillGraph, a framework that organizes skills into a structured graph and co\-evolves it with the agent’s policy through reinforcement learning \(RL\)\. InSkillGraph, nodes represent skills distilled from trajectories, while typed edges capture relations such as prerequisite, enhancement, and co\-occurrence\.SkillGraphconsists of three stages\. First,graph constructionbuilds an initial skill graph from interaction trajectories, making inter\-skill relations explicit\. Second,graph\-aware retrievalstarts from task\-relevant seed skills, expands along graph edges, and orders retrieved skills according to their dependencies, producing a coherent skill sequence for decision\-making\. Third,graph evolutionupdates the graph during training by refining skill nodes and adjusting edge relations according to skill usage and success rate\. Together, these stages form a closed loop: the skill graph provides structured guidance for policy learning, while the improving policy generates new trajectories that further refine the graph\.

![Refer to caption](https://arxiv.org/html/2605.12039v1/x1.png)Figure 1:Overview ofSkillGraph\. The skill graph and the agent’s policy*co\-evolve*through a closed loop:\(1\)graph construction distills skills and their typed relations \(prerequisite, enhancement, co\-occurrence\) from trajectories;\(2\)graph\-aware retrieval traverses these relations to produce dependency\-ordered skill sequences that guide the policy;\(3\)graph evolution uses training feedback to refine skill nodes, adjust edge weights, and restructure the graph, which in turn improves future retrieval and policy learning\.Empirically, we evaluateSkillGraphon ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.12039#bib.bib4)\), WebShop\(Yaoet al\.,[2022a](https://arxiv.org/html/2605.12039#bib.bib5)\), and seven search\-augmented question answering tasks\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2605.12039#bib.bib6)\), covering embodied manipulation, web navigation, and information retrieval\. Experimental results show thatSkillGraphachieves state\-of\-the\-art performance across benchmarks, with especially strong gains on complex multi\-step tasks requiring skill composition\. Further analysis shows that the graph structure improves skill reuse, reduces redundancy compared with flat libraries, and enables transfer of compositional knowledge from simpler tasks to more complex ones\.

Our main contributions are summarized as follows:

- •We propose a graph\-structured formulation of skill library for LLM agents, where skills are connected by explicit prerequisite, enhancement, and co\-occurrence relations\.
- •We introduceSkillGraph, a closed\-loop framework that supports dependency\-aware skill retrieval and structured skill updates during RL\.
- •We conduct experiments on ALFWorld, WebShop, and seven search\-augmented QA tasks, demonstrating state\-of\-the\-art performance and substantial gains on complex multi\-step tasks\.

## 2Related Work

##### Memory mechanisms in agents\.

External memory helps LLM agents reuse experience beyond the context window\. Early methods store raw trajectories as examples\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib11); Chhikaraet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib10)\), while later work compresses experience into summaries or knowledge entries\(Fanget al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib12); Liuet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib33); Ouyanget al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib16); Tanget al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib38)\)\. Recent studies further apply RL directly to agent knowledge structures: MemRL\(Zhanget al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib13)\)performs runtime RL on episodic memory, MemEvolve\(Zhanget al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib35)\)meta\-evolves memory systems, Mem\-α\\alpha\(Wanget al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib39)\)learns memory construction policies, and EvolveR\(Wuet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib14)\)co\-adapts the policy and memory bank\. In contrast,SkillGraphrepresents experience as explicit skill abstractions with typed dependencies and evolves this structure jointly with the policy\.

##### Graph structures for LLMs\.

Graph structures have been widely adopted in LLM systems: Graph\-of\-Thought\(Bestaet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib41)\)models reasoning steps as a directed graph to enable non\-linear thought exploration, GraphRAG\(Edgeet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib43)\)builds entity\-relation graphs over corpora for structured retrieval, andNonkeset al\.\([2024](https://arxiv.org/html/2605.12039#bib.bib42)\)encode task decompositions as planning graphs for agent execution\.SkillGraphapplies graph structures to agent skill management, jointly evolving the graph topology and the policy through RL, enabling the skill graph to adapt continuously rather than remaining static after construction\.

##### Agent skill evolution\.

Agentic skills can compact reusable strategies for subtasks\. Voyager\(Wanget al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib9)\)accumulates executable code skills, and ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib11)\)distills transferable strategic experience from trajectories\. Most closely related, SkillRL\(Xiaet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib1)\)co\-evolves a hierarchical skill bank with the agent’s policy through recursive RL\.SkillGraphbuilds on this line by elevating the flat skill bank into a structured dependency graph, enabling typed relational modeling and topology evolution throughout training\.

## 3SkillGraph

We presentSkillGraph, a framework that organizes agent skills as a directed dependency graph and co\-evolves the graph with the agent’s policy through RL\. The key insight is that explicitly modeling inter\-skill relations enables two mutually reinforcing capabilities:*structured retrieval*that produces dependency\-aware skill sequences for compositional planning, and*principled evolution*that uses training feedback to refine both individual skills and their relations\. As illustrated in Figure[1](https://arxiv.org/html/2605.12039#S1.F1), the framework consists of three stages—graph construction \(Section[3\.1](https://arxiv.org/html/2605.12039#S3.SS1)\), graph\-aware retrieval \(Section[3\.2](https://arxiv.org/html/2605.12039#S3.SS2)\), and graph evolution \(Section[3\.3](https://arxiv.org/html/2605.12039#S3.SS3)\)—integrated into a closed\-loop training procedure \(Section[3\.4](https://arxiv.org/html/2605.12039#S3.SS4)\)\.

### 3\.1Graph Construction

The first step is to build a skill graph that makes inter\-skill relations explicit, providing the structural foundation for both retrieval and evolution\.

##### Skill distillation\.

We collect trajectories by rolling out the base policyπbase\\pi\_\{\\text\{base\}\}in the environment\. A teacher language modelℳ\\mathcal\{M\}then distills successful trajectoriesτ\+\\tau^\{\+\}and failed trajectoriesτ−\\tau^\{\-\}into two types of skills:*general skills*, which capture domain\-independent reasoning strategies applicable across tasks \(e\.g\., “verify each sub\-goal before proceeding”\), and*task\-specific skills*, which encode strategies tied to particular task types \(e\.g\., “check the microwave for heated objects”\)\. Each skill is represented as a compact record containing a title, a core principle describing the strategy, an applicability condition, and a category label indicating its type\.

##### Graph structure\.

The distilled skills form the node set𝒱\\mathcal\{V\}of a directed graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\), whereℰ\\mathcal\{E\}denotes the edge set\. To capture how skills relate to one another, we define three typed edges:

- •Prerequisite\(A→prereqBA\\xrightarrow\{\\texttt\{prereq\}\}B\): skillAAmust be applied before skillBB\.
- •Enhances\(A→enhanceBA\\xrightarrow\{\\texttt\{enhance\}\}B\): general skillAAimproves the effectiveness of task\-specific skillBB\.
- •Co\-occurs\(A↔co\_occurBA\\xleftrightarrow\{\\texttt\{co\\\_occur\}\}B\): skillsAAandBBfrequently appear together in successful episodes\.

Each edgee∈ℰe\\in\\mathcal\{E\}carries a weightw​\(e\)∈\[0,1\]w\(e\)\\in\[0,1\]reflecting the strength of the relation, which is dynamically adjusted during training\. Each nodev∈𝒱v\\in\\mathcal\{V\}maintains running statistics—usage countnuse​\(v\)n\_\{\\text\{use\}\}\(v\), success countnsucc​\(v\)n\_\{\\text\{succ\}\}\(v\), and empirical success ratep^​\(v\)=nsucc​\(v\)/nuse​\(v\)\\hat\{p\}\(v\)=n\_\{\\text\{succ\}\}\(v\)/n\_\{\\text\{use\}\}\(v\)—that drive both evolution decisions and progressive unlocking in Section[3\.3](https://arxiv.org/html/2605.12039#S3.SS3)\. Based on the directed prerequisite and enhancement edges, each node is assigned a topological levelℓ​\(v\)\\ell\(v\)indicating its position in the dependency hierarchy: level\-0 skills have no prerequisites, while higher\-level skills depend on lower\-level ones\. Details of edge initialization and level computation are provided in Appendix[A](https://arxiv.org/html/2605.12039#A1)\.

### 3\.2Graph\-Aware Retrieval

Flat skill retrieval returns a set of individually relevant skills but ignores their dependencies, making it inadequate for tasks that require ordered skill composition\. To address this, we design a graph\-aware retrieval procedure that traverses the skill graph to produce a dependency\-respecting sequence of skills\. Given a task descriptionddwith task typet​\(d\)t\(d\), retrieval proceeds in three steps\.

##### Seed selection\.

We first identify task\-relevant entry points from the currently active skill set𝒱active⊆𝒱\\mathcal\{V\}\_\{\\text\{active\}\}\\subseteq\\mathcal\{V\}, which contains skills that have been progressively unlocked \(see Section[3\.3\.3](https://arxiv.org/html/2605.12039#S3.SS3.SSS3)\)\. From𝒱active\\mathcal\{V\}\_\{\\text\{active\}\}, we select all general skills and task\-type\-matched skills as seed nodes, whereℛ\\mathcal\{R\}denotes a retrieved subset of skill nodes:

ℛseed=\{v∈𝒱active:category​\(v\)=general∨category​\(v\)=t​\(d\)\}\.\\mathcal\{R\}\_\{\\text\{seed\}\}=\\left\\\{v\\in\\mathcal\{V\}\_\{\\text\{active\}\}:\\mathrm\{category\}\(v\)=\\texttt\{general\}\\;\\vee\\;\\mathrm\{category\}\(v\)=t\(d\)\\right\\\}\.\(1\)

##### Graph expansion\.

Starting from the seed setℛseed\\mathcal\{R\}\_\{\\text\{seed\}\}, we expand in two complementary directions to recover the full dependency context:

- •*Backward expansion*traverses incoming prerequisite edges via breadth\-first search \(BFS\) up to a maximum depthDD, producing the backward\-expanded setℛBFS\\mathcal\{R\}\_\{\\text\{BFS\}\}that recovers foundational skills the seeds depend on but that may belong to other task categories\.
- •*Forward expansion*explores outgoing edges via beam search with beam widthBB, producing the forward\-expanded setℛbeam\\mathcal\{R\}\_\{\\text\{beam\}\}\. Each candidate nodevvreceives an expansion scoreσ​\(v\)\\sigma\(v\)propagated from its predecessors:σ​\(v\)=maxu∈parents​\(v\)⁡σ​\(u\)⋅w​\(u,v\)\\sigma\(v\)=\\max\_\{u\\in\\text\{parents\}\(v\)\}\\sigma\(u\)\\cdot w\(u,v\), where seed nodes are initialized withσ=1\\sigma=1\. This prioritizes skills connected by well\-validated relations\.

##### Topological ordering\.

The union of seeds, backward\-expanded, and forward\-expanded skills is topologically sorted according to the graph’s dependency edges, producing an ordered skill sequence:

ℛret=TopoSort𝒢​\(ℛseed∪ℛBFS∪ℛbeam\)\.\\mathcal\{R\}\_\{\\text\{ret\}\}=\\text\{TopoSort\}\_\{\\mathcal\{G\}\}\\\!\\left\(\\mathcal\{R\}\_\{\\text\{seed\}\}\\cup\\mathcal\{R\}\_\{\\text\{BFS\}\}\\cup\\mathcal\{R\}\_\{\\text\{beam\}\}\\right\)\.\(2\)This sequence, capped atKmaxK\_\{\\max\}skills, is prepended to the task prompt as structured guidance for the policy\. Because the ordering reflects dependency relations, the agent receives skills in a natural simple\-to\-complex order that mirrors how sub\-tasks should be composed\.

### 3\.3Graph Evolution

A static skill graph cannot keep pace with a continuously improving policy: new failure modes demand new skills, redundant skills accumulate, and the relative importance of inter\-skill relations shifts over training\. To address this, we evolve both the skill nodes and their edges at each validation step, driven by trajectory\-level feedback\.

#### 3\.3\.1Node\-Level: Adaptive Granularity Control

We maintain appropriate skill granularity through four operations, each triggered by specific diagnostic signals from the training process\.

##### Insert\.

When the agent fails on tasks that existing skills do not adequately cover, we generate targeted new skills\. The teacher modelℳ\\mathcal\{M\}analyzes a batch of failed trajectoriesτ−\\tau^\{\-\}together with the current skill setℛexisting\\mathcal\{R\}\_\{\\text\{existing\}\}, and proposes up tommnew skills addressing the identified failure causes:

\{snew1,…,snewm\}=ℳ​\(insert,τ−,ℛexisting\)\.\\\{s\_\{\\text\{new\}\}^\{1\},\\ldots,s\_\{\\text\{new\}\}^\{m\}\\\}=\\mathcal\{M\}\(\\text\{insert\},\\,\\tau^\{\-\},\\,\\mathcal\{R\}\_\{\\text\{existing\}\}\)\.\(3\)

##### Merge\.

Redundant skills inflate context length and dilute retrieval precision\. We identify candidates for merging by measuring the overlap of their graph neighborhoods: let𝒩​\(v\)\\mathcal\{N\}\(v\)denote the set of neighbors of nodevvin𝒢\\mathcal\{G\}; when two skillssis\_\{i\}andsjs\_\{j\}share most of their neighbors \(Jaccard similarityJ​\(𝒩​\(si\),𝒩​\(sj\)\)≥τmergeJ\(\\mathcal\{N\}\(s\_\{i\}\),\\mathcal\{N\}\(s\_\{j\}\)\)\\geq\\tau\_\{\\text\{merge\}\}, whereτmerge\\tau\_\{\\text\{merge\}\}is the merge threshold\), they likely encode redundant strategies and are synthesized into a single unified skill byℳ\\mathcal\{M\}\.

##### Split\.

Overly broad skills that conflate distinct sub\-strategies exhibit moderate success rates despite high usage \(p^​\(v\)∈\[0\.15,0\.4\]\\hat\{p\}\(v\)\\in\[0\.15,0\.4\]andnuse​\(v\)≥10n\_\{\\text\{use\}\}\(v\)\\geq 10\)\. We decompose such skills into more focused sub\-skills viaℳ\\mathcal\{M\}, reconnecting them with prerequisite edges\.

##### Deprecate\.

Skills that are frequently retrieved but consistently fail \(nuse​\(v\)≥20n\_\{\\text\{use\}\}\(v\)\\geq 20andp^​\(v\)<0\.15\\hat\{p\}\(v\)<0\.15\) are deprecated and excluded from future retrieval, preventing them from degrading policy performance\.

#### 3\.3\.2Edge\-Level: Topology Evolution

While node\-level operations adjust*what*skills are available, edge\-level operations adjust*how*skills relate to one another, directly shaping retrieval quality\.

##### Path reinforcement\.

Successful trajectories provide evidence that the retrieved skill sequence was effective\. We reinforce this signal by increasing the weight of every edge along the successful path:

w​\(e\)←min⁡\(w​\(e\)\+α,1\.0\),∀e∈path​\(τ\+\),w\(e\)\\leftarrow\\min\\bigl\(w\(e\)\+\\alpha,\\;1\.0\\bigr\),\\quad\\forall e\\in\\text\{path\}\(\\tau^\{\+\}\),\(4\)whereα∈\(0,1\)\\alpha\\in\(0,1\)is the reinforcement step size andpath​\(τ\+\)\\text\{path\}\(\\tau^\{\+\}\)denotes the set of edges traversed by the skill sequence used in successful trajectoryτ\+\\tau^\{\+\}\. This makes validated dependency paths more likely to be traversed in future retrieval\.

##### Co\-occurrence discovery\.

New inter\-skill relations emerge as the policy improves\. When two skills co\-occur in a successful episode but are not yet connected in𝒢\\mathcal\{G\}, we add aco\_occuredge to capture this discovered association\.

##### Decay and pruning\.

To prevent stale relations from persisting indefinitely, all edge weights undergo multiplicative decay with decay factorγ∈\(0,1\)\\gamma\\in\(0,1\):w​\(e\)←γ⋅w​\(e\)w\(e\)\\leftarrow\\gamma\\cdot w\(e\)at each checkpoint\. Edges whose weights fall below a pruning thresholdwminw\_\{\\min\}are removed fromℰ\\mathcal\{E\}\. After all updates, node levelsℓ​\(v\)\\ell\(v\)are recomputed to reflect the new topology\.

#### 3\.3\.3Progressive Unlocking

Exposing the agent to complex, high\-level skills before it has mastered their prerequisites can hinder learning\. To implement a curriculum over skill complexity,SkillGraphprogressively unlocks skills based on their topological level\. Initially, only level\-0 foundational skills are active\. LetLLdenote the current highest active level\. When the average success rate of level\-LLskills exceeds an unlocking thresholdθunlock\\theta\_\{\\text\{unlock\}\}, level\-\(L\+1\)\(L\{\+\}1\)skills are activated:

p¯​\(L\)=1\|\{v:ℓ​\(v\)=L\}\|​∑v:ℓ​\(v\)=Lp^​\(v\)≥θunlock⟹𝒱active←𝒱active∪\{v:ℓ​\(v\)=L\+1\}\.\\bar\{p\}\(L\)=\\frac\{1\}\{\|\\\{v:\\ell\(v\)=L\\\}\|\}\\sum\_\{v:\\,\\ell\(v\)=L\}\\hat\{p\}\(v\)\\;\\geq\\;\\theta\_\{\\text\{unlock\}\}\\;\\;\\Longrightarrow\\;\\;\\mathcal\{V\}\_\{\\text\{active\}\}\\leftarrow\\mathcal\{V\}\_\{\\text\{active\}\}\\cup\\\{v:\\ell\(v\)=L\+1\\\}\.\(5\)This ensures that the agent builds competence from the ground up, with advanced compositional skills becoming available only when their foundations are reliable\.

### 3\.4Policy Optimization and Closed\-Loop Training

We optimize the skill\-augmented policyπθ\\pi\_\{\\theta\}, parameterized byθ\\theta, using GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib2)\)\. For each task, we sample a group ofGGrollouts fromπθ\\pi\_\{\\theta\}conditioned on the task descriptionddand the retrieved skill sequenceℛret\\mathcal\{R\}\_\{\\text\{ret\}\}\. Each rolloutiireceives a binary rewardRi∈\{0,1\}R\_\{i\}\\in\\\{0,1\\\}indicating task success, and the estimated advantageA^i\\hat\{A\}\_\{i\}is computed by within\-group normalization:

A^i=Ri−mean​\(\{Rj\}j=1G\)std​\(\{Rj\}j=1G\)\+ϵ,\\hat\{A\}\_\{i\}=\\frac\{R\_\{i\}\-\\text\{mean\}\(\\\{R\_\{j\}\\\}\_\{j=1\}^\{G\}\)\}\{\\text\{std\}\(\\\{R\_\{j\}\\\}\_\{j=1\}^\{G\}\)\+\\epsilon\},\(6\)whereϵ\\epsilonis a small constant for numerical stability\. The policy is updated via the clipped surrogate objective with a KL penalty anchored to the reference policyπref\\pi\_\{\\text\{ref\}\}\(initialized from the SFT model\):

ℒ​\(θ\)=𝔼​\[min⁡\(r​\(θ\)​A^i,clip​\(r​\(θ\),1−ϵc,1\+ϵc\)​A^i\)−β​DKL​\(πθ∥πref\)\],\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\\\!\\left\[\\min\\\!\\left\(r\(\\theta\)\\,\\hat\{A\}\_\{i\},\\;\\text\{clip\}\(r\(\\theta\),1\\\!\-\\\!\\epsilon\_\{c\},1\\\!\+\\\!\\epsilon\_\{c\}\)\\,\\hat\{A\}\_\{i\}\\right\)\-\\beta\\,D\_\{\\text\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\text\{ref\}\}\\right\)\\right\],\(7\)wherer​\(θ\)=πθ/πoldr\(\\theta\)=\\pi\_\{\\theta\}/\\pi\_\{\\text\{old\}\}is the importance sampling ratio between the current and previous policies,ϵc\\epsilon\_\{c\}is the clipping parameter,β\\betais the KL penalty coefficient, andDKLD\_\{\\text\{KL\}\}denotes the Kullback–Leibler divergence\.

At each validation step, the full graph evolution pipeline is executed, creating a closed training loop: the improving policy generates richer trajectories that refine the skill graph through node\- and edge\-level updates, while the refined graph provides higher\-quality structured retrieval that accelerates subsequent policy learning\. The complete procedure is summarized in Algorithm[1](https://arxiv.org/html/2605.12039#alg1)\.

Algorithm 1SkillGraph: Skill\-Augmented RL for Agents via Evolving Skill Graphs0:Base policy

πbase\\pi\_\{\\text\{base\}\}, teacher model

ℳ\\mathcal\{M\}, environment

Env\\mathrm\{Env\}, unlocking threshold

θunlock\\theta\_\{\\text\{unlock\}\}
0:Trained policy

πθ∗\\pi\_\{\\theta^\{\*\}\}, evolved skill graph

𝒢∗\\mathcal\{G\}^\{\*\}
1:— Graph Construction —

2:

𝒯\+,𝒯−←Rollout​\(πbase,Env\)\\mathcal\{T\}^\{\+\},\\mathcal\{T\}^\{\-\}\\leftarrow\\text\{Rollout\}\(\\pi\_\{\\text\{base\}\},\\mathrm\{Env\}\)
3:

𝒱←ℳ​\(𝒯\+,𝒯−\)\\mathcal\{V\}\\leftarrow\\mathcal\{M\}\(\\mathcal\{T\}^\{\+\},\\mathcal\{T\}^\{\-\}\)⊳\\trianglerightDistill general & task\-specific skills

4:

𝒢=\(𝒱,ℰ\)←InitGraph​\(𝒱\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)\\leftarrow\\text\{InitGraph\}\(\\mathcal\{V\}\)⊳\\trianglerightAdd prereq, enhance, co\-occur edges

5:Compute topological levels

ℓ​\(v\)\\ell\(v\)for all

v∈𝒱v\\in\\mathcal\{V\}
6:— Cold\-Start SFT —

7:

θ←SFT​\(πbase,ℳ​\(Env,𝒢\)\)\\theta\\leftarrow\\text\{SFT\}\(\\pi\_\{\\text\{base\}\},\\,\\mathcal\{M\}\(\\mathrm\{Env\},\\mathcal\{G\}\)\);

πref←πθ\\pi\_\{\\text\{ref\}\}\\leftarrow\\pi\_\{\\theta\}
8:

𝒱active←\{v:ℓ​\(v\)=0\}\\mathcal\{V\}\_\{\\text\{active\}\}\\leftarrow\\\{v:\\ell\(v\)=0\\\};

L←0L\\leftarrow 0⊳\\trianglerightUnlock level\-0 skills

9:— Closed\-Loop RL Training —

10:forepoch

=1=1to

NNdo

11:foreach task

ddwith type

t​\(d\)t\(d\)do

12:Graph\-Aware Retrieval:

13:

ℛseed←\{v∈𝒱active:category​\(v\)=general∨category​\(v\)=t​\(d\)\}\\mathcal\{R\}\_\{\\text\{seed\}\}\\leftarrow\\\{v\\in\\mathcal\{V\}\_\{\\text\{active\}\}:\\mathrm\{category\}\(v\)=\\texttt\{general\}\\vee\\mathrm\{category\}\(v\)=t\(d\)\\\}
14:

ℛBFS←BackwardBFS​\(ℛseed,𝒢,D\)\\mathcal\{R\}\_\{\\text\{BFS\}\}\\leftarrow\\text\{BackwardBFS\}\(\\mathcal\{R\}\_\{\\text\{seed\}\},\\mathcal\{G\},D\);

ℛbeam←ForwardBeam​\(ℛseed,𝒢,B\)\\mathcal\{R\}\_\{\\text\{beam\}\}\\leftarrow\\text\{ForwardBeam\}\(\\mathcal\{R\}\_\{\\text\{seed\}\},\\mathcal\{G\},B\)
15:

ℛret←TopoSort𝒢​\(ℛseed∪ℛBFS∪ℛbeam\)\\mathcal\{R\}\_\{\\text\{ret\}\}\\leftarrow\\text\{TopoSort\}\_\{\\mathcal\{G\}\}\(\\mathcal\{R\}\_\{\\text\{seed\}\}\\cup\\mathcal\{R\}\_\{\\text\{BFS\}\}\\cup\\mathcal\{R\}\_\{\\text\{beam\}\}\)⊳\\trianglerightCap atKmaxK\_\{\\max\}skills

16:Sample

GGrollouts

\{τ\(i\)\}i=1G∼πθ\(⋅∣d,ℛret\)\\\{\\tau^\{\(i\)\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid d,\\mathcal\{R\}\_\{\\text\{ret\}\}\); Update

θ\\thetavia GRPO

17:endfor

18:ifvalidation stepthen

19:Graph Evolution:

20:Node\-level:Insert / Merge / Split / Deprecate skills via

ℳ\\mathcal\{M\}
21:Edge\-level:Reinforce paths in

τ\+\\tau^\{\+\}; Discover new co\-occur edges; Decay & prune weak edges

22:Recompute topological levels

ℓ​\(v\)\\ell\(v\)
23:Progressive Unlocking:if

p¯​\(L\)≥θunlock\\bar\{p\}\(L\)\\geq\\theta\_\{\\text\{unlock\}\}then

𝒱active←𝒱active∪\{v:ℓ​\(v\)=L\+1\}\\mathcal\{V\}\_\{\\text\{active\}\}\\leftarrow\\mathcal\{V\}\_\{\\text\{active\}\}\\cup\\\{v:\\ell\(v\)=L\{\+\}1\\\};

L←L\+1L\\leftarrow L\{\+\}1
24:endif

25:endfor

26:return

πθ,𝒢\\pi\_\{\\theta\},\\mathcal\{G\}

## 4Experiments

### 4\.1Experimental Setup

Table 1:Main results on ALFWorld and WebShop\. ALFWorld reports per\-subtask and overall success rates \(%\); WebShop reports task score and success rate \(%\)\.Boldandunderlinedenote the best and second\-best results, respectively\.MethodALFWorldWebShopPickLookCleanHeatCoolPick2AllScoreSucc\.Closed\-source LLMsGPT\-4o75\.360\.831\.256\.721\.649\.848\.031\.823\.7Gemini\-2\.5\-Pro92\.863\.362\.169\.026\.658\.760\.342\.535\.9Prompt\-based Agentic or Memory\-based MethodsReAct48\.535\.434\.313\.218\.217\.631\.246\.219\.5Reflexion62\.041\.644\.930\.936\.323\.842\.758\.128\.8Mem054\.055\.026\.936\.420\.87\.6933\.623\.92\.00MemP54\.338\.548\.156\.232\.016\.741\.425\.36\.40ExpeL21\.067\.055\.052\.011\.06\.0046\.330\.911\.2SimpleMem64\.533\.320\.012\.533\.33\.8429\.733\.28\.59RL\-based MethodsRLOO87\.678\.287\.381\.371\.948\.975\.580\.365\.7GRPO90\.866\.189\.374\.772\.564\.777\.679\.366\.1Memory\-Augmented RL\-based MethodsMemRL62\.838\.522\.212\.58\.000\.0021\.429\.59\.20EvolveR64\.933\.346\.413\.333\.333\.343\.842\.517\.6Mem0\+GRPO78\.154\.856\.131\.065\.026\.954\.758\.137\.5SimpleMem\+GRPO89\.563\.660\.050\.064\.926\.362\.567\.846\.9SkillRL97\.971\.490\.090\.095\.587\.589\.985\.272\.7SkillGraph\(Ours\)100\.080\.0100\.0100\.080\.083\.390\.691\.584\.4##### Environments\.

ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.12039#bib.bib4)\)is a text\-based household interaction environment that covers six task categories \(Pick, Look, Clean, Heat, Cool, Pick2\), each requiring multi\-step goal\-directed manipulation\. WebShop\(Yaoet al\.,[2022a](https://arxiv.org/html/2605.12039#bib.bib5)\)presents a web navigation challenge in which agents must search, browse, and purchase products meeting specific user requirements\. For search\-augmented question answering, we evaluate on three single\-hop benchmarks—NQ\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2605.12039#bib.bib6)\), TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2605.12039#bib.bib21)\), and PopQA\(Mallenet al\.,[2023](https://arxiv.org/html/2605.12039#bib.bib26)\)—and four multi\-hop benchmarks—HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.12039#bib.bib22)\), 2Wiki\(Hoet al\.,[2020](https://arxiv.org/html/2605.12039#bib.bib23)\), MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.12039#bib.bib24)\), and Bamboogle\(Presset al\.,[2023](https://arxiv.org/html/2605.12039#bib.bib25)\)\.

##### Baselines\.

We compareSkillGraphagainst four groups of methods\.\(1\) Closed\-source LLMs: GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib31)\)and Gemini\-2\.5\-Pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib32)\), serving as strong references\.\(2\) Prompt\-based and memory\-augmented methods: ReAct\(Yaoet al\.,[2022b](https://arxiv.org/html/2605.12039#bib.bib7)\), Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.12039#bib.bib8)\), Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib10)\), MemP\(Fanget al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib12)\), ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib11)\), and SimpleMem\(Liuet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib33)\), which use in\-context experience without parameter updates\.\(3\) RL\-based methods: RLOO\(Ahmadianet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib17)\)and GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib2)\)\.\(4\) Memory\-augmented RL methods: MemRL\(Zhanget al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib13)\), EvolveR\(Wuet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib14)\), Mem0\+GRPO, SimpleMem\+GRPO, and SkillRL\(Xiaet al\.,[2026](https://arxiv.org/html/2605.12039#bib.bib1)\)\. For search\-augmented QA, we additionally compare against CoT\(Weiet al\.,[2022](https://arxiv.org/html/2605.12039#bib.bib45)\), RAG\(Arslanet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib46)\), Search\-o1\(Liet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib30)\), Search\-R1\(Jinet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib27)\)and ZeroSearch\(Sunet al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib28)\)\.

##### Implementation details\.

We adopt Qwen2\.5\-7B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2605.12039#bib.bib19)\)as the base policyπbase\\pi\_\{\\text\{base\}\}, initialized via cold\-start SFT, and OpenAI o3\(Jaechet al\.,[2024](https://arxiv.org/html/2605.12039#bib.bib20)\)as the teacher modelℳ\\mathcal\{M\}for skill distillation, SFT data generation, and graph evolution operations\. RL training uses GRPO with learning rate1×10−61\\times 10^\{\-6\}, KL coefficientβ=0\.01\\beta=0\.01, clipping parameterϵc=0\.2\\epsilon\_\{c\}=0\.2, train batch size1616, and group sizeG=8G=8\. For graph\-aware retrieval, we cap the retrieved skill sequence atKmax=8K\_\{\\max\}=8, set backward\-BFS depthD=2D=2, and forward beam widthB=3B=3\. For graph evolution, edges are initialized with weightsw=0\.3w=0\.3\(co\-occur\) andw=0\.2w=0\.2\(enhance\); at each validation checkpoint, successful paths receive additive reinforcementα=0\.05\\alpha=0\.05, all weights decay by factorγ=0\.99\\gamma=0\.99, and edges belowwmin=0\.05w\_\{\\min\}=0\.05are pruned\. Node\-level evolution uses merge thresholdτmerge=0\.85\\tau\_\{\\text\{merge\}\}=0\.85, and at mostm=3m=3newly inserted skills per update\. Progressive unlocking activates level\-\(L\+1\)\(L\{\+\}1\)skills when the average success rate of level\-LLskills exceedsθunlock=0\.6\\theta\_\{\\text\{unlock\}\}=0\.6\.

### 4\.2Main Results

##### Comparison with baselines\.

Table[1](https://arxiv.org/html/2605.12039#S4.T1)reports results on ALFWorld and WebShop\.SkillGraphachieves the best overall performance on both benchmarks\.\(i\)Notably,SkillGraphwith a 7B open\-source model substantially outperforms closed\-source LLMs: it surpasses GPT\-4o by42\.642\.6points and Gemini\-2\.5\-Pro by30\.330\.3points on ALFWorld, and exceeds both by over4848points on WebShop, demonstrating that structured skill reasoning can compensate for the scale gap\.\(ii\)Compared with prompt\-based and memory methods,SkillGraphoutperforms the best method \(ExpeL\) by44\.344\.3points on ALFWorld, with the largest gains on Clean \(100\.0100\.0vs\.55\.055\.0\) and Heat \(100\.0100\.0vs\.56\.256\.2\)\. These subtasks require executing prerequisite actions in a strict order, which flat retrieval cannot enforce but graph\-aware retrieval handles naturally\.\(iii\)Over the vanilla GRPO baseline with the same optimizer,SkillGraphimproves by13\.013\.0and18\.318\.3points on ALFWorld and WebShop respectively, directly quantifying the benefit of graph\-structured skill guidance in reducing exploration burden\.\(iv\)Against the strongest prior method SkillRL,SkillGraphachieves slightly higher ALFWorld performance while gaining11\.711\.7points on WebShop\. The gap stems from the evolving graph structure: graph evolution continuously refines the skill set and discovers inter\-skill relations \(e\.g\., query refinement→\\toattribute matching→\\toprice comparison\), providing higher\-quality compositional guidance than a static flat skill bank\.

##### Generalization to search\-augmented QA\.

Table 2:Results on search\-augmented QA\.SkillGraphis trained on NQ♡and HotpotQA♡\(in\-domain\) and evaluated zero\-shot on the remaining five benchmarks♠\(out\-of\-domain\)\.Table[2](https://arxiv.org/html/2605.12039#S4.T2)reports results on seven QA benchmarks\. Trained only on NQ and HotpotQA,SkillGraphachieves the highest average performance \(48\.948\.9\) and generalizes zero\-shot to five unseen datasets\. On single\-hop tasks,SkillGraphsurpasses all baselines on NQ \(52\.952\.9\) and PopQA \(52\.652\.6\), improving over SkillRL by2\.12\.1and2\.62\.6respectively\. This advantage stems from graph evolution, which keeps the skill set aligned with the evolving policy rather than relying on a fixed skill library\. On multi\-hop tasks,SkillGraphleads on HotpotQA \(44\.744\.7\) and 2Wiki \(43\.443\.4\), where prerequisite\-ordered retrieval helps decompose chained queries into sub\-questions\. These results confirm that the structured skill representation learned from two training domains transfers effectively to unseen tasks, demonstrating strong generalization without task\-specific adaptation\.

![Refer to caption](https://arxiv.org/html/2605.12039v1/x2.png)Figure 2:Skill graph evolution over training on WebShop\. Left: node counts \(total, active, inserted, deprecated\)\. Middle: edge counts by type\. Right: average node success rate\.

### 4\.3Analysis

Table 3:Ablation study on ALFWorld and WebShop success rate\(%\)\.##### Ablation study\.

Table[3](https://arxiv.org/html/2605.12039#S4.T3)isolates each component’s contribution\. The components exhibit complementary strengths across environments\. On ALFWorld, removing graph\-aware retrieval causes the largest single drop \(−31\.2\-31\.2\), confirming that the rigid multi\-step subtasks \(e\.g\., Clean, Heat\) critically depend on prerequisite\-ordered skill sequences, consistent with the large gains reported in the main results\. On WebShop, graph evolution \(−14\.1\-14\.1\) and graph structure \(−11\.7\-11\.7\) contribute the most, indicating that WebShop benefits primarily from maintaining a high\-quality, evolving skill set—the correct skills matter more than their ordering in this flexible navigation setting, which explains why the graph structure gap over SkillRL \(\+11\.7\+11\.7\) is larger than the retrieval ordering gap\. Cold\-start SFT yields the largest combined drop \(−17\.2\-17\.2on both\), confirming that a good initialization is essential for RL convergence in complex agent environments\.

##### Skill graph evolution dynamics\.

Figure[2](https://arxiv.org/html/2605.12039#S4.F2)tracks graph statistics over training\. Node count grows from∼20\{\\sim\}20to∼140\{\\sim\}140via failure\-driven insertion, but the active count plateaus earlier as deprecation prunes failing skills—a self\-regulating loop that prevents unbounded growth\. Co\-occur edges grow fastest through automatic discovery, while prerequisite and enhance edges increase steadily via path reinforcement, showing that the graph discovers relational structure beyond initial construction\. The average node success rate rises confirming that evolution progressively filters low\-quality skills while reinforcing useful ones\.

![Refer to caption](https://arxiv.org/html/2605.12039v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.12039v1/x4.png)

Figure 3:Training dynamics and context efficiency\. Left: WebShop task score over training epochs\. Right: average prompt length during training\.
##### Convergence and context efficiency\.

Figure[3](https://arxiv.org/html/2605.12039#S4.F3)\(left\) shows thatSkillGraphsurpasses SkillRL after roughly5050training steps and maintains a consistently higher score thereafter, converging to a superior final performance\. The faster convergence is driven by dependency\-ordered retrieval reducing early\-stage exploration and progressive unlocking acting as an automatic curriculum\. Figure[3](https://arxiv.org/html/2605.12039#S4.F3)\(right\) shows that graph\-guided retrieval maintains shorter prompts than flat retrieval throughout training, because graph traversal limits the retrieved set to topologically relevant skills rather than all semantically similar entries, improving both inference cost and signal\-to\-noise ratio\.

## 5Conclusion

We presentedSkillGraph, a framework that organizes agent skills into a structured dependency graph with typed relational edges and co\-evolves the graph with the agent’s policy through RL\. By unifying graph construction, graph\-aware retrieval, and graph evolution into a closed training loop,SkillGraphaddresses three key limitations of flat skill libraries: weak compositional planning, poor granularity control, and the inability to accumulate inter\-skill relational signals\. Experiments on ALFWorld, WebShop, and seven search\-augmented QA benchmarks demonstrate state\-of\-the\-art performance, with the largest gains on complex multi\-step tasks that require ordered skill composition\.

##### Limitations and future work\.

Our current framework relies on a strong teacher model for skill distillation and graph\-adaptive operations, which introduces additional inference cost during graph evolution\. Exploring lightweight alternatives such as self\-distillation or critic\-based skill generation could reduce this dependency\. Additionally, the skill graph is currently constructed and evolved within a single environment; investigating cross\-environment skill transfer—where a graph trained on one domain bootstraps learning in another—is a promising direction\. Finally, scalingSkillGraphto larger base models and more diverse task distributions remains an open question worth exploring\.

## References

- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024\)Back to basics: revisiting reinforce\-style optimization for learning from human feedback in llms\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12248–12267\.Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- A survey on rag with llms\.Procedia computer science246,pp\. 3781–3790\.Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk,et al\.\(2024\)Graph of thoughts: solving elaborate problems with large language models\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 17682–17690\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson \(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang \(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.11.11.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.16.16.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px3.p1.20)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. O\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training LLMs to reason and leverage search engines with reinforcement learning\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Rwhi91ideu)Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1601–1611\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.7.7.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee,et al\.\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 453–466\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.6.6.4),[§1](https://arxiv.org/html/2605.12039#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou \(2025\)Search\-o1: agentic search\-enhanced large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 5420–5438\.Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Liu, Y\. Su, P\. Xia, S\. Han, Z\. Zheng, C\. Xie, M\. Ding, and H\. Yao \(2026\)SimpleMem: efficient lifelong memory for llm agents\.arXiv preprint arXiv:2601\.02553\.Cited by:[§1](https://arxiv.org/html/2605.12039#S1.p3.1),[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 9802–9822\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.8.8.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- N\. Nonkes, S\. Agaronian, E\. Kanoulas, and R\. Petcu \(2024\)Leveraging graph structures to detect hallucinations in large language models\.InProceedings of TextGraphs\-17: Graph\-based Methods for Natural Language Processing,pp\. 93–104\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang,et al\.\(2025\)Reasoningbank: scaling agent self\-evolving with reasoning memory\.arXiv preprint arXiv:2509\.25140\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.13.13.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§3\.4](https://arxiv.org/html/2605.12039#S3.SS4.p1.9),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2605.12039#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Cote, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)\{alfw\}orld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0IOX0YcCdTn)Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.3.3.4),[§1](https://arxiv.org/html/2605.12039#S1.p1.1),[§1](https://arxiv.org/html/2605.12039#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Sun, Z\. Qiao, J\. Guo, X\. Fan, Y\. Hou, Y\. Jiang, P\. Xie, Y\. Zhang, F\. Huang, and J\. Zhou \(2025\)Zerosearch: incentivize the search capability of llms without searching\.arXiv preprint arXiv:2505\.04588\.Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- X\. Tang, T\. Qin, T\. Peng, Z\. Zhou, D\. Shao, T\. Du, X\. Wei, P\. Xia, F\. Wu, H\. Zhu,et al\.\(2025\)Agent kb: leveraging cross\-domain experience for agentic problem solving\.arXiv preprint arXiv:2507\.06229\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.12.12.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=ehfRiF0R3a)Cited by:[§1](https://arxiv.org/html/2605.12039#S1.p2.1),[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Wang, R\. Takanobu, Z\. Liang, Y\. Mao, Y\. Hu, J\. McAuley, and X\. Wu \(2025\)Mem\-\{\\\{\\\\backslashalpha\}\\\}: learning memory construction via reinforcement learning\.arXiv preprint arXiv:2509\.25911\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang,et al\.\(2025\)Evolver: self\-evolving llm agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§1](https://arxiv.org/html/2605.12039#S1.p1.1),[§1](https://arxiv.org/html/2605.12039#S1.p2.1),[§1](https://arxiv.org/html/2605.12039#S1.p3.1),[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- R\. Xu and Y\. Yan \(2026\)Agent skills for large language models: architecture, acquisition, security, and the path forward\.arXiv preprint arXiv:2602\.12430\.Cited by:[§1](https://arxiv.org/html/2605.12039#S1.p2.1),[§1](https://arxiv.org/html/2605.12039#S1.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.15.15.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px3.p1.20)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.10.10.4),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022a\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[Table 8](https://arxiv.org/html/2605.12039#A6.T8.1.4.4.4),[§1](https://arxiv.org/html/2605.12039#S1.p1.1),[§1](https://arxiv.org/html/2605.12039#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022b\)ReAct: synergizing reasoning and acting in language models\.InNeurIPS 2022 Foundation Models for Decision Making Workshop,External Links:[Link](https://openreview.net/forum?id=tvI4u1ylcqs)Cited by:[§1](https://arxiv.org/html/2605.12039#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- G\. Zhang, H\. Ren, C\. Zhan, Z\. Zhou, J\. Wang, H\. Zhu, W\. Zhou, and S\. Yan \(2025\)Memevolve: meta\-evolution of agent memory systems\.arXiv preprint arXiv:2512\.18746\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§1](https://arxiv.org/html/2605.12039#S1.p2.1),[§1](https://arxiv.org/html/2605.12039#S1.p3.1),[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.12039#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.12039#S4.SS1.SSS0.Px2.p1.1)\.

## Appendix ASupplementary Details forSkillGraph

This appendix provides formal definitions, derivations, and implementation specifics that supplement the method description in Section[3](https://arxiv.org/html/2605.12039#S3)\.

### A\.1Level Computation

Each nodev∈𝒱v\\in\\mathcal\{V\}is assigned a topological levelℓ​\(v\)\\ell\(v\)used for progressive unlocking and dependency\-respecting retrieval ordering\. Levels are computed via BFS over the directional dependency edges:

ℓ​\(v\)=\{0if​v​has no prerequisite/enhancement parents,maxu:\(u,v\)∈ℰdep⁡ℓ​\(u\)\+1otherwise,\\ell\(v\)=\\begin\{cases\}0&\\text\{if \}v\\text\{ has no prerequisite/enhancement parents\},\\\\ \\max\_\{u:\(u,v\)\\in\\mathcal\{E\}\_\{\\text\{dep\}\}\}\\ell\(u\)\+1&\\text\{otherwise\},\\end\{cases\}\(8\)whereℰdep=\{e∈ℰ:type​\(e\)∈\{prereq,enhance\}\}\\mathcal\{E\}\_\{\\text\{dep\}\}=\\\{e\\in\\mathcal\{E\}:\\text\{type\}\(e\)\\in\\\{\\texttt\{prereq\},\\texttt\{enhance\}\\\}\\\}\. We excludeco\_occuredges from this computation because co\-occurrence captures symmetric association rather than directional dependency; including them would introduce cycles and blur the prerequisite hierarchy\. Levels are recomputed after every graph evolution step to reflect topology changes\.

### A\.2Edge Initialization

Before training begins, edges are initialized with structural priors rather than learned from data:co\_occuredges \(w=0\.3w=0\.3\) connect task\-specific skills within the same category, andenhanceedges \(w=0\.2w=0\.2\) connect each general skill to all task\-specific skills\. Noprereqedges are created at initialization; they emerge through graph evolution as the agent discovers ordering dependencies\.

### A\.3Statistics Update

At each validation checkpoint, node statistics are updated incrementally before evolution decisions are made:

p^​\(v\)←nsucc​\(v\)\+nsuccnew​\(v\)nuse​\(v\)\+nusenew​\(v\),\\hat\{p\}\(v\)\\leftarrow\\frac\{n\_\{\\text\{succ\}\}\(v\)\+n\_\{\\text\{succ\}\}^\{\\text\{new\}\}\(v\)\}\{n\_\{\\text\{use\}\}\(v\)\+n\_\{\\text\{use\}\}^\{\\text\{new\}\}\(v\)\},\(9\)wherensuccnew​\(v\)n\_\{\\text\{succ\}\}^\{\\text\{new\}\}\(v\)andnusenew​\(v\)n\_\{\\text\{use\}\}^\{\\text\{new\}\}\(v\)are the success and usage counts observed in the latest trajectory batch\. A skill receives one usage count when it is retrieved into the prompt and one success count when the corresponding rollout succeeds\.

### A\.4Node Evolution: Additional Details

##### Insertion and edge bootstrapping\.

Newly inserted skills start as isolated nodes with no edges\. Connections are established in subsequent checkpoints via the co\-occurrence discovery mechanism: if a new skill and an existing skill co\-appear in at leastcmin=2c\_\{\\min\}=2successful episodes, aco\_occuredge is added automatically\.

##### Merge: edge inheritance\.

When two skillsviv\_\{i\}andvjv\_\{j\}are merged intovmergedv\_\{\\text\{merged\}\}, the merged node inherits the union of edges from both originals\. Duplicate edges to the same neighbor are resolved by keeping the higher weight\.

##### Split: edge reconnection\.

When a skillvvis split into sub\-skills\{v1′,v2′,…\}\\\{v\_\{1\}^\{\\prime\},v\_\{2\}^\{\\prime\},\\ldots\\\}, the sub\-skills are connected byprereqedges in the order produced by the teacher model\. Existing edges ofvvare redistributed to the sub\-skill whose description is most relevant\.

### A\.5Progressive Unlocking: Additional Details

During the initial warmup phase \(the first 5 training steps\), only non\-deprecated level\-0 skills are active:𝒱active\(0\)=\{v:ℓ​\(v\)=0\}\\mathcal\{V\}\_\{\\text\{active\}\}^\{\(0\)\}=\\\{v:\\ell\(v\)=0\\\}\. After warmup, unlocking is checked at each validation checkpoint\. Success rates are smoothed with a Beta\(1,1\)\(1,1\)prior to avoid premature unlocking from small sample sizes\. If the newly unlocked level already satisfies the threshold, multiple levels can be unlocked within a single checkpoint, enabling rapid progression when the policy has strong foundational competence\.

## Appendix BAdditional Implementation Details

##### Skill schema\.

Each natural\-language skill is stored as a compact record containing a unique skill identifier, a short title, a principle, an applicability condition, and a category \(generalor an environment\-specific task type\)\. The same record format is used by flat skill\-library baselines;SkillGraphaugments it with graph metadata including level, exposure count, successful\-exposure count, success rate, creation step, and deprecation status\.

##### Co\-occurrence edge threshold\.

Newco\_occuredges require at leastcmin=2c\_\{\\min\}=2co\-appearances in successful validation episodes before being added, preventing spurious edges from single lucky episodes\. Deprecated nodes are retained in the saved graph for auditability but excluded from𝒱active\\mathcal\{V\}\_\{\\text\{active\}\}\.

## Appendix CExperimental Details

##### Metric definitions\.

For ALFWorld, we report success rate\. For WebShop, Table[1](https://arxiv.org/html/2605.12039#S4.T1)reports both normalized task score and binary success rate, while Table[3](https://arxiv.org/html/2605.12039#S4.T3)reports task score to match the training\-curve analysis\. For QA benchmarks, we report exact\-match accuracy under the search\-augmented QA evaluation protocol\. The search experiments use a tool\-augmented QA environment where the retriever returns top\-33passages from a Wikipedia index built with an E5 retriever\.SkillGraphis trained on NQ and HotpotQA and evaluated on the seven datasets reported in Table[2](https://arxiv.org/html/2605.12039#S4.T2)\.

##### Hyperparameters\.

Table[4](https://arxiv.org/html/2605.12039#A3.T4)lists the training hyperparameters for each environment\. Table[5](https://arxiv.org/html/2605.12039#A3.T5)lists theSkillGraph\-specific hyperparameters, which are shared across all three environments\.

Table 4:Training hyperparameters per environment\.Table 5:SkillGraphhyperparameters \(shared across all environments\)\.HyperparameterSymbolValueGraph\-aware retrievalRetrieved skill capKmaxK\_\{\\max\}8Backward BFS depthDD2Forward beam widthBB3Node\-level evolutionMax new skills per updatemm3Merge threshold \(Jaccard\)τmerge\\tau\_\{\\text\{merge\}\}0\.85Deprecation threshold–0\.15Min usage for deprecation–20Edge\-level evolutionPath reinforcement stepα\\alpha0\.05Edge decay factorγ\\gamma0\.99Edge pruning thresholdwminw\_\{\\min\}0\.05Progressive unlockingCurriculum warmup epochs–5Level unlock thresholdθunlock\\theta\_\{\\text\{unlock\}\}0\.6

## Appendix DConfidence Intervals

We compute95%95\\%confidence intervals forSkillGraphto quantify evaluation uncertainty\. Table[6](https://arxiv.org/html/2605.12039#A4.T6)reports the results\.

Table 6:95%95\\%confidence intervals forSkillGraphacross all benchmarks\.BenchmarkMetricSkillGraphALFWorld & WebShopALFWorldOverall Succ\. \(%\)90\.6±7\.190\.6\\pm 7\.1WebShopTask Score91\.5±6\.891\.5\\pm 6\.8WebShopSuccess Rate \(%\)84\.4±8\.984\.4\\pm 8\.9Search\-Augmented QA \(EM\)NQEM48\.0±4\.448\.0\\pm 4\.4TriviaQAEM63\.8±4\.263\.8\\pm 4\.2PopQAEM48\.5±4\.448\.5\\pm 4\.4HotpotQAEM44\.7±4\.444\.7\\pm 4\.42WikiEM43\.4±4\.343\.4\\pm 4\.3MuSiQueEM19\.5±3\.519\.5\\pm 3\.5BamboogleEM72\.6±3\.972\.6\\pm 3\.9Average \(QA\)EM48\.9±4\.448\.9\\pm 4\.4
## Appendix EPrompt Templates

This section gives representative prompt templates used bySkillGraph\. Environment\-specific prompts differ mainly in the action space and observation format, while the retrieved skill block is shared across ALFWorld, WebShop, and search\-augmented QA\.

##### Agent prompt with retrieved skills\.

The following template shows the common structure used when skill memory is enabled\. The ALFWorld and WebShop variants replace the action\-space description with admissible environment actions, while the search variant replaces it with the choice between issuing a<search\>query and returning a<answer\>\.

Agent Prompt TemplateYou are an expert agent operating in the target environment\. Your task is to: \{task\_description\} \#\# Retrieved Relevant Experience \{retrieved\_skills\} \#\# Current Progress Prior to this step, you have already taken \{step\_count\} step\(s\)\. Below are the most recent observations and actions: \{action\_history\} Current observation: \{current\_observation\} Admissible actions: \[\{admissible\_actions\}\] Now it is your turn to take an action\. First reason step\-by\-step within <think\> </think\> tags\. Then choose one admissible action within <action\> </action\> tags\.

##### Graph\-ordered skill injection\.

In graph retrieval mode, retrieved skills are rendered in dependency order before being inserted into the agent prompt\. This makes the graph structure visible to the policy without requiring special model architecture changes\.

Graph\-Ordered Skill Block\#\#\# Skills \(ordered by dependency\) \- \*\*\[category\] Skill Title\*\* \[skill\_id\]: Skill principle\. \_Apply when: Applicability condition\.\_ \- \*\*Next Skill Title\*\* \[skill\_id\]: Skill principle\. \_Apply when: Applicability condition\.\_ \#\#\# Mistakes to Avoid \- \*\*Don’t\*\*: Failure pattern or bad action\. \*\*Instead\*\*: Corrective strategy\.

##### Failure\-driven skill insertion prompt\.

During dynamic updates, failed validation trajectories are summarized and passed to the teacher model\. The teacher is asked to produce a small number of new skills and to avoid duplicating existing skill titles\. Returned identifiers are reassigned by the implementation to prevent collisions\.

Failure\-Driven Skill Insertion PromptAnalyze these failed agent trajectories and suggest NEW skills to add to the skill bank\. FAILED TRAJECTORIES: Example 1: Task: \{task\} Task Type: \{task\_type\} Trajectory \(last 5 steps\): Action: \{action\} Observation: \{observation\} \.\.\. EXISTING SKILL TITLES \(avoid duplicating these\): \{existing\_titles\} Generate 1\-\{max\_new\_skills\} NEW actionable skills that would help avoid these failures\. Each skill must have: skill\_id, title \(3\-5 words\), principle \(1\-2 sentences\), when\_to\_apply\. Use skill\_ids: \{dyn\_id\_list\} Return ONLY a JSON array of skills, no other text\.

##### Skill merge and split prompts\.

For graph evolution, the teacher is also used as a skill\-bank curator\. Merge prompts ask it to combine two semantically overlapping skills into one concise skill\. Split prompts ask it to decompose a high\-usage but low\-success skill into two or three simpler sub\-skills, optionally conditioned on failure contexts where the original skill did not help\. Both prompts require the same JSON schema as insertion:skill\_id,title,principle, andwhen\_to\_apply\.

## Appendix FAdditional Search Training Results

Table[7](https://arxiv.org/html/2605.12039#A6.T7)reports intermediate validation checkpoints for the search\-augmented QASkillGraphrun\. The final checkpoint at step200200gives the best average score, while NQ and HotpotQA peak slightly earlier at step180180\. We report the unified step\-200200checkpoint in Table[2](https://arxiv.org/html/2605.12039#S4.T2)for a single consistent model selection rule across datasets\.

Table 7:Search\-augmented QA validation accuracy \(%\) forSkillGraphover training\. NQ and HotpotQA are in\-domain training datasets; the remaining datasets are held\-out transfer evaluations\.Table 8:Licenses of datasets, environments, and models used in this work\.AssetTypeLicenseReferenceEnvironmentsALFWorldEnvironmentMITShridharet al\.\[[2021](https://arxiv.org/html/2605.12039#bib.bib4)\]WebShopEnvironmentMITYaoet al\.\[[2022a](https://arxiv.org/html/2605.12039#bib.bib5)\]Datasets — Single\-hop QANatural Questions \(NQ\)DatasetApache 2\.0Kwiatkowskiet al\.\[[2019](https://arxiv.org/html/2605.12039#bib.bib6)\]TriviaQADatasetApache 2\.0Joshiet al\.\[[2017](https://arxiv.org/html/2605.12039#bib.bib21)\]PopQADatasetMITMallenet al\.\[[2023](https://arxiv.org/html/2605.12039#bib.bib26)\]Datasets — Multi\-hop QAHotpotQADatasetCC BY\-SA 4\.0Yanget al\.\[[2018](https://arxiv.org/html/2605.12039#bib.bib22)\]2WikiMultiHopQADatasetApache 2\.0Hoet al\.\[[2020](https://arxiv.org/html/2605.12039#bib.bib23)\]MuSiQueDatasetCC BY 4\.0Trivediet al\.\[[2022](https://arxiv.org/html/2605.12039#bib.bib24)\]BamboogleDatasetMITPresset al\.\[[2023](https://arxiv.org/html/2605.12039#bib.bib25)\]ModelsQwen2\.5\-7B\-InstructModelApache 2\.0Yanget al\.\[[2025](https://arxiv.org/html/2605.12039#bib.bib19)\]OpenAI o3API ServiceProprietaryJaechet al\.\[[2024](https://arxiv.org/html/2605.12039#bib.bib20)\]
## Appendix GCompute Resources

All training experiments are conducted on a single node equipped with 8×\\timesNVIDIA A100 80GB GPUs, 224 CPU cores, and 2048 GB of system memory\. The total compute budget across all training runs amounts to approximately 280 GPU\-hours\.

## Appendix HBroader Impact

This work proposes a general framework for organizing and evolving reusable skills in LLM\-based agents\. We discuss potential broader impacts below\.

##### Positive impacts\.

By enabling agents to accumulate structured knowledge from experience and reuse it across tasks,SkillGraphcan improve the sample efficiency and reliability of autonomous agents in domains such as household assistance, web navigation, and information retrieval\. The graph\-structured skill memory also enhances interpretability: users can inspect which skills were retrieved, how they are related, and why certain decisions were made, facilitating human oversight of agent behavior\. Furthermore, the progressive unlocking mechanism provides a built\-in safety property—agents are restricted to well\-mastered foundational skills before being exposed to more complex behaviors, reducing the risk of premature deployment of unreliable capabilities\.

##### Potential risks and limitations\.

As with other LLM agent systems,SkillGraphinherits the biases and failure modes of the underlying language model\. Skills distilled from trajectories may encode undesirable patterns if the training data contains biased behaviors\. The teacher model used for graph evolution \(e\.g\., skill insertion, merge, split\) may introduce errors or hallucinated skills, which could propagate through the graph\. We mitigate this through the deprecation mechanism that removes consistently failing skills, but additional safeguards \(e\.g\., human\-in\-the\-loop skill review\) may be necessary for safety\-critical applications\. Our current evaluation focuses on simulated environments; deployment in real\-world settings would require careful validation of skill quality and additional safety constraints\.

## Appendix ILLM Usage Statement

Large language models were used in this work in two capacities\.\(1\) As part of the research methodology:LLMs serve as the teacher model for skill distillation, SFT data generation, and graph evolution operations \(insertion, merge, split\), and as the base policy fine\-tuned via RL, as described in Section[3](https://arxiv.org/html/2605.12039#S3)\.\(2\) For writing assistance:LLMs were used to polish the language and improve the presentation of this manuscript\. All LLM\-assisted content has been manually reviewed, verified, and edited by the authors\. The authors take full responsibility for the accuracy and integrity of all claims, results, and statements presented in this paper\.

## Appendix JAsset Licenses

Table[8](https://arxiv.org/html/2605.12039#A6.T8)summarizes the licenses of all datasets, environments, and models used in this work\. All assets are publicly available and permit academic research use\.

Similar Articles

SkillGen: Verified Inference-Time Agent Skill Synthesis

arXiv cs.LG

This article introduces SkillGen, a multi-agent framework that synthesizes and verifies reusable inference-time skills for LLM agents by contrasting successful and failed trajectories. The method ensures skills are auditable and empirically verified for their net positive impact on agent performance.

SkillOS: Learning Skill Curation for Self-Evolving Agents

Hugging Face Daily Papers

This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.