Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries

arXiv cs.CL Papers

Summary

This paper introduces Group of Skills (GoSkills), a retrieval method that organizes atomic skills into role-labeled execution contexts to improve agent performance within limited context budgets.

arXiv:2605.06978v1 Announce Type: new Abstract: Skill-augmented agents increasingly rely on large reusable skill libraries, but retrieving relevant skills is not the same as presenting usable context. Existing methods typically return atomic skills or dependency-aware bundles whose internal roles remain implicit, leaving the agent to infer the execution entry point, support skills, visible requirements, and failure-avoidance guidance. We introduce Group of Skills (GoSkills), an inference-time group-structured retrieval method that changes the agent-facing retrieval object from a flat skill list to a compact, role-labeled execution context. GoSkills builds anchor-centered skill groups from a typed skill graph, expands support groups through a group graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract with Start, Support, Check, and Avoid fields, without changing the downstream agent, skill payloads, or execution environment. Experiments on SkillsBench and ALFWorld show that GoSkills preserves visible-requirement coverage under a small skill budget, improves over flat skill-access baselines, and often improves reward and agent-only runtime relative to structural retrieval references.
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:41 AM

# Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
Source: [https://arxiv.org/html/2605.06978](https://arxiv.org/html/2605.06978)
Kun Zeng♣\\clubsuit,Yu Huo♠\\spadesuit11footnotemark:1,Siyu Zhang♡\\heartsuit,Zi Ye♣\\clubsuit,Yuecheng Zhuo♢\\diamondsuit, Haoyue Liu♠\\spadesuit,Yuquan Lu♣\\clubsuit,Junhao Wen♣\\clubsuit,Xiaoying Tang♠\\spadesuit ♠\\spadesuitSchool of Science and Engineering, The Chinese University of Hong Kong, Shenzhen ♣\\clubsuitSun Yat\-sen University♡\\heartsuitUniversity of California, San Diego♢\\diamondsuitTaiyuan University of Technology

###### Abstract

Skill\-augmented agents increasingly rely on large reusable skill libraries, but retrieving relevant skills is not the same as presenting usable context\. Existing methods typically return atomic skills or dependency\-aware bundles whose internal roles remain implicit, leaving the agent to infer the execution entry point, support skills, visible requirements, and failure\-avoidance guidance\. We introduceGroupofSkills\(GoSkills\), an inference\-time group\-structured retrieval method that changes the agent\-facing retrieval object from a flat skill list to a compact, role\-labeled execution context\.GoSkillsbuilds anchor\-centered skill groups from a typed skill graph, expands support groups through a group graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract withStart,Support,Check, andAvoidfields, without changing the downstream agent, skill payloads, or execution environment\. Experiments on SkillsBench and ALFWorld show thatGoSkillspreserves visible\-requirement coverage under a small skill budget, improves over flat skill\-access baselines, and often improves reward and agent\-only runtime relative to structural retrieval references\. Code is available at[https://anonymous\.4open\.science/r/Group\-of\-Skills\-E861](https://anonymous.4open.science/r/Group-of-Skills-E861)\.

![Refer to caption](https://arxiv.org/html/2605.06978v1/Figures/Comparison.png)Figure 1:Evolution from individual skill retrieval to group\-structured retrieval\. Vanilla Skills rely on full\-library prompting; Vector Skills retrieve top\-kksemantically similar skills; Graph of Skills performs graph\-structured node retrieval and hydrates dependency\-aware skill bundles; andGroup of Skillsscores anchor\-centered skill groups before expanding them into a group plan\.## 1Introduction

Skill\-augmented LLM agents increasingly rely on external skill libraries: reusable snippets, procedural templates, tool instructions, checkers, and task\-specific conventions that are too numerous to place in the prompt at once\(Jianget al\.,[2026](https://arxiv.org/html/2605.06978#bib.bib26); Wanget al\.,[2026](https://arxiv.org/html/2605.06978#bib.bib27); Hanet al\.,[2025](https://arxiv.org/html/2605.06978#bib.bib28)\)\. As these libraries grow, the bottleneck shifts from whether an agent can access skills to how retrieved skills should be organized under a limited context budget\. Full\-library prompting preserves recall but is expensive; flat semantic retrieval is cheaper but can miss functionally required skills; and graph\-based retrieval improves recall by modeling relations among skills\(Liet al\.,[2026a](https://arxiv.org/html/2605.06978#bib.bib13); Xiaet al\.,[2026](https://arxiv.org/html/2605.06978#bib.bib14)\)\. Yet even a relevant retrieved bundle may still leave the downstream agent to infer the execution entry point, supporting skills, visible requirements, and failure\-avoidance guidance\. Figure[1](https://arxiv.org/html/2605.06978#S0.F1)summarizes this progression from individual skills and hydrated bundles toward role\-aware skill groups\.

This motivates a different question: what unit should a skill retriever expose to the agent? Existing interfaces usually decide which atomic skills or hydrated bundles to include, while leaving the agent\-facing roles among them implicit\(Quet al\.,[2024](https://arxiv.org/html/2605.06978#bib.bib29); Shiet al\.,[2025](https://arxiv.org/html/2605.06978#bib.bib5)\)\. In verifier\-sensitive coding and interaction tasks, this missing organization can matter as much as recall: a setup utility may need to precede a checker, a parser may only be useful as support for a formatter, and visible requirements such as output formats or public tests must remain explicit after context compression\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.06978#bib.bib30); Genget al\.,[2023](https://arxiv.org/html/2605.06978#bib.bib32); Kanget al\.,[2025](https://arxiv.org/html/2605.06978#bib.bib31)\)\. We use “visible requirements” only for information available before execution, excluding hidden tests, evaluator internals, and previous failure traces\.

We introduce*Group of Skills*\(GoSkills\), an inference\-time group\-level retrieval and contextualization method for agent skill libraries\. The key idea is to change the agent\-facing retrieval object from a flat list of atomic skills to a compact, role\-labeled execution context\. A useful group is not an arbitrary semantic cluster: it is an anchor\-centered local pattern whose support members add complementary roles, artifact coverage, visible checks, or failure\-avoidance cues\. Offline,GoSkillsconstructs such bounded groups from typed skill neighborhoods and links related groups through a group graph\. At inference time, it retrieves an anchor group, expands support groups, bottlenecks the selected group plan into a small set of atomic skill payloads, and renders a fixed execution contract\. Because the downstream agent, skill payloads, and execution environment are unchanged, we isolate the intervention to retrieval\-time context organization rather than agent training, tool execution, or environment modification\.

We evaluateGoSkillsonSkillsBench\(Liet al\.,[2026c](https://arxiv.org/html/2605.06978#bib.bib6)\), which tests technical skill selection and visible\-requirement coverage, andALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.06978#bib.bib17)\), which tests multi\-turn interactive decision\-making\. This combination lets us test whether group\-structured context helps both verifier\-sensitive technical tasks and broader downstream execution under constrained skill budgets\.

Our contributions are:

- •We formulate*group\-structured skill retrieval*for agent skill libraries: retrieval selects anchor\-centered skill groups and expands them under a context budget, rather than exposing isolated skills or post\-hoc bundles\.
- •We proposeGoSkills, a deterministic inference\-time method that decomposes skill\-context construction into anchor selection, support expansion, and payload exposure\. This decomposition makes retrieved skills directly usable by rendering the resulting group plan as an execution contract with coverage debt over visible requirements\.
- •We evaluateGoSkillson SkillsBench and ALFWorld\. Results show that it preserves visible\-requirement coverage, improves over flat skill\-access baselines, and often improves reward and agent\-only runtime relative to structural retrieval references\.

## 2Related Work

#### Tool and skill retrieval\.

Tool\-augmented language models, API\-retrieval systems, and tool\-use benchmarks show that external tools can expand agent capabilities, while large tool or skill collections make retrieval necessary\(Schicket al\.,[2023](https://arxiv.org/html/2605.06978#bib.bib1); Mialonet al\.,[2023](https://arxiv.org/html/2605.06978#bib.bib2); Patilet al\.,[2024](https://arxiv.org/html/2605.06978#bib.bib3); Qinet al\.,[2024](https://arxiv.org/html/2605.06978#bib.bib4); Shiet al\.,[2025](https://arxiv.org/html/2605.06978#bib.bib5); Huoet al\.,[2026](https://arxiv.org/html/2605.06978#bib.bib35); Yaoet al\.,[2023](https://arxiv.org/html/2605.06978#bib.bib15); Zhuanget al\.,[2023](https://arxiv.org/html/2605.06978#bib.bib16); Liet al\.,[2023](https://arxiv.org/html/2605.06978#bib.bib33)\)\. Prior skill repositories and benchmarks emphasize packaging, discovering, and evaluating reusable skills across diverse agent tasks and library settings\(Wanget al\.,[2023](https://arxiv.org/html/2605.06978#bib.bib34); Zhanget al\.,[2026](https://arxiv.org/html/2605.06978#bib.bib36); Liet al\.,[2026c](https://arxiv.org/html/2605.06978#bib.bib6); Lianget al\.,[2026](https://arxiv.org/html/2605.06978#bib.bib7); Liet al\.,[2026b](https://arxiv.org/html/2605.06978#bib.bib8)\)\. However, retrieving relevant skills is not equivalent to presenting usable context: a retrieved bundle may still leave the execution entry point, support roles, visible constraints, and failure\-avoidance guidance implicit\. Our work targets this interface layer: after relevant skills have been found, it decides how they should be exposed to the agent\.

#### Structured retrieval\.

Graph\-structured retrieval has been studied for documents, memory, and tool access, where relations help retrieval move beyond independent nearest\-neighbor matching\(Edgeet al\.,[2024](https://arxiv.org/html/2605.06978#bib.bib9); Gutierrezet al\.,[2024](https://arxiv.org/html/2605.06978#bib.bib10); Liuet al\.,[2024b](https://arxiv.org/html/2605.06978#bib.bib11),[a](https://arxiv.org/html/2605.06978#bib.bib12)\)\. In skill settings, such structure can recover prerequisites, setup utilities, preprocessors, and formatters that may not be lexically salient\.GoSkillsbuilds on this structural view but shifts the decision unit from individual skills to small role\-aware groups, then expands an anchor group into support groups before bottlenecking the final payloads for the agent\.

#### Concurrent work\.

Concurrent work uses structure at different stages of skill use\. Graph of Skills retrieves dependency\-aware execution bundles through graph construction, seeding, diffusion, reranking, and hydration\(Liet al\.,[2026a](https://arxiv.org/html/2605.06978#bib.bib13)\), while GRASP studies structured skill composition and execution\-time repair\(Xiaet al\.,[2026](https://arxiv.org/html/2605.06978#bib.bib14)\)\. In contrast,GoSkillsuses structure before execution: it performs group\-level retrieval and expansion, renders a role\-labeled context, and leaves the downstream execution loop unchanged\.

## 3Methodology

Group of Skills\(GoSkills\) is an inference\-time group\-level retrieval and contextualization method for skill\-augmented coding agents\. Its goal is to change the agent\-facing retrieval object from a flat list of atomic skills to a compact, role\-labeled execution context\. Unlike methods that stop after ranking or hydrating atomic skills,GoSkillsuses atomic retrieval only as evidence for activating group\-level retrieval units\. As shown in Figure[2](https://arxiv.org/html/2605.06978#S3.F2),GoSkillsfirst builds reusable skill groups offline\. At inference time, it retrieves an anchor group, expands it with support groups, bottlenecks the selected groups into a bounded set of atomic skill payloads, and renders them as an execution contract\. Thus, any change in downstream behavior comes from how retrieved skills are organized and exposed, not from changing the agent, skill implementations, or environment\.

We distinguish four objects\. A*group*is an offline reusable local retrieval unit centered on a lead skill\. A*group plan*𝒫​\(q\)\\mathcal\{P\}\(q\)is an ordered anchor\-support structure selected for a query\.B​\(q\)B\(q\)is the budgeted set of atomic skill payloads presented to the agent\.D​\(q,B\)D\(q,B\)is any uncovered visible\-requirement debt, andC​\(q\)C\(q\)is the deterministic contract rendered from\(𝒫,B,D\)\(\\mathcal\{P\},B,D\)\. The contribution ofGoSkillsis the group\-level retrieval, expansion, bottlenecking, and rendering policy\.

![Refer to caption](https://arxiv.org/html/2605.06978v1/Figures/framework.png)Figure 2:Overview ofGoSkills\. The offline stage constructs a skill graph from the skill library, extracts anchor\-centered skill groups, and stores reusable group templates\. At inference time,GoSkillsdecomposes the task query, retrieves and scores candidate groups, selects an anchor group, expands support groups, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a compact role\-labeled execution contract for the downstream agent\.### 3\.1Offline: Skill Groups and Group Graph

Let𝒮=\{s1,…,sn\}\\mathcal\{S\}=\\\{s\_\{1\},\\ldots,s\_\{n\}\\\}be a library of atomic skills\. Each skillsshas a hydrated payload and a normalized facet setFsF\_\{s\}extracted from its name, metadata, tags, and text\. We assume a typed skill graph

G𝒮=\(𝒮,E,w,ϕ\),G\_\{\\mathcal\{S\}\}=\(\\mathcal\{S\},E,w,\\phi\),\(1\)where edges encode dependency, workflow, semantic, artifact, or alternative relations\.

Offline,GoSkillsconstructs a pool𝒢\\mathcal\{G\}of small anchor\-centered groups\. Each group is represented as

g=⟨sglead,Mg,Rg,Fg\+,Fgopt,Fg−,Ag,Vg,Tg,πg⟩,g=\\left\\langle s\_\{g\}^\{\\mathrm\{lead\}\},M\_\{g\},R\_\{g\},F\_\{g\}^\{\+\},F\_\{g\}^\{\\mathrm\{opt\}\},F\_\{g\}^\{\-\},A\_\{g\},V\_\{g\},T\_\{g\},\\pi\_\{g\}\\right\\rangle,\(2\)wheresgleads\_\{g\}^\{\\mathrm\{lead\}\}is the lead skill,MgM\_\{g\}are at most two support members,RgR\_\{g\}assigns member roles,Fg\+F\_\{g\}^\{\+\},FgoptF\_\{g\}^\{\\mathrm\{opt\}\}, andFg−F\_\{g\}^\{\-\}encode required, optional, and negative applicability facets,AgA\_\{g\}andVgV\_\{g\}store artifact and visible\-requirement cues,TgT\_\{g\}records the local topology, andπg\\pi\_\{g\}is a fixed group prior\. A group is not an executable hidden program; it is a bounded role template whose members may later be expanded, pruned, and rendered under a prompt budget\.

GoSkillsalso builds a typed group graph

ℋ=\(𝒢,E𝒢,ρ𝒢,ω\),\\mathcal\{H\}=\(\\mathcal\{G\},E\_\{\\mathcal\{G\}\},\\rho\_\{\\mathcal\{G\}\},\\omega\),\(3\)whereρ𝒢\\rho\_\{\\mathcal\{G\}\}labels each group edge as support, artifact, visible\-check, fallback, or incompatibility evidence, andω\\omegaassigns edge weights\. During online expansion,GoSkillsfollows only positive non\-incompatibility edges:

𝒩ℋ\+​\(𝒫,q\)=\{g′∉𝒫set:∃g∈𝒫set,\(g,g′\)∈E𝒢,ρ𝒢​\(g,g′\)≠𝑖𝑛𝑐𝑜𝑚𝑝𝑎𝑡,ω​\(g,g′\)\>0\}\.\\mathcal\{N\}\_\{\\mathcal\{H\}\}^\{\+\}\(\\mathcal\{P\},q\)=\\left\\\{g^\{\\prime\}\\notin\\mathcal\{P\}\_\{\\mathrm\{set\}\}:\\exists g\\in\\mathcal\{P\}\_\{\\mathrm\{set\}\},\(g,g^\{\\prime\}\)\\in E\_\{\\mathcal\{G\}\},\\rho\_\{\\mathcal\{G\}\}\(g,g^\{\\prime\}\)\\neq\\mathit\{incompat\},\\omega\(g,g^\{\\prime\}\)\>0\\right\\\}\.\(4\)This graph lifts atomic skill relations to reusable group\-level relations\. Instead of repeatedly rediscovering support skills from the atomic graph at inference time,GoSkillscan expand from an anchor group to nearby support groups that already encode role, artifact, and visible\-requirement structure\. In this way, the group graph turns low\-level skill connectivity into a reusable retrieval substrate for agent\-facing context organization\. Construction details are given in Appendix[B\.2](https://arxiv.org/html/2605.06978#A2.SS2)\.

### 3\.2Online: Group Retrieval and Contextualized Exposure

Given a queryqq,GoSkillsproduces

Ω​\(q\)=\(𝒫​\(q\),B​\(q\),D​\(q,B\),C​\(q\)\)\.\\Omega\(q\)=\\big\(\\mathcal\{P\}\(q\),B\(q\),D\(q,B\),C\(q\)\\big\)\.\(5\)The group plan is an ordered anchor\-support object:

𝒫​\(q\)\\displaystyle\\mathcal\{P\}\(q\)=⟨ga​\(q\),𝒫sup​\(q\)⟩,\\displaystyle=\\langle g\_\{a\}\(q\),\\mathcal\{P\}\_\{\\mathrm\{sup\}\}\(q\)\\rangle,\(6\)𝒫set​\(q\)\\displaystyle\\mathcal\{P\}\_\{\\mathrm\{set\}\}\(q\)=\{ga​\(q\)\}∪𝒫sup​\(q\)\.\\displaystyle=\\\{g\_\{a\}\(q\)\\\}\\cup\\mathcal\{P\}\_\{\\mathrm\{sup\}\}\(q\)\.B​\(q\)⊆𝒮B\(q\)\\subseteq\\mathcal\{S\}is the presented atomic skill set,D​\(q,B\)D\(q,B\)is remaining coverage debt, andC​\(q\)C\(q\)is the rendered execution contract\. The plan is constrained by a context budgetτ\\tau; in our implementation, payload count is the primary budget and character\-level estimates act as a guard\.

#### Query schema\.

The query is mapped to a deterministic schema

ψ​\(q\)=⟨Fcore,Ftech,Fop,Fartifact,Fconstraint,Ffailure,Fcheck⟩\.\\psi\(q\)=\\left\\langle F\_\{\\mathrm\{core\}\},F\_\{\\mathrm\{tech\}\},F\_\{\\mathrm\{op\}\},F\_\{\\mathrm\{artifact\}\},F\_\{\\mathrm\{constraint\}\},F\_\{\\mathrm\{failure\}\},F\_\{\\mathrm\{check\}\}\\right\\rangle\.\(7\)Let

𝒮g=\{sglead\}∪Mg\.\\mathcal\{S\}\_\{g\}=\\\{s\_\{g\}^\{\\mathrm\{lead\}\}\\\}\\cup M\_\{g\}\.\(8\)The schema summarizes task goals, technology anchors, operations, artifacts, constraints, failure cues, and visible\-requirement semantics\. Visible requirements are task properties available before execution, such as exact output formats, required artifacts, deterministic behavior, unit tests, or formal proof obligations\. Schema extraction uses the task prompt, public task files when provided, skill metadata, and skill text; it does not access hidden tests, evaluator internals, or previous failure traces\.

#### Group activation and scoring\.

Atomic retrieval evidence may be used, but it is not the final retrieved object\. Let

R0​\(q\)=\{s\(1\),…,s\(M\)\}\.R\_\{0\}\(q\)=\\\{s\_\{\(1\)\},\\ldots,s\_\{\(M\)\}\\\}\.\(9\)be an optional seed set returned by a vector or graph retriever\. Candidate groups are activated by direct query–group matches and by overlap with seed skills:

𝒢q=DirectGroupMatches​\(ψ​\(q\),𝒢\)∪\{g∈𝒢:𝒮g∩R0​\(q\)≠∅\}\.\\mathcal\{G\}\_\{q\}=\\textsc\{DirectGroupMatches\}\(\\psi\(q\),\\mathcal\{G\}\)\\cup\\\{g\\in\\mathcal\{G\}:\\mathcal\{S\}\_\{g\}\\cap R\_\{0\}\(q\)\\neq\\emptyset\\\}\.\(10\)ThusR0​\(q\)R\_\{0\}\(q\)serves as evidence for group activation, while the final retrieval decision is made over groups rather than individual skills\.

The selection policy is fixed and deterministic rather than learned\. It uses three linear scores:

Ugrp​\(g,q\)\\displaystyle U\_\{\\mathrm\{grp\}\}\(g,q\)=𝜷⊤​𝐱​\(g,q\)\+λπ​πg,\\displaystyle=\\boldsymbol\{\\beta\}^\{\\top\}\\mathbf\{x\}\(g,q\)\+\\lambda\_\{\\pi\}\\pi\_\{g\},\(11\)Usup​\(g∣𝒫,q\)\\displaystyle U\_\{\\mathrm\{sup\}\}\(g\\mid\\mathcal\{P\},q\)=𝜼⊤​𝐳​\(g,𝒫,q\),\\displaystyle=\\boldsymbol\{\\eta\}^\{\\top\}\\mathbf\{z\}\(g,\\mathcal\{P\},q\),Ubot​\(s∣B,𝒫,q\)\\displaystyle U\_\{\\mathrm\{bot\}\}\(s\\mid B,\\mathcal\{P\},q\)=𝜸⊤​𝐡​\(s,B,𝒫,q\)\.\\displaystyle=\\boldsymbol\{\\gamma\}^\{\\top\}\\mathbf\{h\}\(s,B,\\mathcal\{P\},q\)\.whereUgrpU\_\{\\mathrm\{grp\}\}ranks candidate groups,UsupU\_\{\\mathrm\{sup\}\}adds marginal support groups, andUbotU\_\{\\mathrm\{bot\}\}selects final atomic skills for presentation\. The feature families include retriever relevance, facet coverage, anchor match, visible\-check support, graph connectivity, redundancy, negative applicability, and cost\. We keep the scoring policy fixed across tasks, benchmarks, and backbone models; full coefficients and rule definitions are in Appendix[D\.1](https://arxiv.org/html/2605.06978#A4.SS1)\.

#### Anchor and support\-group expansion\.

Candidate groups are ranked byUgrpU\_\{\\mathrm\{grp\}\}, and the topL​\(q\)L\(q\)groups are retained\. HereL​\(q\)L\(q\)is a capped shortlist size determined by query complexity and retrieval ambiguity\. The anchor group is selected by

ga=arg⁡maxg∈𝒢^q⁡\[Ugrp​\(g,q\)\+λa​Anchor​\(g,q\)\],g\_\{a\}=\\arg\\max\_\{g\\in\\widehat\{\\mathcal\{G\}\}\_\{q\}\}\\left\[U\_\{\\mathrm\{grp\}\}\(g,q\)\+\\lambda\_\{a\}\\mathrm\{Anchor\}\(g,q\)\\right\],\(12\)whereAnchor​\(g,q\)\\mathrm\{Anchor\}\(g,q\)promotes lead skills matching explicit technology or artifact anchors and suppresses generic or incompatible leads\. Starting from𝒫0=⟨ga,∅⟩\\mathcal\{P\}\_\{0\}=\\langle g\_\{a\},\\emptyset\\rangle,GoSkillsgreedily adds support groups from the retained shortlist and the group\-graph neighborhood:

g⋆=arg⁡maxg∈\(𝒢^q∪𝒩ℋ\+​\(𝒫,q\)\)∖𝒫set⁡Usup​\(g∣𝒫,q\)\.g^\{\\star\}=\\arg\\max\_\{g\\in\\left\(\\widehat\{\\mathcal\{G\}\}\_\{q\}\\cup\\mathcal\{N\}\_\{\\mathcal\{H\}\}^\{\+\}\(\\mathcal\{P\},q\)\\right\)\\setminus\\mathcal\{P\}\_\{\\mathrm\{set\}\}\}U\_\{\\mathrm\{sup\}\}\(g\\mid\\mathcal\{P\},q\)\.\(13\)Support expansion stops when the marginal gain, group budget, or context guard is exhausted\. The implementation caps the selected group plan at three groups\.

#### Bottlenecking and coverage\-safe backfill\.

The selected group plan is internal\. The downstream agent receives onlyB​\(q\)B\(q\)and the contract\. Let

𝒮𝒫=⋃g∈𝒫set𝒮g\.\\mathcal\{S\}\_\{\\mathcal\{P\}\}=\\bigcup\_\{g\\in\\mathcal\{P\}\_\{\\mathrm\{set\}\}\}\\mathcal\{S\}\_\{g\}\.\(14\)GoSkillsinserts lead skills first and then selects remaining presented skills withUbotU\_\{\\mathrm\{bot\}\}, subject tocost​\(B\)≤τ\\mathrm\{cost\}\(B\)\\leq\\tau\. This bottleneck uses group structure for planning without exposing all group members\.

To avoid losing explicit requirements,GoSkillscomputes coverage debt

D​\(q,B\)=ℱhigh​\(q\)∖⋃s∈BFs,D\(q,B\)=\\mathcal\{F\}\_\{\\mathrm\{high\}\}\(q\)\\setminus\\bigcup\_\{s\\in B\}F\_\{s\},\(15\)whereℱhigh​\(q\)\\mathcal\{F\}\_\{\\mathrm\{high\}\}\(q\)contains exact high\-confidence facets such as framework names, file extensions, named APIs, output formats, and explicit constraints\. Eligible skills from the activated support universe are backfilled only if they cover current debt, pass negative\-applicability checks, and fit the remaining budget\. Remaining debt is reported rather than silently repaired\. Detailed accounting is in Appendix[B\.4](https://arxiv.org/html/2605.06978#A2.SS4)\.

#### Execution contract\.

The final output is a deterministic execution contract

C​\(q\)=⟨Cstart,Csupport,Ccheck,Cavoid,B​\(q\),D​\(q,B\)⟩\.C\(q\)=\\left\\langle C\_\{\\mathrm\{start\}\},C\_\{\\mathrm\{support\}\},C\_\{\\mathrm\{check\}\},C\_\{\\mathrm\{avoid\}\},B\(q\),D\(q,B\)\\right\\rangle\.\(16\)The contract names the anchor skill, labels support skills, lists visible requirements, states negative guidance, and includes the hydrated payloads for the presented atomic skills\. The template is fixed across tasks; task\-specific content enters only through the selected groups, presented skills, query schema, and coverage debt\.GoSkillsdoes not bind arguments, check preconditions, execute skills, or perform runtime graph repair\.

#### Cost\.

Offline group construction is run once per skill library\. Online, with inverted indexes over group members and normalized facets, the overhead is controlled by the activated group neighborhood rather than the full skill library\. Since the group shortlist size, selected group count, group size, and presented\-skill budget are all capped, the online cost is small relative to downstream agent execution\. Full algorithmic details and complexity terms are reported in Appendix[B\.3](https://arxiv.org/html/2605.06978#A2.SS3)\.

## 4Experiments

### 4\.1Setup

We evaluateGoSkillsonSkillsBenchandALFWorld, covering both technical skill selection with deterministic checks and multi\-turn interactive task execution\. These benchmarks test whether group\-structured retrieval improves agent\-facing skill use under a constrained context budget\. Unless otherwise stated, all experiments use the same retrieval budgets, scoring weights, downstream agent loop, execution environment, and skill payloads; full implementation settings are provided in Appendix[E\.10](https://arxiv.org/html/2605.06978#A5.SS10)\.

#### Baselines\.

We compare against four skill\-access settings:*No Skills*, which provides no skill context;*Vanilla Skills*\(Anthropic,[2026](https://arxiv.org/html/2605.06978#bib.bib24)\), which prepends the full skill library;*Vector Skills*\([OpenAI,](https://arxiv.org/html/2605.06978#bib.bib25)\), which retrieves a flat semantic top\-kklist; and*Graph of Skills*\(Liet al\.,[2026a](https://arxiv.org/html/2605.06978#bib.bib13)\), which hydrates dependency\-aware bundles from a typed skill graph\. In contrast,GoSkillsretrieves anchor\-centered groups, expands support groups through a group graph, and renders the selected context as a role\-labeled execution contract\. Details are shown in Appendix[E\.7](https://arxiv.org/html/2605.06978#A5.SS7)\.

#### Models\.

Our experiments used multiple models, including Gemini 3 Pro\(Pichaiet al\.,[2025](https://arxiv.org/html/2605.06978#bib.bib23)\), MiniMax M2\.7\(MiniMax,[2026](https://arxiv.org/html/2605.06978#bib.bib19)\), GPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2605.06978#bib.bib20)\), Claude Sonnet 4\.5\(Anthropic,[2025](https://arxiv.org/html/2605.06978#bib.bib18)\), Qwen 3\.5\(Qwen Team,[2026](https://arxiv.org/html/2605.06978#bib.bib21)\), and Kimi K2\.5\(Moonshot AI,[2026](https://arxiv.org/html/2605.06978#bib.bib22)\)\. Each model–method–benchmark combination was run three times, and Table[1](https://arxiv.org/html/2605.06978#S4.T1)reports the mean over the three runs\. Reward is averaged over tasks within each run and then averaged across runs\. Token usage reports mean input tokens, and runtime reports mean agent\-only task\-processing time, excluding environment setup\. Appendix[E\.1](https://arxiv.org/html/2605.06978#A5.SS1)describes the benchmark protocol, while Appendix[E\.2](https://arxiv.org/html/2605.06978#A5.SS2)reports valid\-run denominators and aggregation provenance for Table[1](https://arxiv.org/html/2605.06978#S4.T1)\.

Table 1:Aggregate downstream results by benchmark, method, and agent backbone\.Rewarddenotes average reward \(%\),Tokensdenotes mean input tokens, andRuntimedenotes agent\-only runtime \(s\)\. ForReward, larger is better; forTokensandRuntime, smaller is better\. Values are averaged over three runs\. Symbol columns indicate skill\-library access \(Skill\), bounded retrieval \(Ret\.\), and group\-structured context \(Group\);✓✗indicates dependency\-graph structure without group roles\. Colored arrow deltas in skill\-access rows are relative to No Skills under the same benchmark and backbone; green/red indicates improvement/regression under the metric direction\. The best value for each metric within each benchmark–model block is highlighted inbold; No Skills is a no\-context reference and participates in metric highlighting\.ModelMethodAccessSkillsBenchALFWorldSkillRet\.GroupReward↑\\uparrowTokens↓\\downarrowRuntime↓\\downarrowReward↑\\uparrowTokens↓\\downarrowRuntime↓\\downarrow![[Uncaptioned image]](https://arxiv.org/html/2605.06978v1/Figures/logos/gemini-color.png)Gemini 3 ProNo Skills✗✗✗14\.8724,306431\.682\.11,086,42060\.9Vanilla Skills✓✗✗25\.0↑10\.2\\uparrow 10\.2967,791↑243\.5​k\\uparrow 243\.5k465\.8↑34\.2\\uparrow 34\.289\.3↑7\.2\\uparrow 7\.21,524,401↑438\.0​k\\uparrow 438\.0k53\.2↓7\.7\\downarrow 7\.7Vector Skills✓✓✗19\.3↑4\.5\\uparrow 4\.5894,640↑170\.3​k\\uparrow 170\.3k357\.3↓74\.3\\downarrow 74\.393\.6↑11\.5\\uparrow 11\.528,407↓1\.1​M\\downarrow 1\.1M37\.8↓23\.1\\downarrow 23\.1Graph of Skills✓✓✓✗31\.0↑16\.2\\uparrow 16\.2864,577↑140\.3​k\\uparrow 140\.3k366\.2↓65\.4\\downarrow 65\.495\.0↑12\.9\\uparrow 12\.929,846↓1\.1​M\\downarrow 1\.1M42\.6↓18\.3\\downarrow 18\.3GoSkills✓✓✓38\.6↑23\.8\\uparrow 23\.8881,492↑157\.2​k\\uparrow 157\.2k327\.9↓103\.7\\downarrow 103\.797\.9↑15\.8\\uparrow 15\.827,215↓1\.1​M\\downarrow 1\.1M49\.2↓11\.7\\downarrow 11\.7![[Uncaptioned image]](https://arxiv.org/html/2605.06978v1/Figures/logos/minimax-color.png)MiniMax M2\.7No Skills✗✗✗8\.9808,420621\.042\.01,436,55094\.0Vanilla Skills✓✗✗17\.2↑8\.3\\uparrow 8\.3942,113↑133\.7​k\\uparrow 133\.7k580\.7↓40\.3\\downarrow 40\.347\.1↑5\.1\\uparrow 5\.12,184,823↑748\.3​k\\uparrow 748\.3k88\.6↓5\.4\\downarrow 5\.4Vector Skills✓✓✗10\.4↑1\.5\\uparrow 1\.5852,881↑44\.5​k\\uparrow 44\.5k552\.9↓68\.1\\downarrow 68\.150\.7↑8\.7\\uparrow 8\.766,109↓1\.4​M\\downarrow 1\.4M73\.4↓20\.6\\downarrow 20\.6Graph of Skills✓✓✓✗19\.9↑11\.0\\uparrow 11\.0560,442↓248\.0​k\\downarrow 248\.0k518\.4↓102\.6\\downarrow 102\.652\.1↑10\.1\\uparrow 10\.167,884↓1\.4​M\\downarrow 1\.4M71\.5↓22\.5\\downarrow 22\.5GoSkills✓✓✓24\.3↑15\.4\\uparrow 15\.4867,452↑59\.0​k\\uparrow 59\.0k402\.5↓218\.5\\downarrow 218\.554\.3↑12\.3\\uparrow 12\.365,227↓1\.4​M\\downarrow 1\.4M68\.8↓25\.2\\downarrow 25\.2![[Uncaptioned image]](https://arxiv.org/html/2605.06978v1/Figures/logos/openai.png)GPT\-5\.4No Skills✗✗✗18\.6612,870748\.485\.71,024,68091\.5Vanilla Skills✓✗✗28\.4↑9\.8\\uparrow 9\.8832,786↑219\.9​k\\uparrow 219\.9k686\.8↓61\.6\\downarrow 61\.689\.3↑3\.6\\uparrow 3\.61,435,614↑410\.9​k\\uparrow 410\.9k83\.3↓8\.2\\downarrow 8\.2Vector Skills✓✓✗25\.0↑6\.4\\uparrow 6\.4569,353↓43\.5​k\\downarrow 43\.5k742\.1↓6\.3\\downarrow 6\.392\.9↑7\.2\\uparrow 7\.234,436↓990\.2​k\\downarrow 990\.2k57\.0↓34\.5\\downarrow 34\.5Graph of Skills✓✓✓✗36\.4↑17\.8\\uparrow 17\.8380,199↓232\.7​k\\downarrow 232\.7k603\.7↓144\.7\\downarrow 144\.793\.6↑7\.9\\uparrow 7\.947,851↓976\.8​k\\downarrow 976\.8k65\.0↓26\.5\\downarrow 26\.5GoSkills✓✓✓48\.9↑30\.3\\uparrow 30\.3694,825↑82\.0​k\\uparrow 82\.0k352\.9↓395\.5\\downarrow 395\.595\.3↑9\.6\\uparrow 9\.663,319↓961\.4​k\\downarrow 961\.4k38\.2↓53\.3\\downarrow 53\.3![[Uncaptioned image]](https://arxiv.org/html/2605.06978v1/Figures/logos/claude-color.png)Claude Sonnet 4\.5No Skills✗✗✗20\.8712,460605\.484\.71,208,30082\.4Vanilla Skills✓✗✗29\.6↑8\.8\\uparrow 8\.8905,420↑193\.0​k\\uparrow 193\.0k641\.5↑36\.1\\uparrow 36\.191\.4↑6\.7\\uparrow 6\.71,377,280↑169\.0​k\\uparrow 169\.0k78\.9↓3\.5\\downarrow 3\.5Vector Skills✓✓✗26\.2↑5\.4\\uparrow 5\.4610,730↓101\.7​k\\downarrow 101\.7k707\.2↑101\.8\\uparrow 101\.894\.3↑9\.6\\uparrow 9\.636,950↓1\.2​M\\downarrow 1\.2M55\.8↓26\.6\\downarrow 26\.6Graph of Skills✓✓✓✗38\.7↑17\.9\\uparrow 17\.9421,884↓290\.6​k\\downarrow 290\.6k571\.9↓33\.5\\downarrow 33\.594\.6↑9\.9\\uparrow 9\.949,120↓1\.2​M\\downarrow 1\.2M62\.7↓19\.7\\downarrow 19\.7GoSkills✓✓✓46\.8↑26\.0\\uparrow 26\.0714,360↑1\.9​k\\uparrow 1\.9k347\.6↓257\.8\\downarrow 257\.896\.4↑11\.7\\uparrow 11\.758,904↓1\.1​M\\downarrow 1\.1M39\.5↓42\.9\\downarrow 42\.9![[Uncaptioned image]](https://arxiv.org/html/2605.06978v1/Figures/logos/qwen-color.png)Qwen 3\.5No Skills✗✗✗11\.5754,900640\.261\.41,318,76098\.1Vanilla Skills✓✗✗20\.9↑9\.4\\uparrow 9\.41,026,300↑271\.4​k\\uparrow 271\.4k612\.6↓27\.6\\downarrow 27\.669\.3↑7\.9\\uparrow 7\.91,742,510↑423\.8​k\\uparrow 423\.8k91\.4↓6\.7\\downarrow 6\.7Vector Skills✓✓✗18\.1↑6\.6\\uparrow 6\.6721,480↓33\.4​k\\downarrow 33\.4k681\.7↑41\.5\\uparrow 41\.572\.9↑11\.5\\uparrow 11\.558,214↓1\.3​M\\downarrow 1\.3M76\.0↓22\.1\\downarrow 22\.1Graph of Skills✓✓✓✗27\.6↑16\.1\\uparrow 16\.1505,220↓249\.7​k\\downarrow 249\.7k589\.3↓50\.9\\downarrow 50\.976\.4↑15\.0\\uparrow 15\.069,778↓1\.2​M\\downarrow 1\.2M70\.6↓27\.5\\downarrow 27\.5GoSkills✓✓✓33\.5↑22\.0\\uparrow 22\.0756,930↑2\.0​k\\uparrow 2\.0k421\.7↓218\.5\\downarrow 218\.580\.7↑19\.3\\uparrow 19\.374,502↓1\.2​M\\downarrow 1\.2M55\.3↓42\.8\\downarrow 42\.8![[Uncaptioned image]](https://arxiv.org/html/2605.06978v1/Figures/logos/kimi.png)Kimi K2\.5No Skills✗✗✗16\.2698,440603\.878\.61,156,20084\.9Vanilla Skills✓✗✗25\.6↑9\.4\\uparrow 9\.4980,115↑281\.7​k\\uparrow 281\.7k590\.7↓13\.1\\downarrow 13\.184\.3↑5\.7\\uparrow 5\.71,604,870↑448\.7​k\\uparrow 448\.7k76\.2↓8\.7\\downarrow 8\.7Vector Skills✓✓✗22\.4↑6\.2\\uparrow 6\.2642,288↓56\.2​k\\downarrow 56\.2k662\.5↑58\.7\\uparrow 58\.789\.0↑10\.4\\uparrow 10\.441,908↓1\.1​M\\downarrow 1\.1M58\.4↓26\.5\\downarrow 26\.5Graph of Skills✓✓✓✗33\.8↑17\.6\\uparrow 17\.6455,906↓242\.5​k\\downarrow 242\.5k545\.1↓58\.7\\downarrow 58\.790\.7↑12\.1\\uparrow 12\.154,336↓1\.1​M\\downarrow 1\.1M63\.1↓21\.8\\downarrow 21\.8GoSkills✓✓✓41\.7↑25\.5\\uparrow 25\.5705,812↑7\.4​k\\uparrow 7\.4k338\.4↓265\.4\\downarrow 265\.492\.1↑13\.5\\uparrow 13\.560,755↓1\.1​M\\downarrow 1\.1M40\.8↓44\.1\\downarrow 44\.1

### 4\.2Main Results

We present the main results in Table[1](https://arxiv.org/html/2605.06978#S4.T1); run provenance and valid\-run denominators are reported in Appendix[E\.2](https://arxiv.org/html/2605.06978#A5.SS2)\. AcrossSkillsBenchandALFWorld,GoSkillsimproves over the*No Skills*setting for all evaluated backbones, showing that the retrieved group\-structured context provides useful task guidance\. Compared with*Vanilla Skills*,GoSkillsachieves higher reward while avoiding full\-library exposure\. Compared with*Vector Skills*and the structural retrieval reference*Graph of Skills*,GoSkillsyields a favorable aggregate reward–runtime tradeoff\. It is not always token\-minimal, since the fixed execution contract introduces structured prompt overhead, but it often reduces agent\-only runtime by making the execution entry point, support roles, and visible requirements explicit\. The runtime pattern is consistent with the intended role of the contract: it reduces the amount of organization the agent must perform during execution\.

#### SkillsBench\.

On SkillsBench,GoSkillsobtains the highest completed\-run average reward for every evaluated backbone, showing that group\-structured retrieval is especially useful for technical tasks that require selecting the right skills under a limited context budget\. Compared with*Graph of Skills*,GoSkillsimproves both reward and agent\-only runtime\. For example, under GPT\-5\.4, reward increases from 36\.4 to 48\.9, while runtime drops from 603\.7s to 352\.9s\. Under Claude Sonnet 4\.5, reward increases from 38\.7 to 46\.8 and runtime drops from 571\.9s to 347\.6s\.

These results suggest that the gain is not only from retrieving relevant skills, but also from presenting them in a more usable form\. SkillsBench tasks often contain explicit artifacts, output formats, public checks, or deterministic requirements\. A dependency\-aware bundle may include useful skills, but the agent still needs to infer the execution entry point and support roles\.GoSkillsmakes these roles explicit through anchor selection, support expansion, bottlenecking, and the fixedSTART,SUPPORT,CHECK, andAVOIDfields, which can reduce the burden of reorganizing retrieved context during execution\.

#### ALFWorld\.

GoSkillsalso obtains the highest reward across the evaluated backbones, but the margins are smaller than on SkillsBench\. This is expected because ALFWorld depends more on general multi\-turn planning than on technical skill matching\. Although the rendered contract does not always minimize token usage, it often reduces agent\-only runtime, suggesting that explicit anchor, support, and check fields reduce the burden of inferring how retrieved skills should be used\.

![Refer to caption](https://arxiv.org/html/2605.06978v1/Figures/model_method_reward_bars.png)Figure 3:Method\-wise reward comparison under each agent backbone\. Each mini\-panel fixes one model and compares retrieval settings; the top row reportsSkillsBenchand the bottom row reportsALFWorld\.GoSkillsis highlighted in blue and Graph of Skills in orange\.
#### Retrieval\-gate results\.

Table 2:Retrieval\-gate results over 40 annotated visible\-requirement items per mode\. Req\. P, Req\. Par\., and Req\. M denote requirement\-level pass, partial, and miss\. Hit denotes the average must\-hit rate, and Skills denotes the average number of presented skill payloads\.ModeReq\. PReq\. Par\.Req\. MHit↑\\uparrowSkills↓\\downarrowinstruction\_auto40001\.003\.10critical\_override40001\.002\.90

To isolate retrieval quality from downstream execution, we check whether the presented context contains all annotatedmust\_haveskills before agent execution\. The gate contains 40 visible\-requirement items per mode across the SkillsBench gate tasks\. Table[2](https://arxiv.org/html/2605.06978#S4.T2)shows thatGoSkillspasses all 40 requirement items under each mode, achieving a 1\.00 must\-hit rate with fewer than four presented skills on average\. Compared with the no\-backfill ablation in Table[3](https://arxiv.org/html/2605.06978#S5.T3), this indicates that coverage\-safe backfill helps preserve task\-critical skills under a strict bottleneck\.

## 5Ablation Study

![Refer to caption](https://arxiv.org/html/2605.06978v1/Figures/LibraryData.png)Figure 4:Sensitivity to library size on SkillsBench under GPT\-5\.4\. Reward trends as the skill repository grows from 200 to 2,000 skills\.GoSkillsmaintains the highest reward once the library reaches a moderate scale, while its runtime remains stable and significantly lower than that of competing methods even as the library continues to grow\.#### Sensitivity to library size\.

Figure[4](https://arxiv.org/html/2605.06978#S5.F4)studies the sensitivity to the size of skill\-library onSkillsBenchunder GPT\-5\.4\. As the library grows from 200 to 2,000 skills,*Vanilla Skills*become increasingly expensive because it exposes the full library, while*Vector Skills*is more affected by retrieval noise from additional distractors\. In contrast,GoSkillsmaintains a strong reward trend while keeping runtime relatively stable, suggesting that group\-level activation and bounded expansion make retrieval less sensitive to library growth\.

#### Component ablations\.

Table[3](https://arxiv.org/html/2605.06978#S5.T3)reports component ablations and a contract\-matched control\. The control keeps the Graph of Skills payloads fixed and only renders them with theGoSkillscontract\. It improves reward from 36\.4 to 41\.2 and reduces runtime from 603\.7s to 516\.2s, but remains below fullGoSkills, indicating that prompt formatting helps but does not explain the full gain\.

The largest drops come from removing anchor selection, group expansion, or the group graph, showing thatGoSkillsdepends on a reliable entry point and structured support expansion\. Removing coverage backfill lowers must\-hit from 1\.00 to 0\.73, while removing role labels or theAvoidfield preserves must\-hit but worsens reward and runtime\. Finally, the “Retrieved Skills Only” variant shows that flat skill presentation is insufficient; the gains require group\-level retrieval, bottlenecking, and contract rendering\.

Table 3:Ablation study on SkillsBench with a contract\-matched control\. All variants use the same presented\-skill budget\. The control keeps Graph of Skills payloads fixed and only applies theGoSkillscontract\. Colored deltas are relative to fullGoSkills\.VariantReward↑\\uparrowTokens↓\\downarrowRuntime↓\\downarrowMust\-hit↑\\uparrowGraph of Skills36\.4↓12\.5\\downarrow 12\.5380,199↓314\.6​k\\downarrow 314\.6k603\.7↑250\.8\\uparrow 250\.80\.84Graph of Skills \+GoSkillsContract41\.2↓7\.7\\downarrow 7\.7534,079↓160\.7​k\\downarrow 160\.7k516\.2↑163\.3\\uparrow 163\.30\.84GoSkills48\.9694,825352\.91\.00w/o Anchor Selection40\.6↓8\.3\\downarrow 8\.3438,720↓256\.1​k\\downarrow 256\.1k421\.4↑68\.5\\uparrow 68\.50\.82w/o Group Expansion41\.8↓7\.1\\downarrow 7\.1516,390↓178\.4​k\\downarrow 178\.4k396\.7↑43\.8\\uparrow 43\.80\.78w/o Group Graph40\.2↓8\.7\\downarrow 8\.7552,713↓142\.1​k\\downarrow 142\.1k439\.1↑86\.2\\uparrow 86\.20\.88w/o Role Labels44\.3↓4\.6\\downarrow 4\.6681,240↓13\.6​k\\downarrow 13\.6k436\.2↑83\.3\\uparrow 83\.31\.00w/o Coverage Backfill42\.1↓6\.8\\downarrow 6\.8627,510↓67\.3​k\\downarrow 67\.3k374\.6↑21\.7\\uparrow 21\.70\.73w/oAvoidField46\.2↓2\.7\\downarrow 2\.7692,880↓1\.9​k\\downarrow 1\.9k381\.5↑28\.6\\uparrow 28\.61\.00Retrieved Skills Only38\.7↓10\.2\\downarrow 10\.2492,640↓202\.2​k\\downarrow 202\.2k468\.9↑116\.0\\uparrow 116\.00\.76

## 6Limitations

GoSkillsis limited to inference\-time retrieval and contextualization over a skill library\. It does not train the downstream model, execute tools, bind arguments, or adapt after failures; when a required capability is absent from the library or disconnected from the group graph, the method can only report remaining coverage debt\. The implementation depends on deterministic schema extraction, skill metadata, and visible requirements available before execution\. It may be less effective on tasks with long setup chains, sparse metadata, ambiguous requirements, or hidden constraints outside the prompt\. Same\-retriever paired analyses should not be interpreted as uniform reward\-dominance claims over structural retrieval\.GoSkillscan preserve success and reduce agent\-only runtime on matched slices, but reward may trail flat structural rendering in some matched cases\.

## 7Conclusion

This paper introduces group\-structured skill retrieval for agent skill libraries\. Instead of retrieving only individual skills or dependency\-aware bundles,GoSkillsretrieves anchor\-centered skill groups, expands support groups through a group graph, preserves high\-confidence visible\-check requirements through budgeted backfill, and renders a compact execution contract\. The resulting system leaves the downstream model and execution loop unchanged while changing the retrieval unit from atomic skills to role\-aware skill groups\. Experiments indicate that this retrieval unit preserves visible\-requirement coverage under a small presented\-skill budget and improves over flat skill\-access baselines\. Relative to structural retrieval references,GoSkillsoften improves reward and reduces agent\-only runtime despite structured prompt overhead\. We therefore interpret the efficiency result as a downstream usability effect: role\-labeled context can reduce the agent’s burden of organizing retrieved skills, rather than uniformly minimizing token cost\. Same\-retriever paired analyses further support this interpretation on selected matched slices\.

## References

- Anthropic \(2025\)Introducing claude sonnet 4\.5\.Note:[https://www\.anthropic\.com/news/claude\-sonnet\-4\-5](https://www.anthropic.com/news/claude-sonnet-4-5)Accessed: 2026\-05\-05Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px2.p1.1)\.
- Anthropic \(2026\)Agent Skills\.Note:[https://github\.com/agentskills/agentskills](https://github.com/agentskills/agentskills)GitHub repository\. Accessed: 2026\-05\-06Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson \(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Geng, M\. Josifoski, M\. Peyrard, and R\. West \(2023\)Grammar\-constrained decoding for structured nlp tasks without finetuning\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 10932–10952\.Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p2.1)\.
- B\. J\. Gutierrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su \(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=hkujvAPVsg)Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Han, C\. Couturier, D\. M\. Diaz, X\. Zhang, V\. Rühle, and S\. Rajmohan \(2025\)LEGOMem: modular procedural memory for multi\-agent llm systems for workflow automation\.External Links:2510\.04851,[Link](https://arxiv.org/abs/2510.04851)Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p1.1)\.
- D\. Hendrycks, S\. Basart, S\. Kadavath, M\. Mazeika, A\. Arora, E\. Guo, C\. Burns, S\. Puranik, H\. He, D\. Song, and J\. Steinhardt \(2021\)Measuring coding challenge competence with apps\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,J\. Vanschoren and S\. Yeung \(Eds\.\),Vol\.1,pp\.\.External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper-round2.pdf)Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p2.1)\.
- Y\. Huo, S\. Zhang, K\. Zeng, Y\. Lu, C\. Yang, Y\. Guo, and X\. Tang \(2026\)RepoShapley: shapley\-enhanced context filtering for repository\-level code completion\.arXiv preprint arXiv:2601\.03378\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026\)SoK: agentic skills–beyond tool use in llm agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p1.1)\.
- M\. Kang, W\. Chen, D\. Han, H\. A\. Inan, L\. Wutschitz, Y\. Chen, R\. Sim, and S\. Rajmohan \(2025\)Acon: optimizing context compression for long\-horizon llm agents\.arXiv preprint arXiv:2510\.00615\.Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p2.1)\.
- D\. Li, Z\. Li, H\. Du, X\. Wu, S\. Gui, Y\. Kuang, and L\. Sun \(2026a\)Graph of skills: dependency\-aware structural retrieval for massive agent skills\.arXiv preprint arXiv:2604\.05333\.Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p1.1),[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Li, C\. Mu, J\. Chen, S\. Ren, Z\. Cui, Y\. Zhang, L\. Bai, and S\. Hu \(2026b\)Organizing, orchestrating, and benchmarking agent skills at ecosystem scale\.arXiv preprint arXiv:2603\.02176\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Li, Y\. Zhao, B\. Yu, F\. Song, H\. Li, H\. Yu, Z\. Li, F\. Huang, and Y\. Li \(2023\)Api\-bank: a comprehensive benchmark for tool\-augmented llms\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 3102–3116\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun,et al\.\(2026c\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.arXiv preprint arXiv:2602\.12670\.Cited by:[§E\.1](https://arxiv.org/html/2605.06978#A5.SS1.p1.1),[§1](https://arxiv.org/html/2605.06978#S1.p4.1),[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Liang, R\. Zhong, H\. Xu, C\. Jiang, Y\. Zhong, R\. Fang, J\. Gu, S\. Deng, Y\. Yao, M\. Wang,et al\.\(2026\)Skillnet: create, evaluate, and connect ai skills\.arXiv preprint arXiv:2603\.04448\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, Z\. Peng, X\. Yi, X\. Xie, L\. Xiang, Y\. Liu, and D\. Xu \(2024a\)Toolnet: connecting large language models with massive tools via tool graph\.arXiv preprint arXiv:2403\.00839\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Liu, Z\. Lai, Z\. Gao, E\. Cui, Z\. Li, X\. Zhu, L\. Lu, Q\. Chen, Y\. Qiao, J\. Dai,et al\.\(2024b\)Controlllm: augment language models with tools by searching on graphs\.InEuropean Conference on Computer Vision,pp\. 89–105\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Mialon, R\. Dessì, M\. Lomeli, C\. Nalmpantis, R\. Pasunuru, R\. Raileanu, B\. Rozière, T\. Schick, J\. Dwivedi\-Yu, A\. Celikyilmaz,et al\.\(2023\)Augmented language models: a survey\.arXiv preprint arXiv:2302\.07842\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- MiniMax \(2026\)MiniMax M2\.7: early echoes of self\-evolution\.Note:[https://www\.minimax\.io/news/minimax\-m27\-en](https://www.minimax.io/news/minimax-m27-en)Accessed: 2026\-05\-05Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px2.p1.1)\.
- Moonshot AI \(2026\)Kimi K2\.5: ai that sees, codes, and works like an expert\.Note:[https://www\.kimi\.com/ai\-models/kimi\-k2\-5](https://www.kimi.com/ai-models/kimi-k2-5)Accessed: 2026\-05\-05Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px2.p1.1)\.
- \[21\]OpenAItext\-embedding\-3\-large Model\.Note:[https://developers\.openai\.com/api/docs/models/text\-embedding\-3\-large](https://developers.openai.com/api/docs/models/text-embedding-3-large)Accessed: 2026\-05\-06Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px1.p1.1)\.
- OpenAI \(2026\)GPT\-5\.4 Model\.Note:[https://developers\.openai\.com/api/docs/models/gpt\-5\.4](https://developers.openai.com/api/docs/models/gpt-5.4)Accessed: 2026\-05\-05Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px2.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 126544–126565\.External Links:[Document](https://dx.doi.org/10.52202/079017-4020),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/e4c61f578ff07830f5c37378dd3ecb0d-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Pichai, D\. Hassabis, and K\. Kavukcuoglu \(2025\)A new era of intelligence with Gemini 3\.Note:[https://blog\.google/products\-and\-platforms/products/gemini/gemini\-3/](https://blog.google/products-and-platforms/products/gemini/gemini-3/)Accessed: 2026\-05\-05Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, dahai li, Z\. Liu, and M\. Sun \(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dHng2O0Jjr)Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen \(2024\)Towards completeness\-oriented tool retrieval for large language models\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,pp\. 1930–1940\.Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p2.1)\.
- Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Accessed: 2026\-05\-05Cited by:[§4\.1](https://arxiv.org/html/2605.06978#S4.SS1.SSS0.Px2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessi, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 68539–68551\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/d842425e4bf79ba039352da0f658a906-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Shi, Y\. Wang, L\. Yan, P\. Ren, S\. Wang, D\. Yin, and Z\. Ren \(2025\)Retrieval models aren’t tool\-savvy: benchmarking tool retrieval for large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 24497–24524\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1258/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1258),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p2.1),[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Cote, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)\{alfw\}orld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0IOX0YcCdTn)Cited by:[§E\.1](https://arxiv.org/html/2605.06978#A5.SS1.p1.1),[§1](https://arxiv.org/html/2605.06978#S1.p4.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang, and S\. Deng \(2026\)SkillX: automatically constructing skill knowledge bases for agents\.External Links:2604\.04804,[Link](https://arxiv.org/abs/2604.04804)Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Xia, L\. Hu, Y\. Sun, M\. Xu, L\. Xu, S\. Wang, W\. Xu, and J\. Jiang \(2026\)GraSP: graph\-structured skill compositions for llm agents\.arXiv preprint arXiv:2604\.17870\.Cited by:[§1](https://arxiv.org/html/2605.06978#S1.p1.1),[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Zhang, Q\. Long, J\. Bao, T\. Feng, W\. Zhang, H\. Yue, and W\. Wang \(2026\)MemSkill: learning and evolving memory skills for self\-evolving agents\.arXiv preprint arXiv:2602\.02474\.Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhuang, Y\. Yu, K\. Wang, H\. Sun, and C\. Zhang \(2023\)ToolQA: a dataset for LLM question answering with external tools\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=pV1xV2RK6I)Cited by:[§2](https://arxiv.org/html/2605.06978#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AAppendix Overview

The appendix complements the main paper along four axes: method specification, implementation reproducibility, retrieval analysis, and trajectory\-grounded qualitative analysis\. Table[4](https://arxiv.org/html/2605.06978#A1.T4)summarizes the role of each appendix section\.

Table 4:Appendix roadmap\. Each section is written to support a specific claim or reproducibility requirement from the main paper\.SectionPurposeAdditional Method DetailsDefines the retrieval unit, offline group construction, online expansion, bottlenecking, coverage\-debt accounting, contract template, and complexity terms\.Prompt and Interface ExamplesShows the fixed interface contract exposed to the downstream agent and explains how schema, support roles, checks, avoid rules, and remaining debt enter the prompt\.Implementation and HyperparametersReports the fixed budgets, thresholds, feature families, coefficients, and deterministic operator rules used byGoSkills\.Additional Experimental DetailsDocuments the benchmark protocol, SkillsBench retrieval\-gate analysis, task\-level retrieval examples, paired trajectory results, and token/runtime accounting\.Failure and Error AnalysisSeparates retrieval misses, partial coverage, downstream execution failures, context overhead, and infrastructure failures\.Qualitative AnalysisProvides trajectory\-grounded qualitative examples comparing Graph of Skills baseline trajectories withGoSkillsretrieval and paired trajectory evidence\.

## Appendix BAdditional Method Details

This section provides implementation\-level details forGoSkills\. The main text describes the offline/online structure and the key retrieval decisions; here we give the algorithms, scoring rules, contract template, and complexity terms needed for reproducibility\.

### B\.1Comparison of Retrieval Units

Table 5:Comparison of agent\-facing retrieval units\.GoSkillsdiffers from post\-hoc bundle annotation by using role\-aware groups before expansion and bottlenecking\.MethodRetrieval objectExpansion objectRole timingAgent\-facing outputGraph of SkillsSkill nodes scored by graph diffusionHydrated dependency\-aware bundleMostly implicit or rendered after retrievalBounded skill bundleBundle rerankingCandidate skill setSet\-level rerank or pruningAttached after set selectionRole\-labeled bundleGoSkillsSmall local asymmetric groupGroup\-to\-group support expansionUsed before scoring, expansion, and bottleneckingRole contract plus payloads

### B\.2Offline Group Construction

The offline stage constructs a reusable pool of small skill groups before any test\-time query is observed\. As shown in Algorithm[1](https://arxiv.org/html/2605.06978#alg1), its purpose is not to change the underlying skill graph, but to summarize local typed neighborhoods into compact, role\-annotated group candidates that can later be retrieved and bottlenecked online\. For each skill, we treat it as a potential lead skill, enumerate bounded groups within its typed neighborhood, assign intra\-group roles, and extract the facets, artifact signatures, and visible\-check cues used by the online contextualizer\. We discard incompatible or redundant groups to keep the pool interpretable and bounded\. The resulting group pool𝒢\\mathcal\{G\}, group graphℋ\\mathcal\{H\}, and inverted indexIIare then reused across queries\.

Algorithm 1BuildSkillGroupPool0:skill graph

G𝒮=\(𝒮,E,w,ϕ\)G\_\{\\mathcal\{S\}\}=\(\\mathcal\{S\},E,w,\\phi\), group size cap

KmaxK\_\{\\max\}
0:group pool

𝒢\\mathcal\{G\}, group graph

ℋ\\mathcal\{H\}, inverted index

II
1:

𝒢←∅,I←∅\\mathcal\{G\}\\leftarrow\\emptyset,\\quad I\\leftarrow\\emptyset
2:foreach lead skill

s∈𝒮s\\in\\mathcal\{S\}do

3:

Ns←TypedNeighborhood​\(s,E,Kmax\)N\_\{s\}\\leftarrow\\textsc\{TypedNeighborhood\}\(s,E,K\_\{\\max\}\)
4:

Ps←EnumerateGroups​\(s,Ns,Kmax\)P\_\{s\}\\leftarrow\\textsc\{EnumerateGroups\}\(s,N\_\{s\},K\_\{\\max\}\)
5:foreach candidate group

g∈Psg\\in P\_\{s\}do

6:

Rg←AssignRoles​\(g,E,ϕ\)R\_\{g\}\\leftarrow\\textsc\{AssignRoles\}\(g,E,\\phi\)
7:

Fg\+,Fgopt,Fg−,Ag,Vg←ExtractGroupFacets​\(g\)F\_\{g\}^\{\+\},F\_\{g\}^\{\\mathrm\{opt\}\},F\_\{g\}^\{\-\},A\_\{g\},V\_\{g\}\\leftarrow\\textsc\{ExtractGroupFacets\}\(g\)
8:if

Compatible​\(g\)\\textsc\{Compatible\}\(g\)and

NonRedundant​\(g,𝒢\)\\textsc\{NonRedundant\}\(g,\\mathcal\{G\}\)then

9:

𝒢←𝒢∪\{g\}\\mathcal\{G\}\\leftarrow\\mathcal\{G\}\\cup\\\{g\\\}
10:

UpdateIndex​\(I,g\)\\textsc\{UpdateIndex\}\(I,g\)
11:endif

12:endfor

13:endfor

14:

ℋ←BuildGroupGraph​\(𝒢,E,ϕ\)\\mathcal\{H\}\\leftarrow\\textsc\{BuildGroupGraph\}\(\\mathcal\{G\},E,\\phi\)
15:return

𝒢,ℋ,I\\mathcal\{G\},\\mathcal\{H\},I

### B\.3Online Contextualization Algorithm

At inference time,GoSkillsstarts from the query and, optionally, a seed skill retriever\. As shown in Algorithm[2](https://arxiv.org/html/2605.06978#alg2), the online procedure first extracts a lightweight query schema and uses the inverted index to identify candidate groups relevant to the retrieved skills and query facets\. It then selects one anchor group and greedily adds a small number of support groups when they provide sufficient marginal utility under the budgetτ\\tau\. After group selection,GoSkillsperforms the bottlenecking step: it exposes only a bounded set of atomic skillsBB, rather than passing all selected group members to the agent\. Finally, any uncovered high\-confidence facets are recorded as coverage debtDD, optionally backfilled when budget remains, and surfaced in the execution contractCC\. This procedure therefore controls what the agent sees without executing skills, changing the environment, or modifying the base retriever\.

Algorithm 2GoSkillsGroup Retrieval and Expansion0:query

qq, group pool

𝒢\\mathcal\{G\}, group graph

ℋ\\mathcal\{H\}, inverted index

II, optional seed retriever

R0R\_\{0\}, budget

τ\\tau, backfill cap

cmaxc\_\{\\max\}
0:group plan

PP, presented skills

BB, remaining debt

DD, execution contract

CC
1:

ψ←ExtractSchema​\(q\),ℱhigh←HighConfidenceFacets​\(ψ\)\\psi\\leftarrow\\textsc\{ExtractSchema\}\(q\),\\quad\\mathcal\{F\}\_\{\\mathrm\{high\}\}\\leftarrow\\textsc\{HighConfidenceFacets\}\(\\psi\)
2:

Rq0←RetrieveSeedSkills​\(R0,q\),B←∅R\_\{q\}^\{0\}\\leftarrow\\textsc\{RetrieveSeedSkills\}\(R\_\{0\},q\),\\quad B\\leftarrow\\emptyset
3:

𝒢q←CandidateGroups​\(I,Rq0,ψ\)\\mathcal\{G\}\_\{q\}\\leftarrow\\textsc\{CandidateGroups\}\(I,R\_\{q\}^\{0\},\\psi\)
4:

𝒜q←Rq0∪⋃g∈𝒢q𝒮g\\mathcal\{A\}\_\{q\}\\leftarrow R\_\{q\}^\{0\}\\cup\\bigcup\_\{g\\in\\mathcal\{G\}\_\{q\}\}\\mathcal\{S\}\_\{g\}
5:

𝒢^q←TopGroups​\(𝒢q,Ugrp,L​\(q\)\)\\widehat\{\\mathcal\{G\}\}\_\{q\}\\leftarrow\\textsc\{TopGroups\}\(\\mathcal\{G\}\_\{q\},U\_\{\\mathrm\{grp\}\},L\(q\)\)
6:

ga←SelectAnchor​\(𝒢^q,Ugrp,Anchor\),𝒫←⟨ga,∅⟩g\_\{a\}\\leftarrow\\textsc\{SelectAnchor\}\(\\widehat\{\\mathcal\{G\}\}\_\{q\},U\_\{\\mathrm\{grp\}\},\\mathrm\{Anchor\}\),\\quad\\mathcal\{P\}\\leftarrow\\langle g\_\{a\},\\emptyset\\rangle
7:while

CanAddGroup​\(𝒫,τ\)\\textsc\{CanAddGroup\}\(\\mathcal\{P\},\\tau\)do

8:

ℰt←\(𝒢^q∪GroupNeighbors​\(ℋ,𝒫\)\)∖𝒫\\mathcal\{E\}\_\{t\}\\leftarrow\(\\widehat\{\\mathcal\{G\}\}\_\{q\}\\cup\\textsc\{GroupNeighbors\}\(\\mathcal\{H\},\\mathcal\{P\}\)\)\\setminus\\mathcal\{P\}
9:

g⋆←BestSupport​\(ℰt,𝒫,Usup\)g^\{\\star\}\\leftarrow\\textsc\{BestSupport\}\(\\mathcal\{E\}\_\{t\},\\mathcal\{P\},U\_\{\\mathrm\{sup\}\}\)
10:if

g⋆=⊥g^\{\\star\}=\\botor

Usup​\(g⋆∣𝒫,q\)<δsupU\_\{\\mathrm\{sup\}\}\(g^\{\\star\}\\mid\\mathcal\{P\},q\)<\\delta\_\{\\mathrm\{sup\}\}then

11:break

12:endif

13:

𝒫←𝒫∪\{g⋆\}\\mathcal\{P\}\\leftarrow\\mathcal\{P\}\\cup\\\{g^\{\\star\}\\\}
14:

𝒜q←𝒜q∪𝒮g⋆\\mathcal\{A\}\_\{q\}\\leftarrow\\mathcal\{A\}\_\{q\}\\cup\\mathcal\{S\}\_\{g^\{\\star\}\}
15:endwhile

16:

𝒮𝒫←⋃g∈𝒫𝒮g,B←InsertLeads​\(𝒫,τ\)\\mathcal\{S\}\_\{\\mathcal\{P\}\}\\leftarrow\\bigcup\_\{g\\in\\mathcal\{P\}\}\\mathcal\{S\}\_\{g\},\\quad B\\leftarrow\\textsc\{InsertLeads\}\(\\mathcal\{P\},\\tau\)
17:while

CanAddSkill​\(B,τ\)\\textsc\{CanAddSkill\}\(B,\\tau\)do

18:

s⋆←BestSkill​\(\(𝒮𝒫∩𝒜q\)∖B,B,Ubot\)s^\{\\star\}\\leftarrow\\textsc\{BestSkill\}\(\(\\mathcal\{S\}\_\{\\mathcal\{P\}\}\\cap\\mathcal\{A\}\_\{q\}\)\\setminus B,B,U\_\{\\mathrm\{bot\}\}\)
19:if

s⋆=⊥s^\{\\star\}=\\botor

Ubot​\(s⋆∣B,𝒫,q\)<δbotU\_\{\\mathrm\{bot\}\}\(s^\{\\star\}\\mid B,\\mathcal\{P\},q\)<\\delta\_\{\\mathrm\{bot\}\}then

20:break

21:endif

22:

B←B∪\{s⋆\}B\\leftarrow B\\cup\\\{s^\{\\star\}\\\}
23:endwhile

24:

D←CoverageDebt​\(ℱhigh,B\)D\\leftarrow\\textsc\{CoverageDebt\}\(\\mathcal\{F\}\_\{\\mathrm\{high\}\},B\)
25:

cback←0c\_\{\\mathrm\{back\}\}\\leftarrow 0
26:while

D≠∅D\\neq\\emptysetand

CanAddSkill​\(B,τ\)\\textsc\{CanAddSkill\}\(B,\\tau\)and

cback<cmaxc\_\{\\mathrm\{back\}\}<c\_\{\\max\}do

27:

s⋆←BestBackfill​\(𝒜q∖B,D,B\)s^\{\\star\}\\leftarrow\\textsc\{BestBackfill\}\(\\mathcal\{A\}\_\{q\}\\setminus B,D,B\)
28:if

s⋆=⊥s^\{\\star\}=\\botthen

29:break

30:endif

31:

B←B∪\{s⋆\},cback←cback\+1B\\leftarrow B\\cup\\\{s^\{\\star\}\\\},\\quad c\_\{\\mathrm\{back\}\}\\leftarrow c\_\{\\mathrm\{back\}\}\+1
32:

D←CoverageDebt​\(ℱhigh,B\)D\\leftarrow\\textsc\{CoverageDebt\}\(\\mathcal\{F\}\_\{\\mathrm\{high\}\},B\)
33:endwhile

34:

𝒫←AnchorPrune​\(𝒫,B,ψ\)\\mathcal\{P\}\\leftarrow\\textsc\{AnchorPrune\}\(\\mathcal\{P\},B,\\psi\)
35:

C←FormatContract​\(𝒫,B,D,ψ\)C\\leftarrow\\textsc\{FormatContract\}\(\\mathcal\{P\},B,D,\\psi\)
36:return

P,B,D,CP,B,D,C

### B\.4Coverage\-Debt Accounting

LetBoutB\_\{\\mathrm\{out\}\}andDoutD\_\{\\mathrm\{out\}\}be the outputs of Algorithm[2](https://arxiv.org/html/2605.06978#alg2)\. The coverage debt is

Dout=ℱhigh​\(q\)∖⋃s∈BoutFs\.D\_\{\\mathrm\{out\}\}=\\mathcal\{F\}\_\{\\mathrm\{high\}\}\(q\)\\setminus\\bigcup\_\{s\\in B\_\{\\mathrm\{out\}\}\}F\_\{s\}\.Thus every high\-confidence visible requirement not covered by the rendered skill payloads is explicitly reported as remaining debt\. IfDout≠∅D\_\{\\mathrm\{out\}\}\\neq\\emptyset, the backfill loop stopped because no eligible backfill was available, the context budget was exhausted, or the backfill cap was reached\.

### B\.5Execution Contract Template

The execution contract template is fixed across tasks\. Task\-specific content enters only through the selected groups, presented skills, query schema, and coverage debt\.

Table 6:Fixed execution\-contract fields rendered byGoSkills\. The field names remain constant across tasks; only the selected content changes\.FieldRendered roleStartAnchor skill and why it should lead executionSupportSupport skills with roles such as prerequisite, parser, formatter, checker, or fallbackCheckVisible output formats, artifacts, tests, deterministic constraints, or proof obligationsAvoidNegative\-applicability warnings and generic misreadingsSkillsHydrated payloads for the presented atomic skillsDebtRemaining high\-confidence visible requirements, if any

#### Example rendered contract\.

Table[7](https://arxiv.org/html/2605.06978#A2.T7)shows a shortened contract rendered for a representative technical task\. The example is abbreviated for readability: the actual prompt also includes the hydrated payloads of the selected atomic skills\. The contract is deterministic and uses the same field names for all tasks; task\-specific content enters only through the selected group plan, the presented skill set, the query schema, and the remaining coverage debt\.

Table 7:Example rendered execution contract\. The payload text is shortened for space\.FieldRendered contentSTARTUsefuzzy\-matchas the anchor skill because the task requires detecting suspicious invoice entries under approximate entity and string matching\.SUPPORTpdf\-reading: extract invoice text and table fields\. xlsx: parse structured transaction records and preserve row\-level identifiers\.CHECKPreserve invoice IDs; produce the required structured output; verify suspicious entries against the visible task constraints\.AVOIDDo not treat exact string mismatch alone as fraud evidence\. Do not ignore missing or malformed invoice fields\.SKILLSHydrated payloads forfuzzy\-match,pdf\-reading, andxlsx\.DEBTNone\.

### B\.6Complexity Details

LetM=\|R0​\(q\)\|M=\|R\_\{0\}\(q\)\|,ι¯\\bar\{\\iota\}be the average number of indexed groups per seed skill,\|𝒢q\|\|\\mathcal\{G\}\_\{q\}\|the number of activated candidate groups,LLthe retained group shortlist size,d¯ℋ\\bar\{d\}\_\{\\mathcal\{H\}\}the average group\-graph degree,PmaxP\_\{\\max\}the selected group cap,KmaxK\_\{\\max\}the group size cap, andb=\|B\|b=\|B\|the presented\-skill budget\. Candidate activation costsO​\(M​ι¯\)O\(M\\bar\{\\iota\}\), group ranking costsO​\(\|𝒢q\|​log⁡\|𝒢q\|\)O\(\|\\mathcal\{G\}\_\{q\}\|\\log\|\\mathcal\{G\}\_\{q\}\|\), support expansion costsO​\(Pmax​\(L\+d¯ℋ\)​Kmax\)O\(P\_\{\\max\}\(L\+\\bar\{d\}\_\{\\mathcal\{H\}\}\)K\_\{\\max\}\), and bottlenecking costsO​\(b​Pmax​Kmax\)O\(bP\_\{\\max\}K\_\{\\max\}\)\. SinceLL,PmaxP\_\{\\max\},KmaxK\_\{\\max\}, andbbare capped, the online overhead is dominated by the activated group neighborhood rather than the full skill library\.

## Appendix CPrompt and Interface Examples

#### Interface design\.

GoSkillsuses a fixed downstream interface rather than a task\-specific prompt template\. The retriever may use query normalization to expose task terms, but the final agent\-facing content is determined by the selected group plan, presented skills, coverage debt, and fixed contract fields\. This section shows the small set of interface components that matter for reproducibility\. The purpose is not to claim that prompting alone is the method; rather, the prompt is the rendering surface through which group\-level retrieval is made operational\. This interface is not tuned per task\. We do not manually edit START, SUPPORT, CHECK, AVOID, or DEBT fields for individual examples; the fields are filled deterministically from the selected group plan, bottlenecked skill payloads, query schema, and coverage\-debt accounting\.

Table 8:Interface components exposed byGoSkills\. Task\-specific content enters through retrieved groups, presented skills, and visible requirements; the field structure is fixed\.ComponentRoleConstraintStartNames the anchor skill and why it should lead execution\.Exactly one primary entry point is rendered when a selected anchor contributes a presented skill\.SupportLists support skills and their roles, such as parser, formatter, checker, prerequisite, or fallback\.Only skills selected by bottlenecking or backfill are exposed; unpresented group members are not silently implied\.CheckStates visible output formats, artifacts, deterministic checks, or proof obligations\.Uses only high\-confidence task\-visible requirements, not hidden tests or evaluator state\.AvoidSurfaces negative applicability and common misreadings\.Generated from negative facets and explicit task constraints; it is guidance, not a runtime blocker\.DebtReports high\-confidence visible requirements still uncovered by the final presented skills\.Remaining debt is exposed instead of being silently repaired beyond the backfill budget\.

#### Fixed prompt skeleton\.

The following skeleton is fixed across tasks\. Bracketed fields are filled only from Algorithm[2](https://arxiv.org/html/2605.06978#alg2): the selected group plan, presented skill payloads, query schema, and remaining coverage debt\.

GoSkills execution contractSTART Use\[anchor\_skill\]first because\[anchor\_reason\]\. Inspect its source path before writing new code\.SUPPORT Use the following support skills only for their stated roles: \[support\_skill\_1\]:\[role\_1\]–\[reason\_1\] \[support\_skill\_2\]:\[role\_2\]–\[reason\_2\]CHECK Before finalizing, verify the following visible requirements: \[visible\_format\_or\_artifact\] \[visible\_test\_or\_constraint\]AVOID Do not follow these incompatible interpretations: \[negative\_cue\_1\] \[negative\_cue\_2\]SKILLS \[hydrated\_payloads\_for\_presented\_skills\]DEBT \[remaining\_coverage\_debt\_or\_None\]

#### Representative contract rendering\.

The following schematic excerpt illustrates the agent\-facing form of the interface\. The actual task\-specific skill names and checks are filled by Algorithm[2](https://arxiv.org/html/2605.06978#alg2); the field order and interpretation are fixed across tasks\.

Start\.Use the anchor skill first; inspect its source path before writing new code\. Support\.Use listed support skills only for their stated role \(parser, formatter, checker, prerequisite, or fallback\)\. Check\.Preserve visible output formats, required artifacts, determinism, tests, and proof obligations\. Avoid\.Do not follow listed incompatible interpretations or generic workflow substitutions\. Debt\.If nonempty, explicitly account for uncovered visible requirements before finalizing\.

#### Relation to retrieval\.

This interface differs from post\-hoc bundle annotation because role labels are used before the final payload is rendered\. The selected groups influence anchor choice, support\-group expansion, bottlenecking, backfill, and the contract fields\. As a result, the downstream agent receives a small set of atomic skill payloads together with the retrieval decision’s intended execution structure\. This is the mechanism evaluated by the retrieval\-gate and case\-study analyses\.

## Appendix DImplementation and Hyperparameters

This appendix reports the fixed implementation choices used in all experiments\. The main algorithm, objectives, bottlenecking procedure, coverage\-debt repair, and execution contract are described in Section[3](https://arxiv.org/html/2605.06978#S3)\. Here we only list implementation\-level settings needed for reproducibility\.

Table 9:Core implementation settings forGoSkills\. These settings are fixed after development and kept unchanged across the reported test tasks\.ComponentSettingSeed retrieval evidenceLexical or graph\-ranked atomic skills used for group activationGroup retrieval objectAnchor\-centered skill groupsGroup expansionGroup graph over support, artifact, visible\-check, and fallback relationsSeed modeLexical retrieval by defaultReturned skill budgettop\-n=4Seed skill countseed\-top\-k=4Maximum skill payload1,800 charactersMaximum rendered context9,000 charactersMaximum group size3 skillsMaximum selected groups3 groupsBackfill capAt most 2 high\-confidence candidate or support skillsRendered outputExecution contract \+ presented skill payloads \+ remaining debtDownstream agentUnchanged

Table 10:Main group\-pool and selection hyperparameters\. The full list of token dictionaries and environment\-variable names is omitted for space; all reported values are fixed before test evaluation\.ParameterValueComplexity weight0\.60Ambiguity weight0\.40Ambiguity gap / spread weights0\.55 / 0\.45Base pool cap minimum6Top\-nnpool multiplier2Adaptive extra base1\.0Adaptive difficulty multiplier2\.0Candidate pool cap32Score\-floor center0\.55Score\-floor difficulty slope0\.30Score\-floor absolute minimum0\.10Minimum required floor / ceiling3 / 6Group selection minimum score0\.14Support skill minimum score0\.10Inter\-group affinity threshold0\.35

### D\.1Scoring Weights and Rule Definitions

All scores used byGoSkillsare deterministic weighted sums over normalized features\. Each scalar feature is clipped to\[0,1\]\[0,1\]before scoring\. The feature vectors𝐱\\mathbf\{x\},𝐳\\mathbf\{z\}, and𝐡\\mathbf\{h\}follow the column order in Table[12](https://arxiv.org/html/2605.06978#A4.T12); stage\-inapplicable features are set to zero\. Penalty terms use negative coefficients, and rows are not constrained to sum to one because feature families are computed on different clipped scales\. The coefficients in Table[12](https://arxiv.org/html/2605.06978#A4.T12)are selected during development and kept fixed across tasks, benchmarks, backbone models, and test splits\.

Table 11:Deterministic feature families used by the group, support, and bottleneck scores\.Feature familyDefinitionRetriever relevanceNormalized seed\-retrieval score of matched membersFacet coverageQuery schema facets covered by group or skill metadataAnchor matchExact or normalized match to technology and artifact anchorsVisible\-check supportCoverage of tests, formats, artifacts, or proof obligationsConnectivityTyped graph support between lead and membersRedundancy penaltyOverlap with facets already covered by selected groupsNegative applicabilityConflict with explicit constraints or failure cuesCost penaltyHydrated payload size or estimated context cost

Table 12:Fixed coefficients for the three scoring stages\. Rel\. denotes seed or group relevance, Facet denotes query\-facet coverage, Anch\. denotes anchor match, Check denotes visible\-check support, Conn\. denotes graph connectivity, Redun\. denotes redundancy, Neg\. denotes negative applicability, and Cost denotes payload cost\.ScoreRel\.FacetAnch\.CheckConn\.Redun\.Neg\.CostUgrpU\_\{\\mathrm\{grp\}\}0\.280\.220\.180\.120\.10\-0\.05\-0\.25\-0\.04UsupU\_\{\\mathrm\{sup\}\}0\.120\.280\.060\.160\.16\-0\.18\-0\.25\-0\.04UbotU\_\{\\mathrm\{bot\}\}0\.180\.240\.120\.200\.08\-0\.12\-0\.30\-0\.08

#### Feature normalization\.

Retriever relevance is the maximum normalized seed score among matching group members forUgrpU\_\{\\mathrm\{grp\}\}, the marginal relevance of the group forUsupU\_\{\\mathrm\{sup\}\}, and the skill\-level normalized seed score forUbotU\_\{\\mathrm\{bot\}\}\. Facet coverage is the fraction of query schema facets covered by the group or skill after exact and normalized lexical matching\. Anchor match is one for exact technology, artifact, or named\-API matches, lower for normalized aliases, and zero for generic matches\. Visible\-check support counts coverage of output formats, required artifacts, tests, deterministic behavior, or formal proof cues\. Connectivity is the clipped aggregate typed\-edge support between the lead and members\. Redundancy is overlap with already selected facets or skills\. Negative applicability is one when explicit constraints or failure cues conflict with a group or skill\. Cost is the clipped hydrated payload size relative to the current budget\.

Table 13:Offline group\-pool construction rules used by Algorithm[1](https://arxiv.org/html/2605.06978#alg1)\.RuleDefinitionTypedNeighborhoodCollect incoming and outgoing one\-hop neighbors connected by dependency, workflow, artifact, visible\-check, fallback, or alternative edges\. Neighbors are ordered by edge priority and edge weight, then truncated by the group size cap\.EnumerateGroupsCreate singleton groups for each lead, lead–neighbor pairs, and triples containing the lead plus two non\-conflicting support members\. Triples must add either a distinct role or a distinct artifact or visible\-check facet beyond the pair\.AssignRolesSet the lead role to anchor\. Dependency predecessors become prerequisites; workflow predecessors become preprocessors or setup utilities; artifact/output neighbors become formatters or parsers; visible\-check neighbors become checkers; alternative edges become fallbacks\.ExtractGroupFacetsNormalize skill names, tags, documentation headers, declared artifacts, file extensions, tests, visible\-check cues, and negative warnings into required, optional, and negative facet sets\. Group facets are the union of member facets with lead facets marked as required\.CompatibleReject groups with contradictory technology anchors, mutually exclusive file formats, or negative applicability conflicts among members\. Singleton groups always pass unless the skill metadata is malformed\.NonRedundantCanonicalize each group by lead and sorted members\. If two groups have the same canonical members, keep the one with the higher prior\. Reject a non\-singleton group if all support members add no new role, facet, artifact, visible\-check cue, or typed\-edge evidence\.UpdateIndexAdd the retained group to the inverted index for its lead and every member skill, enabling candidate group lookup from seed atomic skills\.BuildGroupGraphConnect retained groups when their leads, members, artifacts, visible\-check cues, fallback edges, or negative facets indicate support, workflow continuation, shared outputs, or conflict\. Edges are weighted by typed\-edge evidence and normalized facet overlap\.

#### Operator semantics\.

Tables[13](https://arxiv.org/html/2605.06978#A4.T13)and[14](https://arxiv.org/html/2605.06978#A4.T14)define the deterministic operators used in Algorithms[1](https://arxiv.org/html/2605.06978#alg1)and[2](https://arxiv.org/html/2605.06978#alg2)\. These rules are fixed across tasks, benchmarks, and backbone models\.

Table 14:Online selection and rendering rules used by Algorithm[2](https://arxiv.org/html/2605.06978#alg2)\.RuleDefinitionExtractSchemaExtract normalized task terms, technology anchors, operations, artifacts, constraints, failure cues, and visible\-check cues using deterministic lexical dictionaries and skill\-library metadata\. Optional rewriting may only add normalized retrieval aliases\.HighConfidenceFacetsReturn exact query facets used for coverage checks, limited to explicit frameworks, file extensions, named APIs, output formats, required artifacts, and stated constraints\.CandidateGroupsReturn direct query–group matches plus indexed groups containing at least one seed skill inRq0R\_\{q\}^\{0\}, then remove groups whose negative facets conflict with high\-confidence query constraints\.TopGroupsRank candidates byUgrpU\_\{\\mathrm\{grp\}\}, apply the adaptive score floor, and keep at most the candidate pool cap\. Ties are broken by seed\-retrieval rank of the lead skill and then by smaller group size\.SelectAnchorChoose the highest\-scoring group after anchor correction\. If a high\-specificity technology or artifact anchor has a plausible matching lead in the shortlist, prefer that lead over a generic group when the corrected score is above the group\-selection minimum\.BestSupportSelect the group with the largest positive marginalUsupU\_\{\\mathrm\{sup\}\}, penalizing overlap with the anchor and previous support groups\. Selection stops below the support threshold, after the group cap, or when the context guard would be exceeded\.GroupNeighborsReturn group\-graph neighbors of the selected group plan𝒫\\mathcal\{P\}\. Neighbors are eligible for expansion only if they add support, artifact, visible\-check, fallback, or coverage evidence under the current query schema\.InsertLeadsInsert selected group leads first, ordered by anchor then support score, while respecting the presented\-skill budget\. Duplicate leads are inserted once\.BestSkillSelect the remaining member skill with maximumUbotU\_\{\\mathrm\{bot\}\}, requiring positive marginal facet, visible\-check, artifact, or connectivity evidence after redundancy penalties\.CoverageDebtCompute uncovered high\-confidence query facets, limited to exact frameworks, file extensions, named APIs, output formats, artifacts, and explicit constraints\.BestBackfillChoose a seed, retrieved\-group, or expanded\-group skill outsideBBonly if it covers current coverage debt, has no negative applicability conflict, and fits the remaining budget\. Backfill is capped at two skills\.AnchorPrunePromote a high\-specificity technology or artifact group to anchor when it is eligible and would otherwise be demoted by a generic group\. Remove selected support groups that contribute no presented skill or contract field after bottlenecking\.FormatContractRender the Start, Support, Check, Avoid, Skills, and remaining\-debt fields\. Support groups that contribute no presented skills are omitted from the contract\.

#### Feature implementation\.

All group, support, and bottleneck utilities use deterministic lexical matches, normalized skill metadata, typed graph relations, group\-graph evidence, and seed retrieval scores\. The feature families are summarized in Table[11](https://arxiv.org/html/2605.06978#A4.T11)\. Optional rewriting is used only for retrieval keyword normalization and is not allowed to introduce new task requirements\.

#### Notation\.

The implementation uses internal token sets for normalized query terms, technology anchors, operation hints, artifacts, constraints, failure cues, and visible\-check cues\. These correspond to the query schemaψ​\(q\)\\psi\(q\)in Section[3](https://arxiv.org/html/2605.06978#S3)\. We use the paper notation throughout the main text and report only the implementation settings here\.

## Appendix EAdditional Experimental Details

### E\.1Benchmark and Evaluation Protocol

The aggregate experiments in Section[4](https://arxiv.org/html/2605.06978#S4)evaluate both SkillsBench\(Liet al\.,[2026c](https://arxiv.org/html/2605.06978#bib.bib6)\)and ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.06978#bib.bib17)\)\. The task\-level analyses in this appendix focus on SkillsBench because its tasks are paired with reusable skills and deterministic checks, making it possible to inspect retrieved skill coverage and matched trajectories\. Each task is executed by the same downstream agent loop, with only the retrieved skill context changed across methods\. We use reward as the primary end\-to\-end metric and report token usage and runtime as efficiency metrics\. Graph of Skills retrieves dependency\-aware skill bundles, whileGoSkillsretrieves anchor\-centered skill groups and expands support groups before rendering the final execution contract\. The retrieval\-gate and case\-study tables below are therefore task\-level SkillsBench analyses, not a separate re\-aggregation of the ALFWorld results\.

For retrieval evaluation, we define task\-specificmust\_haveskills and evaluate 40 annotated visible\-requirement items per gate mode\. A requirement\-level pass means that the final presented context includes the required skill for that item\. A partial result covers a related but incomplete required skill set, and a miss covers none\. We report requirement\-level pass, partial, miss, average must\-hit rate, selected group count, and presented skill count\.

We separate infrastructure failures from substantive agent failures\. Runs with environment construction errors, Docker failures, or startup failures are tracked separately\. If the agent has already entered a meaningful trajectory and then times out, we treat the timeout as a substantive execution failure\. Invalid runs are not used as evidence for retrieval quality\.

### E\.2Run provenance and aggregation

Table[15](https://arxiv.org/html/2605.06978#A5.T15)reports the run provenance used for the aggregate results in Table[1](https://arxiv.org/html/2605.06978#S4.T1)\. We separate infrastructure failures from task\-level failures\. Infrastructure failures include API outages, environment crashes, or logging failures that prevent the benchmark evaluator from producing a valid task outcome\. These runs are excluded from aggregate reward, token, and runtime computation and are not imputed\. In contrast, agent timeouts are retained as valid task\-level outcomes when the evaluator returns a task result; they are counted as task failures for reward and included in token/runtime accounting up to the timeout cap\.

For each model–method–benchmark cell, reward is first averaged over valid task outcomes within each repeat and then averaged over the three repeats\. Token usage reports mean input tokens over the same valid task runs, and runtime reports mean agent\-only task\-processing time, excluding environment setup\. The table aggregates provenance over the six evaluated backbones for compactness; the same accounting rule is applied to every model–method–benchmark cell\.

For space, Table[15](https://arxiv.org/html/2605.06978#A5.T15)aggregates provenance over backbones\. The released run\-level CSV contains one row per task run with fieldsbenchmark,model,method,task\_id,repeat\_id,status,reward,input\_tokens, andagent\_runtime\_s, which provides exact per\-cell denominators for Table[1](https://arxiv.org/html/2605.06978#S4.T1)\.

Table 15:Run provenance for aggregate experiments\. Attempted task\-runs are aggregated over the six evaluated backbones and three repeats\. Valid reward denotes task\-runs with evaluator\-produced outcomes used in Table[1](https://arxiv.org/html/2605.06978#S4.T1)\. Infra\. failures are excluded and not imputed\. Timeout failures are a subset of valid reward task\-runs and are counted as task failures\.BenchmarkMethodTasks / modelBackbonesAttemptedValid rewardInfra\. fail\.Timeout fail\.SkillsBenchNo Skills546972966672SkillsBenchVanilla Skills546972963965SkillsBenchVector Skills5469729601259SkillsBenchGraph of Skills5469729581463SkillsBenchGoSkills546972965751ALFWorldNo Skills14625225116ALFWorldVanilla Skills14625225024ALFWorldVector Skills14625225023ALFWorldGraph of Skills14625224933ALFWorldGoSkills14625225112

### E\.3Retrieval\-gate annotation protocol

The retrieval gate is designed as a visible\-requirement coverage diagnostic for SkillsBench, not as a separate held\-out end\-to\-end benchmark\. Table[16](https://arxiv.org/html/2605.06978#A5.T16)shows how we evaluate whether the final presented context contains the task\-specificmust\_haveskills needed to satisfy visible task requirements before agent execution\.

Themust\_haveannotations were created from task prompts, public task files, skill names, skill metadata, and skill payload descriptions\. They exclude hidden tests, evaluator internals, private oracle information, previous failure traces, and any artifacts produced during agent execution\. Each annotation item links one visible requirement to one or more required skills\. Examples include explicit file formats, named APIs, public checks, deterministic output requirements, or formal proof obligations\.

Annotations were produced before final retrieval\-gate evaluation\. The annotators did not inspect finalGoSkillsoutputs when assigningmust\_havelabels\. Disagreements were resolved by discussion using only visible task materials and the skill\-library metadata\. The final annotation set contains 40 visible\-requirement items per gate mode, covering10SkillsBench gate tasks\. The same annotation set is used for all compared methods and ablations\.

The gate evaluation is separated from the method features\.GoSkillsuses deterministic query facets, normalized skill metadata, typed graph relations, and visible\-check cues for retrieval and bottlenecking\. The gate labels are task\-level evaluator annotations used only after retrieval to check coverage\. Although both the method and the gate rely on visible task information, the gate does not useGoSkillsgroup scores, selected groups, coverage\-debt state, or contract fields as labels\. This avoids evaluating the method with labels derived from its own outputs\.

Table 16:Retrieval\-gate annotation protocol\. The same visible\-requirement annotations are used for all methods evaluated under the gate\.QuestionProtocolWho annotatedmust\_haveskills?Two authors independently annotated visible requirements from task prompts, public files, skill names, skill metadata, and payload descriptions\. Disagreements were resolved by discussion using only visible task materials and skill\-library metadata\.Blind toGoSkillsoutputs?Annotators did not inspect finalGoSkillsretrieved contexts, selected groups, or rendered contracts when assigningmust\_havelabels\. The labels were fixed before the final retrieval\-gate evaluation\.Same facets as the method?No\.GoSkillsuses deterministic query/schema facets for retrieval and bottlenecking, while the gate uses fixed task\-level annotations for post\-hoc coverage evaluation\. Both are restricted to visible task information, but gate labels are not derived fromGoSkillsselected groups, scores, coverage\-debt state, or rendered contracts\.Hidden information excluded?Yes\. Hidden tests, evaluator internals, private oracle information, previous failure traces, and execution\-time artifacts are excluded\. The annotations use only information available before agent execution\.Coverage of 40 items?The final gate contains 40 visible\-requirement items per mode across10SkillsBench gate tasks\. Each item maps one visible requirement, such as an explicit file format, named API, public check, deterministic output constraint, or proof obligation, to one or more required skills\.Other methods evaluated?Yes\. The same gate labels are applied to Vector Skills, Graph of Skills,GoSkills, and relevant ablations using each method’s final presented context before agent execution\.

### E\.4Retrieval Gate Details

Table[17](https://arxiv.org/html/2605.06978#A5.T17)shows the improvement from the initial retrieval configuration to the final gate over the same 40 annotated visible\-requirement items per mode\. The average must\-hit rate improves from 0\.73 to 1\.00 in both modes\.

Table 17:Retrieval\-gate optimization\.StageModeReq\. PReq\. Par\.MissMust\-hitInitial gateinstruction\_auto29470\.73Initial gatecritical\_override29830\.73Final gateinstruction\_auto40001\.00Final gatecritical\_override40001\.00

### E\.5Task\-Level Retrieval Examples

Table[18](https://arxiv.org/html/2605.06978#A5.T18)gives representative task\-level retrieval traces\. These examples show thatGoSkillsoften combines a primary skill with supporting skills that cover artifacts, workflow dependencies, or visible\-check constraints\.

Table 18:Representative retrieval outputs underinstruction\_auto\.TaskPrimary skillPresented skillsinvoice\-fraud\-detectionfuzzy\-matchfuzzy\-match,pdf\-reading,xlsxgravitational\-wave\-detectionconditioningconditioning,matched\-filtering,silence\-detectorthreejs\-structure\-parserthreejsthreejs,obj\-exporter,discover\-important\-functionlean4\-prooflean4\-theorem\-provinglean4\-theorem\-proving,lean4\-memories

### E\.6Task\-level paired bootstrap confidence intervals

To quantify the stability of the main SkillsBench comparisons, we compute task\-level paired bootstrap confidence intervals for reward and runtime deltas\. For each task and method, we first average repeated valid runs within the task\. We then form paired task\-level differences betweenGoSkillsand each baseline and resample tasks with replacement for 10,000 bootstrap samples\. Only tasks with valid results for both methods in a given comparison are included in that paired comparison\. Because valid paired coverage differs across baselines, each row uses the maximal paired task set available for that specific comparison\. We report the mean paired delta and the percentile 95% confidence interval\.

Table[19](https://arxiv.org/html/2605.06978#A5.T19)shows that the reward improvement ofGoSkillsover Graph of Skills remains positive under task\-level resampling, with a mean reward delta of \+12\.5 percentage points and a 95% confidence interval of \[\+5\.2, \+20\.1\]\. The runtime delta is also negative, with a mean reduction of 250\.8 seconds and a 95% confidence interval of \[\-392\.4s, \-108\.6s\]\. These intervals support the main finding thatGoSkillsimproves downstream task performance and agent\-side runtime over the strongest structural\-retrieval baseline considered in our experiments\.

We emphasize that the bootstrap is paired at the task level: each resampled unit contains both theGoSkillsresult and the corresponding baseline result for the same task\. This controls for task difficulty and directly estimates the stability of method\-level differences rather than comparing independent aggregate means\.

Table 19:Task\-level paired bootstrap confidence intervals on SkillsBench\. For each comparison, we first average repeated valid runs within each task and then resample paired tasks with replacement for 10,000 bootstrap samples\. Deltas are reported asGoSkillsminus the baseline\. Positive reward deltas are better, while negative runtime deltas are better\. Reward deltas are reported in percentage points\. Confidence intervals use the percentile 95% interval\.BaselinePairsΔ\\DeltaReward \(pp\)↑\\uparrow95% CIΔ\\DeltaRuntime↓\\downarrow95% CIGraph of Skills54\+12\.5\[\+5\.2, \+20\.1\]\-250\.8s\[\-392\.4s, \-108\.6s\]Vector Skills52\+23\.9\[\+14\.7, \+32\.8\]\-389\.2s\[\-541\.6s, \-211\.3s\]Vanilla Skills50\+20\.5\[\+11\.3, \+29\.4\]\-333\.9s\[\-486\.7s, \-170\.5s\]No Skills47\+30\.3\[\+21\.4, \+39\.6\]\-395\.5s\[\-562\.8s, \-207\.1s\]

### E\.7Baseline specification and accounting

Table[20](https://arxiv.org/html/2605.06978#A5.T20)summarizes the baseline interfaces used in the aggregate experiments\. The goal is to make the comparison about the organization of retrieved context rather than about different downstream agents, execution loops, or skill implementations\. All methods use the same downstream agent loop, execution environment, task prompts, and skill payloads\. For retrieved\-skill methods, we use the same exposed\-payload cap of four atomic skill payloads and the same rendered\-context guard of 9,000 characters\. Thus, Vector Skills, Graph of Skills, andGoSkillsare compared under the same final payload budget, while Vanilla Skills serves as a full\-library exposure reference\.

The main difference among the retrieved\-skill methods is the interface used to construct and render the skill context\. Vector Skills exposes a flat semantic top\-kklist\. Graph of Skills hydrates a dependency\-aware bundle from the typed skill graph\.GoSkillsfirst selects an anchor/support group plan, bottlenecks the selected groups into the same four\-payload budget, applies coverage\-safe backfill when needed, and renders the result as a role\-labeled execution contract\. Therefore, differences between Graph of Skills andGoSkillsshould be interpreted as differences between a hydrated structural bundle and the full group\-structured retrieval\-and\-rendering interface, rather than as changes to the downstream model or execution environment\.

For accounting, reward is averaged over tasks within each run and then averaged across runs\. Token usage reports mean input tokens, including the task prompt, the method\-specific skill context, and the downstream agent prompt\. Runtime reports mean agent\-only task\-processing time and excludes environment setup\. Infrastructure failures are tracked separately from valid task outcomes, as described in Appendix[E\.1](https://arxiv.org/html/2605.06978#A5.SS1)\.

Table 20:Baseline specification for aggregate experiments\. All retrieved\-skill methods use the same exposed\-payload cap and the same payload truncation guard\.GoSkillsdiffers by selecting and rendering a role\-labeled group plan before exposing the final atomic skill payloads\.MethodExposed contextBudget / truncationPrompt wrapperAccountingNo SkillsNo retrieved skill payloads\.No retrieval budget; no skill\-context block\.Same downstream agent prompt with the skill block omitted\.Same reward, token, and runtime accounting\.Vanilla SkillsFull available skill library\.Full\-library exposure; no retrieval top\-kkand no graph hydration\.Generic skill\-library block prepended to the same downstream agent prompt\.Same accounting\.Vector SkillsFlat semantic top\-kkatomic skills\.Top\-k=4k=4exposed payloads; each payload uses the same truncation guard as other retrieved baselines; max rendered skill context is 9,000 characters\.Generic retrieved\-skill block\. NoStart,Support,Check,Avoid, or debt fields\.Same accounting\.Graph of SkillsHydrated dependency\-aware skill bundle from the typed skill graph\.At most 4 exposed payloads after graph hydration; same payload truncation guard; max rendered skill context is 9,000 characters\.Graph\-hydrated skill block\. Dependency structure is available through the bundle, but no explicit role\-labeled execution contract is rendered\.Same accounting\.GoSkillsAnchor/support group plan rendered as a role\-labeled execution contract plus final atomic payloads\.At most 3 selected groups internally; at most 4 exposed payloads after bottlenecking/backfill; same payload truncation guard; max rendered skill context is 9,000 characters\.FixedStart/Support/Check/Avoid/debt contract with the same downstream agent loop and execution environment\.Same accounting\.

### E\.8Matched Paired Results

Table[21](https://arxiv.org/html/2605.06978#A5.T21)reports a matched SkillsBench subset in which Graph of Skills andGoSkillsruns are aligned at the task and slice level\. This subset is narrower than the full benchmark–model aggregation in Table[1](https://arxiv.org/html/2605.06978#S4.T1); it is used to inspect matched outcomes and run\-level reliability rather than replace the primary aggregate comparison\.

Across all slices, ties remain the most common outcome, indicating thatGoSkillsoften preserves downstream task success while changing the retrieval unit from atomic skills or post\-hoc bundles to role\-aware skill groups\. At the same time,GoSkillsobtains more wins than Graph of Skills in every matched slice and achieves a higher average reward throughout\. In the completed matched subset,GoSkillsimproves average reward from 0\.539 to 0\.683, with 10 wins, 3 Graph of Skills wins, and 41 ties\. The fixed paired slice shows a similar pattern, with reward increasing from 0\.614 to 0\.827\. The GPT\-5\.4 and fast\-agent slices also favorGoSkills, while the per\-task snapshot subset shows the largest average reward gap \(0\.860 versus 0\.620\)\. Error counts are mixed across slices, so we interpret this table as evidence that group\-structured rendering improves matched outcomes on these slices, rather than as a standalone reliability claim\.

Table 21:Matched SkillsBench subset for trajectory analysis\.Rdenotes average reward within the matched subset;Errorsreports Graph of Skills /GoSkillscounts\.SlicePairsGraph WGoSkillsWTieGraph RGoSkillsRErrorsCompleted matched subset54310410\.5390\.68316 / 17Fixed paired slice3128210\.6140\.82710 / 7GPT\-5\.4 paired slice1404100\.8380\.9511 / 0Fast\-agent paired slice172780\.4890\.6459 / 7Per\-task snapshot subset2505200\.6200\.8603 / 5

### E\.9Token and Runtime Analysis

Table[22](https://arxiv.org/html/2605.06978#A5.T22)reports token and runtime statistics for the matched trajectory slices\. Across these slices,GoSkillsintroduces modest structured\-context overhead, using slightly more input and total tokens than Graph of Skills\. However, this overhead is accompanied by lower agent\-only runtime in every paired slice\. In the completed matched subset, agent time decreases from 235\.7s to 202\.3s\. The same pattern holds for the fixed slice \(178\.3s to 167\.9s\), the GPT\-5\.4 slice \(204\.7s to 193\.1s\), and the fast\-agent slice \(156\.6s to 137\.2s\)\. Wall time also decreases slightly across all reported slices, suggesting that the role\-labeled context can reduce downstream execution effort even when it adds prompt tokens\.

At the task level, the efficiency pattern is mixed\.GoSkillsreduces input tokens oninvoice\-fraud\-detectionby 31\.7%, and reduces both input tokens and wall time onthreejs\-structure\-parserby 18\.7% and 8\.6%, respectively\. It also reduces wall time ondata\-to\-d3by 18\.3%\. In contrast,setup\-fuzzing\-pyincreases input tokens by 24\.2%, suggesting that long\-chain setup tasks may require broader support context\. Overall, the matched slices indicate a tradeoff:GoSkillsmay spend additional tokens to expose anchor, support, and check structure, but this structure can shorten the agent’s downstream execution\.

Table 22:Token and runtime comparison for matched trajectory slices\.SliceMethodRunsInput tokensTotal tokensWall timeAgent timeCompleted matchedGraph of Skills35/5461,97896,786275\.6235\.7Completed matchedGoSkills35/5462,516100,471271\.1202\.3Fixed sliceGraph of Skills31/3186,92787,878219\.2178\.3Fixed sliceGoSkills31/3192,73493,992212\.8167\.9GPT\-5\.4 sliceGraph of Skills14/1460,73361,350247\.8204\.7GPT\-5\.4 sliceGoSkills14/1464,00165,662236\.1193\.1Fast\-agent sliceGraph of Skills17/17108,499109,725195\.7156\.6Fast\-agent sliceGoSkills17/17111,455113,204193\.6137\.2

### E\.10Implementation Settings

Unless otherwise stated,GoSkillsuses four seed skills for group activation, selects at most three groups, presents at most four atomic skills to the downstream agent, caps each hydrated skill payload at 1,800 characters, and caps the full rendered skill context at 9,000 characters\. The scoring weights for group ranking, support expansion, and bottleneck selection are fixed across all benchmarks, models, and tasks\. The downstream agent loop, execution environment, and skill payloads are kept unchanged across methods; only the retrieved skill context differs\.

## Appendix FFailure and Error Analysis

#### Error taxonomy\.

We separate failures caused by the retrieval interface from failures caused by downstream execution or infrastructure\. This distinction is important becauseGoSkillsonly changes the retrieved context\. It does not execute skills, bind arguments, repair code after a failure, or inspect hidden evaluator state\. Table[23](https://arxiv.org/html/2605.06978#A6.T23)summarizes the categories used in our failure analysis\.

Table 23:Failure taxonomy for group\-structured skill retrieval experiments\. The taxonomy separates whatGoSkillscan directly affect from downstream and infrastructure bottlenecks\.Error modeTypical symptomInterpretationActivation missNo activated group contains the skill or facet needed by the visible task requirement\.Retrieval\-side failure; better schema extraction, seed evidence, or group indexing can help\.Partial coverageThe anchor is plausible, but the final bottleneck omits a required support, artifact, or visible\-check skill\.Retrieval\-side failure; this is the purpose of coverage debt and budgeted backfill\.Good retrieval, bad executionThe exposed skills are plausible, but the agent over\-builds, ignores the contract, or fails to satisfy the task check\.Mostly downstream; retrieval can reduce search friction but cannot guarantee execution\.Context overheadThe contract or support context adds tokens without improving the trajectory\.Efficiency failure; most likely on long setup chains that need broad environment context\.Infrastructure failureDocker, startup, dependency, timeout, or reward\-file errors prevent a clean completed run\.Separated from method\-quality interpretation; not used as direct evidence about retrieval quality\.

#### Retrieval\-side failures\.

InGoSkills, a retrieval miss can occur before group selection if the seed evidence fails to activate the relevant group, or after group selection if bottlenecking drops a required support skill\. The retrieval\-gate comparison in Table[17](https://arxiv.org/html/2605.06978#A5.T17)shows this distinction empirically: the initial configuration produced misses and partial results, while the final gate reaches 40/40 requirement\-level pass in both modes\. We therefore treat coverage debt as an implementation check, not as a proof that execution will succeed\. It only records whether high\-confidence visible requirements remain uncovered by the final presented context\.

#### Execution failures after valid retrieval\.

Some failures remain even when the retrieved context is plausible\. In the Graph of Skills baseline,adaptive\-cruise\-controlandearthquake\-phase\-associationare completed reward\-0 outcomes rather than environment\-start failures\. These cases are useful because they should not be collapsed into infrastructure noise\. They indicate that a long design, simulation, or data\-processing chain can still fail after a valid start\. ForGoSkills, the corresponding limitation is the same: group structure can make the entry point and support roles explicit, but it cannot force the downstream model to perform the right multi\-step execution\.

#### Efficiency failures\.

The group contract can add useful structure, but it is not free\. On the paired subset in Table[22](https://arxiv.org/html/2605.06978#A5.T22),GoSkillsuses more tokens on average while reducing agent\-only runtime on selected slices\. The task\-level pattern is also mixed:invoice\-fraud\-detection,threejs\-structure\-parser, anddata\-to\-d3show efficiency gains, whilesetup\-fuzzing\-pyincreases input tokens\. This supports the scoped interpretation used in the main paper: group\-structured retrieval is most useful when the task has a compact anchor/support/check decomposition, and less reliable when the dominant difficulty is broad setup or environment repair\.

#### Infrastructure failures\.

We retain completed episodes as substantive outcomes, including reward\-0 failures\. Episodes with startup failures, dependency failures, unavailable reward artifacts, or timeouts before a meaningful trajectory are separated as infrastructure evidence\. This policy prevents setup and reward\-artifact failures from being misread as retrieval misses while still counting completed failed attempts as real downstream outcomes\.

## Appendix GQualitative Analysis

#### Section framing\.

We use trajectory\-grounded case studies to explain when group\-structured retrieval changes the downstream trajectory and when it does not\. The relevant question is not only whether a method retrieves a topically related skill, but whether the exposed context gives the downstream agent an executable entry point, supporting artifacts, and visible checks early enough to change the trajectory\. This is the intended difference between Graph of Skills andGoSkills: Graph of Skills scores skill nodes through graph diffusion and hydrates a dependency\-aware bundle, whereasGoSkillsselects anchor\-centered groups, expands support groups, bottlenecks the selected plan to a few atomic payloads, and renders explicitStart,Support,Check,Avoid, and debt fields\.

#### Evidence sources\.

The qualitative examples pair completed Graph of Skills baseline outcomes with theGoSkillsretrieval examples and paired trajectory summaries reported in Appendix[E\.5](https://arxiv.org/html/2605.06978#A5.SS5)–[E\.9](https://arxiv.org/html/2605.06978#A5.SS9)\. For each case, we use the observed reward, runtime, token count, and whether the episode reached a meaningful task trajectory\. Setup, reward\-artifact, Docker, or environment\-start failures are tracked separately as infrastructure evidence\.

Reading guide\.Each case is interpreted along three axes: whether the initial context exposes a credible entry point, whether supporting artifacts and checks are present, and whether the remaining failure is better attributed to retrieval, execution, or infrastructure\.

Table 24:Trajectory\-grounded qualitative evidence used in the case studies\. Baseline evidence reports the Graph of Skills outcome for the same task\.GoSkillsevidence is taken from the task\-level retrieval examples and paired trajectory summaries in Tables[18](https://arxiv.org/html/2605.06978#A5.T18)–[22](https://arxiv.org/html/2605.06978#A5.T22)\. Rows are qualitative examples, not a replacement for the aggregate results in Table[1](https://arxiv.org/html/2605.06978#S4.T1)\.TaskGraph of Skills baselineGoSkillsevidenceInterpretationazure\-bgp\-oscillation\-route\-leakFailreward 0\.0; 149\.18s; 37,808 tokens; completed episode\.Passreward 1\.0; 62\.8s agent time; 98,764 input tokens\.Role\-structured context is associated with a completed success rather than a completed failure\.invoice\-fraud\-detectionPassreward 1\.0; 684\.588s; 66,492 tokens; completed episode\.Primaryfuzzy\-match; supportpdf\-reading,xlsx; input\-token reduction of 31\.7%\.Group context exposes a compact document–table–matching chain\.threejs\-structure\-parserFailreward 0\.0; 392\.375s; 46,123 tokens; completed episode\.Primarythreejs; supportobj\-exporter,discover\-important\-function; input tokens down 18\.7%, wall time down 8\.6%\.Anchor/support exposure helps distinguish parsing, export, and inspection roles\.3d\-scan\-calcPassreward 1\.0; 96\.399s; 33,853 tokens; completed episode\.Control case: the baseline already exposes the geometry bottleneck\.Dependency\-aware retrieval can already be sufficient on compact geometry tasks\.adaptive\-cruise\-controlFailreward 0\.0; 307\.623s; 177,283 tokens; completed episode\.No pairedGoSkillssuccess is reported in the task\-level evidence\.Completed failure indicates an execution/planning bottleneck, not infrastructure noise\.setup\-fuzzing\-pyInfra\.reward unavailable; 321\.622s; 330,037 tokens; setup failure\.GoSkillsincreases input tokens by 24\.2% on this setup\-heavy task\.Long setup chains can require broader context and are not direct retrieval\-benefit cases\.

#### Case Study 1: Azure BGP route\-leak diagnosis\.

azure\-bgp\-oscillation\-route\-leakis the most direct positive case in the matched trajectory evidence because the Graph of Skills baseline trajectory is a completed failure rather than an infrastructure artifact\. The baseline obtains reward 0\.0 with 149\.18s runtime and 37,808 total tokens\. In the paired trajectory slice for the same task,GoSkillsobtains reward 1\.0, with 62\.8s agent time and 98,764 input tokens, compared with 120\.2s agent time and 112,467 input tokens for Graph of Skills\. This supports the mechanism rather than a broad task\-level dominance claim: the role\-labeled contract tells the agent what to start from, which support context to consult, which visible requirements to check, and what misreadings to avoid\.

#### Case Study 2: Invoice fraud detection\.

invoice\-fraud\-detectionillustrates a different regime: Graph of Skills already reaches reward 1\.0, but the baseline trajectory is relatively expensive, taking 684\.588s\. TheGoSkillsretrieval example exposes a short, role\-consistent chain:fuzzy\-matchas the primary skill, supported bypdf\-readingandxlsx\. This is exactly the kind of anchor/support decomposition that the method is designed to make explicit: one skill anchors entity matching, while the support skills cover document extraction and table handling\. In the task\-level efficiency analysis,GoSkillsreduces input tokens by 31\.7%\. The lesson is not that Graph of Skills cannot solve the task, but that group\-structured rendering can reduce the agent’s search burden when the workflow has a compact artifact chain\.

#### Case Study 3: ThreeJS structure parsing\.

threejs\-structure\-parseris a more informative failure contrast\. The Graph of Skills baseline trajectory is a completed run with reward 0\.0, 392\.375s runtime, and 46,123 tokens\. TheGoSkillsretrieval example instead presentsthreejsas the primary skill, withobj\-exporteranddiscover\-important\-functionas support\. This bundle separates the visible roles in the task: understand the Three\.js structure, export or inspect object geometry, and locate the function or code path that determines the required artifact\. The paired efficiency summary reports 18\.7% fewer input tokens and 8\.6% lower wall time forGoSkillson this task\. We therefore treat this as a qualitative mechanism case rather than a standalone proof of a reward gap\.

#### Case Study 4: Compact tasks where graph retrieval is enough\.

The3d\-scan\-calcbaseline trajectory is an important control\. Graph of Skills completes the task with reward 1\.0 in 96\.399s and 33,853 tokens\. Similar completed\-success rows appear for tasks such asflood\-risk\-analysisandenergy\-market\-pricing\. These cases prevent an overly strong reading of the method: when the dependency\-aware bundle already exposes the executable decomposition,GoSkillsshould be expected mainly to preserve coverage or modestly affect efficiency, not to produce a qualitatively different outcome\.

#### Case Study 5: Completed failures and setup\-heavy failures\.

adaptive\-cruise\-controlandearthquake\-phase\-associationshow the other boundary condition\. In the Graph of Skills baseline, both are completed reward\-0 outcomes; the former uses 177,283 tokens and the latter uses 173,300 tokens\. These are not infrastructure failures, but they also should not be interpreted as pure retrieval misses\. They are consistent with long\-horizon execution or verifier\-alignment failures where even a relevant bundle may not be sufficient\. By contrast,setup\-fuzzing\-pydoes not produce an available reward artifact, so it is tracked as infrastructure/setup evidence rather than a clean method\-quality comparison\. The task\-level efficiency analysis also shows thatGoSkillscan increase input tokens on this setup\-heavy task, which is consistent with the need for broader support context\.

#### Infrastructure handling\.

Episodes are separated from completed runs whenever the environment fails before a meaningful trajectory or the reward artifact is unavailable\. Completed episodes are retained with their observed rewards, even when they fail the task\. This policy is important for the case studies above: Azure BGP, ThreeJS parsing, adaptive cruise control, and earthquake phase association are substantive completed outcomes, whereas setup\-heavy environment failures are not used as direct evidence for retrieval quality\.

Qualitative takeaway\.The qualitative pattern is deliberately narrower than the main aggregate table\.GoSkillsis most useful when a task has a short but nontrivial anchor/support/check decomposition that the downstream agent must operationalize under a tight context budget\. On such tasks, role\-labeled group retrieval can turn a retrieved skill set into a more actionable execution contract\. When Graph of Skills already exposes the necessary dependency chain, the expected outcome is often a tie\. When the task requires long\-horizon execution, environment setup, or hidden verifier alignment, retrieval organization alone is not sufficient\. These cases therefore support the scoped claim of the paper: group\-level retrieval is best viewed as an agent\-facing organization mechanism that can preserve visible\-requirement coverage and reduce search friction, rather than as a guarantee of uniform task\-level dominance\.

## Appendix HAdditional Weaknesses

This section summarizes the main limitations of the current implementation and evaluation protocol\. These weaknesses do not change the central goal ofGoSkills: studying whether group\-structured retrieval can make skill contexts more usable under a bounded prompt budget\. However, they clarify where the reported results should be interpreted with care\.

#### Dependence on deterministic schema extraction\.

GoSkillsrelies on deterministic query\-schema extraction, typed skill metadata, and visible requirement cues\. This design makes the retrieval process transparent and reproducible, but it may be less robust when task prompts are ambiguous, metadata are sparse, or the relevant requirements are only implicit\. In particular, the method can miss support skills when a task does not expose clear technology anchors, artifact names, output formats, or failure cues\. The current evaluation therefore best reflects settings where task\-visible requirements can be extracted reliably before execution\.

#### Fixed scoring rules\.

The group, support, and bottleneck scores are fixed weighted sums over handcrafted feature families\. This makes the method easy to inspect and keeps the retrieval policy unchanged across backbones and benchmarks\. However, the coefficients may implicitly fit the development distribution and may not be optimal for other skill libraries or task domains\. A learned or validation\-tuned scoring policy could improve transfer, but would also introduce additional training data requirements and make it harder to isolate the effect of group\-structured retrieval itself\.

## Appendix IBroader impacts

This work studies inference\-time retrieval and contextualization for existing agent skill libraries\. Potential positive impacts include more efficient use of large skill libraries, lower downstream execution cost, and more transparent agent\-facing context through explicit START, SUPPORT, CHECK, AVOID, and DEBT fields\. Potential negative impacts include over\-reliance on retrieved guidance, more capable automated coding agents being used for unsafe automation, and amplification of unsafe or low\-quality skills if such skills are present in the underlying library\. The method does not itself filter unsafe skills, verify semantic correctness, or prevent misuse; deployment should therefore pair retrieval with library curation, policy checks, and task\-level monitoring\.

## Appendix JCompute resources and API usage

All experiments are API\-based agent evaluations\. We do not train or fine\-tune any model and do not use GPUs\. The benchmark workers run the downstream agent loop, retrieval code, logging, and task evaluators on CPU machines\. Each worker uses 8 vCPUs, 32GB RAM\. The experiments are executed in Ubuntu 22\.04 cloud VM with the same software environment and benchmark versions across methods\.

For each model–method–benchmark combination, we run three repeats\. The main tables report mean input tokens and mean agent\-only task\-processing runtime, excluding environment setup\. Infrastructure failures, task timeouts, and valid task outcomes are tracked separately in the run provenance table\. API\-based model inference is performed through the corresponding provider endpoints, with model snapshots and access dates reported in the references or implementation appendix\.

Similar Articles

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv cs.AI

This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.

SkillGen: Verified Inference-Time Agent Skill Synthesis

arXiv cs.LG

This article introduces SkillGen, a multi-agent framework that synthesizes and verifies reusable inference-time skills for LLM agents by contrasting successful and failed trajectories. The method ensures skills are auditable and empirically verified for their net positive impact on agent performance.

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

arXiv cs.AI

This paper proposes SGDR (State-Grounded Dynamic Retrieval), an online skill learning method for web agents that enables stepwise, state-aware skill reuse rather than static task-level retrieval. Experiments on WebArena show SGDR achieves 37.5% success rate with GPT-4.1, a ~10.6% relative gain over strong baselines.