SkillGen: Verified Inference-Time Agent Skill Synthesis

arXiv cs.LG Papers

Summary

This article introduces SkillGen, a multi-agent framework that synthesizes and verifies reusable inference-time skills for LLM agents by contrasting successful and failed trajectories. The method ensures skills are auditable and empirically verified for their net positive impact on agent performance.

arXiv:2605.10999v1 Announce Type: new Abstract: Skills are a promising way to improve LLM agent capabilities without retraining, while keeping the added procedure reusable and controllable. However, high-quality skills are still largely written by hand. We introduce SkillGen, a multi-agent framework that synthesizes a single auditable skill from trajectories generated by a base agent. The output is a human-readable artifact that can be inspected before use. Rather than merely summarizing trajectories, SkillGen leverages contrastive induction over both successful and failed trajectories to identify reusable success patterns, recurring failure modes, and behaviors that appear in nearby successes but are missing from failures. SkillGen then generates candidate skills and iteratively refines the skill. A key novelty in SkillGen is that we model agent skills as interventions to empirically verify the net effect of skills on the overall performance. Specifically, we compare outcomes on the same instances with and without the skill, so that we account for both repairs (cases where the skill fixes a baseline failure) and regressions (cases where the skill breaks a baseline success). Across a broad range of agents and datasets, SkillGen consistently improves held-out performance, outperforms existing skill-generation baselines, and produces skills that transfer across models.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:26 AM

# SkillGen: Verified Inference-Time Agent Skill Synthesis
Source: [https://arxiv.org/html/2605.10999](https://arxiv.org/html/2605.10999)
Yuchen Ma1Yue Huang211footnotemark:1Han Bao2Haomin Zhuang2Swadheen Shukla3Michel Galley3Xiangliang Zhang2Stefan Feuerriegel1 1Munich Center for Machine Learning, LMU Munich 2University of Notre Dame3Microsoft Research

###### Abstract

Skills are a promising way to improve LLM agent capabilities without retraining, while keeping the added procedure reusable and controllable\. However, high\-quality skills are still largely written by hand\. We introduceSkillGen, a multi\-agent framework that synthesizes a single auditable skill from trajectories generated by a base agent\. The output is a human\-readable artifact that can be inspected before use\. Rather than merely summarizing trajectories,SkillGenleverages contrastive induction over both successful and failed trajectories to identify reusable success patterns, recurring failure modes, and behaviors that appear in nearby successes but are missing from failures\.SkillGenthen generates candidate skills and iteratively refines the skill\. A key novelty inSkillGenis that we model agent skills as interventions to empirically verify the net effect of skills on the overall performance\. Specifically, we compare outcomes on the same instances with and without the skill, so that we account for both repairs \(cases where the skill fixes a baseline failure\) and regressions \(cases where the skill breaks a baseline success\)\. Across a broad range of agents and datasets,SkillGenconsistently improves held\-out performance, outperforms existing skill\-generation baselines, and produces skills that transfer across models\.

## 1Introduction

Large language models \(LLMs\) are increasingly used to solve complex, multi\-step tasks\(Schicket al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib18); Qinet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib19); Yaoet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib15); Wanget al\.,[2023a](https://arxiv.org/html/2605.10999#bib.bib12)\)\. A common way to formalize such behavior is throughskills: reusable, inference\-time procedures that encode task\-specific guidance, such as instructions, executable code, and domain knowledge, without modifying model weights\(Zhanget al\.,[2025](https://arxiv.org/html/2605.10999#bib.bib1); Anthropic,[2025](https://arxiv.org/html/2605.10999#bib.bib2)\)\. Skills are modular and auditable: because they are readable inference\-time artifacts rather than weight updates or prompt searches, one can inspect the procedure they encode, revise it directly, and test its effect before deployment\. In practice, however, high\-quality skills are still largely hand\-written\.

Automated skill synthesis aims to learn reusable skills from agent experience\(Shinnet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib14); Zhaoet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib13); Niet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib5); Alzubiet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib6); Wanget al\.,[2026a](https://arxiv.org/html/2605.10999#bib.bib7); Zhanget al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib37)\)\. However, existing methods have two key shortcomings\. First, existing methods primarily learn from successful trajectories, and even when failures are considered, they are typically summarized in isolation rather than contrasted against nearby successes on the same task\. As a result, prior work misses a key contrastive signal between success and failure—that is, what the agent executes correctly in similar contexts and what it omits in failed roll\-outs\. For example, a successful trajectory may include an intermediate validation step that is absent in a failed attempt, but success\-only learning does not isolate that such intermediate validation is important and would not put into a reusable pattern\. Second, existing methods do not explicitly*verify*the empirical benefit of a generated skill\. While a skill may repair some failures, it can also introduce new failure modes on cases that the agent previously solved correctly\. As a result, skill synthesis is fundamentally an interventional problem, where one compares the net\-effect on the agent’s performance with and without the candidate skill\. Also, such performance evaluation is necessary to eventually build iterative approaches to refine candidate skills in a principled manner\.

We introduceSkillGen: a*multi\-agent framework for automatic, inference\-time skill synthesis*\(see Fig\.[1](https://arxiv.org/html/2605.10999#S2.F1)\)\.SkillGentakes an existing dataset of LLM trajectories as input and derives a single auditable skill: a readable intervention whose task context, success procedures, and failure lessons can be inspected and whose empirical net effect is verified before deployment\. The input dataset can be collected during a baseline elicitation phase to compile successful and failed trajectories\. Our framework operates through three specialized agents: \(1\) Acontrastive induction agentanalyzes the input trajectories to extract reusable success patternsandidentify recurring failure modes, with the aim to surface contrasts between successful vs\. failed roll\-outs\. As a result, it outputs a compact and interpretable summary with task diagnostics\. \(2\) In a generation\-verification\-refinement loop, the diagnostics are converted into candidate skills, and the skills are then iteratively refined based on feedback \(using ageneration agentand averification agent\. The final skill is selected by measuring the net\-effect on the final held\-out performance\. This ensures that the selected skill improves the overall performance and thus accounts for“repairs”\(i\.e, when a skill fixes a failure\) and“regressions”\(i\.e\., when a skill breaks a correct case\)\. To the best of our knowledge,SkillGenis the first agentic framework that models inference\-time skill synthesis as an intervention problem to ensure a positive, empirically verified effect on performance\.

We also demonstrate the effectiveness ofSkillGenacross a broad range of interactive, scientific, coding, and other tool\-use benchmarks\. We further evaluateSkillGenusing several open\-weight and proprietary base LLMs\. As our main result,SkillGenimproves average accuracy for all eight evaluated base LLMs, with held\-out gains ranging from\+3\.27\+3\.27to\+10\.08\+10\.08percentage points\. We also compareSkillGenagainst state\-of\-the\-art skill\-generation baselines\(Niet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib5); Wanget al\.,[2026a](https://arxiv.org/html/2605.10999#bib.bib7); Alzubiet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib6); Zhanget al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib37)\), whereSkillGenis consistently positive and achieves the largest average improvement by a clear margin\. Our ablations show that contrastive induction, verification\-guided refinement, and the verification gate each contribute to the performance gains\. We also perform a cross\-model transfer analysis to show that generated skills are generalizable and not tied to the LLM that produced them\.

Contributions\.Our main contributions are three\-fold:111Code is available via[https://github\.com/yccm/SkillGen](https://github.com/yccm/SkillGen)\.\(1\) We formulate a general, end\-to\-end learning task for automatic inference\-time skill synthesis: to produce a single, auditable skill that improves a base agent\. \(2\) We introduceSkillGen, a multi\-agent framework that learns from both failed and successful trajectories via contrastive induction, and then generates new candidate skills that are iteratively refined and verified\. The final skills are selected to have a positive net\-effect on the overall performance\. \(3\) We provide an extensive empirical study to demonstrate consistent and large held\-out performance gains\.SkillGenoutperforms state\-of\-the\-art skill\-generation baselines and produces skills that transfer across models without parameter updates\.

## 2Preliminaries

We view inference\-time skills as*interventions*that modify the behavior of a base agent and thereby change its task performance\. This perspective naturally induces a comparison between outcomes with and without a given skill on the same inputs\.

Task setting\.Let𝒳\\mathcal\{X\}be the input space,ppa task distribution over𝒳\\mathcal\{X\}, and𝒯\\mathcal\{T\}the space of agent trajectories\. A trajectoryτ∈𝒯\\tau\\in\\mathcal\{T\}consists of the full sequence of LLM interactions, including messages, tool calls, environment observations, and the final output\. For skill synthesis, we split the training data into: \(i\) an induction subset𝒟ind=\{xi\}i=1n\\mathcal\{D\}\_\{\\mathrm\{ind\}\}=\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}used to analyze agent behavior, and \(ii\) a construction\-time verification subset𝒟ver=\{x~j\}j=1m\\mathcal\{D\}\_\{\\mathrm\{ver\}\}=\\\{\\tilde\{x\}\_\{j\}\\\}\_\{j=1\}^\{m\}used for evaluating and selecting candidate skills\. We consider a*base agent*𝒜\\mathcal\{A\}that maps inputs to trajectories and that we seek to improve upon\. We model𝒜\\mathcal\{A\}as a stochastic trajectory kernelP𝒜​\(τ∣x;η\)P\_\{\\mathcal\{A\}\}\(\\tau\\mid x;\\eta\), wherexxis the task instance andη\\etais an inference\-time intervention loaded into the agent’s context\. The empty interventionη=∅\\eta=\\varnothingdefines the “no\-skill” behavior, defined byτ0\(x\)∼P𝒜\(⋅∣x;∅\)\\tau^\{0\}\(x\)\\sim P\_\{\\mathcal\{A\}\}\(\\cdot\\mid x;\\varnothing\)\.

To formalize the outcomeYY, we define a task\-level evaluatorℰ:𝒳×𝒯→\[0,1\]\\mathcal\{E\}:\\mathcal\{X\}\\times\\mathcal\{T\}\\\!\\rightarrow\\\!\[0,1\]\. In practice, this could be an LLM\-as\-a\-judge, a benchmark score, or a successful check against some environment outcome\. As a result, the evaluator assigns a success probability to each instance–trajectory pair\. The observed outcome isY​\(x,τ\)∼Bernoulli⁡\(ℰ​\(x,τ\)\)Y\(x,\\tau\)\\sim\\operatorname\{Bernoulli\}\(\\mathcal\{E\}\(x,\\tau\)\), with deterministic evaluators as the special caseℰ​\(x,τ\)∈\{0,1\}\\mathcal\{E\}\(x,\\tau\)\\in\\\{0,1\\\}\. For any instancexx, we define the baseline outcomeY0​\(x\)=Y​\(x,τ0​\(x\)\)Y^\{0\}\(x\)=Y\(x,\\tau^\{0\}\(x\)\); for induction instances, we writeτi0=τ0​\(xi\)\\tau\_\{i\}^\{0\}=\\tau^\{0\}\(x\_\{i\}\), and letyi0y\_\{i\}^\{0\}denote the realized outcome of the base agent\.

Skill interventions:We define a candidate*skill*as a inference\-time interventions=\(u,a,𝒫,ℛ\),s=\(u,a,\\mathcal\{P\},\\mathcal\{R\}\),whereuuis a structured prompt,aais task metadata \(e\.g\., task description\),𝒫\\mathcal\{P\}is an optional set of executable scripts, andℛ\\mathcal\{R\}is an optional collection of auxiliary documents\. Together, these components define the skill space considered bySkillGen\.

We model skills as interventions that change the agent’s behavior and thus its outcomes\. To make comparative assessments of skill learning, we adopt the potential outcome framework\(Rubin,[2005](https://arxiv.org/html/2605.10999#bib.bib41)\)as a principled manner to formalize the treatment effects\. For any inputxxand a candidate skillss, we define two potential outcomes: the baseline outcome \(i\.e\.,Y0​\(x\)=Y​\(x,τ0​\(x\)\)Y^\{0\}\(x\)=Y\(x,\\tau^\{0\}\(x\)\)\) and the skill\-augmented outcome \(i\.e\.,Ys​\(x\)=Y​\(x,τs​\(x\)\)Y^\{s\}\(x\)=Y\(x,\\tau^\{s\}\(x\)\),τs\(x\)∼P𝒜\(⋅∣x;η\(s\)\)\\tau^\{s\}\(x\)\\sim P\_\{\\mathcal\{A\}\}\(\\cdot\\mid x;\\eta\(s\)\)\. Loading a skillsscorresponds to applying an interventionη​\(s\)\\eta\(s\)into𝒜\\mathcal\{A\}\.

Objective:Our goal is to measure and maximize the expected effect of a skill relative to the baseline agent:

Δ​\(s\)=𝔼x∼p​\[𝔼​\[Ys​\(x\)∣x,s\]−𝔼​\[Y0​\(x\)∣x\]\]\.\\Delta\(s\)\\;=\\;\\mathbb\{E\}\_\{x\\sim p\}\\left\[\\mathbb\{E\}\\\!\\left\[Y^\{s\}\(x\)\\mid x,s\\right\]\-\\mathbb\{E\}\\\!\\left\[Y^\{0\}\(x\)\\mid x\\right\]\\right\]\.\(1\)Thus,Δ​\(s\)\\Delta\(s\)captures the net\-effect induced by a skill interventionss: it measures how much the skill improves \(or degrades\) performance on the same input distribution, while accounting for both “repairs” \(i\.e\., cases where the skill fixes a baseline failure\) and “regressions” \(i\.e\., cases where the skill breaks a baseline success\)\. The objective for skill synthesis is therefore to select a skill with positive net\-effect on held\-out performance, but without relying on human\-written task\-specific skills\. During construction, each candidate skill is evaluated on𝒟ver\\mathcal\{D\}\_\{\\mathrm\{ver\}\}under identical inputs with and without the skill\. As a result, we yield a so\-calledstatusσver​\(s;𝒟ver\)∈\{active,deprecated\}\\sigma\_\{\\mathrm\{ver\}\}\(s;\\mathcal\{D\}\_\{\\mathrm\{ver\}\}\)\\in\\\{\\mathrm\{active\},\\mathrm\{deprecated\}\\\}\. At deployment, only active skills are loaded; deprecated skills are subsumed under the empty intervention∅\\varnothing\.

![Refer to caption](https://arxiv.org/html/2605.10999v1/x1.png)Figure 1:SkillGenoverview\.Our multi\-agent framework synthesizes a single auditable skill from baseline trajectories\.1It first elicits successful and failed rollouts as input\.2It extracts reusable patterns of successful and failure modes\.3It follows an iterative generation\-verification\-refinement loop to generate and refine new candidate skills\.
## 3SkillGen

Overview\.SkillGentakes as input: a base agent, a set of observed LLM trajectories \(split into an induction subset and verification subset\), and a task\-level evaluator\.SkillGenthen returns a single auditable skill\.SkillGenfollows an agentic, three\-staged framework \(see Fig\.[1](https://arxiv.org/html/2605.10999#S2.F1); pseudocode in Alg\.[1](https://arxiv.org/html/2605.10999#alg1)in Appendix[A](https://arxiv.org/html/2605.10999#A1)\)\.*Stage*1baseline elicitation: this stage uses the base agent to collect successful vs\. failed trajectories\.*Stage*2contrastive induction: this stage extracts recurring failure modes to identify local patterns that distinguish successful vs\. failed roll\-outs; the patterns are combined into a compact, interpretable summary of task\-level diagnostics \(via theinduction agent\)\.*Stage*3an iterative generation–verification–refinement loop: this stage turns the diagnostics into candidate skills \(via thegeneration agent\), tests each candidate skill on the verification subset for performance evaluation \(via theverification agent\), and finally makes refinements to the candidate skill\. Finally, the candidate skill with the largest construction\-time net\-effectΔ​\(s\)\\Delta\(s\)is returned\.

### 3\.1*Stage*1: Baseline Elicitation

We first run the base agent on the induction subset and store

ℬ=\{\(xi,τi0,yi0\)\}i=1n,ℐ−=\{i:yi0=0\},ℐ\+=\{i:yi0=1\}\.\\mathcal\{B\}=\\\{\(x\_\{i\},\\tau^\{0\}\_\{i\},y^\{0\}\_\{i\}\)\\\}\_\{i=1\}^\{n\},\\qquad\\mathcal\{I\}^\{\-\}=\\\{i:y^\{0\}\_\{i\}=0\\\},\\quad\\mathcal\{I\}^\{\+\}=\\\{i:y^\{0\}\_\{i\}=1\\\}\.\(2\)Here,ℐ−\\mathcal\{I\}^\{\-\}indexes failed baseline rollouts andℐ\+\\mathcal\{I\}^\{\+\}indexes successful ones\. Intuitively, failures show where the base agent would need help; and successes show procedures the base agent can already execute\. Using both strata is unique toSkillGenand important: failures alone can produce misleading or unhelpful advice, while successes alone do not identify the capability gap\.

We further cache no\-skill outcomes on the construction\-time verification subset, i\.e\.,

ℬver=\{\(x~j,τ~j0,bj\)\}j=1m,τ~j0∼P𝒜\(⋅∣x~j;∅\),bj=Y\(x~j,τ~j0\)\.\\mathcal\{B\}\_\{\\mathrm\{ver\}\}=\\\{\(\\tilde\{x\}\_\{j\},\\tilde\{\\tau\}^\{0\}\_\{j\},b\_\{j\}\)\\\}\_\{j=1\}^\{m\},\\qquad\\tilde\{\\tau\}^\{0\}\_\{j\}\\sim P\_\{\\mathcal\{A\}\}\(\\cdot\\mid\\tilde\{x\}\_\{j\};\\varnothing\),\\qquad b\_\{j\}=Y\(\\tilde\{x\}\_\{j\},\\tilde\{\\tau\}^\{0\}\_\{j\}\)\.\(3\)The cached outcomes are neither used by the induction agent nor the first\-round generation agent; instead, the cached outcomes are later used for construction\-time verification and subsequent refinement\. The main motivation is that we can later compare each candidate’s skill against the behavior of the no\-skill agent on the same verification subset\.

### 3\.2*Stage*2: Contrastive Behavioral Induction

Theinduction agentcompresses baseline trajectories into an explicit summary diagnostic for skill synthesis:

Compress⁡\(ℬ\)=𝒵=\(a0,ℱ,𝒮,𝒞\),\\operatorname\{Compress\}\(\\mathcal\{B\}\)=\\mathcal\{Z\}=\(a\_\{0\},\\mathcal\{F\},\\mathcal\{S\},\\mathcal\{C\}\),\(4\)wherea0a\_\{0\}is a task\-level summary of the induction inputs,ℱ\\mathcal\{F\}is a set of cluster\-level failure summaries,𝒮\\mathcal\{S\}is a set of cluster\-level success summaries, and𝒞\\mathcal\{C\}is a set of local contrastive observations between nearby failed and successful rollouts\. Any of the three set\-valued components may be empty \(e\.g\., if the corresponding success or failure stratum is empty\. Stage3receives only𝒵\\mathcal\{Z\}, so this stage converts variable\-length LLM trajectories into a lower\-dimensional diagnostic summary for skill generation\.

∙\\bulletTask summary \(a0a\_\{0\}\)\.The induction agent applies a fixed abstraction prompt, denoted byAbs\\operatorname\{Abs\}, to the induction inputs and writes a task summarya0=Abs⁡\(\{xi\}i=1n\)a\_\{0\}=\\operatorname\{Abs\}\(\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\)\. The summary is designed to describe the task family rather than any single instance, giving the generation agent a general description of what the skill is for\. The componenta0a\_\{0\}is part of the diagnostic summary𝒵\\mathcal\{Z\}and is distinct from the skill metadataaains=\(u,a,𝒫,ℛ\)s=\(u,a,\\mathcal\{P\},\\mathcal\{R\}\)\.

∙\\bulletFailure analysis \(ℱ\\mathcal\{F\}\)\.For each failed rollouti∈ℐ−i\\in\\mathcal\{I\}^\{\-\}, the induction agent writes a root\-cause summaryρi−\\rho\_\{i\}^\{\-\}\. Letϕ\\phibe a text encoder applied to a serialized input–summary pair, and defineei−=ϕ​\(\[xi;ρi−\]\)e\_\{i\}^\{\-\}=\\phi\(\[x\_\{i\};\\rho\_\{i\}^\{\-\}\]\)\. We cluster the resulting failure embeddings via

Π−=Cluster⁡\(\{ei−:i∈ℐ−\}\),fP=Summ−⁡\(\{\(xi,τi0,ρi−\):i∈P\}\)\\Pi^\{\-\}=\\operatorname\{Cluster\}\\big\(\\\{e\_\{i\}^\{\-\}:i\\in\\mathcal\{I\}^\{\-\}\\\}\\big\),\\qquad f\_\{P\}=\\operatorname\{Summ\}^\{\-\}\\big\(\\\{\(x\_\{i\},\\tau\_\{i\}^\{0\},\\rho\_\{i\}^\{\-\}\):i\\in P\\\}\\big\)\(5\)for each clusterP∈Π−P\\in\\Pi^\{\-\}\. EachfPf\_\{P\}is a cluster\-level failure summary: it describes the recurring root cause, the trajectory point at which the failure typically appears, and the corrective rule that would avoid the failure\. The resulting set isℱ=\{fP\}P∈Π−\\mathcal\{F\}=\\\{f\_\{P\}\\\}\_\{P\\in\\Pi^\{\-\}\}\.

∙\\bulletSuccess analysis \(𝒮\\mathcal\{S\}\)\.For eachi∈ℐ\+i\\in\\mathcal\{I\}^\{\+\}, an LLM writes a success summaryρi\+\\rho\_\{i\}^\{\+\}, and we further embedei\+=ϕ​\(\[xi;ρi\+\]\)e\_\{i\}^\{\+\}=\\phi\(\[x\_\{i\};\\rho\_\{i\}^\{\+\}\]\)\. We apply the same embedding and clustering procedure to successful rollouts:

Π\+=Cluster⁡\(\{ei\+:i∈ℐ\+\}\),hP=Summ\+⁡\(\{\(xi,τi0,ρi\+\):i∈P\}\),\\Pi^\{\+\}=\\operatorname\{Cluster\}\\big\(\\\{e\_\{i\}^\{\+\}:i\\in\\mathcal\{I\}^\{\+\}\\\}\\big\),\\qquad h\_\{P\}=\\operatorname\{Summ\}^\{\+\}\\big\(\\\{\(x\_\{i\},\\tau\_\{i\}^\{0\},\\rho\_\{i\}^\{\+\}\):i\\in P\\\}\\big\),\(6\)and𝒮=\{hP\}P∈Π\+\\mathcal\{S\}=\\\{h\_\{P\}\\\}\_\{P\\in\\Pi^\{\+\}\}\. EachhPh\_\{P\}is a cluster\-level success summary: it describes the reusable procedure, the task conditions under which it appears, and checks that make the procedure robust\.

∙\\bulletLocal contrastive analysis \(𝒞\\mathcal\{C\}\)\.The above cluster summaries can miss some of the “small” action choices that separate a success from a failure\. Whenℐ\+\\mathcal\{I\}^\{\+\}is non\-empty, for each failed instancei∈ℐ−i\\in\\mathcal\{I\}^\{\-\}, we retrieve the nearest successful neighbor under embedding distancedd, i\.e\.,

j​\(i\)=arg​minj∈ℐ\+⁡d​\(ei−,ej\+\)\.j\(i\)=\\operatorname\*\{arg\\,min\}\_\{j\\in\\mathcal\{I\}^\{\+\}\}\\,d\(e\_\{i\}^\{\-\},e\_\{j\}^\{\+\}\)\.\(7\)The induction agent first checks whetherxix\_\{i\}andxj​\(i\)x\_\{j\(i\)\}share the same task type\. If they do, the induction agent compares the two full trajectories and generates a contrastive observation

ci=Contr⁡\(xi,τi0,xj​\(i\),τj​\(i\)0\),c\_\{i\}=\\operatorname\{Contr\}\(x\_\{i\},\\tau\_\{i\}^\{0\},x\_\{j\(i\)\},\\tau\_\{j\(i\)\}^\{0\}\),which describes the behavior present in the successful rollout and the corresponding behavior omitted in the failed rollout\. The observations whose pairs pass the same\-task check form𝒞\\mathcal\{C\}\. Thus,𝒞\\mathcal\{C\}provides local contrastive evidence: it anchors advice in behavior that the same base agent has already demonstrated, but that was absent in a nearby failure\.

### 3\.3*Stage*3: Generation–Verification–Refinement Loop

Overview\.Stage3turns the diagnostic summary𝒵\\mathcal\{Z\}into a sequence of candidate skills and uses paired verification to decide which candidate should be deployed\. The loop is designed to improve baseline failures while explicitly tracking regressions on instances that the base agent already solved\. It consists of four steps:\(i\)*generation*, which produces candidate skills from the diagnostic summary;\(ii\)*verification*, which evaluates each candidate on the verification subset;\(iii\)*refinement*, which updates candidates using structured feedback from repairs and regressions; and\(iv\)*selection*, which returns the candidate with the largest verified net gain for deployment\. LetK≥1K\\geq 1denote the round budget\. We index rounds byr∈\{1,…,K\}r\\in\\\{1,\\ldots,K\\\}, writes\(r\)s^\{\(r\)\}for the candidate skill at roundrr, writeΦ\(r\)\\Phi^\{\(r\)\}for the feedback produced after verifyings\(r\)s^\{\(r\)\}, and writes⋆s^\{\\star\}for the selected skill\.

∙\\bullet\(i\) Generation:In each roundrr, thegeneration agentuses the diagnostic summary𝒵\\mathcal\{Z\}as well as feedback from the previous round \(withΦ\(0\)=∅\\Phi^\{\(0\)\}=\\varnothing\) to produce a new candidate skill

s\(r\)=\(u\(r\),a\(r\),𝒫\(r\),ℛ\(r\)\)\.s^\{\(r\)\}=\(u^\{\(r\)\},a^\{\(r\)\},\\mathcal\{P\}^\{\(r\)\},\\mathcal\{R\}^\{\(r\)\}\)\.\(8\)*Skill structure:*To write the new candidate skill, we use a prompt template with a fixed three\-part schema

u\(r\)=\(uctx\(r\),usucc\(r\),ufail\(r\)\)u^\{\(r\)\}=\(u^\{\(r\)\}\_\{\\mathrm\{ctx\}\},u^\{\(r\)\}\_\{\\mathrm\{succ\}\},u^\{\(r\)\}\_\{\\mathrm\{fail\}\}\)\(9\)in natural language, where: \(i\)uctxu\_\{\\mathrm\{ctx\}\}encodes task context, i\.e\., a concise description of the task distribution and constraints \(derived froma0a\_\{0\}\); \(ii\)usuccu\_\{\\mathrm\{succ\}\}encodes reusable success patterns distilled from𝒮\\mathcal\{S\}and the successful instances from the contrastive analysis𝒞\\mathcal\{C\}; and \(iii\)ufailu\_\{\\mathrm\{fail\}\}encodes reusable failure\-avoidance patterns derived fromℱ\\mathcal\{F\}and the negative instances from the contrastive analysis𝒞\\mathcal\{C\}\. The above schema acts as a constrained projection from the diagnostic summary to the skill space\. Intuitively, the idea here is to learn patterns with reusable procedures that define successful vs failure instances, so that the refinement can help encourage the former and avoid the latter\.

For tool\-intensive tasks, the generation agent may additionally emit scripts𝒫\(r\)\\mathcal\{P\}^\{\(r\)\}and reference documentsℛ\(r\)\\mathcal\{R\}^\{\(r\)\}; however, after roundr\>1r\>1, refinement edits are restricted to the bodyu\(r\)u^\{\(r\)\}in natural language, which keeps the tool interface fixed to prevent uncontrolled expansion\.

∙\\bullet\(ii\) Verification:Theverification agentevaluates each candidate skill on all instances in the verification subset𝒟ver\\mathcal\{D\}\_\{\\mathrm\{ver\}\}\. For a candidate skillss, we load the interventionη​\(s\)\\eta\(s\)into the base agent and roll it out on eachx~j∈𝒟ver\\tilde\{x\}\_\{j\}\\in\\mathcal\{D\}\_\{\\mathrm\{ver\}\}, i\.e\.,

zj\(s\)=Y\(x~j,τ~js\),τ~js∼P𝒜\(⋅∣x~j;η\(s\)\)\.z\_\{j\}\(s\)=Y\(\\tilde\{x\}\_\{j\},\\tilde\{\\tau\}^\{s\}\_\{j\}\),\\qquad\\tilde\{\\tau\}^\{s\}\_\{j\}\\sim P\_\{\\mathcal\{A\}\}\(\\cdot\\mid\\tilde\{x\}\_\{j\};\\eta\(s\)\)\.\(10\)*Causal evaluation of skill intervention :*We treat a candidate skillssas an intervention on the base agent and evaluate the effect by comparing outcomes with and without the intervention on the same inputs\. For eachx~j∈𝒟ver\\tilde\{x\}\_\{j\}\\in\\mathcal\{D\}\_\{\\mathrm\{ver\}\}, we observe the baseline outcomebj=Y0​\(x~j\)b\_\{j\}=Y^\{0\}\(\\tilde\{x\}\_\{j\}\)and the skill\-augmented outcomezj​\(s\)=Ys​\(x~j\)z\_\{j\}\(s\)=Y^\{s\}\(\\tilde\{x\}\_\{j\}\)\. Applying the skill to all verification instances yields a direct comparison betweenY0Y^\{0\}andYsY^\{s\}on identical inputs\. In this view,*“repairs”*correspond toY0=0→Ys=1Y^\{0\}=0\\rightarrow Y^\{s\}=1, while*“regressions”*correspond toY0=1→Ys=0Y^\{0\}=1\\rightarrow Y^\{s\}=0\.

*Comparative metrics\.*We aggregate outcomes via

nα​β​\(s\)=∑j=1m𝟏​\{Y0​\(x~j\)=α,Ys​\(x~j\)=β\},n\_\{\\alpha\\beta\}\(s\)=\\sum\_\{j=1\}^\{m\}\\mathbf\{1\}\\\{Y^\{0\}\(\\tilde\{x\}\_\{j\}\)=\\alpha,\\,Y^\{s\}\(\\tilde\{x\}\_\{j\}\)=\\beta\\\},\(11\)with repairsn01​\(s\)n\_\{01\}\(s\)and regressionsn10​\(s\)n\_\{10\}\(s\)\. The empirical net\-effect under this comparison isΔ^m​\(s\)\\widehat\{\\Delta\}\_\{m\}\(s\),

Δ^m​\(s\)=1m​∑j=1m\(Ys​\(x~j\)−Y0​\(x~j\)\)=n01​\(s\)−n10​\(s\)m,Gm​\(s\)=n01​\(s\)−n10​\(s\)\.\\widehat\{\\Delta\}\_\{m\}\(s\)=\\frac\{1\}\{m\}\\sum\_\{j=1\}^\{m\}\\big\(Y^\{s\}\(\\tilde\{x\}\_\{j\}\)\-Y^\{0\}\(\\tilde\{x\}\_\{j\}\)\\big\)=\\frac\{n\_\{01\}\(s\)\-n\_\{10\}\(s\)\}\{m\},\\qquad G\_\{m\}\(s\)=n\_\{01\}\(s\)\-n\_\{10\}\(s\)\.\(12\)For a fixed, non\-adaptively chosen skill and i\.i\.d\. verification instances,𝔼​\[Δ^m​\(s\)\]=Δ​\(s\)\\mathbb\{E\}\[\\widehat\{\\Delta\}\_\{m\}\(s\)\]=\\Delta\(s\)\.

∙\\bullet\(iii\) Refinement:Refinement uses structured feedback to update the skill\.∙\\bullet*Feedback signals\.*After each round, the verification agent summarizes the diagnostic evidence rather than sending raw trajectories back to the generation agent\. For this, the verification agent partitions instances into

𝒬repair\(r\)=\{j:bj=0,zj​\(s\(r\)\)=1\},𝒬regress\(r\)=\{j:bj=1,zj​\(s\(r\)\)=0\},𝒬fail\(r\)=\{j:bj=0,zj​\(s\(r\)\)=0\},\\displaystyle\\mathcal\{Q\}^\{\(r\)\}\_\{\\mathrm\{repair\}\}=\\\{j:b\_\{j\}=0,\\,z\_\{j\}\(s^\{\(r\)\}\)=1\\\},\\quad\\mathcal\{Q\}^\{\(r\)\}\_\{\\mathrm\{regress\}\}=\\\{j:b\_\{j\}=1,\\,z\_\{j\}\(s^\{\(r\)\}\)=0\\\},\\quad\\mathcal\{Q\}^\{\(r\)\}\_\{\\mathrm\{fail\}\}=\\\{j:b\_\{j\}=0,\\,z\_\{j\}\(s^\{\(r\)\}\)=0\\\},

\(13\)Here,𝒬repair\(r\)\\mathcal\{Q\}^\{\(r\)\}\_\{\\mathrm\{repair\}\}contains baseline failures repaired bys\(r\)s^\{\(r\)\},𝒬regress\(r\)\\mathcal\{Q\}^\{\(r\)\}\_\{\\mathrm\{regress\}\}contains baseline successes broken bys\(r\)s^\{\(r\)\}, and𝒬fail\(r\)\\mathcal\{Q\}^\{\(r\)\}\_\{\\mathrm\{fail\}\}contains baseline failures that remain unresolved\.∙\\bullet*Feedback aggregation\.*The verification agent creates explanations of how the skill affected selected repairs, regressions, and unresolved failures\. The verification agent then aggregates these explanations into

Φ\(r\)=\(Φkeep\(r\),Φremove\(r\),Φadd\(r\),Φemphasize\(r\)\),\\Phi^\{\(r\)\}=\\big\(\\Phi^\{\(r\)\}\_\{\\mathrm\{keep\}\},\\Phi^\{\(r\)\}\_\{\\mathrm\{remove\}\},\\Phi^\{\(r\)\}\_\{\\mathrm\{add\}\},\\Phi^\{\(r\)\}\_\{\\mathrm\{emphasize\}\}\\big\),\(14\)which specifies, for the next round, which parts of the current skill to keep, remove, add, and emphasize\.∙\\bullet*Update rule\.*The refinement uses the following update \(to avoid writing a new prompt from scratch\):

s\(r\)=\{Gen⁡\(𝒵\),r=1,Refine⁡\(s\(r−1\),𝒵,Φ\(r−1\)\),r\>1\.s^\{\(r\)\}=\\begin\{cases\}\\operatorname\{Gen\}\(\\mathcal\{Z\}\),&r=1,\\\\ \\operatorname\{Refine\}\(s^\{\(r\-1\)\},\\mathcal\{Z\},\\Phi^\{\(r\-1\)\}\),&r\>1\.\\end\{cases\}\(15\)
∙\\bullet\(iv\) Final skill selection:Since later refinement rounds need not improve empirical performance,SkillGenperforms a best\-of\-KKselection over the candidate sequence\{s\(r\)\}r=1K\\\{s^\{\(r\)\}\\\}\_\{r=1\}^\{K\}and returns the candidate skill with the largest construction\-time net gainGmG\_\{m\}, i\.e\.,

r⋆=arg​max1≤r≤K⁡Gm​\(s\(r\)\),s⋆=s\(r⋆\)\.r^\{\\star\}=\\operatorname\*\{arg\\,max\}\_\{1\\leq r\\leq K\}G\_\{m\}\(s^\{\(r\)\}\),\\qquad s^\{\\star\}=s^\{\(r^\{\\star\}\)\}\.\(16\)
*Verification gate\.*A candidate skill is marked ‘active’ only if it satisfies

Gm​\(s⋆\)≥γm,γm=max⁡\{gabs,⌈grel​m⌉,1\}\.G\_\{m\}\(s^\{\\star\}\)\\geq\\gamma\_\{m\},\\qquad\\gamma\_\{m\}=\\max\\\{g\_\{\\mathrm\{abs\}\},\\lceil g\_\{\\mathrm\{rel\}\}m\\rceil,1\\\}\.\(17\)Otherwise, it is marked ‘deprecated’ and replaced by the empty intervention\.222Here,gabs∈ℤ≥0g\_\{\\mathrm\{abs\}\}\\in\\mathbb\{Z\}\_\{\\geq 0\}is an absolute minimum number of net repairs, andgrel∈\[0,1\]g\_\{\\mathrm\{rel\}\}\\in\[0,1\]is a relative minimum as a fraction of the construction\-time verification subset\. The gate is a simple construction\-time safeguard: the absolute term prevents deploying candidates whose gain is negligible in count, the relative term requires the gain to scale with the size of the verification subset, and the final lower bound of11requires a strictly positive construction\-time net gain\.The threshold due toγm\\gamma\_\{m\}defines the construction\-time deployment rule used bySkillGen: across refinement rounds, we first select the candidate with the largestGmG\_\{m\}, and then mark the selected skill as active if it satisfies Eq\. \([17](https://arxiv.org/html/2605.10999#S3.E17)\); otherwise, the skill is deprecated\.

*Deployment:*At runtime, an active skill is injected into a dedicated slot of the system prompt\. Reference documents are retrieved via on\-demand loading throughskill\_load\_reference; executable scripts encode only declared top\-level functions with prefixedskill\_\. This ensures that the deployed capability matches the verified skill\.

## 4Experiments

We evaluateSkillGenon held\-out test instances across interactive, scientific, coding, web, and tool\-use benchmarks; full implementation details are in Appendix[C](https://arxiv.org/html/2605.10999#A3)\. All claims use paired held\-out evaluations: after construction is complete, the same task instances are rolled out with and without the generated skill\.

RQ1

DoesSkillGenimprove base agents across model families and benchmark domains?

Before any held\-out rollout, each skill and its active/deprecated status is fixed using only the skill\-training dataset: the induction subset for trajectory analysis and the construction\-time verification subset for refinement and selection\. Table[1](https://arxiv.org/html/2605.10999#S4.T1)reports the no\-skill baseline accuracy, the skill\-augmented accuracy, and the absolute accuracy change over 80 held\-out benchmark–split–model combinations\.

Table 1:Main results across open\-weight and proprietary models\.For each model, we report the no\-skill baseline accuracy \(Base\), the skill\-augmented accuracy \(Skill\), and the absolute accuracy change \(Δ\\Delta\) on held\-out test instances\. Values are from the paired rollout per instance under the split protocol in Appendix[C\.2](https://arxiv.org/html/2605.10999#A3.SS2)\.![Refer to caption](https://arxiv.org/html/2605.10999v1/x2.png)Figure 2:Comparison with skill\-generation baselines\.Accuracy improvement \(Δ\\Delta\) from adding a generated skill across representative benchmark–model entries\. Mini, Grok, and Gemma denoteGPT\-5\.4\-Mini,Grok\-4\-Fast, andGemma\-4\-26B, respectively\. All methods use the same evaluation harness\.Table[1](https://arxiv.org/html/2605.10999#S4.T1)shows three main patterns: \(i\)SkillGenimproves average accuracy for all eight base agents, with gains from\+3\.27\+3\.27to\+10\.08\+10\.08percentage points; \(ii\) the effect holds across both open\-weight models \(\+3\.27\+3\.27to\+4\.77\+4\.77pp\) and proprietary models \(\+4\.79\+4\.79to\+10\.08\+10\.08pp\); and \(iii\) out of 80 held\-out benchmark–split–model entries, 50 improve, 25 remain unchanged, and only 5 show regressions\. The largest gains appear on procedural, multi\-step benchmarks:ALFWorldimproves in 14 of 16 entries, andScienceWorldimproves for all eight agents\. Further,SkillGenis especially useful when the base model has enough task capability to execute a learned procedure but still has room to improve\.

![Refer to caption](https://arxiv.org/html/2605.10999v1/x3.png)Figure 3:SkillGenablations\.Δ\\Deltaaccuracy over a shared no\-skill baseline onALFWorld\(OOD\) andChemLLMBenchyield prediction\.A1: ICL \(k=3k=3\) instead of the induced skill;A2: no refinement;A3: no verification gate;A4: no Failure Lessons;A5: plain\-text skill \(no script\+reference bundle\);Full: completeSkillGen\.Fullwins on every dataset–model pair, showing that each component contributes\.![Refer to caption](https://arxiv.org/html/2605.10999v1/x4.png)Figure 4:Cross\-model skill transferability\.Each heatmap reportsΔ\\Deltaaccuracy when a skill generated by a source model \(row\) is executed by an evaluator model \(column\)\. Diagonal cells are self\-transfer, while off\-diagonal cells are cross\-model transfer\. Right and bottom margins show transfer\-out and transfer\-in means, respectively; color saturates at±30\\pm 30pp\. The transfer matrix is evaluated on a shared pool of 100 held\-out instances per benchmark, distinct from the main evaluation split, to ensure that baseline trajectories are consistent across all 36 \(source, evaluator\) pairs\.RQ2

How doesSkillGencompare with state\-of\-the\-art automatic skill\-generation baselines?

We compareSkillGenagainst four recent skill\-generation baselines:Trace2Skill\(Niet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib5)\),SkillX\(Wanget al\.,[2026a](https://arxiv.org/html/2605.10999#bib.bib7)\),EvoSkill\(Alzubiet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib6)\), andCoEvoSkills, a co\-evolutionary baseline instantiated from EvoSkills\(Zhanget al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib37)\)\. The implementation details of baselines are in Appendix[C\.6](https://arxiv.org/html/2605.10999#A3.SS6)\. We evaluate onALFWorldIOD,ALFWorldOOD, andScienceWorld, using three agent models spanning different capability tiers and providers\. Figure[2](https://arxiv.org/html/2605.10999#S4.F2)summarizesΔ\\Deltaaccuracy across the benchmark–model entries\. We observe thatSkillGenleads to consistent gain across settings and achieves the largest overall improvement\.

![Refer to caption](https://arxiv.org/html/2605.10999v1/x5.png)Figure 5:Insights forτ\\tau\-Bench\.Held\-out accuracy onτ\\tau\-Benchretail for the five models where theSkillGenverification gate activated\. Gray bars are no\-skill baselines and teal bars apply the induced skill; deltas are absolute percentage\-point changes\.RQ3

Which components are necessary for reliable skill construction?

Figure[3](https://arxiv.org/html/2605.10999#S4.F3)comparesSkillGenagainst several ablations on representative prediction tasks and shows that each component is relevant for overall performance\. We find: \(i\) the induced skill outperforms simplek=3k=3demonstration reuse, so the improvement is not just retrieval; \(ii\) refinement and the verification gate are both needed for reliable interactive\-task gains, because early candidates can repair some failures while introducing regressions; and \(iii\) task\-specific skill structure is also relevant \(e\.g\., the failure patterns help onALFWorldOOD and the script\+reference bundle helps onChemLLMBench\)\. The completeSkillGensystem achieves the best result on every dataset–model pair in the ablation study\.

RQ4

Are generated skills transferable across agents?

We evaluate transfer by reusing the finalSkillGenskill from one source model without retraining it, then executing it with a different evaluator model onALFWorldOOD,ScienceWorld,Mind2Web, andSocialMazeFTS\. Each transferred skill is compared against the evaluator’s own no\-skill baseline; skills marked ‘deprecated’ by the source pipeline are retained as no\-op skills\. Figure[4](https://arxiv.org/html/2605.10999#S4.F4)shows thatSkillGenproduces skills that often transfer across models, but relevant is the choice of skill\-generating model\. Across 120 off\-diagonal comparisons, 70% are non\-negative, and 42% exceed\+5\+5pp\. We see a clear pattern: transferable skills are not simply written by the strongest baseline agents; onALFWorld,Qwen\-2\.5\-7Bis the best skill\-generating model on average, while, onScienceWorld,GPT\-5\.4\-Nanois best\.

RQ5

In which additional task regimes doesSkillGenprovide useful gains?

Figure[5](https://arxiv.org/html/2605.10999#S4.F5)evaluates long\-horizon retail tool use onτ\\tau\-Bench\.SkillGenimproves every model whose skill passes the verification gate, with an average gain of\+5\.3\+5\.3pp; models whose candidates fail the verification gate are left unchanged rather than exposed to an unverified intervention\.

![Refer to caption](https://arxiv.org/html/2605.10999v1/x6.png)Figure 6:Insights forChemLLMBench\.Held\-out accuracy onChemLLMBenchproperty prediction \(left\) and yield prediction \(right\)\. Gray bars are no\-skill baselines and teal bars apply theSkillGenskill; bars labeled “±0\.0\\pm 0\.0” or “gate off” indicate no measurable change or rejection by the verification gate\.![Refer to caption](https://arxiv.org/html/2605.10999v1/x7.png)Figure 7:Refinement rounds vs\. skill accuracy\.Each refinement round produces one candidate skill evaluated on the construction\-time verification subset\.\(a\)Per\-round candidate accuracy for representative runs, with dashed no\-skill baselines\.\(b\)Best\-so\-far accuracy under a budget ofKKrounds\.\(c\)Aggregate meanΔ\\Deltaaccuracy over all runs with 95% bootstrap confidence intervals\.Figure[6](https://arxiv.org/html/2605.10999#S4.F6)evaluates script\- and reference\-augmented skills onChemLLMBench\. Here, yield prediction benefits, with an average improvement of\+16\.1\+16\.1pp across six models, while property prediction is more knowledge\-bound and improves only for a small subset of agents\. Together, these results suggest that resource\-bundle skills are especially useful when the task is procedurally learnable, rather than being primarily a matter of recalling domain facts\.

RQ6

Why select the best verified refinement round instead of using the latest candidate?

Figure[7](https://arxiv.org/html/2605.10999#S4.F7)provides an empirical justification for treating refinement as best\-of\-KKsearch rather than using the latest candidate\. Per\-round candidates are noisy: by round 8, the latest candidate has expectedΔ=−3\.1\\Delta=\-3\.1pp, while the best verified candidate reaches\+8\.1\+8\.1pp \(i\.e\. a gap of∼\\sim11 pp\)\. Thus, the verification gate is not just a safety check; it is also the selection mechanism that turns unstable refinement trajectories into a reliable final skill\.

RQ7

What qualitative failure modes and insights emerge from the generated skills?

We also inspect manually logged benchmark summaries and per\-model skill\-analysis reports; a detailed qualitative analysis is in Appendix[C\.9](https://arxiv.org/html/2605.10999#A3.SS9)\. Our analysis supports three takeaways: \(i\) the verification gate removes many harmful candidate skills, although accepted skills can still overgeneralize on held\-out instances; \(ii\) residual failures in interactive environments are usually incomplete procedure execution rather than single local action mistakes; and \(iii\) chemistry and coding failures often reflect grounding or global\-structure limits that a reusable inference\-time skill cannot always repair\. These observations align with the design ofSkillGen: summary diagnostics for contrastive learning identify recurring procedural gaps, while construction\-time verification limits the deployment of harmful skill interventions\.

## Acknowledgments and Disclosure of Funding

This paper is supported by the DAAD program "Konrad Zuse Schools of Excellence in Artificial Intelligence", sponsored by the Federal Ministry of Education and Research\.

## References

- S\. Alzubi, N\. Provenzano, J\. Bingham, W\. Chen, and T\. Vu \(2026\)EvoSkill: automated skill discovery for multi\-agent systems\.arXiv preprint arXiv:2603\.02766\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1),[§C\.6](https://arxiv.org/html/2605.10999#A3.SS6.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p2.1),[§1](https://arxiv.org/html/2605.10999#S1.p4.2),[§4](https://arxiv.org/html/2605.10999#S4.p6.1)\.
- Anthropic \(2025\)Agent skills overview\.Note:Accessed: 2026External Links:[Link](https://agentskills.io/home)Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1),[§1](https://arxiv.org/html/2605.10999#S1.p1.1)\.
- H\. Bao, Y\. Huang, Y\. Wang, J\. Ye, X\. Wang, X\. Chen, Y\. Zhao, T\. Zhou, M\. Elhoseiny, and X\. Zhang \(2024\)Autobench\-v: can large vision\-language models benchmark themselves?\.arXiv preprint arXiv:2410\.21259\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- B\. Chen, C\. Shu, E\. Shareghi, N\. Collier, K\. Narasimhan, and S\. Yao \(2023\)FireAct: toward language agent fine\-tuning\.arXiv preprint arXiv:2310\.05915\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p4.1)\.
- Z\. Chen, K\. Liu, Q\. Wang, W\. Zhang, J\. Liu, D\. Lin, K\. Chen, and F\. Zhao \(2024\)Agent\-FLAN: designing data and methods of effective agent tuning for large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p4.1)\.
- S\. Gunasekar, Y\. Zhang, J\. Aneja, C\. C\. T\. Mendes, A\. Del Giorno, S\. Gopi, M\. Javaheripi, P\. Kauffmann, G\. de Rosa, O\. Saarikivi,et al\.\(2023\)Textbooks are all you need\.arXiv preprint arXiv:2306\.11644\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- Y\. Huang, H\. Hua, Y\. Zhou, P\. Jing, M\. Nagireddy, I\. Padhi, G\. Dolcetti, Z\. Xu, S\. Chaudhury, A\. Rawat, L\. Nedoshivina, P\. Chen, P\. Sattigeri, and X\. Zhang \(2025a\)Building a foundational guardrail for general agentic systems via synthetic data\.arXiv preprint arXiv:2510\.09781\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- Y\. Huang, Z\. Jiang, X\. Luo, K\. Guo, H\. Zhuang, Y\. Zhou, Z\. Yuan, X\. Sun, J\. Schleinitz, Y\. Wang, S\. Zhang, M\. Surve, N\. V\. Chawla, O\. Wiest, and X\. Zhang \(2025b\)ChemOrch: empowering LLMs with chemical intelligence via synthetic instructions\.arXiv preprint arXiv:2509\.16543\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- Y\. Huang, S\. Wu, C\. Gao, D\. Chen, Q\. Zhang, Y\. Wan, T\. Zhou, J\. Gao, C\. Xiao, L\. Sun,et al\.\(2025c\)DataGen: unified synthetic dataset generation via large language models\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026\)SoK: agentic skills – beyond tool use in LLM agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1)\.
- S\. Kaur, S\. Park, A\. Goyal, and S\. Arora \(2025\)Instruct\-SkillMix: a powerful pipeline for LLM instruction tuning\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p4.1)\.
- H\. Li, Q\. Dong, Z\. Tang, C\. Wang, X\. Zhang, H\. Huang, S\. Huang, X\. Huang, Z\. Huang, D\. Zhang, Y\. Gu, X\. Cheng, X\. Wang, S\. Chen, L\. Dong, W\. Lu, Z\. Sui, B\. Wang, W\. Lam, and F\. Wei \(2024\)Synthetic data \(almost\) from scratch: generalized instruction tuning for language models\.arXiv preprint arXiv:2402\.13064\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- A\. Mitra, L\. D\. Corro, G\. Zheng, S\. Mahajan, D\. Rouhana, A\. Codas, Y\. Lu, W\. Chen, O\. Vrousgos, C\. Rosset, F\. Silva, H\. Khanpour, Y\. Lara, and A\. Awadallah \(2024\)AgentInstruct: toward generative teaching with agentic flows\.arXiv preprint arXiv:2407\.03502\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- S\. Mukherjee, A\. Mitra, G\. Jawahar, S\. Agarwal, H\. Palangi, and A\. Awadallah \(2023\)Orca: progressive learning from complex explanation traces of GPT\-4\.arXiv preprint arXiv:2306\.02707\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- J\. Ni, Y\. Liu, X\. Liu, Y\. Sun, M\. Zhou, P\. Cheng, D\. Wang, X\. Jiang, and G\. Jiang \(2026\)Trace2skill: distill trajectory\-local lessons into transferable agent skills\.arXiv preprint arXiv:2603\.25158\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1),[§C\.6](https://arxiv.org/html/2605.10999#A3.SS6.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p2.1),[§1](https://arxiv.org/html/2605.10999#S1.p4.2),[§4](https://arxiv.org/html/2605.10999#S4.p6.1)\.
- C\. Qian, C\. Han, Y\. R\. Fung, Y\. Qin, Z\. Liu, and H\. Ji \(2023\)CREATOR: tool creation for disentangling abstract and concrete reasoning of large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, D\. Li, Z\. Liu, and M\. Sun \(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p1.1)\.
- D\. B\. Rubin \(2005\)Causal inference using potential outcomes: design, modeling, decisions\.Journal of the American Statistical Association100\(469\),pp\. 322–331\.Cited by:[§2](https://arxiv.org/html/2605.10999#S2.p5.8)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.arXiv preprint arXiv:2302\.04761\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p2.1)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Alpaca: a strong, replicable instruction\-following model\.Note:Stanford CRFM BlogExternal Links:[Link](https://crfm.stanford.edu/2023/03/13/alpaca.html)Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang, and S\. Deng \(2026a\)SkillX: automatically constructing skill knowledge bases for agents\.arXiv preprint arXiv:2604\.04804\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1),[§C\.6](https://arxiv.org/html/2605.10999#A3.SS6.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p2.1),[§1](https://arxiv.org/html/2605.10999#S1.p4.2),[§4](https://arxiv.org/html/2605.10999#S4.p6.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023a\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p1.1)\.
- Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi \(2023b\)Self\-instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- Z\. Wang, Q\. Wu, X\. Zhang, C\. Zhang, W\. Yao, F\. E\. Faisal, B\. Peng, S\. Qin, S\. Nath, Q\. Lin, C\. Bansal, D\. Zhang, S\. Rajmohan, J\. Gao, and H\. Yao \(2026b\)WebXSkill: skill learning for autonomous web agents\.arXiv preprint arXiv:2604\.13318\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen, Z\. Zheng, C\. Xie, and H\. Yao \(2026\)SkillRL: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1)\.
- C\. Xu, Q\. Sun, K\. Zheng, X\. Geng, P\. Zhao, J\. Feng, C\. Tao, and D\. Jiang \(2024\)WizardLM: empowering large pre\-trained language models to follow complex instructions\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- R\. Xu and Y\. Yan \(2026\)Agent skills for large language models: architecture, acquisition, security, and the path forward\.arXiv preprint arXiv:2602\.12430\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1)\.
- Z\. Xu, F\. Jiang, L\. Niu, Y\. Deng, R\. Poovendran, Y\. Choi, and B\. Y\. Lin \(2025\)Magpie: alignment data synthesis from scratch by prompting aligned LLMs with nothing\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.
- Y\. Yang, J\. Li, Q\. Pan, B\. Zhan, Y\. Cai, L\. Du, J\. Zhou, K\. Chen, Q\. Chen, X\. Li, B\. Zhang, and L\. He \(2026\)AutoSkill: experience\-driven lifelong learning via skill self\-evolution\.arXiv preprint arXiv:2603\.01145\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p1.1)\.
- A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang \(2024\)AgentTuning: enabling generalized agent abilities for LLMs\.InFindings of the Association for Computational Linguistics: ACL 2024,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p4.1)\.
- B\. Zhang, K\. Lazuka, and M\. Murag \(2025\)Equipping agents for the real world with agent skills\.Note:Anthropic Engineering BlogExternal Links:[Link](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills)Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1),[§1](https://arxiv.org/html/2605.10999#S1.p1.1)\.
- H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng, X\. Liu, X\. Li, and P\. S\. Yu \(2026\)EvoSkills: self\-evolving agent skills via co\-evolutionary verification\.arXiv preprint arXiv:2604\.01687\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1),[§C\.6](https://arxiv.org/html/2605.10999#A3.SS6.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p2.1),[§1](https://arxiv.org/html/2605.10999#S1.p4.2),[§4](https://arxiv.org/html/2605.10999#S4.p6.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: LLM agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 19632–19642\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p1.1),[§1](https://arxiv.org/html/2605.10999#S1.p2.1)\.
- B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig, and Y\. Su \(2025\)SkillWeaver: web agents can self\-improve by discovering and honing skills\.arXiv preprint arXiv:2504\.07079\.Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p2.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu, S\. Zhang, G\. Ghosh, M\. Lewis, L\. Zettlemoyer, and O\. Levy \(2023\)LIMA: less is more for alignment\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix B](https://arxiv.org/html/2605.10999#A2.p3.1)\.

## Appendix AAlgorithm

Algorithm[1](https://arxiv.org/html/2605.10999#alg1)summarizes the fullSkillGenconstruction procedure\.

Algorithm 1SkillGen: contrastive induction with generation\-verification\-refinement loop1:induction subset

𝒟ind\\mathcal\{D\}\_\{\\mathrm\{ind\}\}, construction\-time verification subset

𝒟ver\\mathcal\{D\}\_\{\\mathrm\{ver\}\}, base agent

𝒜\\mathcal\{A\}, evaluator

ℰ\\mathcal\{E\}, round budget

K≥1K\\geq 1, gate parameters

gabs∈ℤ≥0g\_\{\\mathrm\{abs\}\}\\in\\mathbb\{Z\}\_\{\\geq 0\}and

grel∈\[0,1\]g\_\{\\mathrm\{rel\}\}\\in\[0,1\]
2:skill

s⋆s^\{\\star\}marked

active\\mathrm\{active\}or

deprecated\\mathrm\{deprecated\}
3:*Stage1*Baseline elicitation

4:Collect induction trajectories

ℬ=\{\(xi,τi0,yi0\)\}i=1n\\mathcal\{B\}=\\\{\(x\_\{i\},\\tau\_\{i\}^\{0\},y\_\{i\}^\{0\}\)\\\}\_\{i=1\}^\{n\}by rolling out

𝒜\\mathcal\{A\}on

𝒟ind\\mathcal\{D\}\_\{\\mathrm\{ind\}\}
5:Cache verification baselines

ℬver=\{\(x~j,τ~j0,bj\)\}j=1m\\mathcal\{B\}\_\{\\mathrm\{ver\}\}=\\\{\(\\tilde\{x\}\_\{j\},\\tilde\{\\tau\}^\{0\}\_\{j\},b\_\{j\}\)\\\}\_\{j=1\}^\{m\}by rolling out

𝒜\\mathcal\{A\}on

𝒟ver\\mathcal\{D\}\_\{\\mathrm\{ver\}\}
6:*Stage2*Contrastive behavioral induction

7:Induction agent analyzes trajectories into

𝒵=\(a0,ℱ,𝒮,𝒞\)\\mathcal\{Z\}=\(a\_\{0\},\\mathcal\{F\},\\mathcal\{S\},\\mathcal\{C\}\)
8:*Stage3*Generation–verification–refinement

9:Initialize

Φ\(0\)←∅\\Phi^\{\(0\)\}\\leftarrow\\varnothing,

g⋆←−∞g^\{\\star\}\\leftarrow\-\\infty,

s⋆←∅s^\{\\star\}\\leftarrow\\varnothing
10:Set gate threshold

γm←max⁡\{gabs,⌈grel​m⌉,1\}\\gamma\_\{m\}\\leftarrow\\max\\\{g\_\{\\mathrm\{abs\}\},\\lceil g\_\{\\mathrm\{rel\}\}m\\rceil,1\\\}
11:for

r=1,…,Kr=1,\\ldots,Kdo

12:Generation agent computes

s\(r\)←\{Gen⁡\(𝒵\),for​r=1,Refine⁡\(s\(r−1\),𝒵,Φ\(r−1\)\),for​r\>1s^\{\(r\)\}\\leftarrow\\begin\{cases\}\\operatorname\{Gen\}\(\\mathcal\{Z\}\),&\\text\{for \}r=1,\\\\ \\operatorname\{Refine\}\(s^\{\(r\-1\)\},\\mathcal\{Z\},\\Phi^\{\(r\-1\)\}\),&\\text\{for \}r\>1\\end\{cases\}
13:Verification agent evaluates

s\(r\)s^\{\(r\)\}on all

x~j∈𝒟ver\\tilde\{x\}\_\{j\}\\in\\mathcal\{D\}\_\{\\mathrm\{ver\}\}and computes

Gm​\(s\(r\)\)G\_\{m\}\(s^\{\(r\)\}\)
14:Verification agent builds feedback

Φ\(r\)\\Phi^\{\(r\)\}from repairs, regressions, and unresolved failures

15:if

Gm​\(s\(r\)\)\>g⋆G\_\{m\}\(s^\{\(r\)\}\)\>g^\{\\star\}then

16:

g⋆←Gm​\(s\(r\)\)g^\{\\star\}\\leftarrow G\_\{m\}\(s^\{\(r\)\}\),

s⋆←s\(r\)s^\{\\star\}\\leftarrow s^\{\(r\)\}
17:endif

18:endfor

19:Mark

s⋆s^\{\\star\}as

active\\mathrm\{active\}if

g⋆≥γmg^\{\\star\}\\geq\\gamma\_\{m\}, else mark it as

deprecated\\mathrm\{deprecated\}
20:return

s⋆s^\{\\star\}

## Appendix BRelated Work

Agent skills\.Early LLM agents augmented models with external tool use\[Schicket al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib18), Qinet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib19)\]or tool creation\[Qianet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib16)\], while interaction\-based agents such as ReAct\[Yaoet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib15)\], Reflexion\[Shinnet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib14)\], ExpeL\[Zhaoet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib13)\], and Voyager\[Wanget al\.,[2023a](https://arxiv.org/html/2605.10999#bib.bib12)\]showed that trajectories can support reusable reasoning and action routines\.

Agent skills provide a first\-class abstraction that often follows the Anthropic Agent Skills standard\[Zhanget al\.,[2025](https://arxiv.org/html/2605.10999#bib.bib1), Anthropic,[2025](https://arxiv.org/html/2605.10999#bib.bib2)\], which defines skills as composable bundles of instructions, scripts, and resources loaded dynamically at inference time\. Recent surveys systematize skill architecture, acquisition, security, and deployment\[Xu and Yan,[2026](https://arxiv.org/html/2605.10999#bib.bib3), Jianget al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib4)\]\. On the construction side, Trace2Skill\[Niet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib5)\], EvoSkill\[Alzubiet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib6)\], EvoSkills\[Zhanget al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib37)\], and SkillX\[Wanget al\.,[2026a](https://arxiv.org/html/2605.10999#bib.bib7)\]synthesize skills from agent experience, while SkillWeaver\[Zhenget al\.,[2025](https://arxiv.org/html/2605.10999#bib.bib10)\], WebXSkill\[Wanget al\.,[2026b](https://arxiv.org/html/2605.10999#bib.bib8)\], AutoSkill\[Yanget al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib9)\], and SkillRL\[Xiaet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib38)\]study web, dialogue, and RL deployment regimes\.SkillGendiffers by making failure analysis central: it clusters error patterns, identifies capability boundaries, and verifies induced skills through multi\-agent collaboration with a construction\-time deployment rule over paired repairs and regressions\.

Synthetic data for LLM agents\.Self\-Instruct\[Wanget al\.,[2023b](https://arxiv.org/html/2605.10999#bib.bib20)\]developed LLM\-bootstrapped instruction generation, which inspiring later extension such as Alpaca\[Taoriet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib21)\], WizardLM\[Xuet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib22)\], and further quality improvements through reasoning traces, curation, synthetic textbooks, and taxonomy\-driven generation\[Mukherjeeet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib23), Zhouet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib24), Gunasekaret al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib25), Liet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib26)\]\. Recent pipelines scale synthetic data with agentic flows or self\-synthesis\[Mitraet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib27), Xuet al\.,[2025](https://arxiv.org/html/2605.10999#bib.bib28)\], controllable generation and verification\[Huanget al\.,[2025c](https://arxiv.org/html/2605.10999#bib.bib32)\], domain\-specific tool\-aware construction\[Huanget al\.,[2025b](https://arxiv.org/html/2605.10999#bib.bib33)\], verified visual question answering\[Baoet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib42)\]and risk\-injected safety trajectories\[Huanget al\.,[2025a](https://arxiv.org/html/2605.10999#bib.bib34)\]\.

For agent\-specific capabilities, AgentTuning\[Zenget al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib29)\], Agent\-FLAN\[Chenet al\.,[2024](https://arxiv.org/html/2605.10999#bib.bib31)\], FireAct\[Chenet al\.,[2023](https://arxiv.org/html/2605.10999#bib.bib30)\], and Instruct\-SkillMix\[Kauret al\.,[2025](https://arxiv.org/html/2605.10999#bib.bib11)\]construct trajectory or skill\-composition data for instruction tuning\. These methods produce data for*model fine\-tuning*; in contrast,SkillGensynthesizes*inference\-time skills*—structured bundles loaded without parameter updates—that target capability gaps exposed by systematic failure analysis\.

## Appendix CExperimental Details

### C\.1Model Details

Table 2:Base agent models used across the reported experiments\. Open\-weight indicates that model weights are publicly available; proprietary models are accessed through hosted APIs\.
### C\.2Datasets and Splits

All reported accuracy changes are based on the following comparative assessment: for each held\-out test instance, we roll out the same base agent once without a skill and once with the generated skill, using the same instance identifier and random seed\. Unless otherwise specified, we use seed 42 and keep the skill\-training dataset and held\-out test pool disjoint\. Within the skill\-training dataset,SkillGenuses an induction subset for trajectory analysis and a construction\-time verification subset for refinement and selection, matching the usual train/validation/test separation\. Table[3](https://arxiv.org/html/2605.10999#A3.T3)summarizes the controlled split protocol for the benchmark\-specific studies that require explicit sampling\.

Table 3:Controlled split protocol for benchmark\-specific studies\.
### C\.3Model Routing and Auxiliary Roles

Each baseline or skill\-augmented rollout is executed by the base agent model listed in Table[2](https://arxiv.org/html/2605.10999#A3.T2); the auxiliary models used insideSkillGennever replace the base agent during rollout\. To isolate the effect of the generated skill from the capability of the skill\-writing model, we use a fixed auxiliary model,GPT\-5\.4\-Mini, for the induction agent, generation agent, and verification agent across all base agents\. Non\-OpenAI model calls are routed through OpenRouter\. Embeddings for clustering and skill\-card merging usetext\-embedding\-3\-small\. Decoding is deterministic with temperature 0; the default output budget is 4,096 tokens, increased to 16,384 tokens for skill generation\.

### C\.4SkillGenHyperparameters

Unless otherwise noted, all runs use the same benchmark\-specific configuration template\. The induction stage uses at most eight failure clusters and eight success clusters, with adaptivekk\-means clustering overk∈\[2,8\]k\\in\[2,8\]and a target cluster size of 15\. The contrastive module keeps up to 20 nearest failure–success pairs\. The generation prompt receives up to six failure clusters, six success clusters, and eight contrastive observations; web search is disabled\.

The main experiments use a maximum refinement budget of eight rounds\. For candidate verification, the verification gate evaluates uniformly sampled construction\-time verification instances from the skill training dataset, using a 70/30 induction/verification split with at least four verification instances when the pool is small\. The deployment decision follows the rule in §[3\.3](https://arxiv.org/html/2605.10999#S3.SS3): a skill is accepted only if its construction\-time verification net gainGm​\(s\)G\_\{m\}\(s\)satisfiesGm​\(s\)≥max⁡\{2,⌈0\.05​m⌉,1\}G\_\{m\}\(s\)\\geq\\max\\\{2,\\lceil 0\.05m\\rceil,1\\\}\. We additionally run up to 30 baseline\-success guard checks to expose regressions on already\-solved instances\. Skills that fail the gate are persisted with statusdeprecated; downstream evaluation treats them as empty interventions, so cells labeled “gate off” report zero change rather than an unverified skill\. The pipeline uses four workers for independent runs, and the verification agent’s feedback stage uses eight workers\.

### C\.5Token Cost Analysis

Table 4:Token cost ofSkillGen\.*Train*is the one\-time construction budget;*Base*and*Skill*are average tokens per call\.Table[4](https://arxiv.org/html/2605.10999#A3.T4)separates one\-time construction cost from per\-call inference overhead\. All values are computed per model and then averaged within each benchmark\. The skill\-construction pipeline, including baseline trajectory collection, induction, generation, refinement, and verification, is a one\-time cost per model–benchmark pair, ranging from 2\.2M tokens onScienceWorldto 10\.2M onτ\\tau\-Bench\(mean 5\.6M\)\. UsingGPT\-5\.4\-Ministandard API prices \($​0\.75\\mathdollar 0\.75/M input tokens and$​4\.50\\mathdollar 4\.50/M output tokens\),333[https://openai\.com/api/pricing](https://openai.com/api/pricing), accessed April 2026\.and the prompt/output mix observed in our training logs, the mean construction budget corresponds to approximately$​8\.2\\mathdollar 8\.2per generated skill\. This cost is paid once for a model–benchmark pair, after which the same skill can be reused across subsequent rollouts and repeated evaluations\. The per\-call columns show that retrieval keeps inference prompts in the same few\-thousand\-token regime: the median skill\-augmented call is 5,919 tokens, and the largest absolute per\-call average in the table is 6,358 tokens onτ\\tau\-Bench\.

#### Compute resources\.

All experiments are orchestrated locally but executed through hosted LLM APIs routed through OpenRouter\. The base\-agent and auxiliary\-model routing is reported in Appendix[C\.3](https://arxiv.org/html/2605.10999#A3.SS3); token budgets are reported in Table[4](https://arxiv.org/html/2605.10999#A3.T4); and concurrency settings are reported in Appendix[C\.4](https://arxiv.org/html/2605.10999#A3.SS4)\. Because model inference is served by third\-party API providers, the underlying accelerator type, memory configuration, and provider\-side scheduling are not exposed to us\. We therefore report reproducible API\-level compute usage in terms of model calls, token budgets, and worker concurrency\. Local compute is used only for orchestration, logging, clustering, and evaluation bookkeeping\.

### C\.6Skill\-Generation Baselines

#### Trace2Skill\.

For Trace2Skill\[Niet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib5)\], we run one no\-skill rollout over the training pool\. Success\-branch and error\-branch analysts process trajectories in parallel, and their proposed patches are consolidated through hierarchical LLM merging\. We preserve the original prompt structure and do not impose an additional output schema beyond the shared Markdown skill wrapper\.

#### SkillX\.

For SkillX\[Wanget al\.,[2026a](https://arxiv.org/html/2605.10999#bib.bib7)\], we run two refinement rounds\. Each round rolls out the current library, extracts skill cards from successful trajectories, clusters cards by cosine similarity at threshold 0\.80 usingtext\-embedding\-3\-small, merges clusters with an LLM, and filters cards with an LLM quality score threshold of 3/5\. The retained library is capped at 12 cards before being canonicalized into the final skill\.

#### EvoSkill\.

For EvoSkill\[Alzubiet al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib6)\], we maintain a frontier ofk=3k=3candidate programs for four iterations\. A proposer chooses amongadd\_new,edit, andkeepoperations based on failures from a fixed validation subset, and a builder emits the next candidate library\. Admission requires the new candidate to outperform the weakest frontier member on the same validation subset\.

#### CoEvoSkills\.

We instantiate the co\-evolutionary baseline from EvoSkills\[Zhanget al\.,[2026](https://arxiv.org/html/2605.10999#bib.bib37)\]using an information\-isolated surrogate verifier\. The surrogate writes binary natural\-language assertions, judges rollouts against those assertions, and returns structured diagnostics to the skill generator\. To match the rollout budget of the other baselines, we use three outer iterations and two inner verifier iterations\. Because the shared evaluation interface consumes Markdown\-formatted skills rather than executable multi\-file bundles, the surrogate assertions are LLM\-judged rather than executed as code\.

#### Scope of the baseline comparison\.

Our baseline comparison evaluates all methods under the same deployment problem studied in this paper: synthesize one fixed, auditable inference\-time skill for each benchmark–model pair, and evaluate that fixed intervention on held\-out instances\. This requires adapting methods that natively construct or retrieve from multi\-skill libraries, such as SkillX and EvoSkill, into the same single\-skill interface used bySkillGen\. We emphasize that this controlled adaptation is not intended to measure the full native deployment potential of those systems under all possible library sizes, retrieval policies, or routing strategies\. Instead, it supports a like\-for\-like comparison of skill synthesis quality when the deployed artifact must be one fixed skill\.

This design also reduces evaluation instability and selection bias\. If library\-based methods were allowed to select different skills at test time, held\-out performance would conflate skill construction with additional design choices such as library size, retrieval scoring, context\-budget allocation, routing policy, and stochastic skill selection\. Those choices are important in their own right, but they introduce extra degrees of freedom that are not shared by all methods and can make a comparative evaluation less stable \(and potentially unfair\)\. The resulting comparison should therefore be interpreted as a controlled single\-skill adaptation of each baseline, rather than as a claim that the adapted baselines exhaust their native multi\-skill capabilities\. All reproduced baselines are adapted to this shared single\-skill evaluation interface: each method emits one Markdown\-formatted skill per benchmark–model pair, which is then injected and evaluated by the shared paired rollout harness\. The base agent model is used for task rollouts, whileGPT\-5\.4\-Miniis used for auxiliary extraction, merging, proposal, and judging steps\.

#### No\-tool comparison protocol\.

For the skill\-generation baseline comparison, we use only benchmark settings in which the evaluated skill does not rely on external tools, generated helper scripts, or reference\-resource loading\. To ensure a like\-for\-like comparison,SkillGen’s optional script and reference components are disabled in these runs: every method, includingSkillGen, emits a single Markdown\-formatted natural\-language skill, i\.e\.,s=\(u,a,∅,∅\)s=\(u,a,\\varnothing,\\varnothing\)\. No method is allowed to provide executable helper functions, generated tools, retrieval documents, or calls toskill\_load\_reference\. All skills are injected through the same prompt slot and evaluated with the same paired rollout harness\. Thus, the comparison isolates the quality of the synthesized natural\-language skill rather than differences in tool availability\.

### C\.7Evaluation Metrics and Gate\-Off Handling

For every evaluated benchmark–split–model cell, we report baseline accuracy, skill\-augmented accuracy, and the paired differenceΔ=accskill−accbase\\Delta=\\mathrm\{acc\}\_\{\\mathrm\{skill\}\}\-\\mathrm\{acc\}\_\{\\mathrm\{base\}\}in percentage points\. We also record repair counts \(baseline wrong, skill correct\), regression counts \(baseline correct, skill wrong\), and net gain as repair minus regression\. WhenSkillGenmarks a skill asdeprecatedduring construction\-time verification, evaluation reuses the no\-skill baseline for the skill side\. This convention makes rejected skills explicit and prevents an unverified prompt from introducing hidden regressions\.

### C\.8t\-SNE Visualizations

Fig\.[8](https://arxiv.org/html/2605.10999#A3.F8)shows the t\-SNE visualization of the contrastive induction of SkillGen on ALFWorld \(gpt\-5\.4\-nano\)\.

![Refer to caption](https://arxiv.org/html/2605.10999v1/x8.png)Figure 8:t\-SNE visualization of SkillGen’s induction on ALFWorld \(gpt\-5\.4\-nano\)\. Red triangles \(F1–F7\) are failure trajectories; green circles \(S1–S8\) are successes\. Gray arrows link each failure to its nearest same\-type success \(20 contrastive pairs\)\. The yellow band marks the decision boundary between the two populations\. Failures cluster compactly in the upper region \(recurring planning errors\), while successes spread broadly \(diverse solving strategies\), motivating the contrastive analysis that drives skill generation\.
### C\.9Failure Analysis

We complement the aggregate gains in Section[4](https://arxiv.org/html/2605.10999#S4)with a qualitative inspection of the archived benchmark summaries and per\-model skill\-analysis reports\. Rather than listing isolated examples, we distill four recurring findings about when skill augmentation still fails\.

#### The verification gate substantially reduces harm, but accepted skills can still regress on held\-out instances\.

The clearest evidence is the contrast between rejected and accepted negative skills\. In LiveCodeBench, rejected skills would have reducedQwen\-2\.5\-7Bfrom 25\.33% to 16\.00% andGemma\-4\-26Bfrom 83\.33% to 80\.67%; these cells are therefore reported as zero\-delta after gating\. Similar rejected regressions appear in Mind2Web and PubMedQA\. However, filtering is not perfect\. On ALFWorld OOD, an accepted skill forLlama\-3\.1\-8Bstill lowers accuracy from 67\.45% to 65\.10%, with 51 repairs but 57 regressions, and on ChemLLMBench yield predictionMistral\-Nemodrops from 43\.33% to 20\.00%, with only three repairs against ten regressions\. The failure mode here is overgeneralization: a skill that looks beneficial on the verification subset can still perturb correct baseline behavior on the final split\.

#### In interactive environments, the dominant residual error is incomplete procedure execution rather than local action selection\.

Both ALFWorld and ScienceWorld exhibit this pattern\. ForLlama\-3\.1\-8Bon ALFWorld, the training\-time analysis records all failures, and the largest cluster, with 65 cases, is labeled*incomplete dependency planning*\. These trajectories often begin the first plausible subgoal but omit a later prerequisite, such as turning on a lamp before inspection or using an intermediate receptacle before transport\. ScienceWorld shows the same phenomenon at the level of experimental procedure\. ForGemma\-4\-26B, the analysis contains 110 failures out of 150 construction instances, with large clusters labeled*ungrounded action planning*\(25\),*incomplete task\-sequence planning*\(20\), and*incomplete goal\-to\-action planning*\(19\)\. In both benchmarks, the agent usually recognizes the task theme, but fails to maintain the full ordered procedure needed to finish it\.

#### On chemistry tasks, failures are driven less by missing facts than by incorrect grounding of reaction roles and decision criteria\.

ForQwen\-2\.5\-7Bon ChemLLMBench yield prediction, the training analysis reports 27 failures out of 30 examples, with two dominant clusters:*superficial reaction\-feasibility assessment*\(14\) and*reaction\-role misparsing in cross\-coupling*\(13\)\. A typical failure is that the model recognizes a familiar reaction family, but confuses substrates with catalysts, bases, solvents, or additives, and then predicts yield from a generic template rather than checking the actual electrophile, nucleophile, ligand/base combination, and substrate scope\. This helps explain why resource\-bundle skills are especially effective on yield prediction: the gain comes from enforcing a grounded checking procedure, not from merely restating chemical knowledge\.

#### On code\-generation tasks, skill augmentation cannot fully compensate for missing global problem structure\.

LiveCodeBench failures forQwen\-2\.5\-7Bare dominated by upstream reasoning errors rather than surface\-level implementation mistakes\. The training analysis reports 113 failures out of 150 problems, with major clusters labeled*incomplete algorithmic modeling*\(19\),*incomplete algorithm realization*\(14\), and*structure\-mapping failure*\(6\)\. Typical trajectories either pursue brute\-force search where an invariant or transformation is required, apply a local greedy heuristic where dynamic programming is needed, or emit truncated code after only partially deriving the solution\. This suggests that a general skill can regularize recurring reasoning habits, but it cannot reliably rescue cases where the model never identifies the right combinatorial structure in the first place\.

## Appendix DBroader Impacts

SkillGenaims to make LLM agents more reliable and easier to adapt without retraining, which could reduce the manual effort required to build task\-specific agent skills and make procedural knowledge more inspectable through human\-readable skill artifacts\. This may benefit scientific, coding, web, and tool\-use workflows where reusable guidance and verification can improve consistency\. At the same time, any method that improves agent performance can also improve agents used for harmful or unintended purposes, and generated skills may overgeneralize, introduce regressions on unseen cases, or amplify mistakes in domains where incorrect tool use has real consequences\. The risks are especially relevant when skills are deployed in open\-ended environments, safety\-sensitive applications, or workflows involving external tools and resources\.SkillGen’s paired verification provides a partial check against these harms by making regressions visible during construction\. However, these checks are limited to the evaluated task distribution, so they should be paired with application\-specific safety evaluation, human review of generated skills, access controls, and ongoing monitoring before deployment\.

Similar Articles