Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition
Summary
This paper introduces W2S, a framework that automatically constructs executable Skills for LLM agents from historical interaction traces using a Skill-IR intermediate representation, improving behavioral replay consistency by 10.5% over baselines.
View Cached Full Text
Cached at: 06/08/26, 09:14 AM
# Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition
Source: [https://arxiv.org/html/2606.06893](https://arxiv.org/html/2606.06893)
Yuyang Zhang1Xinyuan Han2Xudong Jiang1Run Wang1 1Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University 2Nanchang University
###### Abstract
Large language model agents increasingly rely on*Skills*to encode procedural knowledge, including when to invoke a capability, how to decompose a task, what constraints to follow, and how to verify intermediate outcomes\. Despite their importance, high\-quality Skills are still largely hand\-written, making them difficult to scale across domains, tools, and execution environments\. This paper studies how to automatically construct executable Skills from heterogeneous interaction evidence, such as demonstrations, agent trajectories, tool\-use traces, and execution logs\. We argue that this is not a standard summarization problem: historical traces are often fragmented across scenarios, contain redundant or accidental steps, and may omit low\-frequency but safety\-critical operations\. To address this challenge, we introduce Skill\-IR, an intermediate representation that interprets a Skill from a workflow perspective and decomposes its content into three complementary components: Workflow structure, execution Semantics, and runtime Attachments\. Together, these WSA components capture the structural, behavioral, and operational elements required for executable Skills, including task decomposition, step\-level execution requirements, control\-flow conditions, verification procedures, and safety\-critical state management\. Building on Skill\-IR, we propose W2S, a trace\-to\-skill construction framework that converts historical execution evidence into reusable agent Skills\. W2S segments traces into procedural units, induces local Skill drafts from individual paths, aligns and merges shared structures across scenarios, reconciles conditional branches, and compresses redundant steps while preserving verification, approval, rollback, and state\-management behaviors with evidence and confidence annotations\. Experiments on 70 skills show that W2S improves behavioral replay consistency over summarization\- and prompting\-based baselines, with an improvement of 10\.5%\. These results suggest that reliable Skill generation requires treating historical traces as evidence for executable runtime specifications rather than as text to be compressed\.
## 1Introduction
LLM agents\(Luoet al\.,[2025](https://arxiv.org/html/2606.06893#bib.bib31)\)are rapidly evolving from systems that merely generate responses into runtime systems that execute workflows\(Wanget al\.,[2024](https://arxiv.org/html/2606.06893#bib.bib9)\), invoke tools\(Shiet al\.,[2025](https://arxiv.org/html/2606.06893#bib.bib20)\), and read and write state\(Xieet al\.,[2024](https://arxiv.org/html/2606.06893#bib.bib36)\)\. As agents assume these runtime responsibilities, an abstraction is needed to package reusable agent capabilities and specify how they should be applied across tasks\. In this transition, skills have emerged as a key abstraction for organizing such reusable capabilities\(Linget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib13)\)\. A skill specifies what capability an agent can reuse, together with clear instructions about when it should be activated and how it should be used\(Liet al\.,[2026a](https://arxiv.org/html/2606.06893#bib.bib14)\)\. In this sense, a skill is not merely a prompt fragment; it is a runtime specification intended to guide agent behavior across future tasks\(Xu and Yan,[2026](https://arxiv.org/html/2606.06893#bib.bib37)\)\. Recent practice suggests that skills can improve the reliability, transferability, and maintainability of agent behavior, and that prompts and tool\-use procedures are increasingly being reorganized or re\-implemented as skills\(Jianget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib19); Linget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib13)\)\. Because skills can provide a scalable interface for accumulating, transferring, and operationalizing agent experience, they are likely to remain an important building block in future agent systems and have been attracting growing attention from both industry and academia\(Zhouet al\.,[2026b](https://arxiv.org/html/2606.06893#bib.bib17)\)\.
However, despite their demonstrated value, current skills are largely manually authored, making them costly to scale and difficult to keep aligned with evolving usage scenarios, tool environments, and execution requirements\(Liuet al\.,[2026b](https://arxiv.org/html/2606.06893#bib.bib15); Maet al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib16)\)\. Fortunately, rich evidence of task\-oriented behavior already exists in the form of interaction traces, tool calls, expert demonstrations, user feedback, and execution logs, providing a natural basis for automated skill induction\(Huanget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib5)\)\. Yet current approaches to skill generation are often limited to summarizing traces, rather than producing structured runtime specifications that future agents can reliably reuse\(Liet al\.,[2026c](https://arxiv.org/html/2606.06893#bib.bib10); Yanget al\.,[2026a](https://arxiv.org/html/2606.06893#bib.bib12)\)\. As a result, the induced skills may overfit incidental details, omit critical preconditions or recovery procedures, and become difficult to verify or maintain\. A more principled formulation is therefore needed: skill induction should transform execution data into structured, reusable procedural knowledge that can guide future agent behavior\.
Skill creation differs from ordinary summarization in its objective\. Summarization typically compresses historical content according to semantic salience\(Tanget al\.,[2023](https://arxiv.org/html/2606.06893#bib.bib38)\), whereas skill creation aims to reconstruct procedural knowledge that can support future execution\(Wu and Zhang,[2026](https://arxiv.org/html/2606.06893#bib.bib18)\)\. For an induced skill to be reusable, it must preserve not only the main intent of prior traces, but also the runtime structure that determines how an agent should act: triggering conditions, task decomposition\(Yaoet al\.,[2022](https://arxiv.org/html/2606.06893#bib.bib32)\), tool\-use policies, constraints, failure handling, and validation criteriaShinnet al\.\([2023](https://arxiv.org/html/2606.06893#bib.bib39)\)\. These elements are often functionally distinct rather than hierarchically important, and therefore may be merged or omitted by a purely summary\-oriented process\. Thus, skill creation requires transforming traces into a compact but structured workflow, rather than simply producing a concise description of what happened\(Zhouet al\.,[2026a](https://arxiv.org/html/2606.06893#bib.bib11)\)\.
Our key insight is that the proper unit of skill induction should not be a textual instruction, but a structured runtime specification, especially for automated skill generation\. Unlike ordinary text summarization, which compresses source data into salient semantic content\(Radfordet al\.,[2021](https://arxiv.org/html/2606.06893#bib.bib23)\), skill creation must preserve the operational properties that make a skill executable and reusable\. A generated skill should not merely describe what the source data is about; it should specify when the skill applies, how the task proceeds, how local decisions are made, and what runtime safeguards constrain execution\.
To this end, we introduce Skill\-IR, an intermediate representation that converts interaction traces into computable objects before rendering them into executable agent instructions\. Skill\-IR represents a skill with a routing header and three runtime components\. The*routing header*, comprising the front matter and description used for skill discovery, specifies when a skill should be considered applicable\. As shown in Figure[1](https://arxiv.org/html/2606.06893#S2.F1), the*workflow backbone*captures the control structure of execution, including workflow nodes and directed transitions among them \(when the skill has at least two workflow nodes connected by a directed edge\)\.*Node\-level semantics*define the local objectives and decision criteria that govern branch, retry, and termination behavior recorded in the workflow paths\.*Runtime attachments*describe the operational context required by execution, such as tools, scripts, resources, references, templates, configuration constraints, and output requirements\. Together with the routing header, this decomposition separates when a skill applies, how it executes, how decisions are made, and how runtime effects are constrained\.
Building on this insight, we propose W2S, an evidence\-driven skill construction framework\. W2S aligns traces into scenarios, extracts path\-level observations, and generates grounded skill drafts for each path\. It then fuses shared workflow nodes, reconciles branches and conflicts, and compresses redundancy while preserving critical\. The drafted intermediate representation is finally rendered as a reusable agent skill\.
Experiments on multi\-scenario agent traces show that W2S improves replay\-based behavioral fidelity compared with summarization\- and prompting\-based baselines\. These results suggest that historical traces should be treated as execution evidence for inducing runtime specifications, rather than as documents to be compressed\. More broadly, they show that reliable agent learning from experience benefits from explicit intermediate representations, path\-level evidence tracking, and runtime\-aware structure\.
Our contributions are threefold:
- •We identify automated agent skill generation as a structured induction task rather than a trace summarization of past data\. To support this formulation, we introduce Skill\-IR, which represents skill through a routing header, a workflow backbone, node\-level semantics, and runtime attachments\.
- •We propose W2S, an evidence\-driven framework that constructs Skill\-IR from path\-level execution traces by aligning scenarios, merging compatible patterns, and preserving execution\-critical constraints\.
- •Experiments show that, under the same interaction evidence, W2S consistently outperforms Anthropic Skill Creator in both structural fidelity and behavioral consistency on WSASkill dataset\.
## 2Related Work
### 2\.1Agent Skills
LLM\-based agents are increasingly moving from monolithic prompting toward modular procedural abstractions\(Shiet al\.,[2025](https://arxiv.org/html/2606.06893#bib.bib20); Ruanet al\.,[2023](https://arxiv.org/html/2606.06893#bib.bib21)\)\. An*agent skill*, which is intended to persist across related tasks and sessions, is a reusable operational package that specifies when it should be activated, how the agent should proceed, and what task\-specific resources, scripts, tools, or constraints should be used during execution\(Linget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib13)\)\. Unlike a low\-level tool or API, it does not necessarily expand the primitive action space; rather, it organizes existing instructions, actions, and resources into a repeatable procedure\. In this sense, skills serve as a procedural layer between high\-level task conditioning and concrete environment interaction: memory stores facts or preferences, tools expose primitive capabilities, whereas skills describe how those capabilities should be composed for recurring tasks\(Jianget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib19); Zhouet al\.,[2026b](https://arxiv.org/html/2606.06893#bib.bib17)\)\.
Recent agent runtimes and skill formats make this abstraction explicit\. A skill is typically packaged with metadata for discovery, natural\-language instructions for execution, and optional auxiliary files such as scripts, references, templates, or examples\(Nous Research,[2026](https://arxiv.org/html/2606.06893#bib.bib30)\)\. Such designs also adopt progressive disclosure that agents first inspect lightweight descriptions to decide whether a skill is relevant, and load the full procedural content only when needed\. This makes skills attractive for long\-horizon agent systems, where reusable procedures must be invoked selectively without flooding the context with all available experience\. As a result, skills are becoming a central form of*procedural memory*for LLM agents: they are editable, versionable, portable across compatible runtimes, and auditable as explicit artifacts rather than latent model behavior\(Jianget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib19); Wu and Zhang,[2026](https://arxiv.org/html/2606.06893#bib.bib18)\)\.
### 2\.2Trace\-grounded Skill Induction
A growing line of work studies how such skills can be acquired automatically from agent experience\(Wanget al\.,[2024](https://arxiv.org/html/2606.06893#bib.bib9); Xiaet al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib2); Wanget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib8); Huanget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib5)\)\. We refer to this direction as*trace\-grounded skill induction*: the process of converting historical interaction or execution traces into reusable skill artifacts\. A trace may contain user requests, observations, intermediate reasoning, tool calls, environment actions, execution outcomes, corrections, and repeated failure or success patterns\(Niet al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib4)\)\. The key idea is to treat these traces not as passive logs to be retrieved or summarized, but as behavioral evidence from which future operating procedures can be reconstructed\(Liet al\.,[2026b](https://arxiv.org/html/2606.06893#bib.bib3)\)\.
Existing methods instantiate this idea in different forms\. Agent Workflow Memory induces reusable workflows from past web\-agent trajectories and retrieves them to guide future action generation\(Wanget al\.,[2024](https://arxiv.org/html/2606.06893#bib.bib9)\)\. Agent Skill Induction further represents induced skills as executable programs, enabling the system to verify skill correctness through execution rather than using free\-form textual lessons alone\(Wanget al\.,[2025](https://arxiv.org/html/2606.06893#bib.bib28)\)\. AutoSkill abstracts recurring user requirements and interaction patterns into explicit, maintainable skills that can be updated and injected across sessions\(Yanget al\.,[2026b](https://arxiv.org/html/2606.06893#bib.bib29)\)\. SkillRL distills raw trajectories into a hierarchical skill library and lets the skill library co\-evolve with the agent policy during reinforcement learning\(Xiaet al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib2)\)\. Trace2Skill analyzes multiple executions in parallel, extracts trajectory\-local lessons, and consolidates them into transferable skill directories that can either deepen existing human\-written skills or create new ones from scratch\(Niet al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib4)\)\. Together, these studies show that experience can be compressed into persistent procedural artifacts, improving agent success, efficiency, transfer, and long\-term adaptation without retraining the underlying model\.
However, this line of work also exposes a fundamental representation challenge\(Liuet al\.,[2026a](https://arxiv.org/html/2606.06893#bib.bib6)\)\. The usefulness of a skill does not depend only on whether it preserves salient task content, but also on whether it preserves the runtime structure that makes the behavior executable\(Lianget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib7)\)\. If traces are compressed into free\-form summaries or loosely organized lessons, the resulting skill may lose operational details such as activation conditions, workflow stages, branch criteria, retry and fallback rules, tool\-use requirements, validation checks, and termination conditions\(Liet al\.,[2026b](https://arxiv.org/html/2606.06893#bib.bib3)\)\. These details are not cosmetic: changing them can alter whether a skill is invoked, which path the agent follows, when it retries or stops, and what output constraints are enforced\(Lianget al\.,[2026](https://arxiv.org/html/2606.06893#bib.bib7)\)\. Thus, effective trace\-grounded skill induction requires more than extracting important observations from past traces\. It must reconstruct the structured runtime specification that governs future agent behavior\.
Our work builds on this perspective\. Rather than treating trace\-derived skills as generic summaries of past experience, we study them as structured operational artifacts whose control logic, node\-level execution semantics, and runtime attachments jointly determine how an agent acts\. This framing allows us to analyze trace\-grounded skill induction as a conversion from interaction evidence to executable specifications\.
\(a\)Workflow BackboneWW
\(b\)Operational SemanticsSS
\(c\)Runtime AttachmentsAA
Figure 1:Overview of the Skill\-IR representation\. Skill\-IR models a skill as a structured runtime specification composed of three complementary components: the Workflow BackboneWW, Operational SemanticsSS, and Runtime AttachmentsAA\. Together, these components separate workflow structure, node\-level behavior, and runtime constraints, enabling skills to be reconstructed and evaluated as executable specifications rather than free\-form summaries\.
## 3WSA Representation
While summarization primarily produces a compact natural\-language description, we view a skill as a structured object with explicitly defined inputs, outputs, and execution semantics\. Specifically, we model a skill as the combination of a routing header and a runtime specification\. The routing header determines when the skill should be selected \(i\.e\., description\), while the runtime specification determines how the skill behaves once selected\. We decompose the skill into a routing headerRRand a runtime specification:
Skill=\(R,W\+S\+A⏟runtime specification\),\\mathrm\{Skill\}=\(R,\\;\\underbrace\{W\+S\+A\}\_\{\\text\{runtime specification\}\}\),\(1\)whereRRdetermines when the skill should be selected, whileWW,SS, andAAjointly determine how it behaves once selected\.WWis theworkflow backbonethat captures the execution skeleton and node dependencies;SSis theoperational semanticsthat specify how nodes are interpreted and executed; andAAdenotesruntime attachmentsthat define the operational boundary, including tools, resources, constraints, state operations, validation checks, and output schemas\.
### 3\.1Workflow Backbone
The workflow backboneWWcaptures the execution structure of a skill:
whereNNis a set of workflow nodes andEEis a set of directed links between nodes\. Each node represents an abstract execution unit, and each directed link represents an ordering or dependency relation between execution units\. We say a skill*has a workflow backbone*when\|N\|≥2\|N\|\\geq 2and the nodes are connected by at least one directed link, indicating a multi\-step execution flow\. A single\-node skill \(\|N\|=1\|N\|=1\) has no structural backbone to recover, even though it still possesses operational semantics and possibly runtime attachments\. Thus,WWdescribes the skeleton of the procedure: what units exist and how execution may move among them\.
### 3\.2Operational Semantics
Operational semanticsSSspecify how the workflow nodes inWWshould be interpreted and executed\. For each relevant node,SSrecords its behavioral meaning, including its local objective, decision logic, execution conditions, and criteria for success or failure\. In this sense,SScomplements the workflow backbone by assigning executable meaning to the structural units\.
### 3\.3Runtime Attachments
Runtime attachmentsAAspecify the external and contextual dependencies required by skill execution\. They include the resources, interfaces, constraints, state interactions, validation requirements, and output commitments that define the operational boundary of the skill\. UnlikeWW, which captures execution structure, andSS, which captures node\-level behavior,AAcaptures what the skill depends on or constrains at runtime\.
ComponentsTypeRepresentative Use CaseTypical skillT0nonePrompt Fragment“Write in a professional tone” for email drafting\.shakespearean\-englishT1AAttachment WrapperExpose a code formatter tool or API documentation template\.weather\-apiT2SSemantic Guideline“Prioritize user safety over feature completeness” in content moderation\.code\-review\-checklistT3S\+ASemantic ResourceUse a customer database under the rule “only query active accounts”\.react\-best\-practicesT4WBare Workflow“Fetch data → Parse → Return result” for simple ETL tasks\.prototypeT5W\+ATool\-Driven Workflow“Check inventory → Reserve stock → Send confirmation email” in order fulfillment\.azure\-complianceT6W\+SSemantically Guided Workflow“If urgent, escalate; else queue” in customer support routing\.to\-issuesT7W\+S\+AFull Runtime Workflow“Retrieve user profile → Apply business rules → Call payment API → Log transaction” in checkout\.test\-driven\-development
Table 1:Skill types defined by the presence of workflow backboneWW, operational semanticsSS, and runtime attachmentsAA\. A skill is assignedWWwhen\|N\|≥2\|N\|\\geq 2with at least one directed edge; single\-node skills \(\|N\|=1\|N\|=1\) are treated as having noWW\.
### 3\.4Skill Type Coverage
The Skill\-IR model induces eight skill types, summarized in Table[1](https://arxiv.org/html/2606.06893#S3.T1)\. These types define the coverage target for the dataset\. Non\-workflow skills \(\|N\|≤1\|N\|\\leq 1\) are included because they test whether a SkillCreator can reconstruct semantic or resource\-oriented capabilities without a multi\-step backbone to anchor on\. Workflow\-oriented skills are especially important because they test whether the model can reconstruct persistent execution contracts\.
Our dataset construction aims to cover this taxonomy explicitly\. For each skill, annotators or automated analyzers assign a WSA profile, identify the dominant skill type, and record which components are observable from the available traces\. This makes it possible to report reconstruction performance not only in aggregate, but also by skill type\.
StatisticContentReference skills70 skillsWSA skill\-type taxonomy8 skill typesTable 2:Current dataset snapshot used by the W2S reconstruction study\. We count a skill scenario type as a distinct WSA path or use case, rather than as a raw dialogue turn\.
## 4Dataset Construction
We construct WSASkill to evaluate whether a method can reconstruct executable agent skills from interaction evidence\. The dataset contains 70 reference skills and covers the eight WSA skill types defined in Table[1](https://arxiv.org/html/2606.06893#S3.T1)\. Each reference skill is first analyzed under the Skill\-IR framework: we extract its workflow backboneWW, operational semanticsSS, and runtime attachmentsAA\. This produces a structured reference representation that specifies both the organization of the skill and the runtime behavior it is expected to preserve\.
For workflow\-bearing skills, we parse the workflow backbone and enumerate W\-paths from the parsed graph\. Each W\-path corresponds to one INPUT\-to\-OUTPUT route, with loops collapsed into loop units and branch or loop annotations retained\. This path\-level enumeration turns a skill into a set of controlled runtime scenarios, allowing the dataset to cover not only the main execution flow but also alternative branches, loop behavior, fallback cases, and termination outcomes\.
We then collect interaction evidence for each W\-path\. Specifically, for every enumerated path, we collect 10 traces or real interaction instances that demonstrate the expected behavior of that path\. These traces serve as the input evidence for skill reconstruction, while the extracted WSA representation provides the reference used for replay\-based behavioral evaluation\.
This construction makes WSASkill a behavior\-centered benchmark rather than a collection of raw dialogue logs\. By combining WSA component extraction, deterministic W\-path enumeration, and multiple traces per path, the dataset supports fine\-grained evaluation across different skill structures and runtime scenarios\.
## 5Methodology
W2S is a Skill\-IR\-guided trace\-grounded skill induction method\. The goal is not to produce a fluent summary of demonstrations, but to recover a reusable runtime specification whose workflow, node\-level rules, and runtime bindings are grounded in interaction evidence\. We use notation to describe the interface between stages, not to claim a closed\-form estimator\. Given trace evidence𝒟\\mathcal\{D\}, W2S first reconstructs an intermediate WSA representation and then renders that representation as a reusable skill:
𝒟→WSAparseΘ^=\(W^,S^,A^\)→skillgeneration𝒮^\.\\mathcal\{D\}\\xrightarrow\{\\mathrm\{WSAparse\}\}\\widehat\{\\Theta\}=\(\\widehat\{W\},\\widehat\{S\},\\widehat\{A\}\)\\xrightarrow\{\\mathrm\{skillgeneration\}\}\\widehat\{\\mathcal\{S\}\}\.\(3\)The central methodological claim is that reconstruction should be organized around WSA component recovery\. Without this structure, a model may produce a plausible instruction file while losing branch coverage, decision criteria, tool constraints, or validation behavior\.
Figure 2:Overview of our W2S\. The framework converts historical agent interaction traces into structured evidence, including workflow evidence, semantic evidence, and runtime evidence\. These evidence types are integrated into an intermediate skill representation with four components,*i\.e*\., a routing header, a workflow backbone, node\-level semantics, and runtime attachments\. Finally, the intermediate representation is rendered as a reusable executable skill and refined through validation feedback\.### 5\.1WSA Reconstruction from Traces
W2S begins by converting interaction traces into WSA\-oriented evidence\. This step does not assign numerical support scores\. Instead, it separates the observations needed to reconstruct a runtime skill:
ℰ=\{ℰW,ℰS,ℰA\},\\mathcal\{E\}=\\\{\\mathcal\{E\}\_\{W\},\\mathcal\{E\}\_\{S\},\\mathcal\{E\}\_\{A\}\\\},\(4\)whereℰW\\mathcal\{E\}\_\{W\}contains evidence about execution units and ordering,ℰS\\mathcal\{E\}\_\{S\}contains evidence about node functions and decision rules, andℰA\\mathcal\{E\}\_\{A\}contains evidence about tools, resources, validation checks, state requirements, and output contracts\. The abstraction also records provenance: whether a statement is directly observed in a trace, inferred from repeated behavior, or unobserved\. This provenance discipline is central to the method because the output skill is a persistent runtime contract, not a summary of examples\.
The WSA reconstruction stage treats a trace as a multi\-signal object\. User requests reveal activation conditions and task scope; agent actions reveal workflow order; justifications reveal decision criteria; failed or rejected paths reveal node\-local constraints; and final responses reveal output contracts\. W2S keeps these signals separated so that formatting conventions are not mistaken for workflow steps and isolated decisions are not promoted into global rules\.
The workflow component is reconstructed asW^=\(N^,E^\)\\widehat\{W\}=\(\\widehat\{N\},\\widehat\{E\}\)\. When the reference skill contains only a single execution unit \(\|N\|=1\|N\|=1\), the WSA reconstruction stage produces a degenerateW^\\widehat\{W\}with one node and no edges, and the generation step skips workflow instructions\. HereN^\\widehat\{N\}denotes execution units such as intake, retrieval, filtering, planning, checking, and response generation, whileE^\\widehat\{E\}denotes directed links among those units\. The method deliberately keepsWWstructural: branching, looping, approval, fallback, and termination are not encoded as separate elements ofWW\. They are node functions recovered inSS\.
The semantic and attachment components are then aligned to the recovered workflow\. Symbolically, the semantic layer is a node\-indexed interpretation:
S^:N^→\{goal,criteria,outcome,quality\}\.\\widehat\{S\}:\\widehat\{N\}\\rightarrow\\\{\\mathrm\{goal\},\\mathrm\{criteria\},\\mathrm\{outcome\},\\mathrm\{quality\}\\\}\.\(5\)This notation is intentionally schematic: it states the scope of semantic reconstruction rather than a fixed schema\. It allows the method to attach goals, decision criteria, validation rules, examples, and outcomes to the node where they affect behavior\. Runtime attachments are similarly scoped:
A^=A^global∪A^node\.\\widehat\{A\}=\\widehat\{A\}\_\{\\mathrm\{global\}\}\\cup\\widehat\{A\}\_\{\\mathrm\{node\}\}\.\(6\)Global attachments define the skill\-wide environment, while node\-local attachments define resources, actions, checks, or output obligations that apply only at a particular execution unit\. This joint WSA reconstruction makes errors inspectable: a failure can be attributed to a missing node, a wrong link, a wrong node function, or a mis\-scoped runtime attachment\.
### 5\.2WSA\-Constrained Skill Generation
Once WSA components are identified, W2S generates a skill document under explicit structural constraints\. The reconstructed skill must include activation conditions, a coherent workflow, node\-level semantics for decision\-bearing nodes, runtime attachments, validation requirements, and output behavior\. The generation step follows three rules\.
First, every major instruction must be traceable to W, S, or A evidence\. Second, branch alternatives and rare paths should remain explicit rather than being absorbed into generic prose\. Third, uncertainty must be represented as uncertainty\. If the traces do not demonstrate a tool argument, failure mode, or branch condition, the reconstructed skill may mark it as unobserved or request clarification, but should not fabricate a precise rule\.
This constrained synthesis reduces two common reconstruction errors\. The first is*over\-compression*, where distinct branches or validation cases are collapsed into a single broad instruction\. The second is*over\-generalization*, where the model invents workflow steps, decision criteria, or attachments that were not supported by the traces\. Path\-aware WSA generation keeps rare behavior visible, while evidence discipline limits unsupported generalization\.
Operationally, generation aims to produce a skill text whose parsed WSA structure matches the recovered components:
ParseWSA\(𝒮^\)=\(W^,S^,A^\)≈\(W^\(𝒟\),S^\(𝒟\),A^\(𝒟\)\)\.\\mathrm\{Parse\}\_\{WSA\}\(\\widehat\{\\mathcal\{S\}\}\)=\(\\widehat\{W\},\\widehat\{S\},\\widehat\{A\}\)\\approx\\left\(\\widehat\{W\}\(\\mathcal\{D\}\),\\widehat\{S\}\(\\mathcal\{D\}\),\\widehat\{A\}\(\\mathcal\{D\}\)\\right\)\.For non\-workflow skills \(\|N\|=1\|N\|=1\),W^\\widehat\{W\}is a degenerate single\-node graph with no edges\. The approximation is deliberate: the final skill must be readable and usable, but it should not erase component boundaries\. Workflow instructions state the node\-link structure; semantic instructions explain node functions and criteria; attachment instructions bind resources, tools, checks, and output requirements to the appropriate scope\.
The generation stage is therefore constrained by three writing obligations\. First, workflow content must preserve the recovered nodes and links\. Second, semantic content must remain node\-local when it governs a specific decision, validation, fallback, or termination behavior\. Third, attachment content must state the scope of resources and checks\. These obligations are qualitative constraints, not claims that the generator solves a particular optimization problem\.
### 5\.3Feedback Refinement
W2S includes a feedback loop that evaluates and revises the draft before producing the final skill\. The loop has three checks\. The*coverage check*compares the reconstructed WSA components against the trace evidence and flags missing path segments, branch outcomes, decision criteria, or attachments\. The*consistency check*detects contradictions, such as a workflow that requires approval but an attachment section that makes the corresponding action unconditional\. The*executability check*reviews whether the reconstructed skill is actionable for a downstream agent: steps must be ordered, decision criteria must be placed at the right nodes, and validation or output requirements must be specific enough to guide behavior\.
When a check fails, W2S returns targeted feedback to the generation step\. The feedback is WSA\-local: a missing edge is repaired inWW, a vague criterion is repaired inSS, and a missing tool or validation binding is repaired inAA\. The revised skill is then re\-parsed into\(W^,S^,A^\)\(\\widehat\{W\},\\widehat\{S\},\\widehat\{A\}\)and checked again\. This iterative process continues until no high\-priority WSA error remains or a fixed refinement budget is reached\. In this way, feedback refinement is not a generic polishing step; it is a structured attempt to improve reconstruction fidelity along the same components used for evaluation\.
We write the refinement loop as:
𝒮^\(t\+1\)=Revise\(𝒮^\(t\),ΔW\(t\),ΔS\(t\),ΔA\(t\)\)\.\\widehat\{\\mathcal\{S\}\}^\{\(t\+1\)\}=\\mathrm\{Revise\}\\left\(\\widehat\{\\mathcal\{S\}\}^\{\(t\)\},\\Delta\_\{W\}^\{\(t\)\},\\Delta\_\{S\}^\{\(t\)\},\\Delta\_\{A\}^\{\(t\)\}\\right\)\.\(7\)The feedback terms are obtained by comparing the parsed candidate skill with the WSA evidence:
\(ΔW\(t\),ΔS\(t\),ΔA\(t\)\)=Check\(ParseWSA\(𝒮^\(t\)\),ℰ\)\.\(\\Delta\_\{W\}^\{\(t\)\},\\Delta\_\{S\}^\{\(t\)\},\\Delta\_\{A\}^\{\(t\)\}\)=\\mathrm\{Check\}\\left\(\\mathrm\{Parse\}\_\{WSA\}\(\\widehat\{\\mathcal\{S\}\}^\{\(t\)\}\),\\mathcal\{E\}\\right\)\.\(8\)ΔW\\Delta\_\{W\}contains node\-link coverage errors,ΔS\\Delta\_\{S\}contains node\-function or criterion errors, andΔA\\Delta\_\{A\}contains missing or mis\-scoped runtime bindings\. The process stops when no high\-priority WSA error remains or the refinement budget is exhausted\.
The repair operation is typed\. A W\-level repair may split a collapsed node, restore a missing link, or separate two alternative paths\. An S\-level repair may rewrite a vague instruction into a node\-local predicate, add a missing accept/reject criterion, or move a termination condition from the backbone description into the semantic record of the relevant node\. An A\-level repair may bind a resource to the node that uses it, mark an approval requirement as mandatory, or restore an output schema omitted by the previous draft\. Since the feedback is expressed in the same WSA language as the evaluation, the method aligns its revisions with the fidelity dimensions later reported by the benchmark\.
## 6Experiments
We conduct experiments to evaluate whether our framework can faithfully reconstruct skills from workflow evidence\. Our evaluation focuses on replay\-based behavioral fidelity\. We begin by introducing the experimental setup, metrics, and then report the main empirical results\.
### 6\.1Experimental Setup
Dataset\.We evaluate our method on WSASkill, a benchmark organized around the WSA skill taxonomy\. For each reference skill, WSASkill provides the corresponding WSA profile, workflow backbone when applicable, and replay scenarios used to assess behavioral fidelity\. These scenarios instantiate different execution paths, decision outcomes, and attachment\-use cases, enabling us to compare reconstructed skills against their reference behavior under controlled replay settings\.
Baseline\.We use Anthropic Skill Creator \(ASC\) as the baseline\. ASC is an official skill\-construction workflow proposed by Anthropic that converts interaction evidence into a reusableSKILL\.mdfile through a structured interview and drafting process\. It is directly comparable to our setting because both methods aim to construct reusable agent skills from prior interaction evidence\.
Metrics\. We evaluate reconstructed skills using*replay\-based behavioral fidelity*\. For each WSA type, we replay the same evaluation scenarios with the reconstructed skill and compare its behavior against the reference skill\. The score measures whether the reconstruction preserves the expected execution behavior, including task progression, decision outcomes, tool\-use constraints, validation requirements, and final responses\. Higher scores indicate stronger behavioral consistency with the reference skill\.
### 6\.2Main Results
Table[3](https://arxiv.org/html/2606.06893#S6.T3)reports the replay\-based behavioral fidelity results across WSA skill types\. Overall, W2S outperforms Anthropic Skill Creator \(ASC\) on most categories, achieving an average score of 0\.503 compared with 0\.455 for ASC\. This indicates that the WSA\-guided reconstruction more effectively preserves the expected execution behavior of reference skills\.
The main exception is T5, where ASC achieves a higher score than W2S\. T5 represents skills that combine a workflow backbone with runtime attachments but do not include explicit operational semantics\. In this setting, the model must recover how tools, resources, and constraints should be bound to workflow steps without clear semantic guidance\. This makes the generated skill sensitive to attachment placement and tool\-binding details, which can lead to behavior mismatches during replay\. We view this as evidence that attachment\-heavy workflows require stronger mechanisms for scoping resources and constraints in future versions of W2S\.
TypeW2SASCGapT00\.6230\.520\+0\.103T10\.5220\.407\+0\.115T20\.7600\.638\+0\.122T30\.5730\.533\+0\.040T40\.2760\.227\+0\.049T50\.4800\.550\-0\.070T60\.5150\.450\+0\.065T70\.6520\.613\+0\.039Average0\.5030\.455\+0\.048Table 3:Replay\-based behavioral fidelity results by WSA skill type\. Gap is computed as W2S minus ASC; higher scores indicate stronger behavioral consistency with the reference skill\.These results suggest that WSA\-guided skill reconstruction generally improves behavioral preservation over ASC, while also revealing that attachment\-heavy workflow skills remain the most challenging category for the current method\.
## 7Conclusion
We present automated agent skill generation as a structured induction problem rather than a trace summarization problem\. Our key idea is that a reusable skill should preserve the runtime information needed for future execution, including the workflow structure, node\-level decision semantics, and operational constraints\. To this end, we introduce Skill\-IR, a structured representation that makes these components explicit and separates executable skill reconstruction from ordinary text compression\. Building on Skill\-IR, our framework reconstructs skills from behavioral evidence and evaluates them by whether they preserve the runtime contract of reference skills\. Our evaluation moves beyond surface\-level text similarity and measures whether generated skills preserve the runtime contract of reference skills\.
## Open Science
In accordance with open science policies, this paper adheres to principles that promote transparency, accessibility, and reproducibility\. All source code, and supplementary materials are publicly available at[https://github\.com/Q1ngSong/RWSA](https://github.com/Q1ngSong/RWSA), enabling verification, reuse, and further exploration of our methods\.
## References
- From raw experience to skill consumption: a systematic study of model\-generated agent skills\.arXiv preprint arXiv:2605\.23899\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p1.1)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026\)SoK: agentic skills–beyond tool use in llm agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p2.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun,et al\.\(2026a\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.arXiv preprint arXiv:2602\.12670\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1)\.
- Y\. Li, R\. Miao, Z\. Qi, and T\. Lan \(2026b\)Arise: agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning\.arXiv preprint arXiv:2603\.16060\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p3.1)\.
- Y\. Li, Y\. Dou, J\. Shao, Y\. Lyu, I\. Tsang, and H\. Yin \(2026c\)Skilltracer: structural failure attribution and refinement of agentic skills in long\-horizon web tasks\.InWorkshop on Multi\-Agent Learning and Its Opportunities in the Era of Generative AI,Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p2.1)\.
- Q\. Liang, H\. Wang, Z\. Liang, and Y\. Liu \(2026\)From skill text to skill structure: the scheduling\-structural\-logical representation for agent skills\.arXiv preprint arXiv:2604\.24026\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p3.1)\.
- G\. Ling, S\. Zhong, and R\. Huang \(2026\)Agent skills: a data\-driven analysis of claude skills for extending large language model functionality\.arXiv preprint arXiv:2602\.08004\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p1.1)\.
- H\. Liu, H\. Yang, T\. Jiang, B\. Tang, F\. Xiong, and Z\. Li \(2026a\)Skillsvote: lifecycle governance of agent skills from collection, recommendation to evolution\.arXiv preprint arXiv:2605\.18401\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p3.1)\.
- X\. Liu, X\. Luo, L\. Li, G\. Huang, J\. Liu, and H\. Qiao \(2026b\)Skillforge: forging domain\-specific, self\-evolving agent skills in cloud technical support\.arXiv preprint arXiv:2604\.08618\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p2.1)\.
- J\. Luo, W\. Zhang, Y\. Yuan, Y\. Zhao, J\. Yang, Y\. Gu, B\. Wu, B\. Chen, Z\. Qiao, Q\. Long,et al\.\(2025\)Large language model agent: a survey on methodology, applications and challenges\.arXiv preprint arXiv:2503\.21460\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1)\.
- Y\. Ma, Y\. Huang, H\. Bao, H\. Zhuang, S\. Shukla, M\. Galley, X\. Zhang, and S\. Feuerriegel \(2026\)Skillgen: verified inference\-time agent skill synthesis\.arXiv preprint arXiv:2605\.10999\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p2.1)\.
- J\. Ni, Y\. Liu, X\. Liu, Y\. Sun, M\. Zhou, P\. Cheng, D\. Wang, E\. Zhao, X\. Jiang, and G\. Jiang \(2026\)Trace2skill: distill trajectory\-local lessons into transferable agent skills\.arXiv preprint arXiv:2603\.25158\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p2.1)\.
- Nous Research \(2026\)Hermes agent skills system\.Note:[https://hermes\-agent\.nousresearch\.com/docs/user\-guide/features/skills](https://hermes-agent.nousresearch.com/docs/user-guide/features/skills)Accessed: 2026\-05\-24Cited by:[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p2.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InInternational Conference on Machine Learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p4.1)\.
- J\. Ruan, Y\. Chen, B\. Zhang, Z\. Xu, T\. Bao, H\. Mao, Z\. Li, X\. Zeng, R\. Zhao,et al\.\(2023\)Tptu: task planning and tool usage of large language model\-based ai agents\.InNeurIPS 2023 foundation models for decision making workshop,Cited by:[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p1.1)\.
- Z\. Shi, S\. Gao, L\. Yan, Y\. Feng, X\. Chen, Z\. Chen, D\. Yin, S\. Verberne, and Z\. Ren \(2025\)Tool learning in the wild: empowering language models as automatic tool agents\.InProceedings of the ACM on Web Conference 2025,pp\. 2222–2237\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p3.1)\.
- L\. Tang, T\. Goyal, A\. Fabbri, P\. Laban, J\. Xu, S\. Yavuz, W\. Kryściński, J\. Rousseau, and G\. Durrett \(2023\)Understanding factual errors in summarization: errors, summarizers, datasets, error detectors\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11626–11644\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p3.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang,et al\.\(2026\)Skillx: automatically constructing skill knowledge bases for agents\.arXiv preprint arXiv:2604\.04804\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p1.1)\.
- Z\. Z\. Wang, A\. Gandhi, G\. Neubig, and D\. Fried \(2025\)Inducing programmatic skills for agentic tasks\.arXiv preprint arXiv:2504\.06821\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p2.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p2.1)\.
- Y\. Wu and Y\. Zhang \(2026\)Agent skills from the perspective of procedural memory: a survey\.Authorea Preprints\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p2.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p2.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1)\.
- R\. Xu and Y\. Yan \(2026\)Agent skills for large language models: architecture, acquisition, security, and the path forward\.arXiv preprint arXiv:2602\.12430\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1)\.
- C\. Yang, X\. Wu, H\. Liu, X\. Lin, C\. Xu, X\. Jiang, Y\. Sun, W\. Zhang, Z\. Shi, Y\. Xu,et al\.\(2026a\)A survey of agent skills: toward procedural infrastructure for llm agents\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p2.1)\.
- Y\. Yang, J\. Li, Q\. Pan, B\. Zhan, Y\. Cai, L\. Du, J\. Zhou, K\. Chen, Q\. Chen, X\. Li, B\. Zhang, and L\. He \(2026b\)AutoSkill: experience\-driven lifelong learning via skill self\-evolution\.arXiv preprint arXiv:2603\.01145\.Cited by:[§2\.2](https://arxiv.org/html/2606.06893#S2.SS2.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p3.1)\.
- Y\. Zhou, Z\. Zhang, Z\. Cheng, S\. Zhang, Q\. Lan, Z\. Chen, Z\. Yang, R\. Chen, H\. Wang, S\. Hu,et al\.\(2026a\)SkillGenBench: benchmarking skill generation pipelines for llm agents\.arXiv preprint arXiv:2605\.18693\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p3.1)\.
- Y\. Zhou, W\. Shu, Y\. Su, W\. Du, Y\. Fang, and X\. Lin \(2026b\)A comprehensive survey on agent skills: taxonomy, techniques, and applications\.arXiv preprint arXiv:2605\.07358\.Cited by:[§1](https://arxiv.org/html/2606.06893#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06893#S2.SS1.p1.1)\.Similar Articles
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.
SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration
SkillFlow proposes a flow-driven recursive skill evolution framework for LLM-based agentic orchestration, using Tempered Trajectory Balance to prevent strategy collapse and provide transparent credit assignment. Experiments on 14 datasets show significant improvements over baselines in QA, math, code, and decision-making tasks.
SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows
Introduces SKILL.nb, a framework for governing reusable agent workflows through evidence-calibrated lifecycle policies, featuring selective formalization and gate-conditioned execution. It achieves significant improvements on web automation benchmarks and demonstrates resilience to environment drift.
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation
This paper presents COLLEAGUE.SKILL, an open-source system for automatically distilling person-grounded AI skills from heterogeneous traces into inspectable, correctable, and portable skill packages, enabling LLM agents to carry bounded representations of human expertise and interaction style.
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
This paper introduces Formal Skill, a runtime-native abstraction for LLM agents that encodes reusable procedures as executable state machines with JSON metadata, Python executors, and hook-governed control logic. An open-source implementation called FairyClaw is presented, showing competitive performance on Harness-Bench with reduced token usage.