MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction

arXiv cs.AI Papers

Summary

MIND-Skill is a new framework introduced in this research paper that automates the generation of high-quality, reusable agent skills using multi-agent induction and deduction with quality guarantees via TextGrad optimization.

arXiv:2605.08670v1 Announce Type: new Abstract: Large language model (LLM) powered AI agents have emerged as a promising paradigm for autonomous problem-solving, yet they continue to struggle with complex, multi-step real-world tasks that demand domain-specific procedural knowledge. Reusable agent skills, which encapsulate successful problem-solving strategies, offer a natural remedy by enabling agents to build on prior experience. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines. In this work, we present $\textbf{M}$ulti-agent $\textbf{IN}$duction and $\textbf{D}$eduction for $\textbf{Skill}$s ($\textbf{MIND-Skill}$), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees. MIND-Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held-out tasks unseen during optimization. Experiments on AppWorld and BFCL-v3 show that MIND-Skill consistently outperforms concurrent skill generation methods.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:19 AM

# MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction
Source: [https://arxiv.org/html/2605.08670](https://arxiv.org/html/2605.08670)
Yixuan Li1Mingshu Cai211footnotemark:1Ziyang Xiao3Wanyuan Wang4Yanchen Deng1Bo An11Nanyang Technological University2Waseda University3Zhejiang University4Southeast University

###### Abstract

Large language model \(LLM\) powered AI agents have emerged as a promising paradigm for autonomous problem\-solving, yet they continue to struggle with complex, multi\-step real\-world tasks that demand domain\-specific procedural knowledge\. Reusable agent skills, which encapsulate successful problem\-solving strategies, offer a natural remedy by enabling agents to build on prior experience\. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines\. In this work, we presentMulti\-agentINduction andDeduction forSkills \(MIND\-Skill\), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees\. MIND\-Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills\. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria\. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held\-out tasks unseen during optimization\. Experiments on AppWorld and BFCL\-v3 show that MIND\-Skill consistently outperforms concurrent skill generation methods\.

## 1Introduction

Large language models \(LLMs\) have demonstrated exceptional performance on various challenging reasoning tasks, including theorem proving\(Yanget al\.,[2023](https://arxiv.org/html/2605.08670#bib.bib21); Hubertet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib20)\), code generation\(Lyuet al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib22); Wanget al\.,[2025a](https://arxiv.org/html/2605.08670#bib.bib23)\), and scientific discovery\(Novikovet al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib24)\)\. Equipped with tools, memory, and harness scaffolding, LLM\-powered AI agents\(Steinberger and OpenClaw Community,[2026](https://arxiv.org/html/2605.08670#bib.bib25); Nous Research,[2026](https://arxiv.org/html/2605.08670#bib.bib26); Anthropic,[2025a](https://arxiv.org/html/2605.08670#bib.bib27),[2026](https://arxiv.org/html/2605.08670#bib.bib28)\)have emerged as a promising paradigm for autonomous problem\-solving in many open\-ended scenarios\. While LLMs inherit extensive declarative knowledge from pretraining, AI agents still struggle with complex, long\-horizon tasks that demand domain\-specificprocedural knowledge, such as using APIs, making multi\-step tool calls, and adapting actions based on workflow feedback\(Trivediet al\.,[2024](https://arxiv.org/html/2605.08670#bib.bib1); Patilet al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib29)\)\.

Agent skills\(Anthropic,[2025b](https://arxiv.org/html/2605.08670#bib.bib6)\), which encapsulate successful problem\-solving strategies and standard operating procedures into bundles of Markdown documents and related scripts, offer an elegant solution by enabling agents to build on prior domain experience\(Tagkopouloset al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib30); Liet al\.,[2026a](https://arxiv.org/html/2605.08670#bib.bib31)\)\. However, curating high\-quality skills has largely remained a manual endeavor, requiring extensive human expertise to distill rich domain knowledge into actionable guidelines\(Liet al\.,[2026b](https://arxiv.org/html/2605.08670#bib.bib14)\)\. Recent research efforts have attempted to generate skills automatically from different sources of knowledge\. Zero\-shot techniques\(Anthropic,[2025c](https://arxiv.org/html/2605.08670#bib.bib32)\)turn task descriptions or user prompts directly into skills by eliciting the prior knowledge of LLMs, though their effectiveness remains limited\(Liet al\.,[2026b](https://arxiv.org/html/2605.08670#bib.bib14)\)\. Trajectory\-distillation methods\(Niet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib12); Wanget al\.,[2026a](https://arxiv.org/html/2605.08670#bib.bib33); Tuet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib19)\)derive reusable skills for novel tasks by abstracting existing execution traces into generalizable procedures, typically in an offline fashion\. Lastly, lifelong evolving methods\(Nous Research,[2026](https://arxiv.org/html/2605.08670#bib.bib26); Xiaet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib13); Wanget al\.,[2025b](https://arxiv.org/html/2605.08670#bib.bib9); Alzubiet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib11)\)continuously crystallize and refine skills according to agents’ accumulated experiences and memory\.

Unfortunately, a key limitation of existing skill generation methods is the lack of quality guarantees\.First, many techniques directly generate skills from task specifications, trajectories or experiences without a principled closed\-loop pipeline that explicitly validates, corrects, and refines the skills based on execution outcomes\.Second, the documentation quality of the generated skills is largely overlooked\. Skills are intended to be reusable, portable artifacts that can be shared across agents, models and even human practitioners, yet current methods rarely evaluate whether the produced documents adhere to established standards of technical writing, e\.g\., logical flow and troubleshooting guidance\.Third, for trajectory\-distillation methods, the faithfulness of the abstraction process is never verified\. Distilling execution traces into reusable skills necessarily involves lossy compression, which potentially leads to over\-generalization\. Yet there is no established mechanism to guarantee the generated skills faithfully preserve the essential aspects of their source trajectories, such as edge\-case handling and prerequisite checks\.

In light of this, we proposeMulti\-agentINduction andDeduction forSkills \(MIND\-Skill\), a novel framework that synthesizes generalizable skills with quality guarantees from agents’ successful trajectories\. Unlike existing trajectory\-distillation methods that synthesize skills solely from traces, MIND\-Skill features aninduction agentthat derives skills from input trajectories, and adeduction agentthat reconstructs the input trajectories by actively following the generated skills\. The faithfulness of the generated skills is therefore enforced by optimizing areconstruction lossthat measures the discrepancy between the input trajectories and the reconstructed ones\. In addition, we introduce anoutcome lossthat enforces the correctness of the reconstructed trajectories, and arubric lossthat assesses documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria\. These textual losses are jointly optimized with TextGrad\(Yuksekgonulet al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib15)\)to produce high\-quality skills\. Specifically, we make the following contributions:

- •We propose MIND\-Skill, a multi\-agent induction and deduction framework that automatically synthesizes generalizable skills from successful trajectories\. To ensure that the generated skills carry all critical procedural knowledge, we keep the deduction agent frozen so that it receives no guidance beyond the induced skill when reconstructing trajectories\.
- •To guarantee the quality of induced skills, we propose three textual losses and jointly optimize them with TextGrad: a reconstruction loss that measures the discrepancy between the input and reconstructed trajectories, an outcome loss that enforces the execution correctness, and a rubric loss that assesses documentation quality and regularizes the abstraction level of the generated skills\.
- •We evaluate MIND\-Skill on AppWorld\(Trivediet al\.,[2024](https://arxiv.org/html/2605.08670#bib.bib1)\)and BFCL\-v3\(Patilet al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib29)\), and show that the induced skills improve agent performance on both source tasks and held\-out tasks unseen during generation\.

## 2Related Work

### 2\.1Agent Skills

Agent skills encapsulate reusable procedural knowledge into structured documents that can be shared across agents, models, and even human practitioners\(Anthropic,[2025b](https://arxiv.org/html/2605.08670#bib.bib6)\)\. Recent surveys systematize the skill lifecycle and distinguish skills from generic tool use by their procedural, reusable nature\(Jianget al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib35); Xu and Yan,[2026](https://arxiv.org/html/2605.08670#bib.bib36)\)\.Liet al\.\([2026a](https://arxiv.org/html/2605.08670#bib.bib31)\)demonstrate that single agents augmented with in\-depth skills can match the performance of multi\-agent frameworks\. That said, the mere presence of skills does not guarantee improved performance\. SkillsBench\(Liet al\.,[2026b](https://arxiv.org/html/2605.08670#bib.bib14)\)reveals that zero\-shot\-generated skills provide no benefit on average, whereas agents equipped with curated, human\-authored skills consistently outperform the no\-skill baseline\. SWE\-Skills\-Bench\(Hanet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib37)\)further demonstrates that low\-quality skills can significantly degrade agent performance rather than improve it\. Our work directly tackles this gap by coupling skill induction with deduction\-based verification, providing closed\-loop quality guarantees for generated skills\.

### 2\.2Skill Generation

#### Zero\-shot generation\.

Zero\-shot methods produce skills directly from task descriptions or user prompts by eliciting the parametric knowledge of LLMs\(Anthropic,[2025c](https://arxiv.org/html/2605.08670#bib.bib32)\), without leveraging any execution experience\. While lightweight, these methods are fundamentally limited by the absence of execution experience and therefore cannot capture domain\-specific procedural knowledge that only emerges through step\-by\-step interaction with the environment\(Liet al\.,[2026b](https://arxiv.org/html/2605.08670#bib.bib14)\)\.

#### Trajectory distillation\.

Trajectory\-distillation methods abstract execution traces into reusable agent skills\. WebXSkill\(Wanget al\.,[2026b](https://arxiv.org/html/2605.08670#bib.bib40)\)extracts reusable action subsequences from synthetic agent trajectories and abstracts them into parameterized skills that pair executable action programs with step\-level natural language guidance\. Trace2Skill\(Niet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib12)\)dispatches parallel sub\-agents to extract trajectory lessons and then hierarchically consolidates them into a skill directory\. SkillX\(Wanget al\.,[2026a](https://arxiv.org/html/2605.08670#bib.bib33)\)extracts a three\-level skill hierarchy from rollout trajectories and refines it via merging and filtering\. D2Skill\(Tuet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib19)\)reflects on execution trajectories to generate skills at both task and step granularities\. While these methods differ in abstraction strategies, they share two common limitations: the faithfulness of the abstraction process is never explicitly verified, and the documentation quality of the generated skills is largely uncontrolled\. MIND\-Skill addresses both gaps by requiring a frozen deduction agent to reconstruct the source trajectories from the generated skill alone, which provides an explicit faithfulness signal, and by introducing a rubric loss that enforces documentation standards and regularizes the abstraction level\.

#### Lifelong evolving methods\.

Lifelong methods continuously generate and refine skills from accumulated experience\. SAGE\(Wanget al\.,[2025b](https://arxiv.org/html/2605.08670#bib.bib9)\)and SkillRL\(Xiaet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib13)\)apply reinforcement learning to improve skills from environment feedback, but produce skills tightly coupled to a specific policy\. EvoSkill\(Alzubiet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib11)\)proposes new skills from execution failures and retains them via Pareto\-frontier selection\. CoEvoSkills\(Zhanget al\.,[2026a](https://arxiv.org/html/2605.08670#bib.bib39)\)co\-evolves a skill generator with a surrogate verifier that provides feedback without ground\-truth tests\. ACE\(Zhanget al\.,[2026b](https://arxiv.org/html/2605.08670#bib.bib8)\)accumulates strategies into an evolving playbook through generation\-reflection\-curation loops\. Although these methods leverage environment feedback, the resulting signal is confounded by the agent’s own reasoning ability: a capable agent may succeed despite a poor skill, while a weaker agent may fail despite adequate guidance\. MIND\-Skill disentangles these factors through controlled reconstruction, isolating skill quality as the sole objective and enabling principled optimization via TextGrad\.

![Refer to caption](https://arxiv.org/html/2605.08670v1/x1.png)Figure 1:Overview of MIND\-Skill\.Theinduction agent𝒜I\\mathcal\{A\}\_\{I\}\(with optimizable prompt𝒫I\\mathcal\{P\}\_\{I\}\) abstracts a successful trajectoryτ\\tauinto a structured skill document\. Thededuction agent𝒜D\\mathcal\{A\}\_\{D\}\(with frozen prompt𝒫D\\mathcal\{P\}\_\{D\}\) then attempts to reconstruct the trajectory by following only the induced skill and the task specification in a live environment\. Three textual losses assess the quality of the generated skill: thereconstruction lossmeasures procedural alignment betweenτ\\tauandτ^\\hat\{\\tau\}, theoutcome lossevaluates the outcome correctness ofτ^\\hat\{\\tau\}against the environment, and therubric lossassesses the documentation quality and regularizes the abstraction level of the skill itself\. The text\-basedoptimizeraggregates their textual feedback to update the induction prompt𝒫I\\mathcal\{P\}\_\{I\}via TextGrad\. Task specificationttis omitted from the figure for visual clarity\.

## 3MIND\-Skill

Successful trajectories contain valuable procedural knowledge, yet mining high\-quality, generalizable agent skills from them is inherently challenging since they often entangle transferable strategies with instance\-level details\. MIND\-Skill addresses this issue with a novel multi\-agent induction and deduction framework\. Specifically, theinduction agent𝒜I\\mathcal\{A\}\_\{I\}, with an optimizable prompt𝒫I\\mathcal\{P\}\_\{I\}, is tasked with deriving a skillssfrom an input \(successful\) trajectoryτ\\tauand the task specificationtt, while thededuction agent𝒜D\\mathcal\{A\}\_\{D\}attempts to reconstructτ\\tausolely according tottandss\. To ensuresspreserves all critical procedural knowledge, we keep the deduction agent’s prompt𝒫D\\mathcal\{P\}\_\{D\}frozen so that it receives no guidance beyond the induced skill during reconstruction and optimization\.

For each input pair\(t,τ\)\(t,\\tau\), we optimize the induction prompt𝒫I\\mathcal\{P\}\_\{I\}with respect to three textual loss functions: areconstruction lossℒrecon\\mathcal\{L\}\_\{\\text\{recon\}\}that measures procedural alignment between the original and reconstructed trajectories, anoutcome lossℒoutcome\\mathcal\{L\}\_\{\\text\{outcome\}\}that enforces the correctness of the reconstructed trajectory, and arubric lossℒrubric\\mathcal\{L\}\_\{\\text\{rubric\}\}that assesses documentation quality and regularizes the abstraction level of the skill\. For each input taskttand trajectoryτ\\tau, we performlexicographic minimizationwhereℒoutcome\\mathcal\{L\}\_\{\\text\{outcome\}\}is the primary objective, withℒrecon\\mathcal\{L\}\_\{\\text\{recon\}\}andℒrubric\\mathcal\{L\}\_\{\\text\{rubric\}\}as successive tiebreakers\. Formally,

𝒫I∗=arg⁡min𝒫I\(\\displaystyle\\mathcal\{P\}\_\{I\}^\{\*\}=\\mathop\{\\arg\\min\}\_\{\\mathcal\{P\}\_\{I\}\}\\;\\bigl\(ℒoutcome\(τ^,t\),ℒrecon\(τ,τ^,t\),ℒrubric\(s,t\)\),\\displaystyle\\mathcal\{L\}\_\{\\text\{outcome\}\}\(\\hat\{\\tau\},t\),\\,\\mathcal\{L\}\_\{\\text\{recon\}\}\(\\tau,\\hat\{\\tau\},t\),\\,\\mathcal\{L\}\_\{\\text\{rubric\}\}\(s,t\)\\bigr\),\(1\)s\.t\.s\\displaystyle\\text\{s\.t\.\}\\quad s=𝒜I​\(t,τ;𝒫I\),τ^=𝒜D​\(t,s;𝒫D\),\\displaystyle=\\mathcal\{A\}\_\{I\}\(t,\\tau;\\mathcal\{P\}\_\{I\}\),\\quad\\hat\{\\tau\}=\\mathcal\{A\}\_\{D\}\(t,s;\\mathcal\{P\}\_\{D\}\),and the final skill is given bys∗=𝒜I​\(t,τ;𝒫I∗\)s^\{\*\}=\\mathcal\{A\}\_\{I\}\(t,\\tau;\\mathcal\{P\}\_\{I\}^\{\*\}\)\.

An overview of MIND\-Skill is illustrated in Figure[1](https://arxiv.org/html/2605.08670#S2.F1)\. In the following, we describe the induction agent \(§[3\.1](https://arxiv.org/html/2605.08670#S3.SS1)\), the deduction agent \(§[3\.2](https://arxiv.org/html/2605.08670#S3.SS2)\), the loss functions \(§[3\.3](https://arxiv.org/html/2605.08670#S3.SS3)\), and the optimization procedure \(§[3\.4](https://arxiv.org/html/2605.08670#S3.SS4)\)\.

### 3\.1Induction Agent

The induction agent𝒜I\\mathcal\{A\}\_\{I\}abstracts a successful trajectory into a reusable agent skill\. The core challenge is controlling the level of abstraction\. An over\-specific skill that retains instance\-level details \(e\.g\., concrete API field paths or entity identifiers\) may ease reconstruction of the source task but fails to generalize across task variations\. Conversely, an over\-abstract skill that merely states high\-level intent provides no procedural guidance beyond the task specification itself\. The induction agent must therefore identify and preserve only thenon\-obvious procedural structurethat occupies the middle ground between these two failure modes\.

Formally, the induction agent𝒜I\\mathcal\{A\}\_\{I\}is parameterized by a system prompt𝒫I\\mathcal\{P\}\_\{I\}, which is the sole variable optimized during refinement \(§[3\.4](https://arxiv.org/html/2605.08670#S3.SS4)\)\. Given a task specificationttand a successful ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.08670#bib.bib2)\)trajectoryτ=\{\(thoughtm,codem,observationm\)\}m=1\|τ\|\\tau=\\\{\(\\text\{thought\}\_\{m\},\\text\{code\}\_\{m\},\\text\{observation\}\_\{m\}\)\\\}\_\{m=1\}^\{\|\\tau\|\}, it produces a structured skill documents=𝒜I​\(t,τ;𝒫I\)s=\\mathcal\{A\}\_\{I\}\(t,\\tau;\\mathcal\{P\}\_\{I\}\)\. To enforce the desired abstraction level,𝒫I\\mathcal\{P\}\_\{I\}encodes a taxonomy that partitions candidate claims into three categories: \(1\)procedural conventionsthat generalize across tasks but are non\-trivial to infer without execution experience \(e\.g\., paginate until the response is empty\), \(2\)instruction\-inferableknowledge derivable fromttalone \(e\.g\., an aggregation task implies a counting or grouping operation\), and \(3\)ground\-truth leakagethat is only knowable fromτ\\tau\(e\.g\., concrete response schemas, library choices, or hard\-coded thresholds\)\. The prompt𝒫I\\mathcal\{P\}\_\{I\}directs𝒜I\\mathcal\{A\}\_\{I\}to retain only non\-obvious patterns from category \(1\), while explicitly suppressing \(2\) and \(3\)\. This taxonomy serves as the primary inductive bias that TextGrad refines across optimization iterations\.

### 3\.2Deduction Agent

The deduction agent𝒜D\\mathcal\{A\}\_\{D\}reconstructs the trajectory from the induced skill alone\. Its prompt𝒫D\\mathcal\{P\}\_\{D\}is frozen throughout optimization and receives no access to the source trajectoryτ\\tau, ensuring that any improvement in reconstruction quality is solely attributable to the skillss\. Concretely, given the skillssand the task specificationtt, the deduction agent executes a multi\-step ReAct loop in a live environment to produce a reconstructed trajectoryτ^=𝒜D​\(t,s;𝒫D\)\\hat\{\\tau\}=\\mathcal\{A\}\_\{D\}\(t,s;\\mathcal\{P\}\_\{D\}\)\. At each step, the agent reasons about the next action, executes code, and observes the environment response\. The skill is injected into the agent’s prompt as a procedural playbook, serving as the only source of strategic guidance\.

### 3\.3Textual Loss Functions

Existing methods typically refine skills by diagnosing errors from failed trajectories against reference solutions and incorporating the lessons back into skills\. However, a capable agent may compensate for skill deficiencies through its own reasoning, masking gaps that should be fixed, while a weak agent may fail despite adequate guidance, producing misleading negative signals\. In either case, task performance becomes an unreliable proxy for skill quality\. Our reconstruction\-based design provides a controlled alternative: rather than diagnosing failures post\-hoc, we directly test whether the skill alone can reproduce the procedural structure of the reference trajectory\. Because the deduction agent is frozen and receives no strategic guidance beyond the induced skill, divergences betweenτ^\\hat\{\\tau\}andτ\\taucan be directly attributed to deficiencies inss, yielding a clean signal for optimizing the induction agent\. We formalize this through three complementary losses:

#### Reconstruction loss\.

The reconstruction loss evaluates whether the induced skillsspreserves the essential problem\-solving strategy of the source trajectoryτ\\tau\. An LLM judge𝒜J\\mathcal\{A\}\_\{J\}takes the reconstructed trajectoryτ^\\hat\{\\tau\}, the source trajectoryτ\\tau, and the task specificationttas inputs, then produces a scalar loss value along with textual feedback:

ℒrecon​\(τ,τ^,t\)=\(ℓrecon,frecon\)=𝒜J​\(τ,τ^,t;𝒫recon\),\\mathcal\{L\}\_\{\\text\{recon\}\}\(\\tau,\\hat\{\\tau\},t\)=\\bigl\(\\,\\ell\_\{\\text\{recon\}\},\\;f\_\{\\text\{recon\}\}\\,\\bigr\)=\\mathcal\{A\}\_\{J\}\(\\tau,\\hat\{\\tau\},t;\\mathcal\{P\}\_\{\\text\{recon\}\}\),\(2\)whereℓrecon∈\[0,10\]\\ell\_\{\\text\{recon\}\}\\in\[0,10\]measures trajectory discrepancy,freconf\_\{\\text\{recon\}\}is a natural\-language critique identifying specific mismatches, and𝒫recon\\mathcal\{P\}\_\{\\text\{recon\}\}is the system prompt instructing𝒜J\\mathcal\{A\}\_\{J\}to compare these two trajectories\. Crucially, the judge evaluates tactic\-level equivalence rather than step\-level similarity: two trajectories that use different API endpoints, loop constructs, or intermediate variables are considered aligned as long as they implement the same procedural logic \(e\.g\., the same retrieval\-then\-aggregation pattern, the same pagination strategy, or the same prerequisite checking order\)\.

#### Outcome loss\.

The outcome loss provides the only ground\-truth signal in our framework by executing the reconstructed trajectoryτ^\\hat\{\\tau\}in a live environment:

ℒoutcome​\(τ^,t\)=\(ℓoutcome,foutcome\)=EnvExec​\(τ^,t\),\\mathcal\{L\}\_\{\\text\{outcome\}\}\(\\hat\{\\tau\},t\)=\\bigl\(\\,\\ell\_\{\\text\{outcome\}\},\\;f\_\{\\text\{outcome\}\}\\,\\bigr\)=\\text\{EnvExec\}\(\\hat\{\\tau\},t\),\(3\)whereℓoutcome∈\[0,1\]\\ell\_\{\\text\{outcome\}\}\\in\[0,1\]measures the degree of task failure andfoutcomef\_\{\\text\{outcome\}\}captures environment feedback such as error messages and execution traces\. Unlike the reconstruction loss, which relies on LLM judgment to assess faithfulness of the skill induction process, this signal is grounded in actual task execution and provides a complementary anchor from the perspective of outcome correctness\.

#### Rubric loss\.

The rubric loss evaluates the skill documentssalong two axes\. The first isdocumentation quality: whether the skill adheres to established standards of technical writing, such as logical flow, troubleshooting guidance, and completeness, ensuring that it serves as a reusable, portable artifact\. The second islevel of abstraction: the reconstruction and outcome losses optimize for faithful and correct reproduction of the source trajectory, but they cannot distinguish a genuinely transferable skill from one that simply memorizes implementation details\. The rubric loss addresses this by detecting statements in the skill that are tied to the specific implementation of the source trajectory rather than to transferable procedural patterns\. Formally,

ℒrubric​\(s,t\)=\(ℓrubric,frubric\)=𝒜J​\(s,t;𝒫rubric\),\\mathcal\{L\}\_\{\\text\{rubric\}\}\(s,t\)=\(\\ell\_\{\\text\{rubric\}\},\\;f\_\{\\text\{rubric\}\}\)=\\mathcal\{A\}\_\{J\}\(s,t;\\mathcal\{P\}\_\{\\text\{rubric\}\}\),\(4\)whereℓrubric∈\[0,10\]\\ell\_\{\\text\{rubric\}\}\\in\[0,10\]denotes rubric violation degree across five dimensions: whether the skill avoids implementation details tied to the source trajectory \(ground\-truth independence\), whether it provides sufficient procedural guidance to act on \(actionability\), whether it applies to structurally similar tasks beyond the source \(transferability\), whether all key procedural stages are covered \(completeness\), and whether it is free of redundant boilerplate \(conciseness\)\.frubricf\_\{\\text\{rubric\}\}provides textual feedback identifying specific issues along these dimensions\. The rubric loss serves as a regularizer on abstraction level: without it, the induction agent can inflate reconstruction and execution performance by injecting instance\-specific details into the skill, making the skill fail to generalize to novel tasks\.

Algorithm 1Multi\-agent Induction and Deduction for Skills \(MIND\-Skill\)0:Task specification

tt, successful trajectory

τ\\tau, initial induction prompt

𝒫I\(0\)\\mathcal\{P\}\_\{I\}^\{\(0\)\}, frozen deduction prompt

𝒫D\\mathcal\{P\}\_\{D\}, maximum number of iterations

QQ
0:Optimized skill

s∗s^\{\*\}
1:

s∗←nil,ℓrecon∗←∞,ℓoutcome∗←∞,ℓrubric∗←∞s^\{\*\}\\leftarrow\\texttt\{nil\},\\quad\\ell\_\{\\text\{recon\}\}^\{\*\}\\leftarrow\\infty,\\quad\\ell\_\{\\text\{outcome\}\}^\{\*\}\\leftarrow\\infty,\\quad\\ell\_\{\\text\{rubric\}\}^\{\*\}\\leftarrow\\infty
2:for

q=0,1,…,Q−1q=0,1,\\ldots,Q\-1do

3:\# Induction: distill trajectory into skill

4:

s←𝒜I​\(t,τ;𝒫I\(q\)\)s\\leftarrow\\mathcal\{A\}\_\{I\}\(t,\\tau;\\;\\mathcal\{P\}\_\{I\}^\{\(q\)\}\)
5:\# Deduction: reconstruct trajectory from skill and task specification

6:

τ^←𝒜D​\(t,s;𝒫D\)\\hat\{\\tau\}\\leftarrow\\mathcal\{A\}\_\{D\}\(t,s;\\;\\mathcal\{P\}\_\{D\}\)
7:\# Compute textual losses \(each returns a loss value and textual feedback\)

8:

\(ℓrecon,frecon\)←ℒrecon​\(τ,τ^,t\)\(\\ell\_\{\\text\{recon\}\},\\;f\_\{\\text\{recon\}\}\)\\leftarrow\\mathcal\{L\}\_\{\\text\{recon\}\}\(\\tau,\\;\\hat\{\\tau\},\\;t\)
9:

\(ℓoutcome,foutcome\)←ℒoutcome​\(τ^,t\)\(\\ell\_\{\\text\{outcome\}\},\\;f\_\{\\text\{outcome\}\}\)\\leftarrow\\mathcal\{L\}\_\{\\text\{outcome\}\}\(\\hat\{\\tau\},\\;t\)
10:

\(ℓrubric,frubric\)←ℒrubric​\(s,t\)\(\\ell\_\{\\text\{rubric\}\},\\;f\_\{\\text\{rubric\}\}\)\\leftarrow\\mathcal\{L\}\_\{\\text\{rubric\}\}\(s,t\)
11:\# Track the best skill with lexicographic comparison

12:if

\(ℓoutcome,ℓrecon,ℓrubric\)<lex\(ℓoutcome∗,ℓrecon∗,ℓrubric∗\)\(\\ell\_\{\\text\{outcome\}\},\\ell\_\{\\text\{recon\}\},\\ell\_\{\\text\{rubric\}\}\)<\_\{\\textbf\{lex\}\}\(\\ell\_\{\\text\{outcome\}\}^\{\*\},\\ell\_\{\\text\{recon\}\}^\{\*\},\\ell\_\{\\text\{rubric\}\}^\{\*\}\)then

13:

s∗←s,ℓrecon∗←ℓrecon,ℓoutcome∗←ℓoutcome,ℓrubric∗←ℓrubrics^\{\*\}\\leftarrow s,\\quad\\ell\_\{\\text\{recon\}\}^\{\*\}\\leftarrow\\ell\_\{\\text\{recon\}\},\\quad\\ell\_\{\\text\{outcome\}\}^\{\*\}\\leftarrow\\ell\_\{\\text\{outcome\}\},\\quad\\ell\_\{\\text\{rubric\}\}^\{\*\}\\leftarrow\\ell\_\{\\text\{rubric\}\}
14:\# TextGrad: compute textual gradient and update induction prompt

15:

g←GradientLLM​\(𝒫I\(q\),t,s,τ^,frecon,foutcome,frubric\)g\\leftarrow\\text\{GradientLLM\}\\\!\\left\(\\mathcal\{P\}\_\{I\}^\{\(q\)\},t,\\;s,\\;\\hat\{\\tau\},\\;f\_\{\\text\{recon\}\},\\;f\_\{\\text\{outcome\}\},\\;f\_\{\\text\{rubric\}\}\\right\)
16:

𝒫I\(q\+1\)←OptimizerLLM​\(𝒫I\(q\),g\)\\mathcal\{P\}\_\{I\}^\{\(q\+1\)\}\\leftarrow\\text\{OptimizerLLM\}\\\!\\left\(\\mathcal\{P\}\_\{I\}^\{\(q\)\},\\;g\\right\)
17:return

s∗s^\{\*\}

### 3\.4Closed\-Loop Optimization

We optimize the induction prompt𝒫I\\mathcal\{P\}\_\{I\}to improve the induced skillssthrough iterative textual gradient descent following TextGrad\(Yuksekgonulet al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib15)\)\. A key design choice is that the gradient LLM observes the reconstructed trajectoryτ^\\hat\{\\tau\}but not the source trajectoryτ\\tau: information aboutτ\\taureaches the gradient only indirectly through the reconstruction feedbackfreconf\_\{\\text\{recon\}\}\. This prevents the optimizer from proposing superficial fixes that copy implementation details fromτ\\tauinto the prompt, and together with the rubric loss forms a dual safeguard against ground\-truth leakage in the optimization process\. Concretely, a gradient LLM consumes the current prompt𝒫I\(q\)\\mathcal\{P\}\_\{I\}^\{\(q\)\}, the task specificationtt, the induced skillss, the reconstructed trajectoryτ^\\hat\{\\tau\}, and the textual feedback from all three losses, and synthesizes a natural\-language gradientggthat diagnoses failure patterns and proposes revisions\. Then an optimizer LLM appliesggto produce an updated prompt𝒫I\(q\+1\)\\mathcal\{P\}\_\{I\}^\{\(q\+1\)\}for the induction agent\.

The full procedure is summarized in Algorithm[1](https://arxiv.org/html/2605.08670#alg1)\. For each input pair\(t,τ\)\(t,\\tau\), we iterate for up toQQiterations: the current prompt𝒫I\(q\)\\mathcal\{P\}\_\{I\}^\{\(q\)\}instructs the induction agent to derive a skillssfromτ\\tau\(line 4\), the frozen deduction agent reconstructs the trajectory in a live environment \(line 6\), and the three losses evaluate the skill and trajectories \(lines 8–10\)\. We track the best skills∗s^\{\*\}across iterations by lexicographic comparison \(lines 12–13\) to ensure the anytime property\(Zilberstein,[1996](https://arxiv.org/html/2605.08670#bib.bib41)\)\. Finally, the textual feedback drives prompt update for the induction agent via TextGrad \(lines 15–16\)\.

## 4Experiments

#### Benchmarks\.

We evaluate on two complex, long\-horizon benchmarks\.AppWorld\(Trivediet al\.,[2024](https://arxiv.org/html/2605.08670#bib.bib1)\)is an interactive coding agent benchmark comprising 9 daily\-life apps and 457 APIs\. Tasks are officially partitioned into train, test\-normal, and test\-challenge splits; we extract skills from the 90 training tasks and evaluate on both test splits \(168 normal, 417 challenge\)\. We report Task Goal Completion \(TGC\), the fraction of tasks where all unit tests pass, and Scenario Goal Completion \(SGC\), which requires all task variations within a scenario to pass\.BFCL\-v3\(Patilet al\.,[2025](https://arxiv.org/html/2605.08670#bib.bib29)\)is a multi\-turn function\-calling benchmark\. We use the base multi\-turn category \(200 instances\), randomly split into 50 training and 150 test instances\.

#### Baselines\.

We consider the following baselines for comparison:\(i\) ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.08670#bib.bib2)\)uses a task prompt with a single demonstration example\. For AppWorld, we follow the official ReAct implementation; for BFCL, we use the benchmark’s native function\-calling mode;\(ii\) In\-Context Learning \(ICL\)\(Agarwalet al\.,[2024](https://arxiv.org/html/2605.08670#bib.bib3)\)provides the model with diverse task demonstrations in the input prompt, allowing it to infer task format and desired output;\(iii\) Skill\-extractuses the same induction agent as MIND\-Skill to extract a skill from the source trajectory in a single pass without any iterative optimization\. This serves as an ablation that isolates the contribution of our closed\-loop optimization;\(iv\) ACE\(Zhanget al\.,[2026b](https://arxiv.org/html/2605.08670#bib.bib8)\)is a recent lifelong evolving method that accumulates strategies into a monolithic playbook through generation\-reflection\-curation loops\. We use the official codebase in the offline adaptation mode with ground\-truth solutions available during training; implementation details are provided in Appendix[B\.1](https://arxiv.org/html/2605.08670#A2.SS1);\(v\) Trace2Skill\(Niet al\.,[2026](https://arxiv.org/html/2605.08670#bib.bib12)\)is a concurrent method that converts execution traces into structured skills through parallel analysis and hierarchical merge\. We use its official codebase; implementation details are provided in Appendix[B\.2](https://arxiv.org/html/2605.08670#A2.SS2)\.

#### MIND\-Skill implementation\.

For each training task, we roll out a successful trajectory with a frontier model, which serves as input to for MIND\-Skill\. We use the same base model for induction agent, gradient LLM, and optimizer LLM\. The maximum number of iterations is set toQ=8Q=8\. For each test task, we prompt the LLM to retrieveK=3K=3skills from the generated skills, and inject them into the LLM’s context before executing ReAct loop\. Further details are provided in Appendix[B\.3](https://arxiv.org/html/2605.08670#A2.SS3)\.

Table 1:Main results on AppWorld and BFCL\-v3\. All methods use Qwen3\.5\-122B\-A10B for inference\.Boldindicates the best andunderlineindicates the second\-best result per group\.
### 4\.1Main Results

Table[1](https://arxiv.org/html/2605.08670#S4.T1)summarizes the main results\. We highlight the following key findings:

#### MIND\-Skill leads consistently across diverse settings\.

When Qwen3\.5\-122B\-A10B is used to generate skills, MIND\-Skill achieves the highest TGC on both AppWorld splits and the highest BFCL\-v3 accuracy, yielding the best average score \(59\.1\), surpassing ACE \(56\.1\) and Trace2Skill \(55\.1\) by clear margins\. Notably, on AppWorld\-Challenge SGC, MIND\-Skill significantly outperforms SOTA baselines \(39\.6vs\. 34\.5 for ACE and 33\.1 for Trace2Skill\)\. We also note that no baseline performs consistently across both AppWorld splits: Trace2Skill scores higher than ACE on AppWorld\-Normal TGC \(67\.3 vs\. 65\.5\) but lower on AppWorld\-Challenge TGC \(46\.8 vs\. 51\.1\), suggesting that their generated skills may overfit to simpler task patterns\. MIND\-Skill is the only method that leads on both splits simultaneously, and its large SGC advantage on AppWorld\-Challenge indicates that the generated skills capture scenario\-level procedural patterns rather than task\-specific shortcuts\.

#### Closed\-loop optimization outperforms one\-shot induction\.

Skill\-extract uses the same induction agent as MIND\-Skill to induce skills from trajectories in a single pass, isolating the contribution of our closed\-loop optimization procedure \(cf\. §[3\.2](https://arxiv.org/html/2605.08670#S3.SS2)–§[3\.4](https://arxiv.org/html/2605.08670#S3.SS4)\)\. The gap is substantial: MIND\-Skill outperforms Skill\-extract by8\.1and7\.2on average when using Qwen3\.5\-122B\-A10B and GPT\-5\.4 as the base model for the induction agent, respectively\. This confirms that one\-shot skill extraction, even when the underlying induction agent is capable, cannot ensure the generated skills are faithful, generalizable, and well\-structured without iterative optimization driven by our three textual losses\.

#### Weak models match frontier ones with MIND\-Skill\.

When skills are generated by GPT\-5\.4, MIND\-Skill again achieves the highest average \(58\.9\), outperforming Trace2Skill \(56\.6\) and ACE \(56\.3\)\. On AppWorld\-Challenge, Trace2Skill leads in terms of TGC, while MIND\-Skill achieves the highest SGC \(37\.4\) and leads on Normal TGC \(70\.8\) as well as BFCL\-v3 \(78\.7\), showing our superiority across benchmarks\. An interesting observation is that MIND\-Skill with the weaker Qwen3\.5\-122B\-A10B as the base model for the induction agent achieves performance \(59\.1on average\) comparable to MIND\-Skill with GPT\-5\.4 \(58\.9on average\)\. This suggests that our induction\-deduction framework can largely compensate for the capability gap between different base models, making high\-quality skill generation accessible without relying on frontier models\. We present a case study comparing the skills generated by Qwen3\.5\-122B\-A10B and GPT\-5\.4 in Appendix[C\.2](https://arxiv.org/html/2605.08670#A3.SS2)\.

### 4\.2Ablation Study and Further Analysis

![Refer to caption](https://arxiv.org/html/2605.08670v1/x2.png)Figure 2:Performance at each iteration and the effect of varying the number of retrieved skills on AppWorld\. Columns 1–2: TGC and SGC over iterations on Normal \(top\) and Challenge \(bottom\)\. Column 3: TGC and SGC acrossKK, with both splits per panel\. Bars \(left axis\) report the aggregate; lines \(right axis\) report per\-difficulty accuracy\.![Refer to caption](https://arxiv.org/html/2605.08670v1/x3.png)Figure 3:Loss values at each iteration on AppWorld\. Shaded areas show ±1 SEM\.#### Skill quality improves steadily across optimization iterations\.

Figure[2](https://arxiv.org/html/2605.08670#S4.F2)\(columns 1–2\) tracks test performance across optimization iterations\. At each iteration, each task’s skill library entry is updated to the best skill found so far via the lexicographic selection described in Algorithm[1](https://arxiv.org/html/2605.08670#alg1)\(lines 12–13\)\. Starting from iteration 0, which is equivalent to Skill\-extract, TGC improves by7\.7on Normal and by6\.7on Challenge over 8 iterations, with the majority of gains concentrated in the first 3 rounds\. Per\-difficulty breakdowns show that easy tasks saturate early while hard tasks continue to benefit from later iterations, suggesting that early rounds fix coarse procedural gaps whereas later rounds resolve subtler edge cases\. Figure[3](https://arxiv.org/html/2605.08670#S4.F3)confirms this dynamic: all three losses decrease steadily with small variance, and the outcome loss drops to near zero within 3 iterations while the reconstruction and rubric losses continue to decrease in later iterations\.

#### The effect of varying the number of retrieved skillsKK\.

Figure[2](https://arxiv.org/html/2605.08670#S4.F2)\(column 3\) shows the effect of varyingKKinjected at inference time\.K=1K\{=\}1underperforms across all metrics, as a single skill may not cover the full procedural scope of a test task\. Performance improves substantially fromK=1K\{=\}1toK=3K\{=\}3, as retrieving multiple complementary skills broadens procedural coverage and reduces the agent’s sensitivity to any single poor match\. Per\-difficulty breakdowns confirm this: easy tasks are near\-ceiling fromK≥2K\{\\geq\}2, while medium and hard tasks benefit most from the additional coverage atK=3K\{=\}3\.K=5K\{=\}5pushes Normal SGC further to 62\.5, indicating that more skills can still help with scenario\-level consistency\. Balancing overall performance, we useK=3K\{=\}3for all main experiments\.

Table 2:Ablation study on AppWorld\.
#### Each loss component contributes to skill quality\.

Table[2](https://arxiv.org/html/2605.08670#S4.T2)ablates each loss function on AppWorld\. All ablated variants outperform Skill\-extract, confirming that each loss is indispensable for high\-quality skill generation\. Removing the reconstruction loss causes the largest Challenge TGC drop \(51\.8→\\to45\.8\), nearly erasing all gains over Skill\-extract \(45\.1\)\. Without comparing reconstructed and source trajectories, the optimizer lacks the fine\-grained procedural feedback needed to identify missing key steps and flawed workflows\. Removing the rubric loss causes the largest Normal TGC drop \(71\.4→\\to64\.3\)\. Without abstraction\-level regularization, the optimizer tends to leak instance\-specific details into skills, which may coincidentally help on certain challenge tasks but hurt generalization across the broader task population\. Removing the outcome loss has the mildest effect\. Notably, even without any ground\-truth execution feedback, the w/o outcome variant \(68\.5Normal TGC,48\.0Challenge TGC\) already outperforms Trace2Skill on both splits\. This highlights that the reconstruction and rubric losses alone provide sufficiently rich signal to surpass concurrent trajectory\-distillation methods\. Nonetheless, outcome loss catches runtime errors and silent API failures that textual judgment alone misses, helping the full MIND\-Skill improve Challenge TGC to51\.8and SGC to39\.6\.

![Refer to caption](https://arxiv.org/html/2605.08670v1/x4.png)Figure 4:Total number of injected tokens\.
#### MIND\-Skill generates compact skills\.

Figure[4](https://arxiv.org/html/2605.08670#S4.F4)compares the total number of injected tokens at inference time\. Although MIND\-Skill retrievesK=3K\{=\}3skills per test task, the number of injected tokens remains3−6×3\{\-\}6\{\\times\}smaller than ACE’s monolithic playbook and Trace2Skill’s single skill directory\. Our rubric loss explicitly penalizes redundant boilerplate, encouraging the optimizer to retain only essential procedural content\. In contrast, ACE and Trace2Skill pack all training\-time knowledge into a single monolithic artifact regardless of task relevance\. MIND\-Skill instead follows a progressive\-disclosure principle, where each retrieved skill covers only the procedural knowledge relevant to its matched task category\. This produces compact yet actionable skills without sacrificing effectiveness\.

## 5Conclusion

In this work, we presentedMIND\-Skill, a multi\-agent induction and deduction framework for automatically synthesizing high\-quality agent skills from successful execution trajectories\. MIND\-Skill departs from prior skill\-generation approaches by introducing a closed\-loop process that explicitly validates and refines generated skills through trajectory reconstruction, execution feedback, and comprehensive rubric assessment\. Specifically, MIND\-Skill combines an induction agent and a frozen deduction agent with three complementary textual losses: reconstruction loss, outcome loss, and rubric loss\. These losses are jointly optimized with TextGrad to iteratively refine the induction prompt, improving generated skills in terms of faithfulness, task correctness, and documentation quality\. Experiments on AppWorld and BFCL\-v3 show that the resulting skills improve agent performance on both source tasks and held\-out tasks unseen during skill generation, demonstrating the effectiveness and generalizability of the proposed framework\.

## References

- R\. Agarwal, A\. Singh, L\. Zhang, B\. Bohnet, L\. Rosias, S\. Chan, B\. Zhang, A\. Anand, Z\. Abbas, A\. Nova,et al\.\(2024\)Many\-shot in\-context learning\.InNeurIPS,pp\. 76930–76966\.Cited by:[§4](https://arxiv.org/html/2605.08670#S4.SS0.SSS0.Px2.p1.1)\.
- S\. Alzubi, N\. Provenzano, J\. Bingham, W\. Chen, and T\. Vu \(2026\)EvoSkill: Automated skill discovery for multi\-agent systems\.arXiv preprint arXiv:2603\.02766\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px3.p1.1)\.
- Anthropic \(2025a\)Claude code\.Note:[https://www\.anthropic\.com/claude\-code](https://www.anthropic.com/claude-code)Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- Anthropic \(2025b\)Equipping agents for the real world with Agent Skills\.Note:Anthropic Engineering BlogExternal Links:[Link](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills)Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.08670#S2.SS1.p1.1)\.
- Anthropic \(2025c\)Skill creator: SKILL\.md\.GitHub\.Note:[https://github\.com/anthropics/skills/blob/main/skills/skill\-creator/SKILL\.md](https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md)A skill for creating, evaluating, and iteratively improving Claude skills, part of the Anthropic Skills repositoryCited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px1.p1.1)\.
- Anthropic \(2026\)Claude managed agents\.Note:[https://platform\.claude\.com/docs/en/managed\-agents/overview](https://platform.claude.com/docs/en/managed-agents/overview)Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- T\. Han, Y\. Zhang, W\. Song, C\. Fang, Z\. Chen, Y\. Sun, and L\. Hu \(2026\)SWE\-Skills\-Bench: Do agent skills actually help in real\-world software engineering?\.arXiv preprint arXiv:2603\.15401\.Cited by:[§2\.1](https://arxiv.org/html/2605.08670#S2.SS1.p1.1)\.
- T\. Hubert, R\. Mehta, L\. Sartran,et al\.\(2026\)Olympiad\-level formal mathematical reasoning with reinforcement learning\.Nature651,pp\. 607–613\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026\)SoK: Agentic skills – beyond tool use in LLM agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[§2\.1](https://arxiv.org/html/2605.08670#S2.SS1.p1.1)\.
- H\. Li, C\. Mu, J\. Chen, S\. Ren, Z\. Cui, Y\. Zhang, L\. Bai, and S\. Hu \(2026a\)Organizing, orchestrating, and benchmarking agent skills at ecosystem scale\.arXiv preprint arXiv:2603\.02176\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.08670#S2.SS1.p1.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun,et al\.\(2026b\)SkillsBench: Benchmarking how well agent skills work across diverse tasks\.arXiv preprint arXiv:2602\.12670\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.08670#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px1.p1.1)\.
- Z\. Lyu, J\. Huang, Y\. Deng, S\. Hoi, and B\. An \(2025\)Let’s revise step\-by\-step: A unified local search framework for code generation with LLMs\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- J\. Ni, Y\. Liu, X\. Liu, Y\. Sun, M\. Zhou, P\. Cheng, D\. Wang, E\. Zhao, X\. Jiang, and G\. Jiang \(2026\)Trace2Skill: Distill trajectory\-local lessons into transferable agent skills\.arXiv preprint arXiv:2603\.25158\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.08670#S4.SS0.SSS0.Px2.p1.1)\.
- Nous Research \(2026\)Hermes agent: the agent that grows with you\.Note:[https://github\.com/NousResearch/hermes\-agent](https://github.com/NousResearch/hermes-agent)Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1),[§1](https://arxiv.org/html/2605.08670#S1.p2.1)\.
- A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. Ruiz, A\. Mehrabian,et al\.\(2025\)AlphaEvolve: A coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The Berkeley Function Calling Leaderboard \(BFCL\): From tool use to agentic evaluation of large language models\.InICML,pp\. 48371–48392\.Cited by:[3rd item](https://arxiv.org/html/2605.08670#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.08670#S1.p1.1),[§4](https://arxiv.org/html/2605.08670#S4.SS0.SSS0.Px1.p1.1)\.
- P\. Steinberger and OpenClaw Community \(2026\)OpenClaw: your own personal AI assistant\.Note:[https://github\.com/openclaw/openclaw](https://github.com/openclaw/openclaw)Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- P\. Tagkopoulos, F\. Li, and I\. Tagkopoulos \(2025\)SkillFlow: Efficient skill and code transfer through communication in adapting AI agents\.arXiv preprint arXiv:2504\.06188\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1)\.
- H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian \(2024\)AppWorld: A controllable world of apps and people for benchmarking interactive coding agents\.InACL,pp\. 16022–16076\.Cited by:[3rd item](https://arxiv.org/html/2605.08670#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.08670#S1.p1.1),[§4](https://arxiv.org/html/2605.08670#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Tu, C\. Xu, Q\. Zhang, Y\. Zhang, X\. Lan, L\. Li, and D\. Zhao \(2026\)Dynamic dual\-granularity skill bank for agentic RL\.arXiv preprint arXiv:2603\.28716\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px2.p1.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang,et al\.\(2026a\)SkillX: Automatically constructing skill knowledge bases for agents\.arXiv preprint arXiv:2604\.04804\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px2.p1.1)\.
- E\. Z\. Wang, F\. Cassano, C\. Wu, Y\. Bai, W\. Song, V\. Nath, Z\. Han, S\. M\. Hendryx, S\. Yue, and H\. Zhang \(2025a\)Planning in natural language improves LLM search for code generation\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong \(2025b\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px3.p1.1)\.
- Z\. Wang, Q\. Wu, X\. Zhang, C\. Zhang, W\. Yao, F\. E\. Faisal, B\. Peng, S\. Qin, S\. Nath, Q\. Lin,et al\.\(2026b\)WebXSkill: Skill learning for autonomous web agents\.arXiv preprint arXiv:2604\.13318\.Cited by:[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px2.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)SkillRL: Evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px3.p1.1)\.
- R\. Xu and Y\. Yan \(2026\)Agent skills for large language models: Architecture, acquisition, security, and the path forward\.arXiv preprint arXiv:2602\.12430\.Cited by:[§2\.1](https://arxiv.org/html/2605.08670#S2.SS1.p1.1)\.
- K\. Yang, A\. Swope, A\. Gu, R\. Chalamala, P\. Song, S\. Yu, S\. Godil, R\. J\. Prenger, and A\. Anandkumar \(2023\)LeanDojo: Theorem proving with retrieval\-augmented language models\.InNeurIPS,pp\. 21573–21612\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: Synergizing reasoning and acting in language models\.InICLR,Cited by:[§3\.1](https://arxiv.org/html/2605.08670#S3.SS1.p2.10),[§4](https://arxiv.org/html/2605.08670#S4.SS0.SSS0.Px2.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou \(2025\)Optimizing generative AI by backpropagating language model feedback\.Nature639\(8055\),pp\. 609–616\.Cited by:[§1](https://arxiv.org/html/2605.08670#S1.p4.1),[§3\.4](https://arxiv.org/html/2605.08670#S3.SS4.p1.14)\.
- H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng, X\. Liu, X\. Li, and P\. S\. Yu \(2026a\)CoEvoSkills: Self\-evolving agent skills via co\-evolutionary verification\.arXiv preprint arXiv:2604\.01687\.Cited by:[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px3.p1.1)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun \(2026b\)Agentic context engineering: Evolving contexts for self\-improving language models\.InICLR,Cited by:[§2\.2](https://arxiv.org/html/2605.08670#S2.SS2.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.08670#S4.SS0.SSS0.Px2.p1.1)\.
- S\. Zilberstein \(1996\)Using anytime algorithms in intelligent systems\.AI Magazine17\(3\),pp\. 73–73\.Cited by:[§3\.4](https://arxiv.org/html/2605.08670#S3.SS4.p2.6)\.

## Appendix ALimitations and Broader Impacts

MIND\-Skill requires successful trajectories as input to the skill induction process, since the reconstruction loss relies on a reference trajectory to provide optimization signals\. This couples the scope of the generated skill library to the set of tasks for which successful trajectories can be obtained\. In practice, however, this dependency can be satisfied in multiple ways: besides model rollouts, ground\-truth solution scripts can also serve as surrogate trajectories, as described in our fallback strategy \(Appendix[B\.3](https://arxiv.org/html/2605.08670#A2.SS3)\)\.

MIND\-Skill aims to automate the creation of reusable agent skills, reducing the manual effort required from domain experts and making high\-quality procedural knowledge more accessible\. Our results show that even weaker models can produce competitive skills after optimization, which could democratize access to capable AI agents\. On the other hand, as with any advance in autonomous agent capabilities, automatically generated skills could in principle be used to automate undesirable agent behaviors\. We note that this risk is shared broadly across the agent skill and agent framework literature and is not specific to our method\. The skills produced by MIND\-Skill are human\-readable Markdown documents, which facilitates auditing and oversight before deployment\.

## Appendix BImplementation Details

All experiments use Qwen3\.5\-122B\-A10B with extended thinking disabled as the inference model\. In the cross\-model setting \(Table[1](https://arxiv.org/html/2605.08670#S4.T1), lower block\), GPT\-5\.4 is used for skill generation and optimization while inference remains on Qwen; the exact role assignments per method are detailed below\. All LLM calls are issued through OpenRouter\.111[https://openrouter\.ai/](https://openrouter.ai/)For every Qwen\-3\.5 call we additionally pin the upstream provider to the model’s native vendor Alibaba via OpenRouter’sproviderrouting field,222[https://openrouter\.ai/docs/features/provider\-routing](https://openrouter.ai/docs/features/provider-routing)because open\-weight models on OpenRouter are served by multiple upstream providers \(e\.g\., Alibaba, Novita, AtlasCloud, Venice\) whose deterministic batching and CUDA kernels differ enough to produce cross\-provider variance\. All methods share the same training–test partition\.

### B\.1ACE Implementation Details

We re\-implement ACE on AppWorld and BFCL following the official released code, using Qwen\-122B for all three roles \(Generator, Reflector, Curator\)\. In the cross\-model setting, the Reflector and Curator are replaced with GPT\-5\.4; the Generator and inference agent remain on Qwen\-122B\. We use the offline\-with\-GT mode, which is the most directly analogous setting to our training pipeline: both use ground\-truth solutions during training\. Training proceeds sequentially over the training split: for each task, the Generator produces a ReAct trajectory; if the trajectory fails its unit test, the Reflector compares it against the ground\-truth solution code to diagnose the failure; the Curator then distills the reflection into structured bullets \(e\.g\., strategies and hard rules, API usage patterns, common mistakes\) that are appended to a shared playbook\. We allow up to 5 retries per training task following the released configuration\. At test time, the entire accumulated playbook is injected verbatim into the agent’s system prompt regardless of task relevance\.

### B\.2Trace2Skill Implementation Details

We re\-implement Trace2Skill on AppWorld and BFCL following the open\-source release \(Skill Creation mode\), using Qwen\-122B for all roles\. In the cross\-model setting, the success/error analysts and the hierarchical merge LLM are replaced with GPT\-5\.4; rollout collection and inference remain on Qwen\-122B\. This is by Trace2Skill’s design: the analysts must observe successes and failures of the same model that will be deployed at test time, so the resulting skill targets its failure modes\. We roll out on the training tasks\. The success analyst processes each passing trajectory in a single LLM call\. The error analyst runs an agentic ReAct loop \(max\_turns=40\) with pass\-gating; since AppWorld’s evaluation depends on cumulative database state rather than the single\-shot script\-vs\-file comparison assumed by the original code, we replace the evaluator with a stateful REPL exposingappworld\_execute,appworld\_evaluate, andappworld\_reset\. All other methodology \(causality gate, hierarchical merge, patch vocabulary\) is preserved\. We run the hierarchical merge pipeline, and the resulting skill directory is injected verbatim into the ReAct system prompt at inference time\.

### B\.3MIND\-Skill Implementation Details

Qwen\-122B is used as the base model for every agent \(induction, deduction, judges, gradient, and optimizer\) in the self\-generation setting\. In the cross\-model setting, GPT\-5\.4 replaces all roles except deduction and inference, which remain on Qwen\-122B\. The ablation studies and analyses in §[4\.2](https://arxiv.org/html/2605.08670#S4.SS2)use the self\-generation setting with Qwen\-122B as the induction agent\. For each training task, we obtain a reference trajectory by rolling out strong model in the AppWorld environment; if the rollout fails the task checker, we fall back to wrapping the ground\-truth solution code as a trajectory\. After optimization \(Q=8Q\{=\}8iterations per task\), we retain the best\-so\-far skill for each task \(Algorithm[1](https://arxiv.org/html/2605.08670#alg1), lines 12–13\), yielding a library with one skill per training task\. At test time, only each skill’snameanddescriptionare exposed to a Qwen\-122B retrieval call that selects the top\-KKmost relevant skills; their full markdown bodies are concatenated into the agent’s skill slot \(§[D\.2](https://arxiv.org/html/2605.08670#A4.SS2)\)\. We setK=3K\{=\}3as the default and report ablations overK∈\{1,2,3,4,5\}K\\in\\\{1,2,3,4,5\\\}\. Every LLM call is wrapped in a 3\-attempt retry: empty responses \(rate limits, transient outages\) trigger a same\-message retry, while non\-empty responses that fail schema validation are appended back as an assistant turn followed by a fix instruction, recovering both API failures and format violations\.

## Appendix CCase Studies

We present two case studies that complement the quantitative results\. The first \(§[C\.1](https://arxiv.org/html/2605.08670#A3.SS1)\) examines why baseline\-generated skills suffer from quality issues that MIND\-Skill avoids\. The second \(§[C\.2](https://arxiv.org/html/2605.08670#A3.SS2)\) investigates why skills generated by a weaker model can match those from a stronger one\.

\(a\) ACE: task\-specific memorization

Task 60d0b5b:‘‘The last Venmo payment request I sent to Robert was an accident and they approved it\. Send them the money back\.’’

\# Section: STRATEGIES AND HARD RULES Venmo Refund Workflow: When a task requires refunding an accidental ‘payment request’ that was approved: \(1\) use show\_sent\_payment\_requests with status=‘approved’ \(2\) filter by receiver’s email \(3\) sort by created\_at descending \(4\) take the most recent request \(5\) refund via create\_transaction

\(b\) Trace2Skill: malformed structure

\#\# When to Apply \- Filesystem\-style tasks: navigating \.\.\. \- Social / messaging tasks: posting \.\.\. \- Booking / ticketing / trading: creating \.\.\. \- Multi\-domain composition: tasks \.\.\. \(2 more applicability bullets\) 3a\. Trust tool responses: Treat all values returned by tools \.\.\. 3b\. Calculate derived values and \.\.\. \(3c\-\-3m: 10 more procedural steps\) 3m\. \.\.\. pass them as a list argument \(e\.g\., \[5000, 7000\]\)\.\#\# Procedure 1\. Per\-turn intent extraction: Read \.\.\. 1\.5 Check initial service state: Before \.\.\. 2\. Schema check before emitting: For\.\.\.

\(c\) MIND\-Skill: transferable and well\-structured

\#\# When to Apply\- Target API requires auth token via a separate login step\.\- List endpoint returns data inpages, requiring a loop\.\- Task involves identifying oneitem by alabel or attribute,modifying it, and applying adifferent state change to allremaining items\.\#\# Procedure1\. Authenticate: Retrieve token\.2\. Paginate and Collect: Loop until empty to collect all\.3\. Identify Target: Find item byspecific label\.4\. Apply Specific Update:Modify target \(e\.g\.,time shift\)\.5\. Apply Bulk Update:Disablethe rest\.6\. Verify: Re\-fetch and confirm\.\#\# Key Patterns\- Pagination Loop: while\-loop,break on empty page\.\- Selective Mutation: uniqueupdate for target, uniformfor rest\.\- State Verification: post\-update re\-fetch to validate\.\#\# Common Pitfalls\- Failing to loop through allpages, missing items\.\- Updating target incorrectlyor failing to exclude itfrom the bulk update\.

Figure 5:\(a\)ACE encodes the solution of training task60d0b5bas a five\-step recipe \(blue: memorized steps\); its trigger is a near\-verbatim paraphrase of the task instruction\.\(b\)Trace2Skill misplaces 13 procedural steps \(blue\) under “When to Apply” and concatenates a section header inline \(orange\)\.\(c\)MIND\-Skill uses only conceptual placeholders \(green: e\.g\., “label or attribute”, “time shift”\) instead of memorized app\-specific values, and maintains clean section boundaries throughout\.### C\.1Case Study: Skill Quality Degrades Without Explicit Guarantees

Figure[5](https://arxiv.org/html/2605.08670#A3.F5)contrasts representative skills from all three methods\. ACE’s playbook entry \(a\) memorizes the solution of a single training task as a high\-priority rule, which misfires on any test task that deviates from that scenario\. Trace2Skill’s SKILL\.md \(b\) misplaces procedural steps under the wrong section header, producing malformed structure that downstream agents struggle to parse\. In contrast, the MIND\-Skill entry \(c\), generated from training task302c169\_1, uses only conceptual placeholders \(“label or attribute”, “time shift”, “remaining items”\) rather than memorized app\-specific values, and maintains clean section boundaries where Procedure contains only ordered actions, Key Patterns names only transferable abstractions, and Common Pitfalls lists only failure modes\. These differences directly reflect our rubric loss: itsground\-truth independencedimension penalizes task\-specific memorization as in \(a\), itsactionabilityandcompletenessdimensions enforce structural coherence absent in \(b\), and together they produce skills like \(c\) that are both transferable and well\-structured\.

\(a\) Qwen\-self skill \(Net \+12\)

\#\# Procedure 1\. Authenticate: Obtain access token\. 2\. Paginate and Collect: Loop pages until empty to collect all items\. 3\. Identify Target: Find item by label\. 4\. Apply Specific Update: Modify target\. 5\. Apply Bulk Update: Disable the rest\. 6\. Verify: Re\-fetch and confirm\. \#\# Key Patterns \- Pagination Loop: while\-loop, break on empty page\. \- Selective Mutation: unique update for target, uniform for rest\. \- State Verification: post\-update re\-fetch to validate\.

\(b\) GPT\-teach skill \(Net \+1\)

\#\# Procedure 1\. Inspect API docs for endpoints\. 2\. Authenticate and store token\. 3\. Read listing endpoint docs\. 4\. Paginate until empty, collect all\. 5\. Identify target by attribute\. 6\. Read update endpoint docs\. 7\. Update target with modification\. 8\. Bulk\-update all non\-target items\. 9\. Re\-fetch and verify both conditions\. 10\. Mark task complete\. \#\# Key Patterns \- Doc\-first execution \- Credential bootstrap \- Paginate\-until\-empty \- Target\-then\-bulk \- Verify\-by\-refetch

Figure 6:Paired skills from the same training task \(302c169\_1\)\. Net contribution = test tasks flipped from fail to pass minus pass to fail, relative to the no\-skill baseline\. Both skills encode the same procedural logic, but differ in vocabulary: Qwen\-self uses plain labels \(blue\) while GPT\-teach adopts textbook\-style pattern names \(orange\)\.
### C\.2Case Study: Why Weaker Skill Generators Can Match Stronger Ones

A perhaps counterintuitive finding in Table[1](https://arxiv.org/html/2605.08670#S4.T1)is that MIND\-Skill with the weaker Qwen3\.5\-122B\-A10B as skill generator \(59\.1 average\) achieves comparable performance to MIND\-Skill with GPT\-5\.4 \(58\.9 average\)\. We investigate this through a paired case study on training task302c169\_1, where both pipelines optimize a skill from the same source trajectory\. We measure each skill’snet contributionby tracking all test tasks that retrieved it: among those tasks, we count how many flipped from fail to pass after skill injection, minus how many regressed from pass to fail, relative to the no\-skill baseline\. The Qwen\-self skill achieves a net contribution of \+12, while the GPT\-teach skill achieves only \+1\.

Figure[6](https://arxiv.org/html/2605.08670#A3.F6)compares the two skills side by side\. Both encode the same procedural logic \(authenticate, paginate, identify, update, verify\), yet they differ markedly in style\. The GPT skill adopts textbook\-style pattern names \(“Doc\-first execution”, “Credential bootstrap”, “Target\-then\-bulk”\) and includes defensive caveats \(“check partial\-update behavior”, “confirm paging behavior and returned attributes”\) that reflect GPT\-5\.4’s own reasoning preferences\. The Qwen skill uses plainer vocabulary \(“Pagination Loop”, “Selective Mutation”\) at the abstraction level Qwen naturally operates at\. When injected into Qwen’s prompt at inference time, the self\-authored skill is decoded naturally, whereas the GPT\-authored skill requires implicit style adaptation that can dilute the procedural signal\. This is not a matter of correctness; GPT’s labels are arguably more precise, but precision in a foreign dialect does not help the inference model act on it\. Moreover, the GPT skill is calibrated to its own capability: it recommends fine\-grained checks \(e\.g\., inspecting partial\-update semantics\) that GPT\-5\.4 can execute but Qwen cannot operationalize, consuming attention on sophistication the inference model has no headroom to exploit\. Self\-training avoids both costs by construction, since the writer and reader share the same distribution and capability profile\. This suggests that after quality\-guaranteed optimization, such alignment becomes a more important factor than the raw reasoning capability of the skill generator\.

## Appendix DPrompt Design

This section presents the key prompts used in MIND\-Skill\. For readability, all prompts are abbreviated to their essential structure and instructions\.

### D\.1Induction Prompt and Its Evolution

The induction agent’s system prompt𝒫I\\mathcal\{P\}\_\{I\}is the sole variable optimized by TextGrad\. Figure[7](https://arxiv.org/html/2605.08670#A4.F7)contrasts the universal initial prompt𝒫I\(0\)\\mathcal\{P\}\_\{I\}^\{\(0\)\}with an optimized variant𝒫I∗\\mathcal\{P\}\_\{I\}^\{\*\}obtained after four iterations on a representative training task\. The optimizer inserts domain\-pattern rules and explicit abstraction\-leakage warnings with paired Bad/Good examples, growing the prompt from∼\{\\sim\}530 to∼\{\\sim\}2\.0K tokens\. These additions are textual rules derived from gradient feedback on specific failure modes, not manual engineering\.

Initial Induction Prompt𝒫I\(0\)\\mathcal\{P\}\_\{I\}^\{\(0\)\}\(∼\\sim530 tokens\)Role:You are an expert at extracting reusable procedural strategies from task solutions\.Given a task instruction and its solution code, extract a SKILL that describes the procedure pattern—the structural “how\-to” that is NOT obvious from the instruction alone\.Rules for a good skill:1\.Describe ONLY solving strategy and structural patterns: authentication flow, pagination/iteration, multi\-step data retrieval, data transformation, output construction\.2\.Do NOT include task\-specific info: no specific API names, field names, entity names, thresholds\.Test: if someone can guess the original task from your skill alone, it is too specific\.3\.Focus on NON\-OBVIOUS structural knowledge\.Output:Valid SKILL\.md with YAML frontmatter, followed by sections: Overview, When to Apply, Procedure, Key Patterns, Common Pitfalls\.

Optimized Prompt𝒫I∗\\mathcal\{P\}\_\{I\}^\{\*\}\(∼\\sim2\.0K tokens, iteration 5\)\[Role and Rule 1 prefix unchanged\]\+ Identifier & Scope Resolution:Discover the correct data source via parent\-container endpoint; inspect schema for the unique key field\.\+ Constraint\-Based Data Filtering:Construct subsets by hierarchy traversal; warn that a global “list all” may exceed the task scope\.\+ Output Construction & Completion Signals:Silent completion when no report is requested; describe structural schema \(not values\) when a report IS requested\.\[Rule 2 expanded with:\]\+ STRICT ABSTRACTION RULE:Never include specific endpoints, field names, or boolean values\. Bad:“Callshow\_song\_privatesto check iflikedis true\.” Good:“Call the endpoint that exposes user\-specific state flags\.”\[\+ Common Pitfalls scaffold with solution rules for: hallucinating endpoints, assuming global lists match constraints, parameter consistency\.\]

Figure 7:The induction agent’s system prompt is the sole variable optimized by TextGrad\.\(a\)The universal initial prompt𝒫I\(0\)\\mathcal\{P\}\_\{I\}^\{\(0\)\}used for all training tasks\.\(b\)The optimized prompt𝒫I∗\\mathcal\{P\}\_\{I\}^\{\*\}after 4 TextGrad iterations on source task692c77d\_1\. Highlighted spans \(blue\) are rules TextGrad inserted to address failure modes observed during training, including the explicit Bad/Good examples for the abstraction rule\.
### D\.2Deduction Agent

Both training and evaluation share the same ReAct template, differing only in the content injected into the skill slot\. Figure[8](https://arxiv.org/html/2605.08670#A4.F8)illustrates the template structure and the injection mechanism\. The template consists of: \(i\) framing prose orienting the agent to AppWorld’s API discovery tools, \(ii\) the skill injection slot betweenSKILLS BEGIN/ENDmarkers, \(iii\) in\-context ReAct demonstration trajectories, and \(iv\) the real task instruction\. All baselines in our comparison \(ACE, Trace2Skill\) share this same template and differ only in what fills the skill slot\. At training time, the slot receives one candidate skill being optimized; at evaluation time, it receivesKKretrieved skills\.

Deduction Agent: Template StructureUSER:
I am your supervisor and you are a super intelligent AI assistant whose job is to achieve my tasks autonomously\.
To do this, you will interact with apps using their APIs \.\.\.
\[API discovery orientation:show\_app\_descriptions\(\),show\_api\_doc\(\), etc\.\]You are also provided with a curated set of skills to help you solve the task effectively\.
Read the skills first, then execute the task by explicitly leveraging each relevant section:\#\#\# SKILLS BEGIN
\{\{ skills \}\}training: 1 skill∣\\midevaluation:KKretrieved skills
\#\#\# SKILLS END\[in\-context ReAct demonstrations\]USER:
My name is: \{\{first\_name\}\} \{\{last\_name\}\}\.
Task: \{\{ input\_str \}\}real task instructionFigure 8:Abbreviated structure of the deduction agent’s prompt template\. Skills enter through a single\{\{skills\}\}slot; the only difference between training and evaluation is the number of injected skills \(1 vs\.KK\)\.
### D\.3Textual Loss Prompts

Our three textual losses are implemented as LLM judge calls that return scores on a 0–10 scale \(higher is better\)\. We adopt this convention because LLMs produce more calibrated assessments when prompted to score quality directly—for instance, a rubric score of 8/10 carries clear semantic meaning, whereas a loss value of 2 lacks intuitive grounding\. To conform to the standard minimization convention, we convert scores to losses viaℓ=c−score\\ell=c\-\\text\{score\}, whereccis the upper bound of the scoring range\. Figure[9](https://arxiv.org/html/2605.08670#A4.F9)presents the rubric loss prompt, which instructs the judge to classify each claim in the skill along a GT\-leakage counterfactual and score five quality dimensions\. Figure[10](https://arxiv.org/html/2605.08670#A4.F10)presents the reconstruction loss prompt, which evaluates procedural alignment between the source and reconstructed trajectories\. The outcome loss requires no prompt as it is computed directly from environment execution results\.

Skill Quality Rubric Prompt \(RUBRIC\_SYSTEM\)Role:You judge the quality of a procedural skill that will later be used by a coder who has NOT seen the reference solution\.You are given:\(1\)The task instruction \(the same one the coder will see\)\(2\)The skill that was extracted from the \(hidden\) reference solutionThe Central Question:How much of this skill is useful procedural knowledge vs\. leaked solution details?For each claim in the skill, classify it:\(A\)Standard convention— a general software pattern the coder can assume without context\.\(B\)Inferable from instruction— derivable from the task text alone\.\(C\)Leaked from ground truth— unknowable without seeing the reference solution \(exact API paths, specific algorithm choices, library decisions, hard\-coded thresholds\)\.Score each dimension \(0–10\):1\.GT\-Independence: fraction of content a developer could write from instruction alone\.2\.Actionability: can a coder use this \+ API docs to write working code?3\.Transferability: would this apply to a structurally similar task in a different domain?4\.Completeness: full procedure chain covered?5\.Conciseness: information\-dense, no redundant boilerplate?Output:JSON with per\-dimension scores, leaked claims, and issue summary\.Figure 9:Five\-axis rubric prompt with GT\-leakage counterfactual\. The overall score is gated on GT\-independence to prevent overfit skills from masquerading as actionable\.Trajectory Reconstruction Judge Prompt \(TRAJECTORY\_JUDGE\_SYSTEM\)Role:Evaluate whether a deduction agent’s ReAct trajectory follows the same procedural strategy as a reference trajectory on the same task\.Evaluate on procedural alignment, not literal text match\.Criteria:1\.Same sequence of API\-call families \(e\.g\., auth→\\tolist→\\todetail→\\toaction\), order and dependencies\.2\.Same control flow: pagination, accumulation loops, early\-exit conditions, branching\.3\.Deduction agent’s final environment observation converges to the same outcome\.Ignore:Different variable names, intermediate print statements, step\-count differences, specific IDs/values\.Output:JSON with alignment score \(0–10\), boolean flags for API sequence / control flow / final state match, and list of procedural mismatches\.Figure 10:Trajectory reconstruction judge prompt\. Scores procedural alignment rather than literal text match; tolerates step\-count and variable\-name differences as long as API\-family sequence and control flow agree\.
### D\.4Gradient and Optimizer Prompts

The gradient and optimizer LLMs form the two\-step TextGrad update cycle\. The gradient LLM \(Figure[11](https://arxiv.org/html/2605.08670#A4.F11)\) diagnoses failure patterns from rollout cases and produces textual feedback\. The optimizer LLM \(Figure[12](https://arxiv.org/html/2605.08670#A4.F12)\) then takes the current prompt and this feedback to produce an updated prompt, without access to rollout cases or scores\. This separation ensures a clean diagnostic\-then\-apply workflow\. Two design choices are worth noting: the gradient prompt explicitly instructs the LLM to refuse the naive fix of writing ground\-truth\-specific tokens into the skill even when execution failures seem to call for it, preserving GT\-independence as a hard constraint; the optimizer prompt enforces a format\-preservation rule to prevent the optimizer from deleting the SKILL\.md output specification across iterations\.

Gradient LLM System PromptRole:You are part of an optimization system that improves an induction agent prompt\. The induction agent extracts procedural skills from task solutions\. A deduction agent then uses these skills to reconstruct the solution\.Your job:Analyze cases where quality was low, and give feedback on how to improve the induction agent prompt so it produces better skills\.Each case may include:1\.Rubric scores\(0–10\): GT\-independence, actionability, transferability, conciseness, plus specific leaked claims flagged by the judge\.2\.Reconstruction score/issues: deduction agent’s trajectory vs\. reference—shows what the deduction agent got wrong\.3\.Execution result: pass/fail with error messages\.Critical tradeoff:When execution fails because the deduction agent guessed the wrong field name, the naive fix is to write the correct name into the skill\.Do NOT endorse this\.It makes execution pass but destroys GT\-independence\. Better fix: improve the procedure wording so it guides the deduction agent to inspect API docs for the correct field, rather than hard\-coding the field path\.Output:Describe what to change in the induction agent prompt and why\. Do NOT propose a new prompt\.Figure 11:Gradient LLM prompt\. The LLM receives low\-quality and high\-quality rollout cases and produces a textual diagnosis\. It is explicitly instructed to reject the naive fix of leaking ground\-truth details into skills\.Optimizer LLM System PromptRole:You are part of an optimization system that improves text prompts\. You will receive the current prompt and feedback on its weaknesses\. Produce an improved version\.Rules:1\.Make targeted changes that address the specific feedback\.2\.Do not break things that are already working\.3\.Keep roughly the same length and structure\.4\.Hard constraint:The prompt MUST keep the SKILL\.md output format specification intact, including YAML frontmatter withnameanddescriptionfields\. Do not remove, weaken, or omit the format specification\.Output:The improved prompt wrapped in<IMPROVED\_VARIABLE\>tags\. No explanation outside the tags\.Figure 12:Optimizer LLM prompt receives the current prompt and gradient feedback\.

Similar Articles

SkillGen: Verified Inference-Time Agent Skill Synthesis

arXiv cs.LG

This article introduces SkillGen, a multi-agent framework that synthesizes and verifies reusable inference-time skills for LLM agents by contrasting successful and failed trajectories. The method ensures skills are auditable and empirically verified for their net positive impact on agent performance.

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Hugging Face Daily Papers

SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.