AIP: A Graph Representation for Learning and Governing Agent Skills

arXiv cs.AI 06/04/26, 04:00 AM Papers
Summary
The Agent Instruction Protocol (AIP) proposes modeling AI agent skills as directed execution graphs with schema-validated YAML specifications, replacing free-form prose instructions. Experiments show AIP compilation raised Claude Sonnet's task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks.
arXiv:2606.04781v1 Announce Type: new Abstract: Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:09 AM
# AIP: A Graph Representation for Learning and Governing Agent Skills
Source: [https://arxiv.org/html/2606.04781](https://arxiv.org/html/2606.04781)
###### Abstract\.

Agent Skills today consist largely of free\-form prose requiring the agent to read, interpret, and re\-derive how to act in every session\. This imposes two compounding costs: reduced reliability on implementation\-heavy tasks, and difficulty in skill creation and improvement—since editing prose is a fragile process that both humans and agents struggle with, particularly for domain\-specific procedural knowledge underrepresented in model training\. The Agent Instruction Protocol \(AIP\) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural\-language descriptions, connected by explicit typed input/output edges, and governed by a schema\-validated YAML specification\. A compiler meta\-skill translates existing human\-written skills into this form\. The benefits are twofold\. First, compiling human\-written skills to AIP raised Claude Sonnet’s mean task reward from 0\.60 to 0\.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench—a statistically significant gain \(Wilcoxon signed\-rankp=0\.011p=0\.011\), winning 12 tasks to 2 with 13 ties—often in less wall\-clock time\. The graph delivers vetted, runnable units to the agent rather than asking it to re\-derive code, commands, and tool calls from natural language\. Second, on creation and improvement, because each skill is schema\-validated, functionally testable, and addressable node\-by\-node, failures can be diagnosed and repaired precisely\. Two authored\-skill failures were traced to the script level\. After adjusting the AIP spec and recompiling, both recovered with zero regressions \(one task going from 0/5 to 5/5\), turning skill improvement into a measurable tuning loop rather than a prose rewrite\. That same graph structure supports corpus\-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills\.

††copyright:none## 1\.Introduction

*Agent Skills*\(Anthropic,[2025](https://arxiv.org/html/2606.04781#bib.bib18)\), introduced by Anthropic in 2025 and subsequently released as an open standard, have quickly become a dominant representation for packaging reusable agent capabilities\. A skill is a directory built around aSKILL\.mdfile: YAML frontmatter declaring anameanddescription, followed by natural\-language instructions, optional scripts, and reference files that the agent loads on demand through*progressive disclosure*\. The appeal of the format is that it lets a broad range of domain experts—not only model developers—transfer procedural knowledge to AI agents in a lightweight, human\-readable form\(Bakal,[2026](https://arxiv.org/html/2606.04781#bib.bib20)\), and curated skills measurably raise task success across diverse domains\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\)\.

Yet aSKILL\.mdis a relatively free\-form markdown document, and the skill as a whole is consumed as natural\-language context\. As a result, the representation inherits several limitations that grow more acute as agents are deployed on harder, implementation\-heavy tasks and as the field increasingly turns to agents to help maintain and improve them:

1. \(1\)Skills neither capture nor enforce structure where warranted, leaving speed and reliability on the table\.Some context and judgment resist formalization, but a large share of a skill’s procedural knowledge can be expressed as runnable code and explicit graphs of workflow steps—increasingly the way reliable agents are built\(LangChain,[2025](https://arxiv.org/html/2606.04781#bib.bib2); Google,[2025](https://arxiv.org/html/2606.04781#bib.bib3); Schluntz and Zhang,[2024](https://arxiv.org/html/2606.04781#bib.bib4)\)\. Skills allow this logic to remain in free prose for the agent to derive at runtime, so on every new session an agent re\-plans the code, commands, and tool calls the task needs; on implementation\-heavy tasks this is slow and token\-intensive, and lets the agent take different, sometimes erroneous, paths—costing reliability through both the per\-run burden and the run\-to\-run variance\. Offloading deterministic steps to vetted, runnable code\(Gaoet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib11); Chenet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib12)\)and explicitly structured workflows\(Schluntz and Zhang,[2024](https://arxiv.org/html/2606.04781#bib.bib4)\)rather than re\-deriving them is known to improve reliability\. A mechanism that reliably compiles this structurable knowledge into runnable code and execution graphs\(Khattabet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib5)\)therefore stands to lift skill performance\.
2. \(2\)Skill creation and tuning remains a slow, recurring process\.A skill must be tuned like a prompt\. A skill consisting of mostly free\-form prose is read and re\-interpreted by the agent on every run, so it inherits the well\-documented brittleness of prompts: small, semantics\-preserving changes in wording or formatting can swing task accuracy by tens of points\(Sclaret al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib19)\)—so a skill that works is fiddly to get right and easily knocked off course\.
3. \(3\)Agent\-assisted and self\-improvement of skills remains a challenge\.A skill is only worth shipping when it encodes procedural knowledge underrepresented in model training\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1); Bakal,[2026](https://arxiv.org/html/2606.04781#bib.bib20)\); otherwise the agent already knows it\. Experts therefore author skills, but, due to the challenges noted above, increasingly enlist agents to help*revise*them, and the field is pushing toward agent self\-improvement and reinforcement learning over skills\(Gaoet al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib21); Zweigeret al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib22); Robeynset al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib23); Xu and Yan,[2026](https://arxiv.org/html/2606.04781#bib.bib24)\)\. This is hard for two compounding reasons: the material is unfamiliar by construction—the agent is revising domain\-specific procedures it does not itself know—and free\-form prose gives it no bounded surface to edit against\. With nothing to constrain the edit, the agent’s additive bias runs unchecked: outputs skew toward length and verbosity\(Singhalet al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib26); Zhanget al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib27)\), and agents over\-engineer and accrete rather than refine\(Licorishet al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib28)\)\. Skills balloon and grow convoluted, intent drifts, and a fix in one place silently breaks another, while intrinsic self\-revision without external feedback often leaves quality unchanged or degraded\(Huanget al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib30); Xuet al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib31)\)\. Without a more structured representation, skill improvement has neither a strong natural feedback loop nor a bounded action space\.
4. \(4\)Cause, effect, and governance are hard to establish\.Free\-form prose has no clear addressable units: when a skill produces a wrong result there is no node, step, or typed value to which the failure can be precisely attributed, and a corpus of prose skills cannot be systematically audited for missing validation or approval steps\. This runs against the traceability and accountability that governing agentic systems at scale demands\(Saini,[2026](https://arxiv.org/html/2606.04781#bib.bib32)\)\.

The Agent Instruction Protocol \(AIP\)111Today AIP is realized as a*specification*—a schema\-validated execution\-graph format that an agent reads into context and follows\. We retain the term*protocol*for the typed contract this format defines, and for a runtime that would enforce graph traversal and local and remote skill calls; we set that out as future work \(Section[6\.3](https://arxiv.org/html/2606.04781#S6.SS3)\)\.addresses these issues by modeling a skill as a directed execution graph\. Discrete steps become nodes, each backed by a deterministic script or, where human\-like judgment is required, a natural\-language description\. Nodes are connected by typed input/output edges, and the whole structure is governed by a schema\-validated YAML specification\. A compiler meta\-skill allows agents to translate existing human\-written skills into this form at authoring time, surfacing ambiguities, type errors, and field inconsistencies before they can cause runtime failures\. Crucially, AIP does not displace human authorship: experts still write the skills, and the compiler translates their human\-written source into a typed, script\-backed surface—one that improves how reliably agents*execute*skills today, the result we demonstrate, and whose structure we argue is a better substrate for agent\-assisted improvement and reinforcement learning tomorrow\.

We evaluate AIP against human\-curated skills on SkillsBench\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\), a benchmark of 94 agent tasks across 8 domains, using a stratified 27\-task sample with Claude Sonnet as the solver\. Compiling skills to AIP produced a statistically significant improvement in task reward \(Wilcoxon signed\-rankp=0\.011p=0\.011\), and we found that failures in AIP skills can be diagnosed and corrected at the node level—given coded scripts connected by a clear, typed execution\-graph I/O structure—without regressions elsewhere\. This suggests that the benefits of AIP may compound for agent self\-improvement and reinforcement learning \(RL\), where the typed execution graph provides a bounded, validity\-gated action space for learning over skills\(Sutton and Barto,[2018](https://arxiv.org/html/2606.04781#bib.bib15)\)\.

Contributions\.Our work makes the following contributions:

- •A multi\-mode benchmark harness\.An extension to SkillsBench for comparing skill*formats*and skill*authoring methods*under a real solver, enabling controlled head\-to\-head evaluation\.
- •Empirical evidence of improved task reward\.Compiling human\-written skills to AIP significantly improves task reward across 27 tasks \(mean reward0\.599→0\.7050\.599\\rightarrow 0\.705,\+0\.106\+0\.106; Wilcoxonp=0\.011p=0\.011\), with the mechanism localized to executability and procedural consistency\.
- •A demonstrated mechanism for skill self\-improvement\.During early experimentation, two compiler\-authored failure modes were diagnosed at the node and script level by an agent \(Claude Code\)\. In this case they were corrected by editing the AIP specification, and recovered with zero regressions, but, the tooling demonstrates strong potential for a diagnosis–edit–recompile–evaluate loop with a measurable reward signal—a working substrate for agent\-assisted improvement and, prospectively, autonomous self\-improvement and reinforcement learning over skills\.
- •A path to corpus\-level governance and inspection\.Because each skill is a typed, schema\-validated graph, a library of AIP skills is queryable and inspectable: missing validation or approval steps can be audited, shared sub\-procedures discovered, and skills composed from reusable node templates\. Projected into a graph database, the same structure supports access control and visual introspection—a skill’s steps and their typed input/output rendered as a graph \(Figure[2](https://arxiv.org/html/2606.04781#S3.F2)\)—moving governance from manual documentation review to a structured query over a typed graph\. As agents grow more autonomous and begin running their own self\-improvement loops, this inspectable surface is what keeps human oversight and understanding tractable\.

## 2\.Background

Modern LLM agents carry out tasks by interleaving natural\-language reasoning with calls to external tools—running code, querying systems, editing files—and observing the results\(Yaoet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib9); Schicket al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib34)\)\. Because many tasks recur, these agents increasingly draw on reusable*skills*: packaged procedural knowledge—instructions, and sometimes code—that tells them how to carry out a class of tasks\(Wanget al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib10)\)\. Anthropic’s Agent Skills standardize this as aSKILL\.mdfile in a directory of optional resources the agent loads on demand \(Figure[1](https://arxiv.org/html/2606.04781#S2.F1)\)\(Anthropic,[2025](https://arxiv.org/html/2606.04781#bib.bib18); Agent Skills,[2025](https://arxiv.org/html/2606.04781#bib.bib17)\)\.

Skill directory

pdf\-processing/

SKILL\.md\#required:YAMLfrontmatter\+instructions

scripts/\#optional:executablecode\(e\.g\.extract\.py\)

references/\#optional:docstheagentloadsondemand

assets/\#optional:templates,resources

SKILL\.md

\-\-\-

name:pdf\-processing

description:ExtracttextandtablesfromPDFs,fillforms,

andmergefiles\.UsewhenworkingwithPDFdocuments\.

\-\-\-

\#PDFProcessing

\#\#Extractingtext

Run‘scripts/extract\.py<file\>‘;itwritesoneblockperpage\.\.\.

\#\#Fillingforms

Seereferences/forms\.mdforthefield\-mappingconvention\.\.\.

Figure 1\.An Agent Skill is a directory built around aSKILL\.mdfile—YAML frontmatter \(requirednameanddescription\) followed by free\-form Markdown instructions—alongside optionalscripts/,references/, andassets/that the agent loads only as a task requires \(*progressive disclosure*\)\(Agent Skills,[2025](https://arxiv.org/html/2606.04781#bib.bib17)\)\.A skill’s procedure lives largely in the free\-form Markdown body of itsSKILL\.md\(Figure[1](https://arxiv.org/html/2606.04781#S2.F1)\), which the agent reads and interprets at runtime; although the format permits bundled scripts, in practice much of the procedure stays in prose\.

Graphs and structured agent workflows\.In industry, agent development frameworks have already started representing agent workflows with graphs whose nodes are discrete steps and whose edges carry the data passed between them\. This includes frameworks such as LangGraph and Google’s Agent Development Kit \(ADK\)\(LangChain,[2025](https://arxiv.org/html/2606.04781#bib.bib2); Google,[2025](https://arxiv.org/html/2606.04781#bib.bib3)\)\. Earlier than that, DSPy, a Python framework for building AI systems, used declarative pipelines that can be compiled rather than hand\-written to automatically optimize how language models are prompted \(or fine\-tuned\), replacing brittle, manually tuned prompt templates\(Khattabet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib5)\); Anthropic, the AI safety and research company behind the Claude family of models and the agent skill spec, also recommends predefined, structured workflows where predictability and reliability matter\(Schluntz and Zhang,[2024](https://arxiv.org/html/2606.04781#bib.bib4)\); dedicated benchmarks likewise find that supplying agents with explicit workflow structure improves their planning\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib6)\)\. Evidence that this matters comes from the same benchmark we use: on SkillsBench, the skills that help most pair focused guidance with executable code and reference files, while sprawling, comprehensive prose can even hurt performance\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\)—structure, not just content, shapes how well a skill works\. AIP brings this graph view to the skill itself: a skill becomes an execution graph of typed steps, each backed by a script or a natural\-language description\.

Measuring skills\.We build on SkillsBench\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\), a containerized benchmark that isolates the effect of a skill\. Each task pairs a natural\-language instruction with a sandboxed environment and a programmatic verifier, run through the BenchFlow SDK\(BenchFlow,[2026](https://arxiv.org/html/2606.04781#bib.bib7)\), and is scored under three conditions: no skill, a human\-curated skill, and a skill the agent generates for itself\. Measuring task success across these conditions lets a skill’s contribution be quantified directly—the substrate our experiments extend with AIP\.

## 3\.The Agent Instruction Protocol \(AIP\)

Agent Instruction Protocol \(AIP\), an extension of the Agent Skills specification\(Agent Skills,[2025](https://arxiv.org/html/2606.04781#bib.bib17)\), defines a skill as a directed execution graph—today a specification, with the enforcing protocol left to future work \(Section[6\.3](https://arxiv.org/html/2606.04781#S6.SS3)\)\. Nodes represent discrete operational steps: each step is either backed by a deterministic script for computational work, or described in natural language for steps requiring judgment or interaction\. Step nodes are connected by typed input/output edges that make data flow explicit and checkable, and the entire structure is governed by a schema\-validated YAML specification\. Figure[2](https://arxiv.org/html/2606.04781#S3.F2)shows a real compiled skill rendered as such a graph\. The major node and edge types are as follows\.

\#Nodes

Skill\#theprocedure:purpose,trigger\_when,scope\_and\_approval

Step\#aunitofwork:name,description;anoptionalbacking

\#script;typedinputs/outputs;depends\_on/parallel/one\_of

\#Edgesbetweensteps

inputs/outputs\#astep’stypedoutput\(name,type\)feedsanother

\#step’sinput\-\-thetypeddata\-flowedges

depends\_on\#explicitorderingforDAG\-shapedprocedures

\#Astepalsobindspackagefiles

script\#adeterministicbodyunderscripts/

references\#prose\-citeddocsunderreferences/

Steps may also carryparallelandone\_ofcontrol modifiers\. The remaining top\-level metadata—trigger and do\-not\-use conditions, anti\-patterns, scenarios, modes, and integrations—attach as typed satellite nodes, keeping the entire specification queryable \(Section[6\.1](https://arxiv.org/html/2606.04781#S6.SS1)\)\.

![Refer to caption](https://arxiv.org/html/2606.04781v1/figures/exoplanet-detection-period.png)Figure 2\.A compiled AIP skill \(exoplanet\-workflows\) as a directed execution graph, projected into Neo4j\. The pinkSkillnode carries the name and description; teal nodes are procedure*steps*; arrows between steps are*typed input/output edges*\(e\.g\.lc\-path, string;detection, object\) that make data flow explicit and checkable\. ARunsedge binds a step to a deterministic*script*\(orange\); aReferencesedge attaches a prose*reference*\(yellow\) for steps needing judgment\. The same typed structure is what makes a skill node\-addressable, and queryable \(Section[6\.1](https://arxiv.org/html/2606.04781#S6.SS1)\)\.Compilation\.AIP includes a compiler meta\-skill that transforms human\-written source material—existing skills, prose instructions, documentation, code, or informal descriptions—into the graph representation\. The compilation step acts as a quality gate: schema validation catches type errors, missing fields, and structural inconsistencies at authoring time rather than at runtime\. Ambiguities in the source material must be resolved to produce a valid graph, which forces clarity that prose allows to remain implicit\. The meta\-skill also provides instruction around creating scripts from natural language where appropriate, which we have found can be a load\-bearing step for improving reward: some of the largest gains in our evaluation \(Section[4](https://arxiv.org/html/2606.04781#S4)\) come on prose\-only skills where the solver would otherwise re\-derive code at run time, such asmars\-clouds\-clusteringanddapt\-intrusion\-detection\. Figure[3](https://arxiv.org/html/2606.04781#S3.F3)shows this transformation on a real skill, contrasting the human\-curated and compiled on\-disk packages\.

Execution\.At runtime, the agent loads theSKILL\.mdgraph into its context window where it can reason through the execution graph steps\. With more scripts and explicit input/output between steps, the agent can re\-focus language\-model reasoning on the prose nodes that genuinely require judgment and avoid re\-deriving well\-understood procedures on every run; the same principle of offloading deterministic computation to code is known to improve reliability in program\-aided settings\(Gaoet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib11); Chenet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib12)\)\.

Addressability\.Because each node is individually named, typed, and validatable, the graph is addressable at the component level\. A failure in execution can be attributed to a specific node, the corresponding script can be inspected or corrected, and the repair can be validated in isolation\. This property underlies both the skill\-improvement loop evaluated in Section[4](https://arxiv.org/html/2606.04781#S4)and the governance, learning, and protocol roadmap outlined in Section[6](https://arxiv.org/html/2606.04781#S6)\.

Before: human\-authored prose \(SKILL\.md\)

\#ExoplanetDetectionWorkflows

\#\#PipelineDesignPrinciples

\#\#\#KeyStages

1\.DataLoading:format,columns,timesystem

2\.QualityControl:filterviaqualityflags

3\.Preprocessing:removenoise,preservesignal

4\.PeriodSearch:choosealgorithmforsignal

5\.Validation:verifycandidateisreal

6\.Refinement:improveperiodprecision

\#\#\#CriticalDecisions

Whichperiodsearchalgorithm?

\-TLS:bestfortransit\-shapeddips\(box\-like\)

\-Lomb\-Scargle:anyperiodicsignal,fast

\-BLS:alternativetoTLS,inAstropy

After: compiled AIP procedure \(SKILL\.md, YAML\)

purpose:\>

Load,quality\-filter,preprocess,searchforatransit

signal,validatevsSDE/SNRthresholds,refineperiod\.

steps:

\-name:load\-and\-inspect

outputs:

\-\{name:lc\-path,type:string\}

\-\{name:column\-map,type:object\}

\-\{name:flag\-good\-value,type:float\}

\-name:choose\-search\-strategy\#judgmentnode

description:\>

DefaulttoTLS;loadreferences/method\-selection\.md

tojustifyTLSvsLomb\-ScarglevsBLS\.

one\_of:\[tls,lomb\-scargle,bls\]

outputs:\[\{name:method,type:string\}\]

\-name:run\-detection\-pipeline\#scriptnode

script:scripts/detect\_period\.py

inputs:\[lc\-path,column\-map,flag\-good\-value\]

outputs:\[\{name:detection,type:object\}\]

\-name:validate\-candidate\#scriptnode

script:scripts/validate\_candidate\.py

inputs:\[detection\]

outputs:\[\{name:verdict,type:object\}\]

On\-disk skill package \(before→\\rightarrowafter\) [⬇](data:text/plain;base64,IyBCZWZvcmU6IGh1bWFuLWN1cmF0ZWQgKHByb3NlLW9ubHkpCmV4b3BsYW5ldC13b3JrZmxvd3MvCiAgU0tJTEwubWQgICAgICMgcHJvc2UgZ3VpZGFuY2Ugb25seQo=)\#Before:human\-curated\(prose\-only\)exoplanet\-workflows/SKILL\.md\#proseguidanceonly[⬇](data:text/plain;base64,IyBBZnRlcjogY29tcGlsZWQgQUlQIChleG9wbGFuZXQtd29ya2Zsb3dzKQpleG9wbGFuZXQtd29ya2Zsb3dzLwogIFNLSUxMLm1kICAgICAgICAgICAgIyBBSVAgcHJvY2VkdXJlIChZQU1MIGFib3ZlKQogIHNjcmlwdHMvICAgICAgICAgICAgIyBkZXRlcm1pbmlzdGljIHN0ZXAgYm9kaWVzCiAgICBkZXRlY3RfcGVyaW9kLnB5CiAgICBwZXJpb2RfcmFuZ2VfZ3VpZGUucHkKICAgIHZhbGlkYXRlX2NhbmRpZGF0ZS5weQogIHJlZmVyZW5jZXMvICAgICAgICAgIyBwcm9zZSBmb3IganVkZ21lbnQgc3RlcHMKICAgIG1ldGhvZC1zZWxlY3Rpb24ubWQKICAgIHRyb3VibGVzaG9vdGluZy5tZAogIHNvdXJjZS8gICAgICAgICAgICAgIyBwcm92ZW5hbmNlIGtlcHQgYnkgY29tcGlsZXIKICAgIG9yaWdpbmFsLVNLSUxMLm1kCiAgICBwcm9jZWR1cmUuc2NoZW1hLmpzb24=)\#After:compiledAIP\(exoplanet\-workflows\)exoplanet\-workflows/SKILL\.md\#AIPprocedure\(YAMLabove\)scripts/\#deterministicstepbodiesdetect\_period\.pyperiod\_range\_guide\.pyvalidate\_candidate\.pyreferences/\#proseforjudgmentstepsmethod\-selection\.mdtroubleshooting\.mdsource/\#provenancekeptbycompileroriginal\-SKILL\.mdprocedure\.schema\.json

Figure 3\.Compiling a prose skill to AIP, forexoplanet\-workflows\. The free\-form procedure \(top left\) becomes a schema\-validated YAML graph of typed steps \(top right\): each step declares typedinputs/outputs, binds to a deterministic script viascript, or cites a prosereferencefrom its description for judgment\. On disk \(bottom\), the prose\-only human skill—exoplanet\-workflows, one of five modules the task bundles, all shipping no code—expands into a package whose steps are backed by generatedscripts/andreferences/, with the original prose and JSON schema preserved undersource/; Figure[2](https://arxiv.org/html/2606.04781#S3.F2)is this same skill as a graph\.
## 4\.Evaluation

### 4\.1\.Experimental Setup

Benchmark\.We build on SkillsBench\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\), a containerized agent benchmark of 94 tasks across 8 domains, run through the BenchFlow SDK\(BenchFlow,[2026](https://arxiv.org/html/2606.04781#bib.bib7)\)\. Each task bundles a natural\-language instruction, a sandboxed environment, and a programmatic verifier that scores the agent’s output\. SkillsBench is designed to measure how an agent’s task success changes when it is given reusable skills, and ships three native conditions per task: 1\)*noskill*, 2\) one or more*human\-curated*skills authored offline by a domain expert, and 3\)*self\-generated*one or more skills the agent writes for itself at trial time\.

We extend SkillsBench with a multi\-mode harness that adds two AIP conditions: 1\)*aip\-from\-instruction*where one or more AIP skills are authored by a Claude Code Agent from the task instruction alone, and 2\)*aip\-from\-curated*where one or more AIP skills are authored by a Claude Code Agent using the human\-curated skill\(s\) as input\. Both modes use the AIP meta\-skill to compile the skills and both use Opus 4\.7 as the language model\.222The AIP protocol—its specification, schemas, and compiler meta\-skill—is open source on GitHub at[https://github\.com/zach\-blumenfeld/aip](https://github.com/zach-blumenfeld/aip)\(the results in this paper use tagv0\.3a3\)\. The AIP\-SkillBench harness, run data, and analysis are available on GitHub at[https://github\.com/zach\-blumenfeld/aip\-skillbench](https://github.com/zach-blumenfeld/aip-skillbench)\.Following AIP’s author\-once, consume\-many design, each AIP skill is authored a single time, committed, and mounted unchanged across all trials\. The harness thus enables controlled head\-to\-head comparison of AIP skill*formats*and*authoring methods*under a single solver\.

Solver\.The agent harness isclaude\-agent\-acp, driving theclaude\-sonnet\-4\-6model, with all executions isolated in a Docker sandbox and five independent trials per cell \(one task under one condition\)\.

Conditions\.The primary comparison is between the*human\-curated*and*aip\-from\-curated*modes\. The harness modes \(*noskill*and*selfgen*\) are out of scope of the main claim\. SkillsBench\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\)already found*human\-curated*outperforms both these modes on average\.*aip\-from\-instruction*was attempted in early trials, but results were excluded from this report\. In the v0\.3a2 medium set,*aip\-from\-instruction*performed substantially lower than the human baseline \(0/15 passing, vs\. 11/15 for human\-curated\), with failures traced to incorrect methods in scripts committed into the authored skill—a distorting map projection, a parser that rejected valid inputs, a controller that produced no output\. This reiterates SkillsBench’s finding that self\-authored skills provide no benefit on average—models cannot reliably author the procedural knowledge they benefit from consuming\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\)—and that effective skills rest on expert curation\(Bakal,[2026](https://arxiv.org/html/2606.04781#bib.bib20)\)\. It also leans toward the AIP format, rather than the authoring model \(Claude Opus 4\.7\) being the source of the uplift in*aip\-from\-curated*, though this does not fully isolate the two \(see Limitations[4\.5](https://arxiv.org/html/2606.04781#S4.SS5)\)\.

Task sample and strata\.We evaluate on 27 tasks, listed in Table[1](https://arxiv.org/html/2606.04781#S4.T1)\. We characterize each along three axes:

- •Difficulty\(*easy*,*medium*, or*hard*\): the task author’s rating, taken directly from SkillsBench task metadata\.
- •Implementation class\(*light*or*heavy*\): a task is*heavy*when its declared type includes implementation, simulation, optimization, or control, i\.e\. it requires substantial code to be written or run; it is*light*otherwise \(analysis, calculation, extraction, search, or detection\)\.
- •Structure class\(*prose\-only*,*mixed*, or*script\-heavy*\): a property of the*human\-curated*skill: how much runnable code it already ships\. A*prose\-only*skill has no scripts \(all natural language\); a*mixed*skill has scripts but more prose than code \(measured in LoC\); a*script\-heavy*skill has at least as much script as prose\. This axis indexes how much room an AIP conversion has to add executability: most for prose\-only skills, least for script\-heavy ones\.

The 24\-task core is stratified across structure class×\\timesimplementation class so the evaluation spans the gradient of expected AIP benefit rather than a single operating point, with difficulty and domain as secondary spread; all 24 use AIP v0\.3a3\. To this we add a three\-task medium reference set \(*eval\-3med*, marked†\\daggerin Table[1](https://arxiv.org/html/2606.04781#S4.T1)\)\. We report the full 27\-task sample as the headline result, with the caveat that those three were compiled against an earlier protocol version \(v0\.3a2\) and randomly selected \(from the pool of medium difficulty tasks\) rather than stratified\. They were an earlier trial of experimentation\. Results restricted to the 24\-task v0\.3a3 set are consistent in direction and significance \(Section[4\.5](https://arxiv.org/html/2606.04781#S4.SS5)\)\.

Table 1\.The 27 evaluation tasks, grouped by the structure class of their human\-curated skill \(the axis indexing AIP’s room to add executability\)\.*Diff\.*is the SkillsBench difficulty rating;*Impl\.*is the implementation class \(*heavy*= the task type includes implementation, simulation, optimization, or control\);*Skills*is the number of curated skill modules the task bundles\. The 24\-task core is stratified across structure×\\timesimplementation class;†\\daggermarks the three supplementary*eval\-3med*tasks \(v0\.3a2 spec, hand\-picked\)\.TaskDomainDiff\.Impl\.SkillsDescriptionProse\-only — human skill ships no scripts \(12 tasks; most AIP room\)energy\-unit\-commitmentIndustrial/Physical Systemshardheavy3Schedule generator unit commit for day\-ahead demand\.mars\-clouds\-clusteringNatural Sciencehardheavy3Optimize unsupervised clustering of Mars cloud observations\.adaptive\-cruise\-controlIndustrial/Physical Systemsmed\.heavy5Implement and simulate an adaptive\-cruise\-control law\.bike\-rebalanceMathematics/Formal Reasoningmed\.heavy4Plan the optimal overnight redistribution of shared bikes\.drone\-planning\-control†\\daggerIndustrial/Physical Systemsmed\.heavy6Generate drone trajectories and feedback control in simulation\.parallel\-tfidf\-searchSoftware Engineeringmed\.heavy3Implement a parallelized TF\-IDF document search\.fix\-build\-google\-autoSoftware Engineeringeasylight3Repair build errors in a Java codebase so it compiles\.offer\-letter\-generatorOffice/White\-Collareasylight1Fill a\.docxoffer\-letter template with a conditional block\.enterprise\-information\-searchOffice/White\-Collarhardlight1Answer a retrieval query over enterprise documents\.spring\-boot\-jakarta\-migrationSoftware Engineeringhardlight5Migrate a Spring Boot codebase to Jakarta EE namespaces\.earthquake\-plate\-calculation†\\daggerNatural Sciencemed\.light1Find the in\-plate quake farthest from the Pacific boundary\.exoplanet\-detection\-periodNatural Sciencemed\.light5Detect an exoplanet and compute its orbital period\.Mixed — some scripts, but prose dominates \(9 tasks\)energy\-market\-pricingIndustrial/Physical Systemshardheavy4Clear an energy market and compute locational prices\.grid\-dispatch\-operatorIndustrial/Physical Systemsmed\.heavy3Compute an economic unit dispatch for a grid\.jax\-computing\-basicsSoftware Engineeringmed\.heavy1Implement numerical routines in JAX\.suricata\-custom\-exfilCybersecuritymed\.heavy3Author a Suricata rule for a custom exfiltration pattern\.court\-form\-fillingOffice/White\-Collareasylight1Extract case data and fill a court form\.powerlifting\-coef\-calcOffice/White\-Collareasylight3Compute powerlifting scoring coefficients\.dapt\-intrusion\-detectionCybersecurityhardlight2Detect advanced\-persistent\-threat intrusion in PCAP traffic\.crystallographic\-wyckoff\-pos\.†\\daggerNatural Sciencemed\.light2Wyckoff position analysis from X\-ray CIF files\.protein\-expression\-analysisNatural Sciencemed\.light1Analyze cancer cell\-line protein\-expression data\.Script\-heavy — skill is already executable and terse \(6 tasks; least AIP room\)dialogue\-parserSoftware Engineeringeasyheavy1Parse dialogue text into a structured format\.civ6\-adjacency\-optimizerMathematics/Formal Reasoninghardheavy4Optimize district adjacency placement on a Civ VI map\.data\-to\-d3Software Engineeringmed\.heavy1Build a D3\.js \(v6\) visualization of stock data\.3d\-scan\-calcIndustrial/Physical Systemshardlight1Calculate the mass of a 3D\-printed part from its geometry\.sec\-financial\-reportFinance/Economicshardlight2Search SEC filings and analyze a financial report\.travel\-planningMathematics/Formal Reasoningmed\.light6Plan an itinerary under scheduling constraints\.Metrics\.The primary metric is mean task reward, which is robust to the ceiling and floor effects introduced by all\-or\-nothing verifiers\. Secondary metrics are pass rate, wall\-clock execution time, and tool\-call count\.

### 4\.2\.Experimental Results

Table[2](https://arxiv.org/html/2606.04781#S4.T2)reports the aggregate comparison between human\-curated skills and their AIP\-compiled counterparts\. Compiling to AIP raises mean task reward from 0\.599 to 0\.705 \(\+0\.106\+0\.106\), a statistically significant gain under a Wilcoxon signed\-rank test \(p=0\.011p=0\.011\), winning 12 tasks against 2 losses with 13 ties\. Pass rate rises in parallel, from 53\.3% to 67\.4%\. The 24\-task v0\.3a3 subset, which excludes the three hand\-picked supplementary tasks, is consistent in direction and significance \(\+0\.101\+0\.101,p=0\.022p=0\.022\), confirming the headline does not hinge on the supplementary set\. Figure[4](https://arxiv.org/html/2606.04781#S4.F4)shows the trial\-level outcomes behind these aggregates: the gains are concentrated in a subset of differentiating tasks, where compiling to AIP converts failing or timed\-out trials into passes, while the many tied tasks reflect mutual ceilings \(both formats pass all five trials\) or mutual floors \(both fail\)\.

Table 2\.Human\-curated vs\. AIP\-compiled skills on SkillsBench \(Claude Sonnet solver, 5 trials/task\)\. Reward is the primary metric;Δ\\Deltais AIP−\-human\. The Wilcoxon signed\-rank test is computed on per\-task mean reward \(scipy\(Virtanenet al\.,[2020](https://arxiv.org/html/2606.04781#bib.bib14)\)default, tied pairs dropped\)\. Wall\-clock and tool\-call deltas are descriptive: their aggregate differences are not statistically significant\.MetricHumanAIPΔ\\Delta27\-task headline sampleMean task reward0\.5990\.705\+0\.106\+0\.106Pass rate53\.3%67\.4%\+14\.1\+14\.1ppMean wall\-clock \(s\)585510−75\-75†Mean tool calls27\.725\.6−2\.1\-2\.1†Win / tie / loss12 / 13 / 2Wilcoxonpp\(reward\)0\.011\\mathbf\{0\.011\}24\-task v0\.3a3 subset \(robustness\)Mean task reward0\.5670\.668\+0\.101\+0\.101Pass rate50\.8%63\.3%\+12\.5\+12\.5ppWin / tie / loss10 / 12 / 2Wilcoxonpp\(reward\)0\.022\\mathbf\{0\.022\}
†Not statistically significant \(per\-task Wilcoxonp≈0\.28p\\approx 0\.28\); the mean reduction is driven by a few tasks\. Pass rate is the fraction of trials with a passing verifier; mean reward is preferred because several verifiers are all\-or\-nothing, producing the high tie count\.

![Refer to caption](https://arxiv.org/html/2606.04781v1/x1.png)Figure 4\.Per\-trial outcomes for all 27 tasks under the two skill formats \(five trials per task; within each cell the markers are sorted pass, fail, timeout\), with the per\-task mean reward and mean wall\-clock \(seconds\) over those five trials shown to the right of each block\. Each row is a task; the left block is the human\-curated skill and the right block its AIP\-compiled counterpart\. A pass denotes a passing verifier \(reward=1=1\); all timeouts are trials terminated at the harness wall\-clock cap or idle limit and score as reward0, so the means are taken over all five trials\. The higher mean reward of the two formats is shown in bold\. Tasks are grouped by the AIP spec version their packs were compiled against and sorted alphabetically within each group: the 24\-task stratified cohort \(v0\.3a3\) and the three\-task supplementary reference set \(v0\.3a2\)\. Differentiating tasks \(e\.g\.dapt\-intrusion\-detection,mars\-clouds\-clustering,exoplanet\-detection\-period\) show failing or timed\-out human\-curated trials converted to passes after compilation; tied tasks are mutual ceilings or floors that no packaging could move\.![Refer to caption](https://arxiv.org/html/2606.04781v1/x2.png)Figure 5\.Per\-task wall\-clock change from compiling to AIP \(*aip*minus*human*mean over five trials, in seconds\) within each stratum; below the dashed line means AIP is faster\. Boxes show the interquartile range and median; dots are individual tasks\. Theyy\-axis is clipped to±600\\pm 600s, somars\-clouds\-clustering\(−1305\-1305s\) is off\-scale and flagged with a marker in thehard,heavy, andprose\-onlypanels\.
### 4\.3\.Execution time

AIP skills also run faster on average—mean wall\-clock time falls from 585 s to 510 s, and AIP is the faster format on 16 of 27 tasks—but this aggregate reduction is*not*statistically significant \(Wilcoxonp≈0\.28p\\approx 0\.28two\-sided, andp=0\.14p=0\.14under the directional hypothesis that AIP is faster\)\. We therefore treat wall\-clock and tool\-call counts as descriptive, not as significance claims\. The speedups concentrate on a minority of tasks where prose forced the solver to re\-derive code at run time \(e\.g\.dapt\-intrusion\-detection2/5→\\to5/5 at∼\\sim2\.2×\\timesfaster,jax\-computing\-basics∼\\sim2\.3×\\timesfaster\)\.

Figure[5](https://arxiv.org/html/2606.04781#S4.F5)breaks the per\-task wall\-clock change down by stratum\. The clearest pattern is alongstructure\_class: prose\-only skills, where AIP has room to add executable structure, tend to speed up, whereas already\-terse script\-heavy skills do not—mirroring the reward analysis, in which no single structural axis reaches significance\.

Four tasks sit far outside the bulk of the distribution and are individually instructive; together they show that a wall\-clock change is meaningful only when read alongside the reward outcome\.

mars\-clouds\-clustering\(−1305\-1305s, a genuine win\)\.The human skill is prose\-only \(309 lines of reference text, no scripts\) for a task that requires an 847\-combination grid search over a clustering pipeline, scored by an all\-or\-nothing verifier\. The human\-curated solver re\-derives the full pipeline on every trial and passes only 2 of 5: two completed runs produce a wrong result on an exact\-specification detail, and a third computes the correct answer but is killed after idling at the harness time limit\. The AIP conversion ships the vetted procedure and passes 5 of 5 while running roughly 30% faster at equal load\. The headline−1305\-1305s slightly overstates the speedup, as two AIP trials ran under lighter concurrency\.

drone\-planning\-control\(−546\-546s, a genuine win\)\.Also prose\-only \(463 lines, no scripts\)\. The human solver re\-derives the trajectory\-and\-control stack from prose and passes 2 of 5, with one run stalling at the idle limit and two earning only partial credit; the AIP version passes 5 of 5 with markedly lower and tighter wall\-clock\.

fix\-build\-google\-auto\(−430\-430s, not a real speedup\)\.A build\-repair task that both formats essentially fail \(mean reward0\.200\.20each\): both arms thrash through 90–140 tool calls and exhaust the execution budget on multiple trials\. AIP’s lower mean wall\-clock is an artifact of its unsuccessful trials terminating earlier, not a genuine efficiency gain—here the skill is not the bottleneck\.

energy\-market\-pricing\(\+511\+511s, slower but more accurate\)\.The lone task where AIP is markedly slower\. The conversion added a heavier computational path that lands more correct results \(4 of 5 vs\. 3 of 5\) but at roughly twice the tool calls and wall\-clock\. AIP’s executability lever buys reliability at a compute cost that is not uniform across tasks\.

### 4\.4\.Skill improvement

The results above establish that AIP skills are more*executable*\. They are also more*improvable*, and for the same reason: because an AIP skill is a graph of named, typed, schema\-validated nodes—each backed by a script that can be run and tested in isolation—a failure can be localized to a specific node and repaired easily, rather than by rewriting prose and hoping\. We observed this loop directly while iterating the protocol from v0\.3a2 to v0\.3a3\. Two skills that the compiler had authored with latent defects were diagnosed at the script level by an agent \(Claude Code\), corrected by a change to the AIP meta\-skill, recompiled, and re\-evaluated\.

offer\-letter\-generator: 0/5→\\to5/5\.The compiled skill contained a frozen conditional\-key bug—a template lookup keyed onRELOCATIONrather thanRELOCATION\_PACKAGE—so every trial failed in the same way\. The defect was localized to a single node and fixed by a specification change \(a functional\-test correctness check, plus a key\-suffix fallback and a default\-keep rule\) in this case; after recompilation the skill passed all five trials\.

bike\-rebalance: 0\.40→\\to0\.60, timeouts 3→\\to1\.The compiler had authored an over\-engineered routing script \(roughly 1,146 lines\) heavy enough to exhaust the agent’s time budget on three of five trials\. A specification change favoring lean scripts produced a smaller routine that fit the budget, cutting timeouts and lifting reward\.

Both repairs were verified to cause*zero regressions*on the remaining tasks: the node\-level edit fixed the target skill without disturbing the rest of the corpus\. We executed these fixes by editing the AIP meta\-skill and recompiling, but the loop is not bound to the meta\-skill\. The diagnosis was already agent\-driven, and because each skill is a bounded, typed, addressable artifact—not prose—the edit itself is a constrained, checkable action an agent can take directly on the skill, rather than the open\-ended language editing that agents do poorly\. A team adopting AIP can therefore hand more skill maintenance to its own agents\. Diagnosis, specification edit, recompilation, and re\-evaluation thus form a closed feedback step with a measurable reward signal—which we argue in Section[6\.2](https://arxiv.org/html/2606.04781#S6.SS2)is the natural substrate for reinforcement learning over skills\.

### 4\.5\.Limitations

The evaluation, while yielding a statistically significant result, carries several important caveats that qualify the strength of our claims\.

Format–author confound\.The evaluation measures a compile\-then\-run pipeline: an agent compiles a human\-written skill to AIP and then executes it\. A performance gain could therefore reflect the improved graph representation, or it could reflect improvements made to the underlying scripts by the compiler agent—these two mechanisms are not yet separated\. The highest\-value missing experiment is a control arm in which the same converted scripts are delivered as plain Markdown rather than as an AIP graph, isolating the contribution of structure from the contribution of script quality on a subset of tasks\.

Statistical power\.Trials are limited ton=5n=5per cell, and the Wilcoxon signed\-rank test operates on per\-task means over 12–14 non\-tied pairs\. The result is significant but preliminary; broader claims should await a wider trial budget and a larger task set\.

Verifier characteristics\.Several tasks use all\-or\-nothing verifiers, which inflates per\-task variance and produces the high number of ties \(12–13\)\. A finer\-grained reward signal would provide more discriminative power and reduce ceiling and floor effects\.

Subset caveats\.Three of the 27 headline tasks \(the*eval\-3med*reference set\) were compiled against an earlier protocol version \(v0\.3a2 rather than v0\.3a3\) and were randomly sampled from medium difficulty tasks rather than drawn from the stratified sampling procedure\. We therefore report both the full 27\-task figure and the 24\-task v0\.3a3 subset, and the two are consistent in direction and significance:\+0\.106\+0\.106mean reward atp=0\.011p=0\.011over 27 tasks, versus\+0\.101\+0\.101atp=0\.022p=0\.022over the 24\-task stratified set\. The headline result thus does not depend on the supplementary tasks\.

Budget\-capped tasks\.Two tasks \(energy unit commitment and Civilization 6\) hit the execution budget without producing informative results; they are excluded in spirit, though they land as ties in the aggregate counts\. These should be re\-budgeted or dropped in future iterations\.

Single model\.All experiments use Claude Sonnet as the solver\. Two replications are needed\. First, a weaker model \(e\.g\., Claude Haiku\) would test the hypothesis that graph structure provides proportionally greater benefit to less capable models\. Second, models from other vendors—OpenAI’s GPT and Google’s Gemini—and open\-weight families such as Llama, Qwen, DeepSeek, and Mistral are needed to establish that the gains hold across LLMs rather than being specific to one model family\.

## 5\.Related Work

Agent skills and their representations\.ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib9)\)interleaves language\-model reasoning with tool actions; Voyager\(Wanget al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib10)\)accumulates learned behaviors as code snippets indexed by prose; and Toolformer\(Schicket al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib34)\)teaches models to invoke external APIs mid\-generation\. Anthropic’s Agent Skills\(Anthropic,[2025](https://arxiv.org/html/2606.04781#bib.bib18); Agent Skills,[2025](https://arxiv.org/html/2606.04781#bib.bib17)\)standardize the packaging of such procedural knowledge, which a growing literature treats as institutional or expert knowledge to be transferred to agents\(Bakal,[2026](https://arxiv.org/html/2606.04781#bib.bib20)\)and surveys along axes of architecture and acquisition\(Xu and Yan,[2026](https://arxiv.org/html/2606.04781#bib.bib24)\)\. Closest to our work, SSL\(Lianget al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib25)\)also argues for structuring skill artifacts, but targets skill*discovery and assessment*rather than execution\. These package procedural knowledge differently—as prose \(Agent Skills\), as code retrieved by prose descriptions \(Voyager\), or as structure aimed at skill discovery \(SSL\)—but none gives the skill a typed, schema\-validated graph of scripted and prose steps, nor measures its effect on task execution; that is what AIP contributes\.

Structured execution and workflow graphs\.PAL\(Gaoet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib11)\)and Program of Thoughts\(Chenet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib12)\)offload deterministic computation to code while reserving language\-model reasoning for the rest, and chain\-of\-thought prompting\(Weiet al\.,[2022](https://arxiv.org/html/2606.04781#bib.bib33)\)externalizes reasoning as explicit intermediate steps\. At the system level, agent frameworks such as LangGraph and Google’s ADK\(LangChain,[2025](https://arxiv.org/html/2606.04781#bib.bib2); Google,[2025](https://arxiv.org/html/2606.04781#bib.bib3)\)represent agent behavior as graphs of steps, DSPy\(Khattabet al\.,[2023](https://arxiv.org/html/2606.04781#bib.bib5)\)compiles declarative pipelines instead of hand\-tuning prompts, and predefined workflow structure is both recommended\(Schluntz and Zhang,[2024](https://arxiv.org/html/2606.04781#bib.bib4)\)and benchmarked\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib6)\)as a route to reliability\. AIP brings this graph view to the skill specification itself: scripted nodes carry deterministic work and prose nodes carry judgment, connected by typed input/output edges\.

Benchmarking agentic systems\.AgentBench\(Liuet al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib35)\)and SWE\-bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib36)\)evaluate agents on real\-world tasks with programmatic verifiers\. SkillsBench\(Liet al\.,[2026](https://arxiv.org/html/2606.04781#bib.bib1)\), run through the BenchFlow SDK\(BenchFlow,[2026](https://arxiv.org/html/2606.04781#bib.bib7)\), instead isolates the incremental effect of a reusable skill by scoring each task with no skill, a curated skill, and a self\-generated skill\. We extend SkillsBench with AIP conditions to compare skill*formats*under a single solver\.

Editing, self\-improvement, and learning over skills\.Editing a skill is unreliable for the reasons set out earlier: the content is unfamiliar and free\-form prose offers no bounded surface to edit against, so model edits skew additive and intrinsic self\-revision without feedback rarely helps\(Sclaret al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib19); Singhalet al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib26); Huanget al\.,[2024](https://arxiv.org/html/2606.04781#bib.bib30)\)\. A parallel line pursues agents that improve themselves\(Gaoet al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib21); Zweigeret al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib22); Robeynset al\.,[2025](https://arxiv.org/html/2606.04781#bib.bib23)\), building on reinforcement learning from a reward signal\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.04781#bib.bib13); Sutton and Barto,[2018](https://arxiv.org/html/2606.04781#bib.bib15)\)\. AIP’s typed, schema\-validated graph turns a skill edit into a bounded, checkable action and supplies the reward\-bearing feedback loop these methods need at the level of an individual skill\.

Governing skill corpora\.As agentic systems scale, auditability and accountability become first\-order concerns\(Saini,[2026](https://arxiv.org/html/2606.04781#bib.bib32)\)\. Because an AIP skill is a typed graph, a corpus of skills can be projected into a graph database\(Robinsonet al\.,[2015](https://arxiv.org/html/2606.04781#bib.bib16)\)and queried—for skills missing an approval step, shared sub\-procedures, or reusable templates—moving governance from manual documentation review to structured query\.

## 6\.Discussion and Future Work

### 6\.1\.From per\-skill graphs to corpus governance

The same typed graph that makes a skill’s deterministic steps runnable also makes the skill queryable\. A library of AIP skills can be projected into a graph database\(Robinsonet al\.,[2015](https://arxiv.org/html/2606.04781#bib.bib16)\), enabling audits such as identifying skills that lack an approval or validation step, discovering skills that share a common sub\-procedure, and composing skills from reusable node templates\. This moves skill governance from a manual documentation review to a structured query over a typed graph\. An open question is whether providing an agent with query access to a skill corpus—rather than delivering a skill in\-context as YAML—further improves task performance; we identify a controlled A/B experiment on this as a high\-value next step\.

### 6\.2\.Reinforcement learning over the skill graph

Left unconstrained, autonomous skill writing tends to accrete prose and code without pressure toward compression or toward the right boundary between scripted and natural\-language nodes\. The AIP execution graph provides a bounded, typed, validity\-gated action space: edits are changes to nodes, scripts, or edges, each of which can be validated against the schema and evaluated against a reward signal\. The manual revision cycle from v0\.3a2 to v0\.3a3—in which failures were diagnosed at the node level, corrections were made to the specification, and the updated skill was re\-evaluated—is a manual policy step in this framework\. Automating the edit–evaluate–edit loop is reinforcement learning over skills\(Sutton and Barto,[2018](https://arxiv.org/html/2606.04781#bib.bib15); Ouyanget al\.,[2022](https://arxiv.org/html/2606.04781#bib.bib13)\); the node\-level repair results of Section[4](https://arxiv.org/html/2606.04781#S4)serve as proof\-of\-concept that the feedback signal is both localizable and actionable\. Formalizing this loop is the central agenda for future work\.

### 6\.3\.From specification to protocol

Today, AIP is closer to a specification than a protocol: the agent loads the entire YAML graph into its context window and follows the traversal logic through its own reasoning, with nothing enforcing adherence to the graph topology\. This keeps AIP compatible with current agent\-skill formats and let us benchmark it immediately, but it leaves headroom\. Because traversal is unenforced, reliable execution still rests on the agent’s own discipline—a ceiling that an enforced protocol could raise, especially for smaller models\. Loading the full specification into context on every run also adds a token burden that raises cost and latency and can erode performance through context rot\. Since AIP already defines a typed action surface, a full protocol for walking the graph is feasible—one that executes nodes through controlled local or remote calls rather than in\-context reasoning, while remaining backward\-compatible with the agent\-skill specification\. We consider this the most important consideration for future work\.

## 7\.Conclusion

Agent skills written largely as free\-form prose leave procedure that could be captured with structure and code for the agent to re\-derive in every session—costing reliability and consistency on implementation\-heavy tasks—and they resist improvement, since editing prose is something both humans and agents do poorly, for the distinct reasons set out in Section[1](https://arxiv.org/html/2606.04781#S1)\. AIP compiles a human\-written skill into a directed execution graph: scripted where computation is deterministic, natural language where judgment is needed, connected by typed input/output edges and schema\-validated throughout\. Experts still author the skill; AIP makes its deterministic steps runnable and every step addressable\.

The result is a representation that is more executable and improvable\. Compiling human\-written skills to AIP yields a statistically significant gain in task reward on SkillsBench, by handing the agent vetted, runnable units and a fixed procedure rather than asking it to re\-plan from prose\. And because each skill is a graph of named, typed, testable nodes, a failure localizes to a single node and is repaired by a checkable edit—a loop we ran by hand here, but one an adopter’s agent can run directly\. The same structure makes a skill corpus queryable for governance and a bounded, typed action space for reinforcement learning over skills\.

AIP today is a specification an agent reads and follows; enforcing that traversal as a runtime protocol is a potential next step\. But the core lesson already holds: a skill that is a graph is easier to execute, easier to diagnose when it fails, and easier to improve once diagnosed\. The graph is not incidental to these properties—it is their common cause, and the foundation for skill libraries that can be reliably executed, governed, and learned\.

## References

- Agent Skills \(2025\)Agent Skills\.External Links:[Link](https://agentskills.io/specification)Cited by:[Figure 1](https://arxiv.org/html/2606.04781#S2.F1),[§2](https://arxiv.org/html/2606.04781#S2.p1.1),[§3](https://arxiv.org/html/2606.04781#S3.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- Anthropic \(2025\)Anthropic\.Note:Agent Skills open standard and reference SDK at[https://agentskills\.io](https://agentskills.io/)External Links:[Link](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills)Cited by:[§1](https://arxiv.org/html/2606.04781#S1.p1.1),[§2](https://arxiv.org/html/2606.04781#S2.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- G\. Bakal \(2026\)Knowledge activation: ai skills as the institutional knowledge primitive for agentic software development\.External Links:2603\.14805,[Link](https://arxiv.org/abs/2603.14805)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2606.04781#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.04781#S4.SS1.p4.1),[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- BenchFlow \(2026\)BenchFlow\.External Links:[Link](https://docs.benchflow.ai/introduction)Cited by:[§2](https://arxiv.org/html/2606.04781#S2.p4.1),[§4\.1](https://arxiv.org/html/2606.04781#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p3.1)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2023\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.External Links:2211\.12588,[Link](https://arxiv.org/abs/2211.12588)Cited by:[item 1](https://arxiv.org/html/2606.04781#S1.I1.i1.p1.1),[§3](https://arxiv.org/html/2606.04781#S3.p5.1),[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu, H\. Wang, H\. Xiao, Y\. Zhou, S\. Zhang, J\. Zhang, J\. Xiang, Y\. Fang, Q\. Zhao, D\. Liu, Q\. Ren, C\. Qian, Z\. Wang, M\. Hu, H\. Wang, Q\. Wu, H\. Ji, and M\. Wang \(2025\)A survey of self\-evolving agents: what, when, how, and where to evolve on the path to artificial super intelligence\.External Links:2507\.21046,[Link](https://arxiv.org/abs/2507.21046)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p4.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.External Links:2211\.10435,[Link](https://arxiv.org/abs/2211.10435)Cited by:[item 1](https://arxiv.org/html/2606.04781#S1.I1.i1.p1.1),[§3](https://arxiv.org/html/2606.04781#S3.p5.1),[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- Google \(2025\)Google\.Note:Open\-source framework with sequential, parallel, and loop workflow agentsExternal Links:[Link](https://adk.dev/)Cited by:[item 1](https://arxiv.org/html/2606.04781#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.04781#S2.p3.1),[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.Note:ICLR 2024External Links:2310\.01798,[Link](https://arxiv.org/abs/2310.01798)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p4.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.External Links:2310\.06770,[Link](https://arxiv.org/abs/2310.06770)Cited by:[§5](https://arxiv.org/html/2606.04781#S5.p3.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2023\)DSPy: compiling declarative language model calls into self\-improving pipelines\.External Links:2310\.03714,[Link](https://arxiv.org/abs/2310.03714)Cited by:[item 1](https://arxiv.org/html/2606.04781#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.04781#S2.p3.1),[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- LangChain \(2025\)LangChain, Inc\.\.Note:Orchestration framework modeling stateful agent workflows as directed graphs of nodes and edgesExternal Links:[Link](https://docs.langchain.com/langgraph)Cited by:[item 1](https://arxiv.org/html/2606.04781#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.04781#S2.p3.1),[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun, S\. Wang, B\. Li, Q\. Zeng, D\. Wang, X\. Zhao, Y\. Wang, R\. B\. Chaim, Z\. Di, Y\. Gao, J\. He, Y\. He, L\. Jing, L\. Kong, X\. Lan, J\. Li, S\. Li, Y\. Li, Y\. Lin, X\. Liu, X\. Liu, H\. Lyu, Z\. Ma, B\. Wang, R\. Wang, T\. Wang, W\. Ye, Y\. Zhang, H\. Xing, Y\. Xue, S\. Dillmann, and H\. Lee \(2026\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.External Links:2602\.12670,[Link](https://arxiv.org/abs/2602.12670)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2606.04781#S1.p1.1),[§1](https://arxiv.org/html/2606.04781#S1.p5.1),[§2](https://arxiv.org/html/2606.04781#S2.p3.1),[§2](https://arxiv.org/html/2606.04781#S2.p4.1),[§4\.1](https://arxiv.org/html/2606.04781#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.04781#S4.SS1.p4.1),[§5](https://arxiv.org/html/2606.04781#S5.p3.1)\.
- Q\. Liang, H\. Wang, Z\. Liang, and Y\. Liu \(2026\)From skill text to skill structure: the scheduling\-structural\-logical representation for agent skills\.External Links:2604\.24026,[Link](https://arxiv.org/abs/2604.24026)Cited by:[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- S\. A\. Licorish, A\. Bajpai, C\. Arora, F\. Wang, and K\. Tantithamthavorn \(2025\)Comparing human and LLM generated code: the jury is still out\!\.External Links:2501\.16857,[Link](https://arxiv.org/abs/2501.16857)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang, Y\. Dong, and J\. Tang \(2025\)AgentBench: evaluating LLMs as agents\.External Links:2308\.03688,[Link](https://arxiv.org/abs/2308.03688)Cited by:[§5](https://arxiv.org/html/2606.04781#S5.p3.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§5](https://arxiv.org/html/2606.04781#S5.p4.1),[§6\.2](https://arxiv.org/html/2606.04781#S6.SS2.p1.1)\.
- M\. Robeyns, M\. Szummer, and L\. Aitchison \(2025\)A self\-improving coding agent\.External Links:2504\.15228,[Link](https://arxiv.org/abs/2504.15228)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p4.1)\.
- I\. Robinson, J\. Webber, and E\. Eifrem \(2015\)Graph databases: new opportunities for connected data\.2 edition,O’Reilly Media\.External Links:ISBN 978\-1491930892Cited by:[§5](https://arxiv.org/html/2606.04781#S5.p5.1),[§6\.1](https://arxiv.org/html/2606.04781#S6.SS1.p1.1)\.
- S\. Saini \(2026\)Governing the agentic enterprise: a new operating model for autonomous ai at scale\.California Management Review\.Note:Published online March 20, 2026External Links:[Link](https://cmr.berkeley.edu/2026/03/governing-the-agentic-enterprise-a-new-operating-model-for-autonomous-ai-at-scale/)Cited by:[item 4](https://arxiv.org/html/2606.04781#S1.I1.i4.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p5.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.External Links:2302\.04761,[Link](https://arxiv.org/abs/2302.04761)Cited by:[§2](https://arxiv.org/html/2606.04781#S2.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- E\. Schluntz and B\. Zhang \(2024\)Anthropic\.External Links:[Link](https://www.anthropic.com/engineering/building-effective-agents)Cited by:[item 1](https://arxiv.org/html/2606.04781#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.04781#S2.p3.1),[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- M\. Sclar, Y\. Choi, Y\. Tsvetkov, and A\. Suhr \(2024\)Quantifying language models’ sensitivity to spurious features in prompt design or: how i learned to start worrying about prompt formatting\.Note:ICLR 2024External Links:2310\.11324,[Link](https://arxiv.org/abs/2310.11324)Cited by:[item 2](https://arxiv.org/html/2606.04781#S1.I1.i2.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p4.1)\.
- P\. Singhal, T\. Goyal, J\. Xu, and G\. Durrett \(2024\)A long way to go: investigating length correlations in rlhf\.Note:COLM 2024External Links:2310\.03716,[Link](https://arxiv.org/abs/2310.03716)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p4.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.2 edition,MIT Press\.External Links:[Link](http://incompleteideas.net/book/the-book-2nd.html)Cited by:[§1](https://arxiv.org/html/2606.04781#S1.p5.1),[§5](https://arxiv.org/html/2606.04781#S5.p4.1),[§6\.2](https://arxiv.org/html/2606.04781#S6.SS2.p1.1)\.
- P\. Virtanen, R\. Gommers, T\. E\. Oliphant, M\. Haberland, T\. Reddy, D\. Cournapeau, E\. Burovski, P\. Peterson, W\. Weckesser, J\. Bright, S\. J\. van der Walt, M\. Brett, J\. Wilson, K\. J\. Millman, N\. Mayorov, A\. R\. J\. Nelson, E\. Jones, R\. Kern, E\. Larson, C\.J\. Carey, I\. Polat, Y\. Feng, E\. W\. Moore, J\. VanderPlas, D\. Laxalde, J\. Perktold, R\. Cimrman, I\. Henriksen, E\. A\. Quintero, C\. R\. Harris, A\. M\. Archibald, A\. H\. Ribeiro, F\. Pedregosa, P\. van Mulbregt, and SciPy 1\.0 Contributors \(2020\)SciPy 1\.0: fundamental algorithms for scientific computing in python\.Nature Methods17,pp\. 261–272\.External Links:[Document](https://dx.doi.org/10.1038/s41592-019-0686-2)Cited by:[Table 2](https://arxiv.org/html/2606.04781#S4.T2)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.External Links:2305\.16291,[Link](https://arxiv.org/abs/2305.16291)Cited by:[§2](https://arxiv.org/html/2606.04781#S2.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- R\. Xiao, W\. Ma, K\. Wang, Y\. Wu, J\. Zhao, H\. Wang, F\. Huang, and Y\. Li \(2024\)FlowBench: revisiting and benchmarking workflow\-guided planning for llm\-based agents\.External Links:2406\.14884,[Link](https://arxiv.org/abs/2406.14884)Cited by:[§2](https://arxiv.org/html/2606.04781#S2.p3.1),[§5](https://arxiv.org/html/2606.04781#S5.p2.1)\.
- R\. Xu and Y\. Yan \(2026\)Agent skills for large language models: architecture, acquisition, security, and the path forward\.External Links:2602\.12430,[Link](https://arxiv.org/abs/2602.12430)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- W\. Xu, G\. Zhu, X\. Zhao, L\. Pan, L\. Li, and W\. Y\. Wang \(2024\)Pride and prejudice: llm amplifies self\-bias in self\-refinement\.External Links:2402\.11436,[Link](https://arxiv.org/abs/2402.11436)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[§2](https://arxiv.org/html/2606.04781#S2.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p1.1)\.
- Y\. Zhang, S\. S\. S\. Das, and R\. Zhang \(2024\)Verbosity≠\\neqveracity: demystify verbosity compensation behavior of large language models\.External Links:2411\.07858,[Link](https://arxiv.org/abs/2411.07858)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1)\.
- A\. Zweiger, J\. Pari, H\. Guo, E\. Akyürek, Y\. Kim, and P\. Agrawal \(2025\)Self\-adapting language models\.Note:NeurIPS 2025External Links:2506\.10943,[Link](https://arxiv.org/abs/2506.10943)Cited by:[item 3](https://arxiv.org/html/2606.04781#S1.I1.i3.p1.1),[§5](https://arxiv.org/html/2606.04781#S5.p4.1)\.
AIP: A Graph Representation for Learning and Governing Agent Skills

Similar Articles

AIPO: : Learning to Reason from Active Interaction

Multi Agents

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

Equipping agents for the real world with Agent Skills

Submit Feedback

Similar Articles

AIPO: : Learning to Reason from Active Interaction
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
Equipping agents for the real world with Agent Skills