@omarsar0: New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize.…

X AI KOLs Following 05/25/26, 03:40 PM Papers

Summary

Microsoft Research introduces SkillOpt, a method that treats agent skill documents as trainable external state, using an optimizer model to make bounded edits validated by a held-out set. The approach achieves best or tied results across 52 evaluation cells and improves accuracy by over 23 points on GPT-5.5, with zero extra inference cost and transferable skills.

New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize. Probably not optimal. This works show why. It treats the skill doc as a trainable external state of a frozen agent instead. It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes. SkillOpt is best or tied on all 52 (model, benchmark, harness) cells. On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses. Paper: https://arxiv.org/abs/2605.23904 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 05/25/26, 04:41 PM

New research from Microsoft Research

I see a lot of AI engineers handwriting agent skill docs and hope they generalize.

Probably not optimal. This works show why.

It treats the skill doc as a trainable external state of a frozen agent instead.

It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes.

SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.

On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses.

Paper: https://arxiv.org/abs/2605.23904

Learn to build effective AI agents in our academy: https://academy.dair.ai

Introduction

Source: https://arxiv.org/html/2605.23904 [Uncaptioned image]

May 2026

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang1,∗,‡Ziyang Gong2,∗Weiquan Huang3,∗Qihao Yang2,∗Ziwei Zhou4,∗ Zisu Huang4,∗Yan Li2Xuemei Gao1Qi Dai1Bei Liu1 Kai Qiu1Yuqing Yang1Dongdong Chen1Xue Yang2,‡Chong Luo1

1Microsoft2Shanghai Jiao Tong University3Tongji University4Fudan University

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision—none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead betrainedas the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible.SkillOptis, to our knowledge, the first systematiccontrollabletext-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code),SkillOptis best or tied onall 52 evaluated (model, benchmark, harness) cellsand beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT–5.5 it lifts the average no-skill accuracy by+23.5\mathbf{+23.5}points in direct chat, by+24.8+24.8inside the Codex agentic loop, and by+19.1+19.1inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.Code:https://aka.ms/SkillOptCorrespondence:[email protected],[email protected]∗Equal contribution.‡Corresponding authors.

Refer to caption Figure 1:Overview ofSkillOpt. The target model executes tasks with a current skill, an additional frontier optimizer model converts trajectories into bounded add/delete/replace skill edits, and a held-out gate accepts only edits that improve validation performance. Accepted edits are exported as a reusable skill artifact, while rejected edits become negative feedback for later updates.Frontier language models are increasingly deployed as agents, from single-prompt callers to multi-step execution harnesses with tools, files, and verifiers[39,26,32,37]. In such settings, domain adaptation is no longer only about model weights or prompts: it also requires improving theproceduresby which the agent gathers evidence, calls tools, follows domain conventions, and formats outputs[36,11]. Agent skills provide a natural interface for this procedural adaptation[12,10]: a skill is a portable natural-language artifact that packages procedures, domain heuristics, tool policies, output constraints, and failure modes, letting a frozen agent adapt through external text.

If the recurring object of adaptation is the agent’s procedure, the skill document itself should be trainable. Yet weight adaptation is often unavailable for closed frontier models and expensive for open ones, while manually written or one-shot skills are brittle under a target domain or harness. Recent systems convert execution experience into reusable textual artifacts—distilling trajectory lessons, refining skill folders via failure analysis, building domain-specific skill libraries, or optimizing prompts from trajectory feedback[19,2,13,27,1]—but leave open a more basic question: if skills are the adaptation layer, how should they be optimized? Our key idea is to treat skill editing as a controllable domain-adaptation process, with the skill document as the external state, an additional frontier model as the optimizer, and training-style controls over evidence, step size, validation, and update direction.

We introduceSkillOpt, a text-space optimizer for agent skills. Given a target domain, an initial skill, and the model being adapted,SkillOptrepeatedly samples trajectory batches, analyzes successes and failures, and asks a frontier optimizer model to propose structured add/delete/replace edits. It then aggregates and ranks candidate edits under a textual learning-rate budget, applies a bounded update to the skill document, and evaluates the candidate skill on a held-out selection split before accepting it. Rejected edits are retained as negative feedback, while the epoch-wise slow/meta update preserves longer-horizon regularities. Figure1gives a schematic view of this loop. The deployed output is a compactbest_skill.mdfile of roughly300300–2,0002{,}000tokens, with the adapted model and execution harness remaining fixed.

The deep-learning analogy is operational rather than decorative. Rollout and reflection batch sizes control the noise in the evidence used for each edit; the textual learning rate and schedule control how far one skill version is allowed to move from the previous one; the held-out gate plays the role of validation; and the epoch-wise slow/meta update acts like a momentum term, carrying stable editing directions across epochs. This stability is crucial: if consecutive skill revisions move too far or in inconsistent directions, rejected edits and previous accepted edits no longer provide a meaningful optimization history. With bounded, validation-gated updates, each revision remains close enough to the last one that later optimizer calls can learn from what helped, what failed, and what should be preserved.

We conduct, to our knowledge, the first systematic study of skill optimization as a domain-adaptation training method for frontier agents. We evaluateSkillOpton six benchmarks covering QA, spreadsheets, documents, math, and embodied decision making, across seven target models from frontier-scale GPT to small-scale Qwen, and under three execution modes (direct chat, Codex harness, Claude Code harness). Out of 52 evaluated (model, benchmark, harness) cells,SkillOptis the best or tied-best measured method on all 52. With GPT–5.5 in direct chat, it lifts SearchQA from 77.7 to 87.3, SpreadsheetBench from 41.8 to 80.7, OfficeQA from 33.1 to 72.1, DocVQA from 78.8 to 91.2, LiveMathematicianBench from 37.6 to 66.9, and ALFWorld from 83.6 to 95.5 (a+23.5+23.5point average gain over no skill), and it also beats the strongestper-cellbaseline drawn from human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills by+5.4+5.4points on average. The same optimization interface is effective inside Codex-style and Claude Code-style execution loops, lifting GPT–5.5 by+24.8+24.8and+19.1+19.1points over no skill respectively, and outperforming EvoSkill by+14.0+14.0and+3.2+3.2points.

The learned artifacts also transfer beyond the exact training setting. A SpreadsheetBench skill trained on GPT–5.4 improves every smaller GPT variant we test; a Codex-trained spreadsheet skill transfers to Claude Code with a+59.7+59.7point gain; and an OlympiadBench skill yields positive gains on Omni-MATH[6]. These transfer results are important for the paper’s application value: a skill can be optimized once, audited as text, and reused across related models, harnesses, or tasks without changing model weights. Our ablations explain why this works. Bounded textual learning outperforms uncontrolled rewriting, held-out gating prevents harmful proposals from accumulating, the rejected-step buffer converts failed edits into negative feedback, and the epoch-wise slow/meta update improves long-horizon refinement without bloating the deployed skill. Finally, per-benchmark case studies show that the learned skills remain compact (300300–2,0002{,}000tokens after only11–44accepted edits), inspectable, and procedural rather than instance-specific.

Our contributions are as follows:

•We formulate agent-skill learning as optimization over an external natural-language state and introduceSkillOpt, a harness-agnostic optimizer with rollout batches, reflection minibatches, add/delete/replace edits, textual learning rates, schedules, held-out acceptance, rejected-edit buffers, and epoch-wise slow/meta update.
•We provide a broad empirical study across six benchmarks, seven target models, and three execution harnesses, showing thatSkillOptis best or tied-best on 52 of 52 cells and outperforms no-skill, human-skill, one-shot LLM-skill, prompt-optimization (TextGrad, GEPA), and skill-evolution (Trace2Skill, EvoSkill) baselines under every model.
•We validate the optimization design through component ablations and three forms of transfer (cross-model, cross-harness, cross-benchmark), showing that the exported skill artifact is compact, reusable, and deployable without model-weight updates.

Related Work

Prompt auto tuning and agent-configuration search.

GEPA demonstrates that trajectory feedback can guide reflective prompt evolution and outperform reinforcement learning on several language-agent tasks[1]. ABSTRAL and EvoTest extend this idea from single prompts to multi-agent design documents and test-time agentic system evolution without gradients or fine-tuning[30,9]. By treating language artifacts as optimizable objects, these methods can directly exploit execution feedback, but they mainly target prompts, system designs, or full configurations rather than reusable domain adaptation.SkillOptinstead optimizes a persistent skill document that can be trained, validated, exported, and reused with the adapted model, applying language-level controllability to a stable procedural skill state.

Skill construction and skill evolution.

SkillsBench and the SoK on agentic skills frame skills as reusable procedural knowledge, covering tool policies, applicability conditions, execution routines, and supporting resources[12,10]. Prior systems construct such skills from lifelong experience, trajectory lessons, skill knowledge bases, or heterogeneous domain resources[38,19,31,27,5], and further refine them through failure analysis, creation-evaluation-revision loops, co-evolving generators and verifiers, collective updates, or reinforcement learning[2,13,41,15,35,33,23,18,34]. While these works emphasize skill discovery, repository growth, sharing, evolutionary search, or policy optimization,SkillOptstudies a narrower problem: how to train one compact domain skill with deep-learning-style controls such as trajectory batches, reflection minibatches, textual learning rates, validation gates, rejected-edit buffers, and slow/meta updates. This yields a controlled and auditable procedure for producing a portablebest_skill.mdwithout changing model weights.

Method

Refer to caption Figure 2:Pipeline ofSkillOpt. A frozen target model executes a rollout batch with the current skill; an optimizer model performs minibatch reflection over successes and failures, proposes bounded add/delete/replace edits, merges and ranks them under a scheduled edit budget, and accepts the candidate skill only through a held-out validation gate. Across epochs, the slow/meta update retains longer-horizon lessons without changing the target model.### Problem Setup

A skillssis a natural-language policy inserted into the agent context before execution, consistent with recent work treating skills as reusable procedural knowledge for agents[12,10]. In direct-chat benchmarks, it is prepended to the system or developer instruction; in tool-use harnesses, it becomes persistent procedural memory. We useMMto denote the frozen target model whose behavior is being adapted through skill optimization. For a harnesshh, taskxx, and skillss, execution produces a trajectoryτ\tauand a scalar scorerr:

(τ(s),r(s))=h(M,x,s),r(s)∈[0,1].(\tau(s),r(s))=h(M,x,s),\qquad r(s)\in[0,1].(1) Given train, selection, and test splitsDtr,Dsel,DtestD_{\mathrm{tr}},D_{\mathrm{sel}},D_{\mathrm{test}},SkillOptusesDtrD_{\mathrm{tr}}to generate a set of candidate skills𝒞(Dtr)\mathcal{C}(D_{\mathrm{tr}}), selects the best skill onDselD_{\mathrm{sel}}, and reports the final performance onDtestD_{\mathrm{test}}:

ssel⋆=arg⁡maxs∈𝒞(Dtr)⁡1|Dsel|∑x∈Dselr(s),s^{\star}_{\mathrm{sel}}=\arg\max_{s\in\mathcal{C}(D_{\mathrm{tr}})}\frac{1}{|D_{\mathrm{sel}}|}\sum_{x\in D_{\mathrm{sel}}}r(s),(2)Test(ssel⋆)=1|Dtest|∑x∈Dtestr(ssel⋆).\mathrm{Test}(s^{\star}_{\mathrm{sel}})=\frac{1}{|D_{\mathrm{test}}|}\sum_{x\in D_{\mathrm{test}}}r(s^{\star}_{\mathrm{sel}}).(3)The training split supplies experience, the selection split gates updates, and the test split is used only for final reporting. The optimizer state contains the current skill, the best validation-gated skill, cached skill hashes, an epoch-local rejected-step buffer, and optional slow/meta-update state. Only the best accepted skill is exported asbest_skill.md.

Forward Pass: Rollout Evidence

At each optimization step, the target model runs a rollout batch fromDtrD_{\mathrm{tr}}with the current skill. The harness records task metadata, messages, tool calls, observations, command outputs, final answers, verifier feedback, and benchmark-specific context such as spreadsheet previews, document references, or compact execution traces. This batch is the evidence unit: small batches update quickly but noisily, while larger batches expose more recurring patterns before the skill changes. The implementation also supports accumulation, where several rollout batches are reflected on separately and merged into one update, decoupling execution throughput from update frequency.

Backward Pass: Minibatch Reflection

The optimizer model turns trajectories into skill edits, following the broader line of trajectory-driven reflection and prompt evolution[28,16,1]. It first separates failures from successes and partitions each group into reflection minibatches. This matters because single trajectories often produce anecdotal fixes, while minibatches expose reusable procedural errors: the agent consistently searches the wrong source, writes an answer in the wrong format, or fails to verify a tool result. Failure minibatches propose missing or corrective rules; success minibatches preserve behaviors that already work. Each reflection returns structured add/delete/replace edits, or in rewrite mode a small set of rewrite suggestions.

Local proposals are merged hierarchically by first consolidating failure- and success-driven edits separately, then combining them with priority on failure corrections. This step filters duplicate, contradictory, and example-specific suggestions before the optimizer selects the final bounded update.

Bounded Text Updates

The learning-rate analogue inSkillOptis the edit budgetLtL_{t}: the maximum number of skill edits applied at steptt. After aggregation, the optimizer model ranks the merged edit pool by expected utility and clips it to the topLtL_{t}edits. This is the key difference from ad hoc prompt rewriting. Unbounded rewrites can erase useful rules, introduce incompatible instructions, or overfit to a local failure; bounded updates preserve continuity while still allowing the skill to acquire new procedures.SkillOptsupports constant, linear, cosine, and autonomous schedules. The default cosine schedule starts with larger edits and decays toward smaller consolidation steps.

The selected edits produce a candidate skill. In patch mode, edits are localized operations such as append, insert, replace, and delete; in rewrite mode, selected suggestions condition a full skill rewrite. Step-level edits cannot overwrite the protected slow-update field, so fast local changes and slower epoch-wise consolidation remain separated.

Validation Gate and Rejected-Edit Buffer

Every candidate skill is evaluated onDselD_{\mathrm{sel}}with the same frozen target model and harness. If it improves over the current selection score, it becomes the new current skill; if it also exceeds the best score so far, it becomesbest_skill.md. Otherwise it is rejected. This gate turns reflection into propose-and-test optimization rather than unconditional self-editing, which is crucial because plausible textual diagnoses can still hurt the actual target model.

Rejected updates are still useful. The optimizer records an epoch-local buffer containing observed failure patterns and, for rejected steps, the edits that were tried and the score drop they caused. Later reflection calls in the same epoch receive this buffer, so the optimizer model can avoid repeating failed edits and focus on unresolved failures. This gives the loop negative feedback during training without adding inference-time cost.

Table 1:Main results on held-out test splits. Scores are percentages; within each model–harness block, bold marks the best measured entry and underlining marks the second-best entry for each benchmark. Blue cells denoteSkillOpt, and small green/red subscripts show the absolute change relative to theNo skillrow of the same model in the same harness. We omit ALFWorld under Codex and Claude Code harnesses because ALFWorld requires persistent embodied-environment interaction.SkillOptis the best-or-tied entry on every measured cell of the table, with positive gains over the no-skill baseline throughout.

Epoch-Wise Slow/Meta Update

Fast updates learn from the current batch; the epoch-wise slow/meta update learns from adjacent epochs. At the end of an epoch,SkillOptsamples the same training items under the previous epoch’s skill and the current skill, then groups them into improvements, regressions, persistent failures, and stable successes. The optimizer model writes a concise longitudinal guidance block into a protected slow-update field, and this candidate is still passed through the validation gate. Thus slow update captures durable domain lessons while preserving the same safety check as step-level edits.

The meta skill is optimizer-side only. It summarizes which edit patterns helped, which were rejected, and which failures persisted across epochs. This meta guidance is prepended to future optimizer prompts for reflection, merging, and ranking, but it is not shipped with the target model. The advantage is separation of concerns: the deployed skill remains compact and portable, while training benefits from a richer record of the editing process.

Harness-Agnostic Deployment

SkillOptis harness-agnostic through a lightweight adapter interface, matching the broader trend toward agents embedded in tool-use and software-execution environments[39,26,37]. An adapter constructs train/evaluation batches, injects the current skill into the agent context, runs the native harness, and returns scored trajectories. The same optimizer therefore works for direct QA, spreadsheet execution, document reasoning, multimodal QA, embodied environments, and Codex-style or Claude Code-style execution loops. This is the main practical advantage of treating skills as the adaptation layer: a stronger optimizer model can train a reusable skill artifact offline, and the resultingbest_skill.mdcan then be deployed or tested across target models, harnesses, and nearby benchmarks without changing model weights.

(a) Training set size

(b) Mini-batchsize

(d) Learning rate

(e) Learning-rate scheduler

(f) Slow-update samples

Table 2:Hyperparameter analysis for the text optimizer. Each panel changes one scalar or scheduling factor from the default setting unless noted. Panel (a) fixes the split to4:1:54{:}1{:}5train/selection/test; the 1-example, 20%, 40%, and 80% rows use subsets of the training partition, and the 100% row reuses the completed4:1:54{:}1{:}5split-ratio run. Panel (b) sweeps the reflection mini-batchsizeBmB_{m}; panel (c) sweeps the rollout batchsizeBB.Table 3:Component ablations for learning-rate form, rejected buffer, and epoch-wise slow/meta update. Light-blue rows mark the default setting within each component group; the learning-rate group uses the default lr=4 setting. Bold values mark the best measured result within that group and benchmark. The without-rejected-buffer row uses the matched no-buffer ablation setting.(a) Cross-model transferSource modelTarget modelBenchmarkBaselineDirectTransferredGPT–5.4GPT–5.4SpreadsheetBench41.462.552.1+10.7GPT–5.4-mini36.147.545.5+9.4GPT–5.4-nano23.542.526.5+3.0GPT–5.4GPT–5.4LiveMath36.844.047.2+10.4GPT–5.4-mini14.732.819.2+4.5GPT–5.4-nano23.227.228.8+5.6(b) Cross-harness transferSource harnessTarget harnessBenchmarkBaselineDirectTransferredCodexClaude CodeLiveMath40.856.542.4+1.6Claude CodeCodex35.278.448.0+12.8CodexClaude CodeSpreadsheetBench22.180.481.8+59.7Claude CodeCodex27.585.071.1+43.6(c) Cross-benchmark transferSource benchmarkTarget benchmarkModelBaselineDirectTransferredOlympiadBenchOmni-MATHGPT–5.456.6–60.3+3.7GPT–5.4-mini34.8–36.6+1.8GPT–5.4-nano38.8–40.1+1.3

Table 4:Transfer of optimized skills across three axes.(a)Cross-model: a skill optimized for the source model is deployed on the target model.(b)Cross-harness: a skill trained inside the source harness is evaluated inside the target harness, all on GPT–5.5.(c)Cross-benchmark: the source benchmark skill is evaluated on the target benchmark across three target models.Baselineis the target’s no-skill score,Directis the in-domainSkillOptscore, andTransferredapplies the source skill without further optimization. Subscripts show the change over the target baseline. Every row in (a)–(c) is a positive transfer (no row falls below the target’s no-skill baseline).

Experiments

We evaluateSkillOptas a text-space optimizer for frozen agents: the target model executes each task with the current skill, while an offline optimizer edits that skill from rollout evidence. The experiments answer four questions. (i) Do optimized skills improve over no-skill, human-skill, one-shot LLM-skill, prompt-optimization (TextGrad, GEPA), and skill-evolution (Trace2Skill, EvoSkill) baselines? (ii) Does the same loop work across direct chat, Codex, and Claude Code harnesses, and across seven target models from frontier-scale GPT to small Qwen? (iii) Which optimizer controls matter? (iv) What do the learned skills look like, and at what cost?

Setting.

We report each benchmark’s native hard score or exact-match accuracy on held-out test splits across SearchQA[4], SpreadsheetBench[14], OfficeQA[22], DocVQA[17], LiveMathematicianBench[8](abbreviated LiveMath in tables), and ALFWorld[29], using two model families: GPT[21]and Qwen[24,25]. The benchmark suite is intentionally diverse—it spans single-round QA (SearchQA, DocVQA, LiveMathematicianBench MCQ), multi-turn tool loops with up to2424tool calls (OfficeQA), multi-round codegen with up to3030turns and a realopenpyxl/pandasruntime (SpreadsheetBench, defaultmode=multi), and persistent embodied interaction with up to5050steps per episode (ALFWorld). Dataset-backed runs use deterministic train/selection/test splits derived from the same dataset seed (𝚜𝚙𝚕𝚒𝚝_𝚜𝚎𝚎𝚍=𝟺𝟸\mathtt{split\_seed=42}); the selection split is usedonlyto accept or reject candidate skill edits, and all reported scores are computed on the disjoint held-out test split. The reported numbers thus measure generalization, not validation-set fit.

Default optimizer hyperparameters.

Unless noted,SkillOptuses four epochs, rollout batch size4040per step, reflection minibatch size88(with1616analyst workers running reflections in parallel and a merge batch size of88), textual learning rateLt=4L_{t}=4with cosine decay (floorLt=2L_{t}=2, configurable schedules: constant, linear, cosine, autonomous), held-out validation gating (strictly greater than the current selection score—ties are rejected), slow update with2020sampled tasks per epoch comparing previous-epoch and current-epoch skill, an optimizer-side meta skill that summarizes accepted/rejected patterns into teacher-only guidance, thepatchedit mode (the alternative isrewrite_from_suggestions), and an optional rejected-edit buffer of recent failed proposals. Teacher reflection is allowed up to three refinement rounds per minibatch. Both teacher and student calls default to amediumreasoning effort. For benchmarks with tightly bounded training pools (LiveMathematicianBench:3535training items per epoch with rollout batch200200; ALFWorld:3939training tasks with140140selection and134134test environments), per-benchmark configs scale the batch sizes accordingly while keeping the same gate, scheduler, and slow/meta-update machinery. Additional benchmark, baseline, and optimizer-protocol details are in AppendixC.

Harnesses.

Direct chat invokes the target model through a single chat completion call with the skill prepended to the system prompt. The Codex harness drives the target through thecodexCLI in a workspace-write sandbox[20];SkillOptrenders the current skill to a per-taskSKILL.mdalongside task files and reads back a compact execution trace (codex_trace_summary.txt) that is included in the teacher reflection context, so the optimizer learns fromwhat the agent actually did, not just its final answer. The Claude Code harness mirrors the same workspace contract through theclaudeCLI[3]. All three modes consume the samebest_skill.mdfile format, which is what enables the cross-harness transfer experiments in Section4.3.

Baselines.

We compare against seven baselines that span the no-adaptation, hand-written, one-shot, and learning families:no skill(frozen target model run with the benchmark’s default system prompt),human skill(an expert-written skill document curated per benchmark),one-shot LLM skill(a single skill generated from a high-level task description by GPT–5.5 and never updated),Trace2Skill[19](trajectory-level skill distillation),TextGrad[40](gradient-style natural-language prompt optimization),GEPA[1](Pareto reflective prompt evolution), and the harness-side competitorEvoSkill[2](skill-folder evolution under failure analysis). All baselines use the same target model, the same held-out test split, and the same scorer for every benchmark, so the comparison isolates the choice of adaptation procedure rather than secondary factors such as prompt template or scoring pipeline.

Main Results

Table1is the main result matrix. Counting every (target model, benchmark, harness) cell as one comparison and the strongest of the no-skill, human-skill, LLM-skill, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines as the per-cell competition,SkillOptwins or matches the best measured result on5252of5252evaluated cells. This dominance is uniform across model scales:SkillOptis best on every benchmark for GPT–5.5, GPT–5.4, GPT–5.4-mini, GPT–5.4-nano, GPT–5.2, Qwen3.5–4B, and Qwen3.6–35B-A3B in direct chat, and for GPT–5.5 under both Codex and Claude Code harnesses.

The size of the gains is also unusually large for a no-weight-update method. On GPT–5.5 direct chat, the six-benchmark average rises from58.858.8(no skill) to82.382.3(SkillOpt), a+23.5+23.5point absolute improvement, while the best per-cell baseline averages only76.976.9, leavingSkillOpt+5.4+5.4points clear of an oracle baseline that picks the best of six competing methods per cell. Per-benchmark deltas over no skill range from+9.6+9.6on SearchQA, where the no-skill model is already near ceiling, to+38.9+38.9on SpreadsheetBench and+39.0+39.0on OfficeQA, where strict procedural and answer-format requirements expose the limits of zero-shot frontier models. Procedural benchmarks see the largest improvements: SpreadsheetBench41.8→80.741.8{\to}80.7, OfficeQA33.1→72.133.1{\to}72.1, and LiveMathematicianBench37.6→66.937.6{\to}66.9on GPT–5.5; SpreadsheetBench9.3→23.99.3{\to}23.9(×2.6\times 2.6) on Qwen3.5–4B; and ALFWorld34.3→69.434.3{\to}69.4(×2.0\times 2.0) on GPT–5.4-nano.

The improvement is not specific to frontier-scale targets. Averaged over the six benchmarks,SkillOptlifts GPT–5.4 by+12.7+12.7points, GPT–5.4-mini by+15.4+15.4, GPT–5.4-nano by+26.7+26.7, GPT–5.2 by+16.6+16.6, Qwen3.5–4B by+19.2+19.2, and Qwen3.6–35B-A3B by+9.1+9.1, for an average improvement of approximately+17.6+17.6points per model. Small and weak target models benefit the most in relative terms (e.g. GPT–5.4-nano nearly doubles on DocVQA and triples on ALFWorld), which is consistent with the view that a compact skill artifact can supply procedural knowledge that small models do not yet hold in weights.

The same optimization interface is also effective under tool-backed execution. On the Codex harness,SkillOptis best on all five evaluated benchmarks for GPT–5.5, with average gain+24.8+24.8points over no skill and+14.0+14.0over the next-best baseline (EvoSkill). On the Claude Code harness, it is best on all five benchmarks for GPT–5.5, with average gain+19.1+19.1over no skill and+3.2+3.2over EvoSkill, while EvoSkill itself already lifts the five-benchmark average from57.857.8to73.773.7. The two ALFWorld cells under harness rows are left blank because ALFWorld requires persistent embodied-environment interaction that is not represented in the standard Codex / Claude Code adapters; we therefore report harness results on search, spreadsheets, document QA, multimodal QA, and math.

Taken together, the table supports a strong empirical claim: across direct chat and two tool-execution harnesses, across seven target models, and on procedural and factual benchmarks alike, optimizing a single compact skill artifact under bounded text-space training is the strongest no-weight-update adaptation strategy among the baselines we consider. The main gains come from feedback-driven skill editing rather than from a better one-shot prompt: human and LLM skills can help when prior instructions happen to match the benchmark, but they cannot correct failures after observing rollouts; Trace2Skill mines trajectory lessons without a held-out gate; TextGrad and GEPA optimize prompts but not a persistent skill artifact; and EvoSkill, the strongest harness-side competitor, lacks both bounded textual learning rates and rejected-edit memory. These comparisons support the central design choice—keep the target model, harness, and evaluator fixed, and optimize only the reusable skill artifact.

Alternative explanations.

The per-cell baselines clarify what drives the gains. The effect is not simply prompt length: human skills are already145145–516516tokens long and often exceed the one-shot LLM skill, yet they are beaten in every direct-chat model row while the learned artifacts remain compact (Table6). It is also not only optimizer capacity:SkillOptleads every baseline even for GPT–5.4-nano, and the optimizer-strength analysis in Table5shows that a target-matched optimizer recovers much of the gain. Finally, the harness results show that the method is not just exploiting one skill format: EvoSkill already improves the Codex SpreadsheetBench cell from27.527.5to67.567.5, butSkillOptadds another+17.5+17.5points (67.5→85.067.5{\to}85.0). The gains are largest on procedural benchmarks, where reusable rules about tool use and output formatting matter most, but they also appear on factual and multimodal benchmarks.

Headline numbers in one place.

For convenience, the headline aggregates over Table1are: (i)52/5252/52cells best or tied-best; (ii) average per-model improvement≈+17.6\approx+17.6points across the seven direct-chat target models; (iii) average GPT–5.5 improvement of+23.5+23.5(direct chat),+24.8+24.8(Codex),+19.1+19.1(Claude Code) over no skill; (iv) GPT–5.5 oracle-baseline gap of+5.4+5.4points (direct chat) computed as the difference betweenSkillOpt’s six-benchmark average (82.382.3) and an oracle that picks the best of six competing methodsper cell(76.976.9). The remainder of this section unpacks why these gains appear (Section4.2), how stable and transferable they are (Section4.3), and what the learned artifact looks like (Section4.4).

Ablations

Table2, Figure3, and Table3test the design choices in the optimizer using GPT–5.5 as both the target and the optimizer. The overall message is thatSkillOptbenefits from sufficient evidence, a bounded textual learning rate, rejected-edit feedback, and epoch-wise slow/meta update. SearchQA has limited headroom and is therefore stable across many settings (most cells fluctuate inside a±1.5\pm 1.5point band), while SpreadsheetBench and LiveMathematicianBench expose the trade-off between learning useful procedures and over-editing the skill.

Evidence and batch sizes (panels a, b, c).

Panel (a) shows that procedural benchmarks reward more training evidence: SpreadsheetBench climbs47.5→78.047.5{\to}78.0and LiveMathematicianBench climbs59.1→70.559.1{\to}70.5as the optimizer sees1→100%1{\to}100\%of the training partition, while SearchQA saturates at roughly84−8684{-}86after20%20\%already. Panel (b) shows the same robustness in the other direction: varying the reflection mini-batchsize from11to3232keeps SearchQA inside85.9−87.185.9{-}87.1and SpreadsheetBench inside75.4−77.975.4{-}77.9, with the defaultBm=8B_{m}{=}8at or near the top on all three benchmarks. Panel (c) is equally flat in the rollout-batchsize dimension—moving fromB=8B{=}8to a full epoch keeps SearchQA inside85.1−87.285.1{-}87.2and SpreadsheetBench inside75.0−77.575.0{-}77.5. Together this means the headline gains are not the product of a fragile prompt-search batch size, but a genuine effect of having enough scored evidence per update.

Textual learning rate and schedule (panels d, e).

Panels (d) and (e) directly compare bounded textual learning to looser settings. SweepingLt∈{1,2,4,8,16}L_{t}\in\{1,2,4,8,16\}shows that small or moderate edit budgets are competitive throughout:Lt=4L_{t}{=}4achieves86.5/78.2/56.586.5/78.2/56.5, the highest LiveMath score belongs toLt=8L_{t}{=}8at66.966.9, and the lowest score across all five settings is still only85.585.5on SearchQA. Panel (e) confirms this on the schedule axis: constant decay scores87.3/80.7/62.187.3/80.7/62.1, cosine87.1/77.5/61.387.1/77.5/61.3, and linear87.2/72.9/62.987.2/72.9/62.9, so the bounded-update story does not depend on a single specific scheduler. The qualitative claim is what matters: any moderate, bounded edit budget already beats baselines that rewrite the skill without a budget (Table3, “without lr” row,84.6/75.7/57.384.6/75.7/57.3).

Epoch-wise slow/meta update (panel f, Table3, Figure3).

The slow/meta update supplies longer-horizon guidance beyond the current rollout batch. Slow-update sampling (panel f) places the default at2020examples per epoch (87.1, 77.5, and 61.3), with55,1010, and4040each within±2.7\pm 2.7points. In the matched default component row, removing the rejected-edit buffer lowers scores by 1.6, 4.6, and 2.4 points on SearchQA, SpreadsheetBench, and LiveMath, respectively, supporting its role as a stabilizer for the default loop rather than as an extra deployment-time mechanism. The slow/meta ablation rows are sharper: removing both meta skill and slow update drops SpreadsheetBench from77.577.5to55.055.0(−22.5-22.5points), the largest degradation in the ablation suite. Figure3complements these numerical ablations: validation checkpoints track held-out test performance across epochs, confirming that the gate tends to select skills that generalize rather than skills that only fit the selection split.

Refer to caption Figure 3:Performance trends across epoch checkpoints on three benchmarks: (a) SpreadsheetBench, (b) SearchQA, and (c) LiveMath. For each checkpoint, we report the training rollout score, the selection-best score on the validation set, and the final performance on the unseen test set. The results show how skill quality evolves during optimization and whether the checkpoint preferred by validation selection aligns with the checkpoint that yields the best generalization to the test set.

Gate strictness and edit observability.

The validation gate is intentionally strict: a candidate skill is accepted only when its selection-split score isstrictly greater thanthe current selection score, so ties are rejected and the deployed skill never silently drifts. This conservative criterion makes rejected edits informative negative feedback rather than hidden state. Operationally, every step also records anedit_apply_report.jsoncontaining per-edit accept/skip status, so the source of every change tobest_skill.mdis recoverable after the fact. The epoch-wise slow/meta update writes into a markup-fenced protected region of the skill document that step-level edits cannot overwrite, separating the fast intra-epoch update from the slower cross-epoch consolidation; the optimizer-side meta skill lives only in the teacher’s reflection context and is never shipped with the deployed artifact. These implementation choices explain why removing both meta skill and slow update is especially damaging on SpreadsheetBench: it removes the long-horizon evidence stream and the protected-region contract that keeps local edits from overwriting durable procedural lessons.

Overall, the ablations show that the gains are relatively insensitive to the exact rollout batch, reflection minibatch, or learning-rate schedule, but much more sensitive to the presence of bounded text-space learning, validation gating, rejected-edit feedback, and epoch-wise slow/meta update—the design choices that make skill editing behave like a controlled training loop.

Analysis and Transfer

Tables4–4ask whether an optimized skill behaves like a reusable artifact rather than a task-specific prompt. We test three shifts: deploying a skill across model scales (Table4), moving it across execution harnesses (Table4), and applying it to nearby math benchmarks, including OlympiadBench[7]and Omni-MATH[6](Table4). Table5then asks how much of the gain depends on optimizer capacity by replacing the frontier optimizer with a target-matched one of the same scale as the deployed model.

Cross-model transfer.

Table4(a) is uniformly positive: every cross-model row shows a gain over the target’s no-skill baseline. SpreadsheetBench skills trained with GPT–5.4 transfer to GPT–5.4 (+10.7+10.7), GPT–5.4-mini (+9.4+9.4), and GPT–5.4-nano (+3.0+3.0); LiveMath skills transfer to GPT–5.4 (+10.4+10.4), GPT–5.4-mini (+4.5+4.5), and GPT–5.4-nano (+5.6+5.6). On two of the six rows the transferred skillsurpassesthe in-domainSkillOptreference (LiveMath GPT–5.4:47.247.2transferred vs.44.044.0in-domain; LiveMath GPT–5.4-nano:28.828.8transferred vs.27.227.2in-domain), suggesting that some learned procedures are target-model agnostic. The remaining rows still recover a useful fraction of the in-domain gain—e.g. SpreadsheetBench GPT–5.4 retains roughly half of the in-domain gain (+10.7+10.7of+21.1+21.1)—and no row falls below the target’s no-skill baseline.

Cross-harness transfer.

The harness-shift rows in Table4(b) are the clearest deployment signal. A SpreadsheetBench skill trained inside the Codex loop transfers to Claude Code with absolute gain+59.7+59.7over the Claude Code no-skill baseline (22.1→81.822.1{\to}81.8, slightly exceeding the in-domain Claude CodeSkillOptreference of80.480.4), and the symmetric Claude-Code→\toCodex transfer adds+43.6+43.6on top of the Codex baseline (27.5→71.127.5{\to}71.1). On LiveMath, the Codex→\toClaude Code transfer is smaller (+1.6+1.6over a40.840.8baseline) but still positive, while the Claude-Code→\toCodex transfer adds+12.8+12.8(35.2→48.035.2{\to}48.0). Because the two harnesses expose different tool/file APIs and command surfaces, these positive transfers suggest that the learned rules are not only harness-specific command recipes. In SpreadsheetBench especially, the transferred skill appears to encode workbook-level procedures such as structure-first inspection, formula-aware verification, and static-value materialization, so the cost of optimizing a skill in one execution environment can be amortized across related deployment environments.

Cross-benchmark transfer.

Cross-benchmark transfer is the strictest of the three shifts: source and target benchmarks share only the broad task family (math). On the OlympiadBench→\toOmni-MATH direction reported in Table4(c), the transferred skill is positive on all three model scales we evaluate, with gains of+3.7+3.7on GPT–5.4,+1.8+1.8on GPT–5.4-mini, and+1.3+1.3on GPT–5.4-nano. These rows are smaller than the in-domain and cross-harness transfers—unsurprisingly, since they require the optimized skill to retain useful procedural knowledge after both the test instances and the answer-format conventions change—but they remain uniformly positive, supporting the intended interpretation that the optimized skill encodes reusable mathematical procedure rather than memorized benchmark-specific formatting.

Effect of optimizer strength.

Because the optimizer inSkillOptruns only during the offline training loop and is never invoked at deployment, optimizer choice is a training-time lever: a stronger optimizer can improve the deployed skill without raising the inference cost of using that skill. The deployed artifact is still a staticbest_skill.mdthat calls only the target model. Table5quantifies this lever by running the same loop with two optimizer regimes—a strong frontier optimizer (GPT–5.5) and a target-matched optimizer that shares the target model—while holding the rollout batches, validation gate, bounded edit budget, rejected-edit buffer, and slow/meta update identical.

Two observations follow.First, the stronger optimizer produces larger absolute gains on every (benchmark, target) cell we test: GPT–5.4-nano lifts by+19.0+19.0vs.+11.9+11.9on SpreadsheetBench and+19.0+19.0vs.+14.1+14.1on SearchQA, and GPT–5.4-mini follows the same ordering (+11.4+11.4vs.+7.1+7.1on SpreadsheetBench,+4.3+4.3vs.+2.4+2.4on SearchQA). The bounded-edit, validation-gated loop is what makes this monotone: without the gate, a stronger optimizer could just as easily push larger but harmful rewrites.Second, the target-matched optimizer is far from collapsed—it recovers5656–74%74\%of the strong-optimizer gain across the four cells, confirming thatSkillOptis not a distillation pipeline from a stronger teacher into a weaker student: the optimization loop itself contributes substantial value on top of whatever the optimizer can already do. The practical implication is that a high-capacity frontier optimizer is the right default whenever it is available—it costs only training-time API calls and adds nothing to deployment—while the same loop remains effective if the budget forces a target-matched optimizer instead.

Table 5:Effect of optimizer strength. Each (benchmark, target) pair is optimized either by a strong frontier optimizer (GPT–5.5, bolded) or by a target-matched optimizer that shares the target model; everything else in theSkillOptloop is held fixed. Gains over the target’s no-skill baseline are shown as small green subscripts; the same baseline is used for both optimizer settings within a row. The optimizer runs only during offline training, so the stronger-optimizer column adds zero cost at deployment.

Learned Skills: Compactness, Cost, and Examples

A central premise ofSkillOptis that the trainable object should remain a small, inspectable text document. Tables1–5demonstrate that the optimizer is effective; this subsection asks what its output actually looks like and what it costs. We characterize the learned artifact on three axes—compactness, edit economy, and cost-per-point—and then show one representative learned rule per benchmark to illustrate what kind of procedural knowledge survives the bounded-update loop.

Table 6:Cost and edit economy of the GPT–5.5 / GPT–5.5 (student / teacher) skill runs. Initial and finalbest_skill.mdlengths are in tokens;Editsis the number of accepted bounded updates;Cost / ptis training tokens per absolute test-point gain. One representative learned rule per benchmark is shown in Figure4.##### Compactness.

The final skills are uniformly small. Across the six benchmarks in Table6, the finalbest_skill.mdranges from379379tokens (LiveMathematicianBench) to1,9951{,}995tokens (SpreadsheetBench), with a median of roughly920920tokens. Even the longest learned skill is well below a typical system-prompt budget for modern frontier models, and the shortest one fits inside a single screen. The growth from initial to final skill is moderate (×2.5\times 2.5to×53\times 53depending on whether the initial skill was a one-liner or a paragraph), but the final size in absolute tokens stays small enough that a domain practitioner can read, audit, and edit the deployed artifact in minutes.

Edit economy.

A second striking pattern is that the gains come from very few accepted edits. Across the six benchmarks, the number of edits actually committed tobest_skill.mdduring optimization is between11and44(median2.52.5). LiveMathematicianBench’s+29.3+29.3point gain over no skill arises from asingleaccepted edit, and OfficeQA’s+39.0+39.0point gain similarly arises from one accepted edit. This is direct evidence that the validation gate is doing real work: the optimizer model proposes many more edits per epoch, but only a handful pass the held-out check and survive into the deployed skill. The bulk of the optimizer’s text-space search is thus rejected, captured by the rejected-edit buffer (Section3.5) for future use, and never reaches the target model. The deployed skill is correspondingly compact rather than the union of every reflection.

Cost per point of test-set gain.

The training-token column quantifies the cost of operating the loop. Two regimes are visible. Procedural benchmarks where rollouts are short and cheap—SpreadsheetBench, OfficeQA, LiveMathematicianBench—reach0.60.6–3.63.6M training tokens per absolute test-set point, even though the absolute gains on these benchmarks are the largest (e.g.+39.0+39.0points on OfficeQA at1.11.1M tokens / point, total20.820.8M tokens). Benchmarks with longer trajectories or richer multimodal context—SearchQA (37.937.9M / pt) and DocVQA (46.446.4M / pt)—cost an order of magnitude more per point. The important deployment distinction is that this cost is paid once during skill training; after export, the optimizedbest_skill.mdadds no optimizer calls, no weight updates, and only a compact text artifact to the target agent.

What does a learned skill actually say?

Figure4reproduces one representative learned rule per benchmark, taken verbatim from the finalbest_skill.mdof each case study in Table6. Three observations stand out. First, the rules areproceduralrather thaninstance-specific: none of them name a specific question, file, or entity. Second, they consistently encode the discipline that frontier models lack zero-shot: answer-format constraints (OfficeQA, LiveMathematicianBench), evidence binding to a specific visual region (DocVQA), workbook-structure-first reasoning (SpreadsheetBench), search-frontier discipline (ALFWorld), and canonical-entity choice (SearchQA). Third, they read like rules a thoughtful human practitioner would write after a day with the benchmark—except they are produced automatically by the optimizer and validated edit-by-edit on held-out data.

SearchQA.“Infer the expected answer type from clue wording, then choose the shortest canonical entity supported by co-occurring distinctive evidence.”SpreadsheetBench.“Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation.”OfficeQA.“Treat oracle parsed pages as primary evidence, lock table/date/unit context, and output exactly the requested rounded value without extra labels.”DocVQA.“For tables, forms, charts, and legends, first bind the question to the exact visual row/header/field, then copy only the aligned answer span.”LiveMathematicianBench.“In strongest-statement MCQs, rank choices by theorem strength and prefer a justified stronger-result option over true but weaker corollaries.”ALFWorld.“Keep a horizon-aware visited/frontier ledger, diversify search after repeated same-type failures, and avoid revisiting the destination until holding the target.”

Figure 4:Representative learned rules, one per benchmark, extracted from the finalbest_skill.mdof the GPT–5.5 / GPT–5.5 runs in Table6. Each rule is verbatim from the deployed skill. Notably, every rule is procedural rather than instance-specific, and several encode forms of discipline (answer formatting, evidence binding, search-frontier management) that frontier models do not apply zero-shot.

Implications.

Together, the four observations above support a stronger version of the central claim. Compactness (<2,000<2{,}000tokens) and edit economy (11–44accepted edits) mean the deployed artifact is interpretable. Cost-per-point (0.60.6M–46.446.4M tokens / point) shows that the training cost is measurable and paid before deployment. The shape of the learned rules—procedural, generalizable, and consistent with what a thoughtful human practitioner would write—is evidence that text-space optimization with bounded updates and validation gating discovers transferable procedural knowledge rather than merely overfitting to the training split. This complements the cross-model, cross-harness, and cross-benchmark transfer evidence in Section4.3: the artifact transfers because many of the rules it encodes are intrinsically transferable.

Qualitative Skill Evolution

We inspect two representative runs to understand what the optimized skill actually learns. The ALFWorld case uses GPT–5.4-nano as the student and GPT–5.5 as the teacher, while the SpreadsheetBench case uses GPT–5.5 as both the frozen student and optimizer model. In both cases,SkillOptdoes not replace the initial skill with an unrelated prompt. Instead, accepted edits add compact procedural constraints around recurring failure modes observed in rollout trajectories.

ALFWorld.

The initial ALFWorld skill gives a generic household plan: search for the target object, pick it up, transform it if needed, and place it at the destination. The accepted edits make this plan more stateful and less loop-prone. The optimized skill learns exact object-name matching, so related objects such as mugs, cups, pans, and pots are not substituted for one another. It adds visited-location memory, so unvisited receptacles and surfaces are preferred over repeatedly checking likely but exhausted locations. It also adds destination memory, pick-two progress locks, and direct completion rules: once the agent can clean, heat, cool, place, or otherwise complete the next subgoal, it should take that admissible action instead of examining, closing, or verifying again. Qualitatively, the skill evolves from a general search-transform-place strategy into a finite-state execution policy with object identity, search memory, progress locks, and loop breakers. In this representative run, the selected skill improves ALFWorld held-out test performance from 49.3 to 74.6.

SpreadsheetBench.

The initial SpreadsheetBench skill already instructs the agent to use Python spreadsheet libraries and preserve unrelated workbook content. The accepted edits turn this generic automation workflow into a workbook-forensics policy. The optimized skill learns to inspect the actual workbook rather than rely on previews, locate headers and target ranges across multiple sheets, normalize keys and cell types before lookup or aggregation, and preserve formatting during structural edits. It also adds a key rule for formula-style prompts: when the grader reads cell values, the agent should compute and write evaluated static values, even if the prompt mentions formulas such asINDEX/MATCHorXLOOKUP. Later edits further require filling complete target ranges, including currently blank result cells, keeping helper computations in Python rather than adding workbook artifacts, and reopening the saved workbook to check boundary rows and remaining blanks. In this representative run, the selected skill improves SpreadsheetBench held-out test performance from 40.4 to 78.9.

Conclusion

We presentedSkillOpt, a text-space optimizer that treats an external skill document as the trainable state for frozen LLM agents. By separating the target model that executes tasks from the optimizer that edits skills, and by using bounded edit budgets, minibatch reflection, held-out validation gates, rejected-edit buffers, and epoch-wise slow/meta update,SkillOptturns skill improvement into a controlled learning process rather than ad hoc prompt revision. Across six benchmarks, seven target models, and three execution modes,SkillOptis best or tied-best on5252of5252evaluated cells, lifts GPT–5.5 by+23.5+23.5points on average over no skill in direct chat and by+24.8/+19.1+24.8/+19.1points under Codex and Claude Code harnesses, and beats the strongest per-cell baseline from human, LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills by+5.4+5.4points on average. Per-benchmark case studies show that these gains arise from compact (<2,000<2{,}000token), interpretable skill artifacts assembled from only11–44accepted edits, and that the deployed skills transfer across model scales, harnesses, and nearby benchmarks. These results suggest that compact natural-language skills can serve as a practical domain-adaptation layer for frontier agents, enabling reusable improvement without modifying model weights.

Outlook.

SkillOptoptimizes a single skill artifact for a single target domain; natural extensions include skill libraries that share infrastructure across domains, reuse of optimizer-side meta skills across benchmarks, reward-free or preference-driven validation gates for open-ended tasks, and self-distillation of optimized skills back into the target model as a stepping stone toward weight-level adaptation. We hope that treating the skill itself as the trainable object—rather than as a side artifact of prompting—will let future work apply the full toolkit of optimization (learning rates, schedules, regularization, curricula, validation) to a part of the agent stack that has so far been hand-engineered.

References

[1]L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang,et al.(2025)Gepa: reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457.Cited by:§1,§2,§3.3,§4.
[2](2026)Evoskill: automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766.Cited by:§1,§2,§4.
[3]Anthropic(2025)Claude code: an ai coding agent system.Note:Accessed: 2026-05-06External Links:LinkCited by:§4.
[4]M. Dunn, L. Sagun, M. Higgins, V. U. Guney, V. Cirik, and K. Cho(2017)Searchqa: a new q&a dataset augmented with context from a search engine.arXiv preprint arXiv:1704.05179.Cited by:§4.
[5]R. Fang, Y. Liang, X. Wang, J. Wu, S. Qiao, P. Xie, F. Huang, H. Chen, and N. Zhang(2025)Memp: exploring agent procedural memory.arXiv preprint arXiv:2508.06433.Cited by:§2.
[6]B. Gao, F. Song, Z. Yang, Z. Cai, Y. Miao, Q. Dong, L. Li, C. Ma, L. Chen, R. Xu, Z. Tang, B. Wang, D. Zan, S. Quan, G. Zhang, L. Sha, Y. Zhang, X. Ren, T. Liu, and B. Chang(2024)Omni-math: a universal olympiad level mathematic benchmark for large language models.External Links:2410.07985,LinkCited by:§1,§4.3.
[7]C. He, R. Luo, Y. Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang,et al.(2024)OlympiadBench: a challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008.Cited by:§4.3.
[8]L. He, Q. Yu, H. Dong, B. Liao, X. Xu, M. Goldblum, J. Bian, and N. Mesgarani(2026)LiveMathematicianBench: a live benchmark for mathematician-level reasoning with proof sketches.External Links:2604.01754,LinkCited by:§4.
[9]Y. He, J. Liu, Y. Liu, Y. Li, T. Cao, Z. Hu, X. Xu, and B. Hooi(2025)Evotest: evolutionary test-time learning for self-improving agentic systems.arXiv preprint arXiv:2510.13220.Cited by:§2.
[10]Y. Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu(2026)SoK: agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867.Cited by:§1,§2,§3.1.
[11]O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam,et al.(2023)Dspy: compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714.Cited by:§1.
[12]X. Li, W. Chen, Y. Liu, S. Zheng, X. Chen, Y. He, Y. Li, B. You, H. Shen, J. Sun,et al.(2026)SkillsBench: benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670.Cited by:§1,§2,§3.1.
[13]X. Liu, X. Luo, L. Li, G. Huang, J. Liu, and H. Qiao(2026)SkillForge: forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618.Cited by:§1,§2.
[14]Z. Ma, B. Zhang, J. Zhang, J. Yu, X. Zhang, X. Zhang, S. Luo, X. Wang, and J. Tang(2024)Spreadsheetbench: towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems37,pp. 94871–94908.Cited by:§4.
[15]Z. Ma, S. Yang, Y. Ji, X. Wang, Y. Wang, Y. Hu, T. Huang, and X. Chu(2026)SkillClaw: let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377.Cited by:§2.
[16]A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang,et al.(2023)Self-refine: iterative refinement with self-feedback.Advances in neural information processing systems36,pp. 46534–46594.Cited by:§3.3.
[17]M. Mathew, D. Karatzas, and C. Jawahar(2021)Docvqa: a dataset for vqa on document images.InProceedings of the IEEE/CVF winter conference on applications of computer vision,pp. 2200–2209.Cited by:§4.
[18]Q. Mi, Z. Ma, M. Yang, H. Li, Y. Wang, H. Zhang, and J. Wang(2026)ProcMEM: learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869.Cited by:§2.
[19]J. Ni, Y. Liu, X. Liu, Y. Sun, M. Zhou, P. Cheng, D. Wang, X. Jiang, and G. Jiang(2026)Trace2skill: distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158.Cited by:§1,§2,§4.
[20]OpenAI(2025)Codex: a cloud-based software engineering agent.Note:Accessed: 2026-05-06External Links:LinkCited by:§4.
[21]OpenAI(2026-03)Introducing GPT-5.4.External Links:LinkCited by:§4.
[22]K. Opsahl-Ong, A. Singhvi, J. Collins, I. Zhou, C. Wang, A. Baheti, O. Oertell, J. Portes, S. Havens, E. Elsen,et al.(2026)Officeqa pro: an enterprise benchmark for end-to-end grounded reasoning.arXiv preprint arXiv:2603.08655.Cited by:§4.
[23]L. Qiu, Z. Gao, J. Chen, Y. Ye, W. Huang, X. Xue, W. Qiu, and S. Tang(2026)AutoRefine: from trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758.Cited by:§2.
[24]Qwen Team(2026-02)Qwen3.5: towards native multimodal agents.External Links:LinkCited by:§4.
[25]Qwen Team(2026-04)Qwen3.6-35B-A3B: agentic coding power, now open to all.External Links:LinkCited by:§4.
[26]T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom(2023)Toolformer: language models can teach themselves to use tools.Advances in neural information processing systems36,pp. 68539–68551.Cited by:§1,§3.7.
[27]S. Shen, W. Cheng, M. Ma, A. Turcan, M. J. Zhang, and J. Ma(2026)SKILLFOUNDRY: building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964.Cited by:§1,§2.
[28]N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao(2023)Reflexion: language agents with verbal reinforcement learning.Advances in neural information processing systems36,pp. 8634–8652.Cited by:§3.3.
[29]M. Shridhar, X. Yuan, M. Cote, Y. Bisk, A. Trischler, and M. Hausknecht(2021){alfw}orld: aligning text and embodied environments for interactive learning.InInternational Conference on Learning Representations,External Links:LinkCited by:§4.
[30]W. Song, J. Yue, and Z. Pang(2026)ABSTRAL: automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791.Cited by:§2.
[31]C. Wang, Z. Yu, X. Xie, W. Yao, R. Fang, S. Qiao, K. Cao, G. Zheng, X. Qi, P. Zhang,et al.(2026)SkillX: automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804.Cited by:§2.
[32]G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar(2023)Voyager: an open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291.Cited by:§1.
[33]J. Wang, Q. Yan, Y. Wang, Y. Tian, S. S. Mishra, Z. Xu, M. Gandhi, P. Xu, and L. L. Cheong(2025)Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102.Cited by:§2.
[34]R. Wu, X. Wang, J. Mei, P. Cai, D. Fu, C. Yang, L. Wen, X. Yang, Y. Shen, Y. Wang,et al.(2025)Evolver: self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079.Cited by:§2.
[35]P. Xia, J. Chen, H. Wang, J. Liu, K. Zeng, Y. Wang, S. Han, Y. Zhou, X. Zhao, H. Chen,et al.(2026)Skillrl: evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234.Cited by:§2.
[36]C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen(2023)Large language models as optimizers.InThe Twelfth International Conference on Learning Representations,Cited by:§1.
[37]J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press(2024)Swe-agent: agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37,pp. 50528–50652.Cited by:§1,§3.7.
[38]Y. Yang, J. Li, Q. Pan, B. Zhan, Y. Cai, L. Du, J. Zhou, K. Chen, Q. Chen, X. Li,et al.(2026)Autoskill: experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145.Cited by:§2.
[39]S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao(2022)React: synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629.Cited by:§1,§3.7.
[40]M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, and J. Zou(2024)Textgrad: automatic “differentiation” via text.arXiv preprint arXiv:2406.07496.Cited by:§4.
[41]H. Zhang, S. Fan, H. P. Zou, Y. Chen, Z. Wang, J. Zhou, C. Li, W. Huang, Y. Yao, K. Zheng,et al.(2026)EvoSkills: self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687.Cited by:§2.

Appendix AAdditional Method Details and Optimizer Prompts

This appendix gives the executable details behindSkillOpt. The optimization loop keeps the task-execution model fixed and trains only a text skill document. A separate optimizer model reads rollout evidence, proposes patch-style edits, merges and ranks the edits, and submits each candidate skill to a held-out selection gate. The task-execution model only receives the current skill and the benchmark task; it does not see the optimizer prompts below.

Appendix BLimitations

SkillOptstudies skill optimization as a lightweight alternative to model-weight adaptation, but it still has several practical limitations. First, the optimization loop relies on scored trajectories and a held-out selection split, so it is most directly applicable when the target task has automatic verifiers, exact-match metrics, executable checks, or otherwise reliable feedback signals. For open-ended domains where success is subjective, multi-dimensional, or costly to judge, the validation gate may require stronger human or model-based evaluation. Second, although the deployed artifact is only a compactbest_skill.md, training the skill requires additional rollout computation and calls to an optimizer model; this cost is amortized when the same skill is reused, but may be less attractive for one-off tasks. Third,SkillOptintentionally optimizes a single portable skill rather than growing a large skill library or changing model weights. This design improves deployment simplicity, but a single skill may be insufficient for highly heterogeneous domains that require many disjoint procedures. Finally, optimized skills can encode domain-specific heuristics from the training distribution, so careful held-out evaluation remains necessary before transferring them to substantially different models, harnesses, or task settings.

Appendix CExperimental Protocol Details

Benchmarks and metrics.

We use each benchmark’s native evaluator and report hard success or exact-match accuracy on held-out test examples. SearchQA measures extractive question answering; SpreadsheetBench evaluates spreadsheet-oriented code and tool use; OfficeQA and DocVQA test local-document and multimodal-document reasoning; SealQA stresses noisy retrieval; LiveMathematicianBench evaluates mathematical multiple-choice reasoning; and ALFWorld tests sequential decision making. Dataset-backed benchmarks use deterministic train/selection/test splits, with a default2:1:72{:}1{:}7split when no benchmark-specific split is stated. The selection split is used only for model selection over candidate skills; all headline scores are computed on held-out test data.

Baselines.

The no-skill baseline evaluates the frozen student without an optimized skill document. Human-skill and LLM-skill baselines use manually written and one-shot generated skills under the same evaluation protocol. Trace2Skill mines skill artifacts from training trajectories and evaluates the frozen student withoutSkillOpt’s iterative validation gate. TextGrad and GEPA are reflective prompt-optimization baselines for direct-chat settings. EvoSkill is included for the harness-backed comparison where a matched completed run is available. Entries not measured under the final aligned protocol are marked as –rather than mixed with incompatible runs.

Optimization protocol.

Unless otherwise stated,SkillOptruns for four epochs with rollout batch size 40, reflection minibatch size 8, textual learning rate 4, cosine learning-rate decay with minimum rate 2, held-out validation gating, slow update enabled with 20 sampled examples, and optimizer-side meta skill enabled. The optimizer analyzes successes and failures separately, proposes patch-style skill edits, merges duplicate or contradictory proposals, ranks edits under the current learning-rate cap, and applies the selected edits to form a candidate skill. The candidate is evaluated on the selection split and is accepted only if it improves the current selection score; the best accepted skill is exported asbest_skill.md. The student model, backend, harness, and benchmark evaluator remain fixed during optimization.

Ablation protocol.

One-factor ablations vary a single scalar or component while holding the remaining optimizer configuration fixed. The train-size ablation fixes the train/selection/test split to2:1:72{:}1{:}7and varies how much of the training partition is exposed to the optimizer. The 100% row uses the full training partition under the same split, so it is directly comparable to the smaller-subset rows. Component ablations remove or alter one mechanism at a time, including the edit budget, rejected-edit buffer, and epoch-wise slow/meta update.

Optimization Procedure

Algorithm1expands the procedure used in the experiments. The central state variables are the current skillscurs_{\mathrm{cur}}, the best validation-gated skillsbests_{\mathrm{best}}, a selection-score cache𝒞\mathcal{C}, a step bufferℬ\mathcal{B}containing rejected edits and observed failure patterns, and an optimizer-side meta skillmmetam_{\mathrm{meta}}used only to guide future edit generation.

Algorithm 1SkillOptskill optimization1:Frozen training model

MM, optimizer model

OO, harness

hh, splits

Dtrain,Dsel,DtestD_{\mathrm{train}},D_{\mathrm{sel}},D_{\mathrm{test}}, initial skill

s0s_{0}, epochs

EE, edit-budget schedule

LtL_{t}, rollout batch size

BB, accumulation factor

AA, reflection minibatch size

BmB_{m} 2:Best validation-gated skill

sbests_{\mathrm{best}}and held-out test score

scur←s0s_{\mathrm{cur}}\leftarrow s_{0},

sbest←s0s_{\mathrm{best}}\leftarrow s_{0},

𝒞←∅\mathcal{C}\leftarrow\emptyset,

ℬ←[]\mathcal{B}\leftarrow[\ ],

mmeta←∅m_{\mathrm{meta}}\leftarrow\emptyset 4:

scorecur←Evaluate(M,h,s0,Dsel)\mathrm{score}_{\mathrm{cur}}\leftarrow\textsc{Evaluate}(M,h,s_{0},D_{\mathrm{sel}});

scorebest←scorecur\mathrm{score}_{\mathrm{best}}\leftarrow\mathrm{score}_{\mathrm{cur}} 5:

𝒞[Hash(s0)]←scorecur\mathcal{C}[\textsc{Hash}(s_{0})]\leftarrow\mathrm{score}_{\mathrm{cur}} 6:for

e=1e=1to

EEdo

7:Shuffle

DtrainD_{\mathrm{train}}into rollout batches; reset

ℬ←[]\mathcal{B}\leftarrow[\ ] 8:foreach optimization step in epoch

eedo

9:Collect

AArollout batches by executing

h(M,x,scur)h(M,x,s_{\mathrm{cur}})for sampled tasks

xx 10:Split rollout evidence into failures and successes, then into minibatches of size

BmB_{m} 11:Ask

OOto analyze failure minibatches and produce failure patch proposals

12:Ask

OOto analyze success minibatches and produce success patch proposals

13:Ask

OOto merge failure proposals, merge success proposals, and perform a final failure-prioritized merge

14:Ask

OOto rank merged edits and keep at most

LtL_{t}edits

15:Apply the selected edits to obtain a candidate skill

s~\tilde{s} 16:if

Hash(s~)∈𝒞\textsc{Hash}(\tilde{s})\in\mathcal{C}then

17:

scorecand←𝒞[Hash(s~)]\mathrm{score}_{\mathrm{cand}}\leftarrow\mathcal{C}[\textsc{Hash}(\tilde{s})] 18:else

19:

scorecand←Evaluate(M,h,s~,Dsel)\mathrm{score}_{\mathrm{cand}}\leftarrow\textsc{Evaluate}(M,h,\tilde{s},D_{\mathrm{sel}}) 20:

𝒞[Hash(s~)]←scorecand\mathcal{C}[\textsc{Hash}(\tilde{s})]\leftarrow\mathrm{score}_{\mathrm{cand}} 21:endif

22:if

scorecand>scorecur\mathrm{score}_{\mathrm{cand}}>\mathrm{score}_{\mathrm{cur}}then

23:

scur←s~s_{\mathrm{cur}}\leftarrow\tilde{s};

scorecur←scorecand\mathrm{score}_{\mathrm{cur}}\leftarrow\mathrm{score}_{\mathrm{cand}} 24:if

scorecand>scorebest\mathrm{score}_{\mathrm{cand}}>\mathrm{score}_{\mathrm{best}}then

25:

sbest←s~s_{\mathrm{best}}\leftarrow\tilde{s};

scorebest←scorecand\mathrm{score}_{\mathrm{best}}\leftarrow\mathrm{score}_{\mathrm{cand}} 26:endif

27:else

28:Add rejected edits and observed failure patterns to

ℬ\mathcal{B} 29:endif

30:endfor

31:if

e≥2e\geq 2and slow update is enabledthen

32:Compare the same sampled tasks under the previous and current epoch-end skills

33:Ask

OOfor protected longitudinal guidance; validate the injected guidance through

DselD_{\mathrm{sel}} 34:endif

35:if

e≥2e\geq 2and optimizer memory is enabledthen

36:Ask

OOto update

mmetam_{\mathrm{meta}}for future edit generation and selection

37:endif

38:endfor

39:

scoretest←Evaluate(M,h,sbest,Dtest)\mathrm{score}_{\mathrm{test}}\leftarrow\textsc{Evaluate}(M,h,s_{\mathrm{best}},D_{\mathrm{test}}) 40:return

sbests_{\mathrm{best}},

scoretest\mathrm{score}_{\mathrm{test}}

Optimizer Prompt Contracts

The following blocks reproduce the operational prompt contracts used by the optimizer model, with terminology normalized to the paper’s optimizer/training-model framing. The prompts require JSON outputs so that edits can be parsed, filtered, applied, and validated without manual intervention.

Failure analysis:analyst_error.md

You are an expert failure-analysis agent for AI agent tasks.

You will be given MULTIPLE failed agent trajectories from a single minibatch
and the current skill document.
Your job is to identify the most important COMMON failure patterns across
the batch and propose a concise set of skill edits.

## Analysis Process
1. Read ALL trajectories in the minibatch.
2. Identify the most prevalent, systematic failure patterns across them.
3. For each pattern, classify its failure type.
4. Propose skill edits that address the COMMON patterns, not individual edge cases.
5. Edits must be generalizable; do not hardcode task-specific values.
6. Only patch gaps in the skill; do not duplicate existing content.

You will be told the maximum number of edits (the budget L). Produce AT MOST L edits,
focusing on the highest-impact patterns. You may produce fewer if warranted.

Respond ONLY with a valid JSON object (no markdown fences, no extra text):
{
  "batch_size": <number of trajectories analysed>,
  "failure_summary": [
    {"failure_type": "<type>", "count": <int>, "description": "<one-line>"}
  ],
  "patch": {
    "reasoning": "<why these edits address the batch’s common failures>",
    "edits": [
      {"op": "append",       "content": "<markdown to add at end of skill>"},
      {"op": "insert_after", "target": "<exact heading/text to insert after>",
       "content": "<markdown>"},
      {"op": "replace",      "target": "<exact text to replace>",
       "content": "<replacement>"},
      {"op": "delete",       "target": "<exact text to remove>"}
    ]
  }
}
Only include edits that are needed. "edits" can be an empty list if no patch is warranted.

IMPORTANT: The skill document may contain a section between
<!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers.
This is a PROTECTED section managed by a separate slow-update process.
Do NOT propose any edits that target, modify, or delete content within these markers.

Success analysis:analyst_success.md

You are an expert success-pattern analyst for AI agents.

You will be given MULTIPLE successful agent trajectories from a single minibatch
and the current skill document. Your job is to identify generalizable behavior
patterns that are COMMON across the batch and worth encoding in the skill.

## Rules
- Only propose patches for patterns NOT already covered in the skill.
- Focus on patterns that appear across MULTIPLE trajectories in the batch.
- Be concise. Patterns must generalize beyond specific tasks.
- Prefer reinforcing existing sections over adding new top-level sections.

You will be told the maximum number of edits (the budget L). Produce AT MOST L edits,
focusing on the most broadly applicable patterns. You may produce fewer if warranted.

Respond ONLY with a valid JSON object:
{
  "batch_size": <number of trajectories analysed>,
  "success_patterns": ["<pattern 1>", "<pattern 2>"],
  "patch": {
    "reasoning": "<why these patterns are worth encoding>",
    "edits": [
      {"op": "append",       "content": "<markdown>"},
      {"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
      {"op": "replace",      "target": "<old text>",     "content": "<new text>"},
      {"op": "delete",       "target": "<exact text to remove>"}
    ]
  }
}
"edits" may be empty if the skill already covers all observed patterns.

IMPORTANT: The skill document may contain a section between
<!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers.
This is a PROTECTED section managed by a separate slow-update process.
Do NOT propose any edits that target, modify, or delete content within these markers.

Failure merge:merge_failure.md

You are a skill-edit coordinator. You receive multiple independently-proposed patches
from FAILURE analysis of agent trajectories. Merge them into ONE coherent,
non-redundant patch.

Merge guidelines:
1. Deduplicate: keep the best-worded version of similar edits.
2. Resolve conflicts: if patches contradict on the same point,
   choose the one with stronger justification or synthesize both.
3. Preserve unique insights: include all non-redundant corrective edits.
4. Prevalent-pattern bias: edits appearing consistently across multiple patches
   address systematic failures; preserve them with HIGH priority.
   Edits from only one patch may be discarded if task-specific.
5. Independence: no two edits in the merged patch may target the same text region.
6. Support count: for each merged edit, estimate how many source patches support it.
7. PROTECTED SECTION: The skill may contain a section between
   <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers.
   Do NOT merge or produce any edits that target content within these markers.

Respond ONLY with a valid JSON object:
{
  "reasoning": "<summary of key consolidation decisions>",
  "edits": [
    {
      "op": "append|insert_after|replace|delete",
      "target": "<if insert_after or replace or delete>",
      "content": "<markdown>",
      "support_count": <integer>,
      "source_type": "failure"
    }
  ]
}

Success merge:merge_success.md

You are a skill-edit coordinator. You receive multiple independently-proposed patches
from SUCCESS analysis of agent trajectories. Merge them into ONE coherent patch
that reinforces effective patterns.

Merge guidelines:
1. Deduplicate: keep only the most generalizable version of similar patterns.
2. Be conservative: success-driven patches reinforce existing behavior.
   Only include edits for patterns NOT already in the skill.
3. Prevalent-pattern bias: patterns seen across many successful trajectories
   are most worth encoding.
4. Support count: estimate how many source patches support each merged edit.
5. PROTECTED SECTION: The skill may contain a section between
   <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers.
   Do NOT merge or produce any edits that target content within these markers.

Respond ONLY with a valid JSON object:
{
  "reasoning": "<summary>",
  "edits": [
    {
      "op": "append|insert_after|replace|delete",
      "target": "<if needed>",
      "content": "<markdown>",
      "support_count": <integer>,
      "source_type": "success"
    }
  ]
}

Final merge:merge_final.md

You are a skill-edit coordinator performing the FINAL merge. You receive two
pre-merged patch groups:
1. Failure-driven patches (corrective, high priority)
2. Success-driven patches (reinforcement, lower priority)

Merge guidelines:
1. FAILURE PATCHES TAKE PRIORITY: the primary goal of skill reflection is to
   fix failures. Failure-driven edits should be preserved unless they directly
   conflict with a well-supported success pattern.
2. Deduplicate: if a failure edit and success edit cover the same point,
   keep the failure version.
3. Preserve success insights: include success edits that cover patterns
   NOT addressed by failure edits.
4. Higher-level merges represent broader consensus: edits that survived
   previous merge rounds should be given priority.
5. Carry forward support_count and source_type for each edit.
6. PROTECTED SECTION: The skill may contain a section between
   <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers.
   Do NOT merge or produce any edits that target content within these markers.

Respond ONLY with a valid JSON object:
{
  "reasoning": "<summary of priority decisions>",
  "edits": [
    {
      "op": "append|insert_after|replace|delete",
      "target": "<if needed>",
      "content": "<markdown>",
      "support_count": <integer>,
      "source_type": "failure|success"
    }
  ]
}

Ranking and selection:ranking.md

You are an expert edit-ranking optimizer for a skill optimization system. You receive
a skill document and a pool of proposed edits. Your job is to RANK the edits by
importance and select the top ones.

Ranking criteria (in order of priority):
1. Systematic impact: edits that address widespread, recurring failure patterns
   across many tasks should rank highest. A rule that fixes 50% of failures beats
   one that fixes a single edge case.
2. Complementarity: edits that fill gaps in the current skill, not duplicate
   existing content, rank higher.
3. Generality: edits phrased as general principles rank higher than those
   tied to specific question types or entities.
4. Actionability: edits with clear, concrete guidance rank higher than vague advice.

You will be told how many edits to select (the budget).

Respond ONLY with a valid JSON object:
{
  "reasoning": "<brief justification for your ranking decisions>",
  "selected_indices": [<0-based indices of the top edits, in priority order>]
}

Slow update:slow_update.md

You are a strategic skill advisor for an AI agent optimization system.

Your role is different from the per-step analyst. The per-step analyst sees
individual trajectories and proposes local patches. YOU see how the skill has
evolved across an entire epoch by comparing the SAME tasks under two consecutive
skill versions. This longitudinal view lets you identify systemic drift,
regressions, and persistent blind spots that step-level edits cannot catch.

## What You Receive

1. Previous epoch’s skill and current epoch’s skill, to see what changed.
2. Longitudinal comparison: the same 20 training tasks rolled out under both skills,
   categorized into regressions, persistent failures, improvements, and stable successes.
3. Previous slow update guidance, if any: the guidance written at the end of the
   last epoch.

## Your Process

1. Reflect on the previous guidance, if provided:
   - Which parts of the previous guidance were effective?
   - Which parts failed or backfired?
   - Were there blind spots the previous guidance missed entirely?

2. Write updated guidance that:
   - Retains and strengthens parts of the previous guidance that proved effective.
   - Revises or removes parts that were ineffective or counterproductive.
   - Adds new instructions to address newly observed regressions and persistent failures.

## Output Requirements

Write a strategic guidance block that will OVERWRITE the previous guidance
in the protected section of the skill document. This section is READ-ONLY to
all subsequent step-level optimization; only this epoch-boundary process can
overwrite it at the next epoch boundary.

Your guidance must:
- Be written as direct, actionable instructions to the training model.
- Prioritize: (1) preventing regressions, (2) fixing persistent failures,
  (3) reinforcing successful patterns.
- NOT duplicate content already in the main skill body; complement it.
- Address the training model directly, for example: "When you encounter X, always do Y."

Respond ONLY with a valid JSON object:
{
  "reasoning": "<reflection on previous guidance AND analysis of longitudinal comparison>",
  "slow_update_content": "<the exact guidance text to insert into the protected section>"
}

Optimizer memory:meta_skill.md

You are an optimizer coach for an AI agent skill optimization system.

Your job is not to solve tasks directly and not to write training-model-facing
skill rules. Your job is to write a compact optimizer-side meta skill that helps
future optimizer calls produce better skill edits in this environment.

## What You Receive

1. The previous epoch’s last-step skill.
2. The current epoch’s last-step skill.
3. A longitudinal comparison on the SAME sampled tasks under those two skills.
4. The previous optimizer memory, if one existed.

## Your Goal

Write a concise optimizer memory that improves future optimizer behavior in stages
such as failure analysis, success analysis, patch merging, and edit ranking.

This optimizer memory should capture things like:
- Which kinds of edits tend to help in this environment.
- Which kinds of edits tend to be too vague, redundant, brittle, or harmful.
- What level of abstraction works best for rules here.
- What failure-repair patterns should be prioritized.
- What regression risks future optimizer calls should guard against.

## Important Constraints

- Address the FUTURE OPTIMIZER directly, not the training model.
- Focus on how to write better edits and organize better skill updates.
- Use evidence from the adjacent-epoch comparison, not generic advice.
- Keep it compact and high-signal. Prefer a few durable principles.
- Revise or remove parts of the previous optimizer memory if they did not help.
- Do not output training-model-facing task instructions.

Respond ONLY with a valid JSON object:
{
  "reasoning": "<brief reflection on what editing directions helped or hurt>",
  "meta_skill_content": "<compact optimizer guidance for future edits>"
}

Patch Representation and Safeguards

Patch-mode optimization restricts each update to four atomic operations:append,insert_after,replace, anddelete. Each merged edit also records a support count and a source type, allowing ranking to prefer edits that survive independent analyses and hierarchical merges. The edit budgetLtL_{t}acts as a textual learning rate: it limits how many proposed edits can be applied at a step, preserving continuity between adjacent skills.

The protected slow-update section, delimited bySLOW_UPDATE_STARTandSLOW_UPDATE_END, is off limits to all step-level prompts. Only the epoch-boundary slow-update process may rewrite that section, and the rewritten skill still passes through the same held-out selection gate before it can become the current skill. Rejected candidates are not discarded entirely: their failure patterns and rejected edits are stored in the step buffer so that later optimizer calls can avoid repeating harmful changes.

Design Principles

The implementation follows five design principles. First, the task-execution model is fixed; only the text skill changes. Second, every candidate skill is evaluated on a selection split before acceptance, which prevents unvalidated reflection from accumulating. Third, minibatch analyses are merged hierarchically so that the final edits represent recurring evidence rather than single examples. Fourth, the edit budget serves as a learning-rate analogue, allowing larger early changes and smaller late refinements. Fifth, the deployed skill remains lightweight and inspectable, while the optimizer-side meta skill stays separate from the skill shown to the task-execution model.

Introduction

Related Work

Prompt auto tuning and agent-configuration search.

Skill construction and skill evolution.

Method

Forward Pass: Rollout Evidence

Backward Pass: Minibatch Reflection

Bounded Text Updates

Validation Gate and Rejected-Edit Buffer

Epoch-Wise Slow/Meta Update

Harness-Agnostic Deployment

Experiments

Setting.

Default optimizer hyperparameters.

Harnesses.

Baselines.

Main Results

Alternative explanations.

Headline numbers in one place.

Ablations

Evidence and batch sizes (panels a, b, c).

Textual learning rate and schedule (panels d, e).

Epoch-wise slow/meta update (panel f, Table3, Figure3).

Gate strictness and edit observability.

Analysis and Transfer

Cross-model transfer.

Cross-harness transfer.

Cross-benchmark transfer.

Effect of optimizer strength.

Learned Skills: Compactness, Cost, and Examples

Edit economy.

Cost per point of test-set gain.

What does a learned skill actually say?

Implications.

Qualitative Skill Evolution

ALFWorld.

SpreadsheetBench.

Conclusion

Outlook.

References

Appendix AAdditional Method Details and Optimizer Prompts

Appendix BLimitations

Appendix CExperimental Protocol Details

Benchmarks and metrics.

Baselines.

Optimization protocol.

Ablation protocol.

Optimization Procedure

Optimizer Prompt Contracts

Failure analysis:analyst_error.md

Success analysis:analyst_success.md

Failure merge:merge_failure.md

Success merge:merge_success.md

Final merge:merge_final.md

Ranking and selection:ranking.md

Slow update:slow_update.md

Optimizer memory:meta_skill.md

Patch Representation and Safeguards

Design Principles

Similar Articles

@Yif_Yang: Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language …

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery

@DAIEvolutionHub: MICROSOFT JUST OPEN-SOURCED A WAY TO “TRAIN” AI AGENTS WITHOUT TOUCHING MODEL WEIGHTS SkillOpt treats a simple markdown…

@dair_ai: https://x.com/dair_ai/status/2061104052818108476

Submit Feedback