@Xudong07452910: This SkillOpt paper is quite interesting—it actually addresses a very important point: AI agents in the future won't just rely on humans writing prompts; they can train their own 'job descriptions'. Currently, many skills/prompts are written one-off, and when real tasks pile up, various edge cases start to fail...

X AI KOLs Timeline Papers

Summary

SkillOpt introduces a systematic controllable text-space optimizer that enables AI agents to train and improve their own skills (like 'work instructions') through iterative edits and validation, outperforming human-crafted and one-shot prompts across multiple benchmarks and models.

This SkillOpt paper is quite interesting—it actually addresses a very important point: AI agents in the future won't just rely on humans writing prompts; they can train their own 'job descriptions'. Currently, many skills/prompts are written one-off, and when real tasks pile up, various edge cases start to fail. SkillOpt's approach is more like training an employee: first, let the agent work according to the current description, record successes and failures, then have another AI summarize what should be changed, and finally validate with test tasks—only keep the changes if they actually improve performance. This means that an agent's evolution doesn't necessarily require modifying model weights; it can also be achieved through continuously optimizing external skills. In the past, people collected 'god-tier prompts'. In the future, what may be more valuable is a skill system that can continuously self-evolve. https://arxiv.org/pdf/2605.23904
Original Article
View Cached Full Text

Cached at: 05/26/26, 01:09 PM

This SkillOpt paper is quite interesting. It actually addresses a very important point: In the future, AI Agents will not only rely on humans writing prompts, but can also train their own “job instructions” by themselves. Currently, many skills/prompts are written in one shot. When faced with many real tasks, various edge cases start to fail. SkillOpt’s approach is more like training an employee: first let the Agent work according to the current instruction, record successes and failures, then have another AI summarize what needs to be changed, and finally validate with test tasks — only keep the change if it actually improves. This means the evolution of an Agent does not necessarily require changing model weights; it can also be achieved by continuously optimizing external skills. In the past, people collected “god-level prompts.” In the future, what might be more valuable is a skill system that can continuously self-evolve. https://arxiv.org/pdf/2605.23904 —

Introduction

Source: https://arxiv.org/html/2605.23904 [Uncaptioned image] May 2026 SkillOpt: Executive Strategy for Self-Evolving Agent Skills Yifan Yang1,∗,‡Ziyang Gong2,∗Weiquan Huang3,∗Qihao Yang2,∗Ziwei Zhou4,∗ Zisu Huang4,∗Yan Li2Xuemei Gao1Qi Dai1Bei Liu1 Kai Qiu1Yuqing Yang1Dongdong Chen1Xue Yang2,‡Chong Luo1 1Microsoft2Shanghai Jiao Tong University3Tongji University4Fudan University

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision—none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead betrainedas the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible.SkillOptis, to our knowledge, the first systematiccontrollabletext-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code),SkillOptis best or tied onall 52 evaluated (model, benchmark, harness) cellsand beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT–5.5 it lifts the average no-skill accuracy by+23.5\mathbf{+23.5}points in direct chat, by+24.8+24.8inside the Codex agentic loop, and by+19.1+19.1inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.Code:https://aka.ms/SkillOptCorrespondence:[email protected] (https://arxiv.org/html/2605.23904v2/mailto:[email protected]),[email protected] (https://arxiv.org/html/2605.23904v2/mailto:[email protected])∗Equal contribution.‡Corresponding authors. Refer to captionFigure 1:Overview ofSkillOpt. The target model executes tasks with a current skill, an additional frontier optimizer model converts trajectories into bounded add/delete/replace skill edits, and a held-out gate accepts only edits that improve validation performance. Accepted edits are exported as a reusable skill artifact, while rejected edits become negative feedback for later updates.

Frontier language models are increasingly deployed as agents, from single-prompt callers to multi-step execution harnesses with tools, files, and verifiers[39 (https://arxiv.org/html/2605.23904#bib.bib4),26 (https://arxiv.org/html/2605.23904#bib.bib9),32 (https://arxiv.org/html/2605.23904#bib.bib8),37 (https://arxiv.org/html/2605.23904#bib.bib10)]. In such settings, domain adaptation is no longer only about model weights or prompts: it also requires improving theproceduresby which the agent gathers evidence, calls tools, follows domain conventions, and formats outputs[36 (https://arxiv.org/html/2605.23904#bib.bib2),11 (https://arxiv.org/html/2605.23904#bib.bib3)]. Agent skills provide a natural interface for this procedural adaptation[12 (https://arxiv.org/html/2605.23904#bib.bib14),10 (https://arxiv.org/html/2605.23904#bib.bib15)]: a skill is a portable natural-language artifact that packages procedures, domain heuristics, tool policies, output constraints, and failure modes, letting a frozen agent adapt through external text. If the recurring object of adaptation is the agent’s procedure, the skill document itself should be trainable. Yet weight adaptation is often unavailable for closed frontier models and expensive for open ones, while manually written or one-shot skills are brittle under a target domain or harness. Recent systems convert execution experience into reusable textual artifacts—distilling trajectory lessons, refining skill folders via failure analysis, building domain-specific skill libraries, or optimizing prompts from trajectory feedback[19 (https://arxiv.org/html/2605.23904#bib.bib19),2 (https://arxiv.org/html/2605.23904#bib.bib20),13 (https://arxiv.org/html/2605.23904#bib.bib21),27 (https://arxiv.org/html/2605.23904#bib.bib17),1 (https://arxiv.org/html/2605.23904#bib.bib11)]—but leave open a more basic question: if skills are the adaptation layer, how should they be optimized? Our key idea is to treat skill editing as a controllable domain-adaptation process, with the skill document as the external state, an additional frontier model as the optimizer, and training-style controls over evidence, step size, validation, and update direction. We introduceSkillOpt, a text-space optimizer for agent skills. Given a target domain, an initial skill, and the model being adapted,SkillOptrepeatedly samples trajectory batches, analyzes successes and failures, and asks a frontier optimizer model to propose structured add/delete/replace edits. It then aggregates and ranks candidate edits under a textual learning-rate budget, applies a bounded update to the skill document, and evaluates the candidate skill on a held-out selection split before accepting it. Rejected edits are retained as negative feedback, while the epoch-wise slow/meta update preserves longer-horizon regularities. Figure1 (https://arxiv.org/html/2605.23904#S1.F1)gives a schematic view of this loop. The deployed output is a compactbest_skill.mdfile of roughly300300–2,0002{,}000tokens, with the adapted model and execution harness remaining fixed. The deep-learning analogy is operational rather than decorative. Rollout and reflection batch sizes control the noise in the evidence used for each edit; the textual learning rate and schedule control how far one skill version is allowed to move from the previous one; the held-out gate plays the role of validation; and the epoch-wise slow/meta update acts like a momentum term, carrying stable editing directions across epochs. This stability is crucial: if consecutive skill revisions move too far or in inconsistent directions, rejected edits and previous accepted edits no longer provide a meaningful optimization history. With bounded, validation-gated updates, each revision remains close enough to the last one that later optimizer calls can learn from what helped, what failed, and what should be preserved. We conduct, to our knowledge, the first systematic study of skill optimization as a domain-adaptation training method for frontier agents. We evaluateSkillOpton six benchmarks covering QA, spreadsheets, documents, math, and embodied decision making, across seven target models from frontier-scale GPT to small-scale Qwen, and under three execution modes (direct chat, Codex harness, Claude Code harness). Out of 52 evaluated (model, benchmark, harness) cells,SkillOptis the best or tied-best measured method on all 52. With GPT–5.5 in direct chat, it lifts SearchQA from 77.7 to 87.3, SpreadsheetBench from 41.8 to 80.7, OfficeQA from 33.1 to 72.1, DocVQA from 78.8 to 91.2, LiveMathematicianBench from 37.6 to 66.9, and ALFWorld from 83.6 to 95.5 (a+23.5+23.5point average gain over no skill), and it also beats the strongestper-cellbaseline drawn from human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills by+5.4+5.4points on average. The same optimization interface is effective inside Codex-style and Claude Code-style execution loops, lifting GPT–5.5 by+24.8+24.8and+19.1+19.1points over no skill respectively, and outperforming EvoSkill by+14.0+14.0and+3.2+3.2points. The learned artifacts also transfer beyond the exact training setting. A SpreadsheetBench skill trained on GPT–5.4 improves every smaller GPT variant we test; a Codex-trained spreadsheet skill transfers to Claude Code with a+59.7+59.7point gain; and an OlympiadBench skill yields positive gains on Omni-MATH[6 (https://arxiv.org/html/2605.23904#bib.bib38)]. These transfer results are important for the paper’s application value: a skill can be optimized once, audited as text, and reused across related models, harnesses, or tasks without changing model weights. Our ablations explain why this works. Bounded textual learning outperforms uncontrolled rewriting, held-out gating prevents harmful proposals from accumulating, the rejected-step buffer converts failed edits into negative feedback, and the epoch-wise slow/meta update improves long-horizon refinement without bloating the deployed skill. Finally, per-benchmark case studies show that the learned skills remain compact (300300–2,0002{,}000tokens after only11–44accepted edits), inspectable, and procedural rather than instance-specific. Our contributions are as follows:

  • •We formulate agent-skill learning as optimization over an external natural-language state and introduceSkillOpt, a harness-agnostic optimizer with rollout batches, reflection minibatches, add/delete/replace edits, textual learning rates, schedules, held-out acceptance, rejected-edit buffers, and epoch-wise slow/meta update.
  • •We provide a broad empirical study across six benchmarks, seven target models, and three execution harnesses, showing thatSkillOptis best or tied-best on 52 of 52 cells and outperforms no-skill, human-skill, one-shot LLM-skill, prompt-optimization (TextGrad, GEPA), and skill-evolution (Trace2Skill, EvoSkill) baselines under every model.
  • •We validate the optimization design through component ablations and three forms of transfer (cross-model, cross-harness, cross-benchmark), showing that the exported skill artifact is compact, reusable, and deployable without model-weight updates.

Related Work

Prompt auto tuning and agent-configuration search.

GEPA demonstrates that trajectory feedback can guide reflective prompt evolution and outperform reinforcement learning on several language-agent tasks[1 (https://arxiv.org/html/2605.23904#bib.bib11)]. ABSTRAL and EvoTest extend this idea from single prompts to multi-agent design documents and test-time agentic system evolution without gradients or fine-tuning[30 (https://arxiv.org/html/2605.23904#bib.bib27),9 (https://arxiv.org/html/2605.23904#bib.bib13)]. By treating language artifacts as optimizable objects, these methods can directly exploit execution feedback, but they mainly target prompts, system designs, or full configurations rather than reusable domain adaptation.SkillOptinstead optimizes a persistent skill document that can be trained, validated, exported, and reused with the adapted model, applying language-level controllability to a stable procedural skill state.

Skill construction and skill evolution.

SkillsBench and the SoK on agentic skills frame skills as reusable procedural knowledge, covering tool policies, applicability conditions, execution routines, and supporting resources[12 (https://arxiv.org/html/2605.23904#bib.bib14),10 (https://arxiv.org/html/2605.23904#bib.bib15)]. Prior systems construct such skills from lifelong experience, trajectory lessons, skill knowledge bases, or heterogeneous domain resources[38 (https://arxiv.org/html/2605.23904#bib.bib16),19 (https://arxiv.org/html/2605.23904#bib.bib19),31 (https://arxiv.org/html/2605.23904#bib.bib18),27 (https://arxiv.org/html/2605.23904#bib.bib17),5 (https://arxiv.org/html/2605.23904#bib.bib26)], and further refine them through failure analysis, creation-evaluation-revision loops, co-evolving generators and verifiers, collective updates, or reinforcement learning[2 (https://arxiv.org/html/2605.23904#bib.bib20),13 (https://arxiv.org/html/2605.23904#bib.bib21),41 (https://arxiv.org/html/2605.23904#bib.bib22),15 (https://arxiv.org/html/2605.23904#bib.bib23),35 (https://arxiv.org/html/2605.23904#bib.bib24),33 (https://arxiv.org/html/2605.23904#bib.bib25),23 (https://arxiv.org/html/2605.23904#bib.bib28),18 (https://arxiv.org/html/2605.23904#bib.bib29),34 (https://arxiv.org/html/2605.23904#bib.bib30)]. While these works emphasize skill discovery, repository growth, sharing, evolutionary search, or policy optimization,SkillOptstudies a narrower problem: how to train one compact domain skill with deep-learning-style controls such as trajectory batches, reflection minibatches, textual learning rates, validation gates, rejected-edit buffers, and slow/meta updates. This yields a controlled and auditable procedure for producing a portablebest_skill.mdwithout changing model weights.

Method

Refer to captionFigure 2:Pipeline ofSkillOpt. A frozen target model executes a rollout batch with the current skill; an optimizer model performs minibatch reflection over successes and failures, proposes bounded add/delete/replace edits, merges and ranks them under a scheduled edit budget, and accepts the candidate skill only through a held-out validation gate. Across epochs, the slow/meta update retains longer-horizon lessons without changing the target model.

Problem Setup

A skillssis a natural-language policy inserted into the agent context before execution, consistent with recent work treating skills as reusable procedural knowledge for agents[12 (https://arxiv.org/html/2605.23904#bib.bib14),10 (https://arxiv.org/html/2605.23904#bib.bib15)]. In direct-chat benchmarks, it is prepended to the system or developer instruction; in tool-use harnesses, it becomes persistent procedural memory. We useMMto denote the frozen target model whose behavior is being adapted through skill optimization. For a harnesshh, taskxx, and skillss, execution produces a trajectoryτ\tauand a scalar scorerr: (τ(s),r(s))=h(M,x,s),r(s)∈[0,1].(\tau(s),r(s))=h(M,x,s),\qquad r(s)\in[0,1].(1) Given train, selection, and test splitsDtr,Dsel,DtestD_{\mathrm{tr}},D_{\mathrm{sel}},D_{\mathrm{test}},SkillOptusesDtrD_{\mathrm{tr}}to generate a set of candidate skillsC(Dtr)\mathcal{C}(D_{\mathrm{tr}}), selects the best skill onDselD_{\mathrm{sel}}, and reports the final performance onDtestD_{\mathrm{test}}: ssel⋆=arg⁡maxs∈C(Dtr)⁡1|Dsel|∑x∈Dselr(s),s^{\star}_{\mathrm{sel}}=\arg\max_{s\in\mathcal{C}(D_{\mathrm{tr}})}\frac{1}{|D_{\mathrm{sel}}|}\sum_{x\in D_{\mathrm{sel}}}r(s),(2)Test(ssel⋆)=1|Dtest|∑x∈Dtestr(ssel⋆).\mathrm{Test}(s^{\star}_{\mathrm{sel}})=\frac{1}{|D_{\mathrm{test}}|}\sum_{x\in D_{\mathrm{test}}}r(s^{\star}_{\mathrm{sel}}).(3)The training split supplies experience, the selection split gates updates, and the test split is used only for final reporting. The optimizer state contains the current skill, the best validation-gated skill, cached skill hashes, an epoch-local rejected-step buffer, and optional slow/meta-update state. Only the best accepted skill is exported asbest_skill.md.

Forward Pass: Rollout Evidence

At each optimization step, the target model runs a rollout batch fromDtrD_{\mathrm{tr}}with the current skill. The harness records task metadata, messages, tool calls, observations, command outputs, final answers, verifier feedback, and benchmark-specific context such as spreadsheet previews, document references

Similar Articles

@omarsar0: New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize.…

X AI KOLs Following

Microsoft Research introduces SkillOpt, a method that treats agent skill documents as trainable external state, using an optimizer model to make bounded edits validated by a held-out set. The approach achieves best or tied results across 52 evaluation cells and improves accuracy by over 23 points on GPT-5.5, with zero extra inference cost and transferable skills.

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Hugging Face Daily Papers

SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.

@FinanceYF5: 2/ SkillOpt: Treating Documents as Trainable Parameters Microsoft treats SKILL.md as trainable model parameters—without changing weights, only optimizing natural language documents, with a validation gate filtering each change. 6 Benchmarks, 52 consecutive wins, GPT-5.5 conversation boost…

X AI KOLs Following

Microsoft proposes the SkillOpt method, which treats documents as trainable parameters. By optimizing natural language documents without modifying weights, it improves model performance. It achieves 52 consecutive wins across 6 benchmarks, with GPT-5.5 improving by 23.5 points and Claude Code by 19.1 points.