@Yif_Yang: Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language …
Summary
Introducing SkillOpt, an optimizer that treats natural-language skills as trainable external parameters instead of finetuning model weights. It uses bounded edits and validation gating to enable stable, controllable skill updates, achieving best or tied-best results across 52 settings on 6 benchmarks with 7 models.
View Cached Full Text
Cached at: 05/26/26, 05:02 AM
Introducing SkillOpt — an optimizer for agent skills.
Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter.
Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization. SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability. We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings.
Train the skill, not the model.
https://aka.ms/skillopt https://huggingface.co/papers/2605.23904…
Executive Strategy for Self-Evolving Agent Skills
Source: https://microsoft.github.io/SkillOpt/ Project Video
SkillOpt in motion.
A short visual overview of how SkillOpt treats natural-language skills as trainable artifacts: roll out, reflect, edit, validate, and export.
Promotional video for the SkillOpt project page. The static paper teaser is shown below for high-resolution inspection.
Paper Teaser
The core loop at a glance.
The teaser summarizes the SkillOpt training loop: rollout evidence, optimizer-side reflection, bounded skill edits, validation gating, and the exported reusable skill.
Figure from the SkillOpt paper. On small screens, the figure area scrolls horizontally to preserve the original details.
A skill is external state for an agent.
Instead of fine-tuning a model or hand-maintaining prompts, SkillOpt runs the frozen agent on scored batches, asks a separate optimizer model to propose structured edits, and accepts a candidate only when validation performance improves.
Frozen target modelOptimizer modelAdd / delete / replace editsHeld-out gate
Rollout
The target model executes tasks with the current skill and records scored trajectories.
Reflect
The optimizer analyzes success and failure minibatches to find reusable procedures.
Edit
Candidate add, delete, and replace operations are merged and ranked under a budget.
Gate
The candidate skill is kept only if it improves held-out selection performance.
Evidence
Rollout batches capture messages, tool calls, verifier feedback, task metadata, and final scores.
Minibatches
Failures and successes are reflected separately so edits correct recurring errors while preserving working behavior.
Bounded Edits
An edit budget functions as a textual learning rate, preventing useful rules from being overwritten by broad rewrites.
Memory
Rejected edits, slow update, and optimizer-side meta skill provide longer-horizon feedback without bloating deployment.
SkillOpt pipeline from the paper. The frozen target model executes with the current skill; the optimizer model proposes bounded edits; held-out validation decides whether the candidate becomes the new current skill.
Method comparison
SkillOpt clears the strongest baseline on every benchmark.
ComponentSettingSearchQASpreadsheetLiveMathLearning ratelr=4 default87.177.561.3Learning ratewithout lr84.675.757.3Rejected bufferwith buffer87.177.561.3Rejected bufferwithout buffer85.572.958.9Update memorymeta skill + slow update87.177.561.3Update memorywithout both86.355.059.7
What the ablations say
BoundedTextual learning rates prevent destructive rewrites while keeping enough plasticity to learn new procedures.
GatedHeld-out selection turns reflection into propose-and-test optimization rather than unconditional self-editing.
BufferedRejected edits become negative feedback, helping the optimizer avoid repeating harmful directions.
Epoch checkpoint trends from the paper. Selection-best checkpoints are compared with train rollout score and unseen test performance.
ALFWorld skill evolution scoresSelection score rises from 68.6 percent to 81.4 percent, while rejected edits are visible as downward candidate points.85%80%75%70%65%basestep 1step 2step 3slowstep 4
Accepted edits become the current skill only after held-out selection improves.Step 3 is rescued by a slow update; Step 4 trains higher but fails selection.
Cross-model+15.2GPT-5.4 LiveMath skill transferred to GPT-5.4-nano on LiveMathBench.
Cross-harness+31.8Codex-trained SpreadsheetBench skill transferred into Claude Code.
Self-optimizer+10.4GPT-5.4-nano used as its own optimizer improved SpreadsheetBench over baseline.
Deployment1 fileThe target model consumes only the final skill, not optimizer memory.
A stronger optimizer model gives the largest gains, but the loop is not merely distillation from a stronger model. Even matched target-as-optimizer settings can discover useful edits when the update is constrained, buffered, and validated.
@misc{yang2026skilloptexecutivestrategyselfevolving,
title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills},
author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo},
year={2026},
eprint={2605.23904},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.23904},
}
Similar Articles
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.
SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery
A new paper formalizes skill optimization for agents by treating markdown skill files as trainable parameters, using bounded edits validated against holdout sets. The approach transfers well between models and improves performance on procedural benchmarks.
@omarsar0: New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize.…
Microsoft Research introduces SkillOpt, a method that treats agent skill documents as trainable external state, using an optimizer model to make bounded edits validated by a held-out set. The approach achieves best or tied results across 52 evaluation cells and improves accuracy by over 23 points on GPT-5.5, with zero extra inference cost and transferable skills.
@DAIEvolutionHub: MICROSOFT JUST OPEN-SOURCED A WAY TO “TRAIN” AI AGENTS WITHOUT TOUCHING MODEL WEIGHTS SkillOpt treats a simple markdown…
Microsoft open-sourced SkillOpt, a method that treats markdown skill files like neural network parameters to train AI agents without modifying model weights, using learning rates, validation checks, minibatches, and epochs.
@Xudong07452910: This SkillOpt paper is quite interesting—it actually addresses a very important point: AI agents in the future won't just rely on humans writing prompts; they can train their own 'job descriptions'. Currently, many skills/prompts are written one-off, and when real tasks pile up, various edge cases start to fail...
SkillOpt introduces a systematic controllable text-space optimizer that enables AI agents to train and improve their own skills (like 'work instructions') through iterative edits and validation, outperforming human-crafted and one-shot prompts across multiple benchmarks and models.