A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization
Summary
This paper presents an automated pipeline for optimizing natural language skill descriptions in enterprise AI agents to resolve skill collisions, achieving performance matching manual tuning with a 32× speedup. Ablation studies show that a single LLM rewrite using error cases captures most improvements, while other design choices have minimal impact.
View Cached Full Text
Cached at: 07/01/26, 05:31 AM
# A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization
Source: [https://arxiv.org/html/2606.30775](https://arxiv.org/html/2606.30775)
Yangqiaoyu Zhou, Mohammad Alqudah, Kwei\-Herng Lai, Aaron Halfaker Yingqi Xiong, Yaar Harari Microsoft \{yangqzhou, yaarharari\}@microsoft\.com
###### Abstract
Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions\. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term skill collision\. As agents scale to dozens of skills, manually tuning descriptions to maintain routing accuracy becomes a significant engineering bottleneck\. We deploy an automated description optimization pipeline on a production enterprise group chat agent \(9 skills, 372 regression cases\)\. The pipeline produces descriptions averaging 79\.2% F1, matching manually tuned descriptions at 79\.4% F1 \(average per\-skill difference \-0\.20%, within the±0\.78%\\pm 0\.78\\%multi\-seed noise floor\), while reducing per\-skill engineering effort from 120 minutes to 3\.8 minutes \(32×\\timesspeedup\)\. We then examine which pipeline components actually drive this match\. Systematic ablation on both the production system and ToolBench \(∼\\sim16k tools\) reveals that a single LLM rewrite using any available false\-positive and false\-negative cases captures most of the available improvement\. Other design choices we tested \(iteration budget, feedback signal composition, dual editing of confused pairs, and training set size\) each affect final F1 by less than 0\.5%\. Description optimization addresses skill collisions caused by overlapping descriptions but cannot resolve cases where two skills’ intended scopes genuinely overlap\. We identify a diagnostic \(a large train–validation F1 gap\) that flags the latter cases for architectural rather than text\-level intervention\.
A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization
Yangqiaoyu Zhou, Mohammad Alqudah, Kwei\-Herng Lai, Aaron HalfakerYingqi Xiong, Yaar HarariMicrosoft\{yangqzhou, yaarharari\}@microsoft\.com
## 1Introduction
\(a\) Before optimization“Who is Sarah’s manager?”LLM PlannerPeopleSearch
“Looks up profile information about a person in the organization\.”GetUserManager
“Retrieves organizational information about people in the company\.”FP error caseQuery routed toPeopleSearch\(incorrect\)↓\\downarrowLLM rewritesGetUserManagerdescription\(b\) After error feedback“Who is Sarah’s manager?”LLM PlannerGetUserManager
“Returns the direct manager of a named employee; use when the query asks who someone reports to\.”routes×\\timesroutes ✓Figure 1:Skill collision and automated resolution via error feedback\.\(a\)Before optimization: two skills share nearly identical descriptions; the LLM planner cannot distinguish them and misroutes the query toPeopleSearch\(incorrect,×\\times\)\.\(b\)After error\-feedback\-guided rewriting: the false\-positive case is fed back to an LLM, which produces a discriminative description forGetUserManager; the planner now routes correctly \(✓\)\.Modern enterprise AI assistants are built as orchestrated agent systems where an LLM\-based planner routes user queries to specialized skills\(Yaoet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib172); Wuet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib176)\)\. Routing decisions are driven by comparing the query against each skill’s natural language description\. When descriptions are insufficiently discriminative, the planner misroutes queries, causing what we term*skill collision*: semantically overlapping skills compete for the same query population \(Figure[1](https://arxiv.org/html/2606.30775#S1.F1)\)\.
Skill collision is most acute during onboarding\. Adding a new skill to a deployed agent forces its description to be precisely positioned relative to all incumbent skills\. In practice, this requires iterative manual tuning: a developer writes a description, deploys the skill, observes routing errors from real or synthetic traffic, and edits accordingly\. This loop is slow, requires expertise in both the skill’s functionality and the planner’s decision boundaries, and does not scale as agent capabilities grow\.
We deploy an automated description optimization pipeline that replaces this manual loop\. The pipeline initializes a candidate description with an LLM, then iteratively refines it using false\-positive and false\-negative cases from a labeled training set\. On a production enterprise group chat agent \(9 skills, 372 regression cases\), the automated descriptions match manually tuned ones in routing quality \(79\.2% vs\. 79\.4% average F1, per\-skill differences within the±\\pm0\.78% multi\-seed noise floor\) while reducing per\-skill engineering effort from 120 minutes to 3\.8 minutes \(32×\\timesspeedup\)\.
Given that the automated pipeline reaches manual quality, we ask which of its components actually matter\. As originally implemented, the pipeline combined several mechanisms motivated by intuition: contrastive feedback presenting FP, FN, and TP cases simultaneously; iterative refinement up to a fixed budget; dual editing of the most\-confused rival skill; tuned training set sizes\. Through systematic ablation on the production system and at scale on ToolBench\(Qinet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib170)\)\(∼\\sim16k tools\), spanning closed\-world \(per\-query candidate pool\) and open\-world \(retrieval over the full corpus\) routing regimes, we find that most of these design choices change final routing F1 by less than 0\.5%\. A single LLM rewrite using any available FP/FN cases captures the bulk of available improvement: on production, single\-shot achieves 79\.2% F1 \(per\-skill \-0\.12% from human\-written descriptions, matching iterative refinement within noise\); on ToolBench open\-world, single\-shot achieves a \+4\.45% F1 gain, within 0\.2% of iterative refinement\.
Not all routing failures are description failures\. Two further findings characterize where description optimization applies, and where it does not\. First, closed\-world \(fixed candidate pool\) and open\-world \(retrieval\-based\) routing require separate optimization: descriptions tuned in one regime transfer poorly to the other, and LLM initialization helps in one setting but hurts the other\. Second, when two skills’ intended scopes genuinely overlap \(as opposed to merely sharing under\-discriminative wording\), no description rewrite can resolve the collision; these skills exhibit large training\-validation F1 gaps regardless of optimization signal quality\. We identify this gap as a diagnostic signal that flags skills for architectural intervention \(scope restructuring, intent\-specific routing rules\) rather than continued description refinement\.
Taken together, these findings yield a compact operational picture for skill onboarding: optimize with a single\-shot LLM rewrite, match the optimization regime to the deployment regime, and use the train\-validation F1 gap to triage which skills need architectural intervention instead of more description tuning\. We make three contributions: \(1\) a deployed production system that replaces manual description tuning at enterprise scale, validated as non\-inferior to manually tuned baselines with a 32×\\timesengineering speedup; \(2\) systematic ablation across production and ToolBench showing that pipeline complexity—feedback type, iteration budget, dual editing, training size—has negligible impact, and a single LLM rewrite suffices; \(3\) characterization of where description optimization applies, including a diagnostic signal for skills requiring architectural intervention and the separation of closed\-world and open\-world routing into distinct optimization regimes\.
## 2Related Work
Tool selection for LLM agents\.LLM tool selection encompasses routing models and planners over large API collections\(Qinet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib170); Shenet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib177); Liet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib178)\), generating accurate calls from documentation\(Patilet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib173)\), and retrieving candidate tools for orchestrators\(Liuet al\.,[2025](https://arxiv.org/html/2606.30775#bib.bib182); Jia and Li,[2025](https://arxiv.org/html/2606.30775#bib.bib185); Lumeret al\.,[2025](https://arxiv.org/html/2606.30775#bib.bib186)\)\. These upstream retrieval methods complement our work; we optimize the descriptions used within the established candidate pools\.
Tool description optimization\.Recent methods use LLMs and execution traces to rewrite tool documentation, aiming to improve downstream task completion\(Yuanet al\.,[2024](https://arxiv.org/html/2606.30775#bib.bib189); Quet al\.,[2025](https://arxiv.org/html/2606.30775#bib.bib190); Fanget al\.,[2025](https://arxiv.org/html/2606.30775#bib.bib191); Guoet al\.,[2026](https://arxiv.org/html/2606.30775#bib.bib192)\)\. Instead of targeting downstream execution success via execution traces, we target upstream routing decisions using objective false positive and negative cases to diagnose skill collisions\. Furthermore, we rigorously evaluate against manually tuned descriptions in a production deployment and demonstrate that a single LLM rewrite captures the vast majority of available improvements\.
Prompt optimization & self\-refinementNumerous frameworks optimize overarching task prompts or refine outputs using textual gradients, demonstrations, or LLM self\-criticism\(Khattabet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib169); Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2606.30775#bib.bib187); Yanget al\.,[2024](https://arxiv.org/html/2606.30775#bib.bib179); Zhouet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib180); Yuksekgonulet al\.,[2025](https://arxiv.org/html/2606.30775#bib.bib184); Pryzantet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib183); Madaanet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib174); Shinnet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib175); Agrawalet al\.,[2026](https://arxiv.org/html/2606.30775#bib.bib188)\)\. Instead of optimizing aggregate task prompts, we focus exclusively on per\-skill descriptions evaluated via discrete routing errors\. Furthermore, while many of these methods rely on multi\-step evolutionary search or iterative closed\-loop refinement, we demonstrate that a single\-shot rewrite using objective routing feedback captures the vast majority of available improvements for tool description optimization\.
## 3Method
The skill onboarding pipeline takes as input a skill name and outputs an optimized routing description for use by the LLM planner\. The pipeline has two stages: LLM initialization and error\-feedback refinement\. We additionally describe a single\-shot variant which section[5\.2](https://arxiv.org/html/2606.30775#S5.SS2)shows is sufficient to match the full iterative pipeline\.
Stage 1: LLM initialization\.We prompt an LLM with the skill name to generate a candidate routing description\. The initialization prompt requests a description of what the skill does\. The initialized description serves as both the starting point for refinement and a standalone baseline for evaluation\.
Stage 2: Error\-feedback refinement\.The initialized description is evaluated on a labeled training set of queries with ground\-truth skill labels\. We collect false positives \(FP: queries routed to the skill that should not be\), false negatives \(FN: queries the skill misses\), and true positives \(TP: correctly routed queries\)\. At each iteration, we prompt the LLM with the current description, up to 5 FP and 5 FN cases \(a practical token\-budget choice; §[5\.2](https://arxiv.org/html/2606.30775#S5.SS2)confirms performance is insensitive to this limit\), and a TP set matched to the number of negative cases \(to balance the prompt context across positive and negative examples\)\. The LLM is asked to identify routing failure patterns and revise the description\. The refined description replaces the current one, and the process repeats up to a fixed iteration budget or until per\-skill training F1 exceeds a90%90\\%threshold \(in practice, the iteration budget is the effective stopping criterion; see §[5\.2](https://arxiv.org/html/2606.30775#S5.SS2)\)\. At each iterationtt, the routing evaluation produces error casesℰt=\(FPt,FNt,TPt\)\\mathcal\{E\}\_\{t\}=\(\\text\{FP\}\_\{t\},\\text\{FN\}\_\{t\},\\text\{TP\}\_\{t\}\); the description with the highest training F1 across iterations is selected\. Algorithm[1](https://arxiv.org/html/2606.30775#alg1)summarizes the loop\.
Algorithm 1Error\-Feedback Refinement1:skill
ss, training queries
QQ, budget
TT, threshold
τ\\tau
2:optimized description
d^\\hat\{d\}
3:
d0←Initialize\(s\)d\_\{0\}\\leftarrow\\textsc\{Initialize\}\(s\)⊳\\trianglerightStage 1
4:
d^←d0\\hat\{d\}\\leftarrow d\_\{0\},
f^←0\\hat\{f\}\\leftarrow 0
5:for
t=1,…,Tt=1,\\ldots,T:
6:
ft,ℰt←Evaluate\(dt−1,Q\)f\_\{t\},\\mathcal\{E\}\_\{t\}\\leftarrow\\textsc\{Evaluate\}\(d\_\{t\-1\},Q\)
7:if
ft\>f^f\_\{t\}\>\\hat\{f\}:
d^←dt−1\\hat\{d\}\\leftarrow d\_\{t\-1\},
f^←ft\\hat\{f\}\\leftarrow f\_\{t\}
8:if
ft≥τf\_\{t\}\\geq\\tau:break
9:
dt←LLMRewrite\(dt−1,Sample\(ℰt\)\)d\_\{t\}\\leftarrow\\textsc\{LLMRewrite\}\(d\_\{t\-1\},\\textsc\{Sample\}\(\\mathcal\{E\}\_\{t\}\)\)
10:return
d^\\hat\{d\}
Single\-shot variant\.We additionally evaluate a single\-shot configuration where the LLM is given the initialized description and all available FP and FN cases in a single prompt, producing one revised description without further iteration\. This simulates the approach a developer would take with a complete training set, and §5\.2 shows it is sufficient to match iterative refinement\.
Open\-world adaptation\.In the retrieval setting, all tools are indexed by description and candidates are retrieved per query via hybrid sparse\-dense retrieval \(BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.30775#bib.bib181)\)combined withtext\-embedding\-ada\-002\(OpenAI,[2022](https://arxiv.org/html/2606.30775#bib.bib171)\)cosine similarity, with per\-query min\-max normalization\)\. The refinement loop runs identically, but FP cases now correspond to tools retrieved into the top\-20 candidate pool that should not be invoked, providing a targeted disambiguation signal that reflects retrieval errors\. Only tools that appear in at least one FP query’s retrieved pool are eligible for refinement\.
Table 1:Per\-skill skill\-selection F1 on the production agent\. Each row corresponds to replacing a single skill’s description with an automatically generated variant, while keeping all other skill descriptions fixed at their human\-written versions\. HUMAN: all skills use the currently deployed, manually tuned descriptions \(hence identical F1 across rows\); INIT: the target skill’s description is LLM\-initialized with no error feedback; SS: single\-shot LLM rewrite given all training FP/FN cases; Iter: iterative refinement loop \(max 10 iterations\)\. AverageΔ\\Delta\(HUMAN→\\toSS\) of−0\.12%\-0\.12\\%andΔ\\Delta\(HUMAN→\\toIter\) of−0\.20%\-0\.20\\%are both well within the multi\-seed per\-skill noise floor of±0\.78%\\pm 0\.78\\%\(Appendix[E](https://arxiv.org/html/2606.30775#A5)\)\. Skills sorted byΔ\\Delta\(HUMAN→\\toIter\) in descending order\.
## 4Experimental Setup
Production setting\.The production system is an enterprise group chat agent whose LLM\-based planner routes queries to one of 9 skills spanning people search, web search, calendar scheduling, internal knowledge retrieval, organizational hierarchy lookup, email, and document generation\. We use 372 synthetic test cases with ground\-truth skill labels, created by product and engineering experts\. Each query may target one or more skills; all targeted skills are labeled as positives\. Per\-skill training queries are synthesized separately from the 372\-case test set; positive examples range from 10 to 119 per skill depending on skill scope, with negative examples sampled from queries targeting other skills to ensure label balance\. We compare four conditions:HUMAN\(currently deployed manually tuned descriptions, our primary comparison baseline\),OLD\(skill disabled, system\-level pre\-deployment baseline\),INIT\(LLM\-initialized description, no error feedback\), andIter\(optimized description after refinement\)\. Training set is the full set of available examples per skill \(referred to astrainmax\); a smallertrain20variant is reported in Appendix[C](https://arxiv.org/html/2606.30775#A3)\.
ToolBench setting\.ToolBench\(Qinet al\.,[2023](https://arxiv.org/html/2606.30775#bib.bib170)\)covers∼\\sim16k RESTful API tools across I1, I2, I3 splits\. I2 is the primary split for skill collision evaluation, because it requests multiple tools from the same API category to fulfill the query\. ToolBench descriptions are developer\-facing API stubs \(e\.g\.,*“Search Book by its name”, “This endpoint will return back all news about Climate Change from all over the world”*\) rather than routing instructions, which makes description quality directly measurable but also inflates the absolute gain numbers relative to systems with already\-curated descriptions\. We frame the task as single\-turn tool routing: given a query and a candidate pool, predict the correct tool, with no API execution\. We treat the first non\-terminal tool call from the official ToolBench trajectories as ground truth; the original trajectories are LLM\-generated and we did not manually verify each label\. We study two candidate pool regimes\.Closed\-worlduses the per\-query pre\-definedavailable\_toolslist \(3 to 15 tools\) from the ToolBench dataset as the full candidate pool\.Open\-worldretrieves the top\-20 candidates per query from the full∼\\sim16k corpus via hybrid retrieval\. For each split, we select the top\-100 tools by positive sample count \(requiring at least 25 positive and 25 negative queries\); these subsets are fixed across all ablations\. Per tool, we allocate 20 positive and 20 negative queries as the training set and hold out a balanced validation set of up to 50 positive and 50 negative queries, identical across all training\-size ablation runs so that comparisons reflect only the change in training signal\. We track three description states:ORIG\(iteration 0, original placeholder\),INIT\(iteration 1, LLM\-generated with no error feedback\), andIter\(best description across up to 10 refinement iterations\)\.
Ablation dimensions\.On ToolBench we ablate: error signal type \(FP\+FN\+TP / FP\-only / FN\-only / FP\+FN without TP\), iteration budget \(\{5,10,30\}\\\{5,10,30\\\}\), per\-iteration sampling \(5 cases per category vs\. all available\), training size \(\{5,10,20\}\\\{5,10,20\\\}examples per class\), dual editing \(simultaneously refining the rival tool’s description\), and three BM25 to dense retrieval weight combinations\(0\.2,0\.8\)\(0\.2,0\.8\),\(0\.5,0\.5\)\(0\.5,0\.5\),\(0\.8,0\.2\)\(0\.8,0\.2\)\. Default settings unless specified are train==20, max iterations==10, BM25==0\.2, EMB==0\.8\.
Table 2:Five design choices originally motivating the pipeline complexity, evaluated on ToolBench I2 open\-world \(BM25=0\.2, EMB=0\.8, max\-10 iterations, train=20 unless ablated\)\.Δ\\Deltais the gain from orig to best across variants of each design choice\.nnis the tools eligible for optimization \(those with at least one retrieval false positive in the candidate pool\);nnvaries across rows because eligibility depends on the configuration\. Detailed per\-variant numbers in Appendix[A](https://arxiv.org/html/2606.30775#A1)\.
## 5Results
### 5\.1Automated optimization matches manual tuning at production scale
Table[1](https://arxiv.org/html/2606.30775#S3.T1)reports per\-skill HUMAN, INIT, single\-shot \(SS\), and iterative \(Iter\) F1 on the production agent\. The averageΔ\\Delta\(HUMAN→\\toIter\) across the 9 skills is−0\.20%\-0\.20\\%andΔ\\Delta\(HUMAN→\\toSS\) is−0\.12%\-0\.12\\%\. To characterize run\-to\-run variance of the live regression API, we ran 3 independent seeds of the trainmax pipeline; the average per\-skill standard deviation ofΔ\\DeltaF1 is±0\.78%\\pm 0\.78\\%, with 8 of 9 skills below±1%\\pm 1\\%std \(Appendix[E](https://arxiv.org/html/2606.30775#A5)\)\. Both averages are well below this per\-skill noise floor; the largest per\-skill differences are±1\.7%\\pm 1\.7\\%for Iter and±2\.5%\\pm 2\.5\\%for SS \(Table[1](https://arxiv.org/html/2606.30775#S3.T1)\)\. Per\-skill outcomes split nearly evenly under both methods: 5 skills favor Iter and 4 favor manual tuning; SS splits 4 vs\. 5\. We interpret this as both single\-shot and iterative automated optimization being non\-inferior to manual tuning at production scale, but not exceeding it\.
The value of automated onboarding is therefore operational\. An applied scientist manually tuned descriptions for one skill using the standard process \(write, regression test, observe failures, edit\), recorded wall\-clock time, and compared against the automated pipeline\. Manual tuning required 120 minutes for this skill, whereas the automated pipeline produced comparable descriptions in 3\.8 minutes, which is a 32× speedup\.
Notably, the single\-shot variant achieves this match without iteration or any of the additional pipeline mechanisms\. The next section examines which components of the pipeline actually contribute to this result\.
### 5\.2A single LLM rewrite captures most of the improvement
Table[2](https://arxiv.org/html/2606.30775#S4.T2)summarizes ablations of five design choices originally motivating the elaborate pipeline: feedback signal composition, iteration budget, dual editing of confused tool pairs, per\-iteration case sampling limit, and training set size\. All five vary final F1 within0\.5%0\.5\\%on ToolBench I2 open\-world\. Specifically: feedback signal type \(FP\+FN\+TP vs\. FP\-only vs\. FN\-only vs\. FP\+FN without TP\) yields gains within0\.1%0\.1\\%of each other; dual editing provides negligible benefit; sampling 5 cases per category per iteration achieves similar gain as passing all available cases; training size effects \(5 vs\. 10 vs\. 20 examples per tool\) are within0\.5%0\.5\\%, with train=10 marginally best\. The implication is that the per\-example error feedback signal is the dominant driver of optimization quality\. How that signal is packaged into the LLM rewrite prompt is has less effect\. Detailed per\-variant numbers are in Appendix[A](https://arxiv.org/html/2606.30775#A1)\.
Figure[2](https://arxiv.org/html/2606.30775#S5.F2)shows the iteration budget sweep\. On ToolBench open\-world, gains increase monotonically with iteration budget:\+4\.24%\+4\.24\\%at max\-5,\+4\.65%\+4\.65\\%at max\-10,\+5\.07%\+5\.07\\%at max\-30\. Approximately90%90\\%of tools hit the iteration cap without reaching the90%90\\%stopping criterion at max\-10, indicating optimization is still progressing when truncated\. However, single\-shot rewriting \(one LLM call given all available training FP and FN cases\) achieves\+4\.45%\+4\.45\\%, matching max\-10 within0\.2%0\.2\\%and exceeding max\-5\. Iterative refinement with sampled feedback adds little when single\-shot has access to the full training signal\. The production pattern reported in Section[5\.1](https://arxiv.org/html/2606.30775#S5.SS1)confirms this finding at production skill\.
For practitioners trading compute against performance, single\-shot is the dominant choice across both settings\. Long iteration budgets capture marginal additional gains in open\-world ToolBench but the additional cost is rarely justified\.
Figure 2:Average F1 gain on validation across different max iteration budget on ToolBench I2 open\-world \(BM25=0\.2, EMB=0\.8\)\. Single\-shot rewriting matches max\-10 iterative refinement within0\.2%0\.2\\%\.
### 5\.3Optimization regimes diverge between retrieval\-based and fixed\-pool routing
Closed\-world baseline F1 is high \(∼\\sim91%91\\%on I2\) and gains from optimization are correspondingly modest \(0\.60\.6\-0\.9%0\.9\\%across I1, I2, I3, Appendix[H](https://arxiv.org/html/2606.30775#A8)\)\. Open\-world baseline F1 is lower \(∼\\sim70%70\\%on I2\) and gains are substantially larger \(\+4\.5%\+4\.5\\%, Appendix[B](https://arxiv.org/html/2606.30775#A2)\)\. More striking than the magnitude difference is a directional one: LLM\-initialized descriptions*hurt*closed\-world F1 by0\.80\.8\-1\.4%1\.4\\%comparing to the original placeholder text, while*improving*open\-world F1 by1\.51\.5\-2\.1%2\.1\\%\.
The two regimes require separate optimization\. We re\-evaluated closed\-world\-optimized descriptions in the open\-world setting without further refinement: gains are modest \(0\.40\.4\-1\.0%1\.0\\%\), well below in\-setting open\-world optimization \(3\.73\.7\-4\.5%4\.5\\%\)\. Descriptions tuned against a fixed candidate pool do not generalize to the dynamic retrieved pool: closed\-world optimization improves routing precision among a small known set of competitors, while open\-world optimization additionally improves retrieval recall\.
Description optimization is qualitatively different in retrieval\-based vs\. fixed\-pool routing\. Practitioners should determine which regime their deployment matches and optimize within that regime: LLM initialization is a useful starting point for retrieval\-based routing but should be avoided when the candidate pool is fixed and small\.
### 5\.4Initial training F1 as a diagnostic signal
Figure[3](https://arxiv.org/html/2606.30775#S5.F3)examines whether iter\-0 training F1 \(the routing F1 of the original placeholder description on the training set, before any optimization\) predicts the value of optimization on held\-out data\. On ToolBench open\-world, tools with iter\-0 training F1 below65%65\\%\(n=33n=33\) achieve an average held\-out gain of\+6\.27%\+6\.27\\%, approximately1010times the gain of tools with iter\-0 training F1 at or above65%65\\%\(n=15n=15,\+0\.63%\+0\.63\\%; t\-testp<0\.001p<0\.001\)\. The pattern is robust to threshold choice \(Appendix[G](https://arxiv.org/html/2606.30775#A7)\), and all tools with validation gains above10%10\\%fall in the low iter\-0 region\. Iter\-0 training F1 is a useful prioritization signal: tools below65%65\\%are the candidates where optimization adds substantial value\.
On ToolBench, most training improvements transfer to held\-out performance\. Per\-tool training and validation F1 gains on ToolBench are positively correlated \(Spearmanρ=0\.66\\rho=0\.66\); only 2 of 48 tools \(4%4\\%\) exhibit an overfitting signature where training gains substantially exceed validation gains \(Appendix[I](https://arxiv.org/html/2606.30775#A9)\)\.
A subset of skills cannot be resolved by description optimization alone\. On the production agent, two skills \(IntKnowledgeandWorkbackPlan\) follow this pattern: low iter\-0 training F1, large training gains, but failure to exceed the HUMAN baseline on the held\-out test \(Table[1](https://arxiv.org/html/2606.30775#S3.T1); multi\-seed confirmation in Appendix[E](https://arxiv.org/html/2606.30775#A5)\)\. Both skills carry inherent semantic overlap with other skills in the deployment:IntKnowledgeretrieves person\-related organizational data, overlappingPeopleSearch;WorkbackPlanproduces structured plans, overlappingStatusReportin surface form\. Description text alone cannot disambiguate skills whose intended scopes overlap, regardless of training signal quality\. Practitioners encountering a skill with low iter\-0 training F1 and a largeΔ\\Deltatrain versusΔ\\Deltaval gap should consider architectural intervention \(e\.g\., intent\-specific routing rules, restructured skill scopes\) rather than description refinement\.
Figure 3:Each dot is one ToolBench I2 tool \(n=48n=48\)\. Tools with iter\-0 training F1 below65%65\\%\(vertical dashed line\) achieve substantially higher validation F1 gains\.
## 6Conclusion
We deployed an automated description optimization pipeline on a production enterprise group chat agent and validated findings at scale on ToolBench\. Systematic ablation showed most of the originally designed mechanisms are unnecessary; a single\-pass LLM rewrite given any FP/FN cases captures the bulk of available improvement and is non\-inferior to manually tuned descriptions at production scale\.
## Limitations
We acknowledge several limitations\. Our production system has 9 skills and 372 synthetic test cases, limiting statistical power for per\-skill claims\. The ToolBench original descriptions are placeholder text, which inflates the absolute gain numbers compared to systems with already\-curated descriptions; the system’s value in such settings remains to be characterized\. Open\-world ablations are conducted on I2 only; I1 and I3 open\-world experiments are left for future work\. The semantic overfitting observation in production is based on 2 of 9 skills; while ToolBench data falsifies the original “low iter\-0 predicts failure” hypothesis at scale, the production\-specific pattern in low\-data settings warrants validation on additional production deployments\.
## References
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2026\)GEPA: reflective prompt evolution can outperform reinforcement learning\.External Links:2507\.19457,[Link](https://arxiv.org/abs/2507.19457)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- PLAY2PROMPT: zero\-shot tool instruction optimization for llm agents via tool play\.External Links:2503\.14432,[Link](https://arxiv.org/abs/2503.14432)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p2.1)\.
- R\. Guo, K\. Dong, X\. Gao, and K\. Das \(2026\)Learning to rewrite tool descriptions for reliable llm\-agent tool use\.External Links:2602\.20426,[Link](https://arxiv.org/abs/2602.20426)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p2.1)\.
- J\. Jia and Q\. Li \(2025\)AutoTool: efficient tool selection for large language model agents\.External Links:2511\.14650,[Link](https://arxiv.org/abs/2511.14650)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2023\)DSPy: compiling declarative language model calls into self\-improving pipelines\.ArXivabs/2310\.03714\.External Links:[Link](https://api.semanticscholar.org/CorpusID:263671701)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- M\. Li, Y\. Zhao, B\. Yu, F\. Song, H\. Li, H\. Yu, Z\. Li, F\. Huang, and Y\. Li \(2023\)API\-Bank: a comprehensive benchmark for tool\-augmented LLMs\.External Links:2304\.08244Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p1.1)\.
- M\. M\. Liu, D\. Garcia, F\. Parllaku, V\. Upadhyay, S\. F\. A\. Shah, and D\. Roth \(2025\)ToolScope: enhancing llm agent tool use through tool merging and context\-aware filtering\.External Links:2510\.20036,[Link](https://arxiv.org/abs/2510.20036)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p1.1)\.
- E\. Lumer, F\. Nizar, A\. Gulati, P\. H\. Basavaraju, and V\. K\. Subbiah \(2025\)Tool\-to\-agent retrieval: bridging tools and agents for scalable llm multi\-agent systems\.External Links:2511\.01854,[Link](https://arxiv.org/abs/2511.01854)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Prabhumoye, B\. P\. Majumder, X\. Lu, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- OpenAI \(2022\)New and improved embedding model\.Note:[https://openai\.com/blog/new\-and\-improved\-embedding\-model](https://openai.com/blog/new-and-improved-embedding-model)Cited by:[§3](https://arxiv.org/html/2606.30775#S3.p5.1)\.
- K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024\)Optimizing instructions and demonstrations for multi\-stage language model programs\.External Links:2406\.11695,[Link](https://arxiv.org/abs/2406.11695)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2023\)Gorilla: large language model connected with massive APIs\.External Links:2305\.15334Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p1.1)\.
- R\. Pryzant, D\. Iter, J\. Li, Y\. Lee, C\. Zhu, and M\. Zeng \(2023\)Automatic prompt optimization with “gradient descent” and beam search\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 7957–7968\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.494/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.494)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhang, Y\. Lin, X\. Sun, Q\. Li, Z\. Liu, Q\. Sun, Q\. Zeng, Q\. Wei, S\. Hu, Z\. Liu, S\. Han, W\. Chen, J\. Yi, W\. Zhao, T\. Gui, Z\. Zhang,et al\.\(2023\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.External Links:2307\.16789Cited by:[§1](https://arxiv.org/html/2606.30775#S1.p4.1),[§2](https://arxiv.org/html/2606.30775#S2.p1.1),[§4](https://arxiv.org/html/2606.30775#S4.p2.2)\.
- C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen \(2025\)From exploration to mastery: enabling llms to master tools via self\-driven interactions\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=QKBu1BOAwd)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p2.1)\.
- S\. E\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: bm25 and beyond\.Found\. Trends Inf\. Retr\.3,pp\. 333–389\.External Links:[Link](https://api.semanticscholar.org/CorpusID:207178704)Cited by:[§3](https://arxiv.org/html/2606.30775#S3.p5.1)\.
- Y\. Shen, K\. Song, X\. Tan, D\. Li, W\. Lu, and Y\. Zhuang \(2023\)HuggingGPT: solving AI tasks with ChatGPT and its friends in Hugging Face\.External Links:2303\.17580Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2023\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.External Links:2308\.08155Cited by:[§1](https://arxiv.org/html/2606.30775#S1.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.External Links:2309\.03409Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.30775#S1.p1.1)\.
- S\. Yuan, K\. Song, J\. Chen, X\. Tan, Y\. Shen, R\. Kan, D\. Li, and D\. Yang \(2024\)EASYTOOL: enhancing llm\-based agents with concise tool instruction\.External Links:2401\.06201,[Link](https://arxiv.org/abs/2401.06201)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p2.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou \(2025\)Optimizing generative ai by backpropagating language model feedback\.Nature639\(8055\),pp\. 609–616\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-08661-4)Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
- Y\. Zhou, A\. I\. Muresanu, Z\. Han, K\. Paster, S\. Pitis, H\. Chan, and J\. Ba \(2023\)Large language models are human\-level prompt engineers\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.30775#S2.p3.1)\.
## Appendix AAblation Tables
We ablate on the error signal type, iterative refinement, dual editing, and training size in Tables[3](https://arxiv.org/html/2606.30775#A1.T3),[4](https://arxiv.org/html/2606.30775#A1.T4),[5](https://arxiv.org/html/2606.30775#A1.T5),[6](https://arxiv.org/html/2606.30775#A1.T6),[7](https://arxiv.org/html/2606.30775#A1.T7), and[8](https://arxiv.org/html/2606.30775#A1.T8)\.
Table 3:Error signal type ablation on I2 open\-world \(48\-tool intersection, BM25=0\.2/EMB=0\.8, train\_size=20, max10\)\. All conditions within 0\.09%09\\%\.Table 4:Error signal type ablation on the production agent \(9 skills, trainmax\)\. All conditions within0\.69%0\.69\\%, inside production API non\-determinism noise\.Δ\\Delta= avg\. New F1−\-avg\. OLD F1 across all 9 skills\.Table 5:Single\-shot vs\. iterative refinement on I2 open\-world \(47\-tool overlap, BM25=0\.2/EMB=0\.8, train\_size=20\)\.Table 6:Dual editing ablation on ToolBench I2 open\-world \(BM25=0\.2/EMB=0\.8, train\_size=20, max10\)\.nndiffers because dual editing fails on one tool whose training cases trigger the Azure content filter\. Difference of0\.03%0\.03\\%is well within experimental noise\.Table 7:Dual editing ablation on the production agent \(trainmax\)\.Δ\\Delta= Iter F1−\-Avg OLD baseline \(averaged across 4 settings,75\.6%75\.6\\%overall\)\. Average effect of dual editing is−0\.11%\-0\.11\\%, well within the per\-skill±0\.78%\\pm 0\.78\\%noise floor\.Table 8:Training size ablation on ToolBench I2 open\-world \(BM25=0\.2/EMB=0\.8, max10\)\.nnvaries because tool eligibility \(at least one retrieval false positive in the train set\) increases with train size\.Δ\\Deltavalues within0\.5%0\.5\\%\.
## Appendix BOpen\-World Detail \(ToolBench\)
Table[9](https://arxiv.org/html/2606.30775#A2.T9)reports the full open\-world results referenced in §[5\.3](https://arxiv.org/html/2606.30775#S5.SS3)\.Transferre\-evaluates closed\-world\-optimized descriptions in the open\-world retrieval setting without further refinement\.In\-settingruns the full optimization loop using the retrieved candidate pool\. Embedding\-heavy retrieval \(BM25=0\.2, EMB=0\.8\) consistently outperforms BM25\-heavy across both transfer and in\-setting; BM25 fails to differentiate tools with generic placeholder names\.nnis the tools with at least one retrieval false positive; transfer evaluates all 100 tools \(no FP requirement\), while in\-setting requires at least one FP per tool to be optimizable\.
Table 9:Open\-world results on ToolBench I2\. Transfer re\-evaluates closed\-world descriptions; in\-setting re\-optimizes against the retrieved candidate pool\.SettingBM25/EMBnnOrigBestΔ0→F\\Delta\_\{0\\to F\}Transfer \(re\-evaluate closed\-world descriptions\)0\.2/0\.810070\.971\.8\+1\.0\+1\.00\.5/0\.510071\.872\.6\+0\.7\+0\.70\.8/0\.210070\.070\.5\+0\.4\+0\.4In\-setting optimization \(max10\)0\.2/0\.84870\.675\.1\+4\.5\\mathbf\{\+4\.5\}↑\\uparrow0\.5/0\.54568\.972\.8\+3\.9\+3\.9↑\\uparrow0\.8/0\.24370\.073\.7\+3\.7\+3\.7↑\\uparrow
## Appendix CProduction train20 Results
Trainmax \(used in the main paper, §[5\.1](https://arxiv.org/html/2606.30775#S5.SS1)\) uses all available positive examples per skill \(10 to 119, varying by skill\)\. For comparison, train20 caps the training set at 20 positive plus 20 negative examples per skill \(or fewer when the skill has fewer positives available;SendEmail,generate\_status\_report,WorkbackPlanfall back to all available data\)\. Table[10](https://arxiv.org/html/2606.30775#A3.T10)reports per\-skill F1 for the train20\-standard run\. On average, train20 yieldsΔ\\Delta\(OLD→\\toIter\)=\+1\.71%=\+1\.71\\%versus trainmax\-standard’s\+3\.65%\+3\.65\\%, indicating that more training data helps in aggregate but not uniformly: 5 skills are within1%1\\%of their trainmax result, whileGetUserManagerregresses sharply at train20 \(an outlier we attribute to API non\-determinism, since the multi\-seed trainmax results showGetUserManageris otherwise stable; Appendix[E](https://arxiv.org/html/2606.30775#A5)\)\.
Table 10:train20\-standard production results \(per\-skill F1; train pos/neg = 20/20, capped by availability\)\.Δ\\Delta= Iter−\-OLD\.
## Appendix DOLD Baseline \(Production Agent\)
OLD measures the system\-level F1 with the new skill disabled, evaluated on the subset of test cases that do not target the skill\. Because OLD uses a different test subset than HUMAN/INIT/Iter \(which evaluate on the full 372 cases with the skill enabled\), OLD→\\toIter gains conflate skill availability with description quality and are not directly comparable to HUMAN→\\toIter\. We report OLD here for completeness; §[5\.1](https://arxiv.org/html/2606.30775#S5.SS1)uses HUMAN→\\toIter as the primary comparison\.
Table 11:Production results \(trainmax, standard optimization\) including the OLD baseline\. OLD evaluates the system with the new skill disabled on a different test subset; it is not directly comparable to HUMAN/INIT/NEW\.Δ\\DeltaDeploy = OLD→\\toNEW \(deployment \+ quality\)\. Averaged OLD baseline across runs\.
## Appendix EMulti\-Seed Variance \(Production Agent\)
To characterize run\-to\-run variance, we ran 3 independent seeds of the trainmax\-standard pipeline\. Each seed includes a fresh optimization run \(LLM calls at temperature\>0\>0\) and a fresh regression evaluation \(live API calls at temperature\>0\>0\)\. The HUMAN baseline \(79\.4%79\.4\\%\) was evaluated in a single regression run; we did not characterize HUMAN\-specific run variance\. The reported per\-skill noise floor ofΔ\\DeltaF1 \(±0\.78%\\pm 0\.78\\%std on average; Table[12](https://arxiv.org/html/2606.30775#A5.T12)\) bounds Iter variance directly;Δ\\Delta\(HUMAN→\\toIter\) variance includes additional HUMAN evaluation noise we did not measure\. The meanΔ\\Delta\(OLD→\\toIter\) is\+3\.27±0\.78%\+3\.27\\pm 0\.78\\%across the 3 seeds, with 8 of 9 skills below±1%\\pm 1\\%std\.IntKnowledgeregresses against OLD in all 3 seeds \(−1\.90%\-1\.90\\%,−3\.72%\-3\.72\\%,−2\.04%\-2\.04\\%\), confirming this is a robust failure rather than a single\-run artifact\.
Table 12:Per\-skillΔ\\DeltaF1 \(NEW−\-OLD\) across 3 independent seeds of the trainmax\-standard pipeline\. Std is sample standard deviation across seeds\. The average per\-skill std is±0\.78%\\pm 0\.78\\%\.
## Appendix FProduction Training F1 Dynamics
Table[13](https://arxiv.org/html/2606.30775#A6.T13)reports per\-skill training\-set F1 at iteration 0 \(the initial LLM\-generated description\) and the best across iterations on the trainmax\-standard production setting\. Two skills exhibit large training F1 gains:IntKnowledge\(\+30\.8\+30\.8pp\) andWorkbackPlan\(\+28\.1\+28\.1pp\)\. Both also start with low iter\-0 training F1 \(≤67%\\leq 67\\%\), and neither translates the training gain into a held\-out F1 improvement above the HUMAN baseline \(Table[1](https://arxiv.org/html/2606.30775#S3.T1)\)\. The other 7 skills show no training F1 movement: each begins at or above the90%90\\%stop criterion and the optimization loop terminates at iter\-0\. The pattern matches the small\-data overfitting characterization in §[5\.4](https://arxiv.org/html/2606.30775#S5.SS4)\.
Table 13:Per\-skill training\-set F1 at iter\-0 \(initial LLM description\) vs\. best across iterations on the production agent \(trainmax\-standard\)\.Δ\\Deltatrain is the iterative refinement gain on the training subset\. Skills ordered as in Table[1](https://arxiv.org/html/2606.30775#S3.T1)\.
## Appendix GIter\-0 Training F1 as a Failure Predictor \(ToolBench\)
We test whether the production observation in §[5\.4](https://arxiv.org/html/2606.30775#S5.SS4), that low iter\-0 training F1 predicts failure, generalizes to the 48 ToolBench open\-world tools \(I2, BM25=0\.2/EMB=0\.8, train\_size=20, max10\)\. For each tool, we extract iter\-0 training F1 \(train\_f1@0\) and the held\-out validation gainΔ\\Deltaval from the existing optimization logs; no new LLM calls are required\.
The 65% threshold derived from production is highly significant on ToolBench \(Welch’stt,p=0\.0008p=0\.0008,N=48N=48\): tools below 65% iter\-0 training F1 gain\+6\.3%\+6\.3\\%on validation on average, vs\.\+0\.6%\+0\.6\\%for tools above \(Table[14](https://arxiv.org/html/2606.30775#A7.T14)\)\. However, the direction is opposite to the production failure mode: at ToolBench scale, low iter\-0 F1 predicts*larger*held\-out gains, not failure to generalize\. Spearman correlation betweenΔ\\Deltatrain andΔ\\Deltaval is\+0\.66\+0\.66\(p<10−4p<10^\{\-4\}\): in the open\-world retrieval setting, large training gains generalize\. We interpret the production overfitting \(IntKnowledge,WorkbackPlan\) as a small\-data\-regime artifact of low\-positive\-sample skills \(10–119 positives\), not a general property of the pipeline\.
Table 14:Threshold sweep: meanΔ\\Deltaval for tools below vs\. above each iter\-0 training F1 thresholdτ\\tau\. The 65% threshold from production is highly significant \(p=0\.0008p=0\.0008\)\. At ToolBench scale, low iter\-0 F1 predicts*larger*validation gains, not failure\.
## Appendix HClosed\-World Results
Table[15](https://arxiv.org/html/2606.30775#A8.T15)shows validation F1 in the closed\-world setting, where the candidate pool is the per\-queryavailable\_toolslist \(3–15 tools\) rather than the full retrieved corpus\. A consistent pattern emerges across all splits: LLM initialization \(init\)*reduces*F1 by0\.80\.8–1\.4%1\.4\\%relative to the original placeholder descriptions \(orig\), while iterative refinement recovers and surpasses the original\. Despite being completely uninformative text, placeholder descriptions perform comparably to zero\-shot LLM\-generated ones in this setting; the gains come entirely from error feedback, not initialization\. This contrasts with the open\-world setting \(§[5\.3](https://arxiv.org/html/2606.30775#S5.SS3)\) where initialization helps, suggesting description quality matters more when retrieval recall is the bottleneck\.
On I2, the training size sweet spot is 10 examples per class \(\+1\.0%\+1\.0\\%\); additional examples do not consistently improve the refinement signal\. Iterative refinement \(\+0\.9%\+0\.9\\%\) modestly outperforms single\-shot \(\+0\.6%\+0\.6\\%\), consistent with per\-example signal being weaker in the fixed small candidate pool\.
Table 15:Closed\-world results\. Top: I1/I2/I3 under standard optimization \(train\_size=20, max10\)\. Bottom: I2 training size ablation and single\-shot comparison\. Val F1 averaged over 100 tools\.nnOrigInitBestΔ0→1\\Delta\_\{0\\to 1\}Δ0→F\\Delta\_\{0\\to F\}I110090\.689\.491\.5−1\.2\-1\.2\+0\.9\+0\.9↑\\uparrowI210090\.689\.291\.5−1\.4\-1\.4\+0\.9\+0\.9↑\\uparrowI310082\.882\.083\.5−0\.8\-0\.8\+0\.6\+0\.6↑\\uparrowI2 training size ablationtrain=510090\.789\.591\.1−1\.2\-1\.2\+0\.4\+0\.4↑\\uparrowtrain=1010090\.789\.391\.7−1\.4\-1\.4\+1\.0\\mathbf\{\+1\.0\}↑\\uparrowtrain=2010090\.689\.291\.5−1\.4\-1\.4\+0\.9\+0\.9↑\\uparrowI2 single\-shot vs\. iterative \(train=20, max10\)Single\-shot10090\.6—91\.3—\+0\.6\+0\.6↑\\uparrowIterative10090\.689\.291\.5−1\.4\-1\.4\+0\.9\+0\.9↑\\uparrow
## Appendix ITraining versus validation gain at ToolBench scale
Figure[4](https://arxiv.org/html/2606.30775#A9.F4)plots per\-tool training F1 gain against validation F1 gain for the 48 tools optimized in the open\-world setting\. Training and validation gains are positively correlated \(Spearmanρ=0\.66\\rho=0\.66\); 2 tools \(4%4\\%\) exhibit an overfitting signature \(Δ\\Deltatrain\>20%\>20\\%,Δ\\Deltaval<5%<5\\%\)\.
Figure 4:Per\-toolΔ\\Deltatrain versusΔ\\Deltaval on ToolBench I2 open\-world \(n=48n=48\)\. Spearmanρ=0\.66\\rho=0\.66\. Two outliers in the lower\-right exhibit overfitting\.Similar Articles
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2069064122218717387
This article explores how AI agents can automatically write and optimize their skill files using techniques like SkillOpt from Microsoft Research, which treats skill documents as trainable state and delivers significant performance improvements. It addresses the challenge of manual skill tuning and presents frameworks like GEPA and EvoSkill as evolutionary approaches.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
This paper systematically evaluates model-generated skills for language agents across the full lifecycle of experience generation, extraction, and consumption, finding that skills are beneficial on average but exhibit non-trivial negative transfer, leading to a meta-skill that improves skill quality.
@MSFTResearch: AI agents often fail because their instructions, or skills, are manually modified with no guarantee of improvement. Lea…
SkillOpt turns AI agent skill editing from manual modification into a training process, improving agent reliability without changing model weights, achieving consistent gains across benchmarks.
@Yif_Yang: Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language …
Introducing SkillOpt, an optimizer that treats natural-language skills as trainable external parameters instead of finetuning model weights. It uses bounded edits and validation gating to enable stable, controllable skill updates, achieving best or tied-best results across 52 settings on 6 benchmarks with 7 models.