SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing
Summary
SkillAudit introduces a framework for evolving LLM agent skills without ground-truth feedback by using paired trajectory auditing and contrastive evaluation. It achieves 73.9% average task reward across 89 tasks, outperforming baseline methods.
View Cached Full Text
Cached at: 06/15/26, 09:11 AM
# Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing
Source: [https://arxiv.org/html/2606.14239](https://arxiv.org/html/2606.14239)
Haowen Gao1,2, Haoran Chen3, Can Wang3, Shasha Guo1, Liang Pang1, Zhaoyang Liu3, Huawei Shen1, Xueqi Cheng1
1State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Tongyi Lab, Alibaba Group, Beijing, China
gaohaowen23s@ict\.ac\.cn, congling\.chr@alibaba\-inc\.com, xiaocan\.wc@alibaba\-inc\.com guoshasha@ict\.ac\.cn, pangliang@ict\.ac\.cn, jingmu\.lzy@alibaba\-inc\.com
###### Abstract
Agent skills are structured procedural instruction packages that guide frozen large language model agents in specialized professional workflows\. However, skills rarely remain sufficient after deployment: new edge cases, changes in tools and APIs, and deployment constraints often become visible only through use\. This makes skill evolution a practical necessity\. Existing methods, however, typically depend on privileged feedback such as held\-out validation scores, hidden test outcomes, environment rewards, or expert reference responses\. Such signals are often unavailable when a practitioner has only a task description and workspace data\. This raises a central challenge: how can agent skills be improved without access to external ground\-truth feedback during optimization? We introduceSkillAudit, a framework for evolving agent skills without ground\-truth feedback\. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, allowing the system to isolate how the skill changes agent behavior without external labels\. To turn these behavioral differences into edit guidance,SkillAudituses Process\-Aligned Contrastive Evaluation \(PACE\), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document\. A structural verifier, compiled once from the task specification and then fixed, provides a stable check on task constraints and rolls back updates that harm execution\.SkillAuditfurther routes edits through two complementary pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces skill passages that conflict with the task\. Across 89 containerized tasks spanning 8 professional domains,SkillAuditachieves a 73\.9% average task reward, outperforming both an agent without skills \(40\.9%\) and the static expert skill included in the benchmark \(56\.7%\)\. These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution\.
## 1Introduction
Large language model \(LLM\) agents are increasingly used for long\-horizon professional tasks, including software engineering, scientific analysis, and enterprise data pipelines\(Yaoet al\.,[2023](https://arxiv.org/html/2606.14239#bib.bib26); Honget al\.,[2024](https://arxiv.org/html/2606.14239#bib.bib28); Jimenezet al\.,[2024](https://arxiv.org/html/2606.14239#bib.bib22)\)\. These tasks require procedural reliability: agents must invoke tools in the right order, satisfy strict output constraints, and recover from domain\-specific edge cases\. To provide such procedural knowledge without updating model parameters,agent skills, structured multi\-file instruction packages that combine natural\-language guidance with optional supporting artifacts, have emerged as a practical interface for frozen models\(Anthropic,[2025](https://arxiv.org/html/2606.14239#bib.bib31); Wanget al\.,[2024a](https://arxiv.org/html/2606.14239#bib.bib24); Zhaoet al\.,[2024](https://arxiv.org/html/2606.14239#bib.bib27); Shinnet al\.,[2023](https://arxiv.org/html/2606.14239#bib.bib25); Xu and Yan,[2026](https://arxiv.org/html/2606.14239#bib.bib33)\)\. Recent large\-scale evaluations confirm their value: curated skills substantially improve task completion across diverse professional agent benchmarks\(Liet al\.,[2026b](https://arxiv.org/html/2606.14239#bib.bib21)\)\.
However, a useful skill rarely remains sufficient after deployment\. As practitioners reuse a skill, new edge cases appear, tools and APIs change, data formats shift, and deployment\-specific constraints become visible only through use\. A skill that was once helpful may therefore become incomplete, misaligned with the task, or even actively misleading\. The challenge is not merely to author a strong skill once, but to enable skills to evolve into more reliable procedural knowledge through continued interaction with the tasks they are meant to support\.
Recent work has begun to study skill evolution by iteratively refining skill documents through execution feedback\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.14239#bib.bib1);[a](https://arxiv.org/html/2606.14239#bib.bib2); Yanget al\.,[2026a](https://arxiv.org/html/2606.14239#bib.bib5); Alzubiet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib6); Liuet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib7); Maet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib4)\)\. Yet existing methods typically rely on feedback unavailable in the deployment settings of interest\. As illustrated in Figure[1](https://arxiv.org/html/2606.14239#S1.F1), these methods fall into two broad paradigms\.*Oracle\-Gated Evolution*methods \(e\.g\., SkillOpt\(Yanget al\.,[2026a](https://arxiv.org/html/2606.14239#bib.bib5)\)and CoEvoSkills\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.14239#bib.bib1)\)\) accept or reject skill updates using external validation signals such as held\-out scores, hidden test outcomes, or oracle pass/fail feedback\.*Failure\-Signal Driven*methods \(e\.g\., SkillForge\(Liuet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib7)\)and SkillClaw\(Maet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib4)\)\) instead use richer external supervision such as enterprise knowledge bases, historical support tickets, cross\-user interaction logs, or task\-outcome rewards\. In many realistic settings, however, a practitioner has only a task description and workspace data, not hidden tests, reference solutions, deployment logs, or ground\-truth scoring functions\. This leaves open a practical question: how can skills be improved when external ground\-truth feedback is unavailable during evolution?
We address this question through*paired trajectory auditing*\. The key idea is to execute the same task twice—once with the candidate skill and once without it\. The resulting trajectory pair isolates how the skill changes agent behavior, providing a self\-contained signal about where the skill helps, where it is ignored, and where it misleads the agent\. Raw trajectory differences are evidence but not ready\-made diagnosis: they neither identify which passages in the skill caused a behavioral change nor supply a stable criterion for accepting or rejecting edits across iterations\. We therefore combine two complementary components\. First,PACE\(Process\-Aligned Contrastive Evaluation\) maps trajectory divergences to localized diagnostic signals anchored to specific passages in the skill document\. Second, a structural verifier is compiled once from the task specification and then fixed throughout evolution; it encodes task constraints derivable from the task description and workspace alone, guarding against evaluator drift and execution regressions\.
Figure 1:Three skill evolution paradigms\.*Oracle\-Gated*\(left\) and*Failure\-Signal Driven*\(center\) require external ground\-truth signals\.SkillAudit\(right\) requires onlyTT,WW, andS0S\_\{0\}: paired execution producesτw\\tau\_\{w\}andτwo\\tau\_\{wo\}, which PACE and the Anchor Verifier evaluate internally to yield a verdict of*helped*,*hurt*, or*inert*, with no ground\-truth signal accessed\.Based on this design, we introduceSkillAudit, a framework for skill evolution without ground\-truth feedback during optimization\. By this we mean the evolution loop never accesses hidden tests, reference solutions, task rewards, oracle pass/fail feedback, or human\-authored validation scripts; it uses only the task description, workspace data, candidate skills, execution trajectories, generated artifacts, and constraints derivable from the task specification\. At each iteration,SkillAuditexecutes the task with and without the current skill, aggregates the resulting PACE diagnostics and structural checks, and decides whether to commit, defer, or roll back an update\. Updates that harm execution are vetoed unconditionally\. To handle different forms of skill–task mismatch,SkillAuditroutes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages whose guidance conflicts with the task\.
Our contributions are:
- •Ground\-truth\-free skill evolution\.We formulate skill evolution under a realistic deployment setting in which hidden tests, reference solutions, task rewards, and oracle feedback are unavailable during optimization, formalized as the ground\-truth\-free constraint𝒞gtf\\mathcal\{C\}\_\{\\mathrm\{gtf\}\}\. We introduceSkillAudit, a framework that improves agent skills without relying on external ground\-truth signals\.
- •Paired trajectory auditing\.We propose paired trajectory auditing as the core mechanism for deriving optimization signals\. By executing the same task with and without a candidate skill,SkillAuditisolates how the skill changes agent behavior and converts these differences into self\-contained, label\-free evidence for skill editing\.
- •Process\-aligned diagnosis and guarded editing\.We develop a dual\-axis evaluation architecture that combines a fixed structural verifier, compiled from the task specification, withPACE, a process\-aligned contrastive evaluator cluster\.PACEproduces segment\-anchored diagnostic signals across four dimensions: Process Adherence, Artifact Evidence, Consistency, and Effectiveness Delta\. Based on these diagnostics, edits are routed through complementaryRefineandRepairpipelines, while verifier\-based checks veto updates that violate task constraints or degrade execution\.
- •Empirical gains and boundary analysis\.We evaluateSkillAuditon 89 containerized tasks spanning 8 professional domains\.SkillAuditachieves 73\.9% average task reward, outperforming both the baseline agent without skills \(40\.9%\) and the static expert skill included in the benchmark \(56\.7%\) by \+33\.0 and \+17\.2 percentage points, respectively\. Further analysis identifies an observability boundary that helps explain when ground\-truth\-free skill evolution succeeds and when it fails\.
## 2Problem Formulation
We consider a deployment setting in which a practitioner faces a professional task with three externally available inputs: a natural\-language task descriptionTT, workspace dataWW, and an initial skillS0S\_\{0\}\. The task description specifies the objective, deliverables, and constraints\. The workspace contains the files, data directories, and configuration on which the task operates\. The initial skillS0S\_\{0\}is a structured procedural document, optionally accompanied by helper scripts, authored by human practitioners or retrieved from an existing skill library as the closest available match to the task\.
The goal is to produce an evolved skillS∗S^\{\*\}that improves the agent’s task performance when injected into the agent context at inference time\. Skills are external artifacts rather than model parameters: the agent’s weights remain frozen throughout evolution, and an evolved skill can be reused by other models without retraining\.
In this setting, the ground\-truth reward is not observable during evolution\. The practitioner does not have access to hidden test scripts, reference solutions, held\-out validation sets, scoring functions, environment rewards, or oracle pass/fail feedback\. We therefore formulate skill evolution as a constrained optimization problem\. Letτ∼π\(⋅∣S,T,W\)\\tau\\sim\\pi\(\\cdot\\mid S,T,W\)denote an execution trajectory produced by the frozen agentπ\\piwhen the skillSSis injected into its context, and letℛ\(τ\)∈\[0,1\]\\mathcal\{R\}\(\\tau\)\\in\[0,1\]be the \(latent, unobserved\) terminal task reward\. The objective is
S∗=argmaxS𝔼τ∼π\(⋅∣S,T,W\)\[ℛ\(τ\)\],subject to theground\-truth\-free constraint𝒞gtf\.S^\{\*\}=\\arg\\max\_\{S\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\pi\(\\cdot\\mid S,T,W\)\}\\\!\\left\[\\,\\mathcal\{R\}\(\\tau\)\\,\\right\],\\qquad\\text\{subject to the \\emph\{ground\-truth\-free constraint\} \}\\mathcal\{C\}\_\{\\mathrm\{gtf\}\}\.\(1\)The constraint𝒞gtf\\mathcal\{C\}\_\{\\mathrm\{gtf\}\}permits the evolution procedure to use onlyTT,WW,S0S\_\{0\}, candidate skills, and the observable execution traces and artifacts produced during interaction with the workspace; it rules out any access toℛ\\mathcal\{R\}itself or to any of its usual proxies \(hidden tests, reference solutions, held\-out validation scores, environment rewards, or oracle pass/fail signals\) at every point during optimization\.
Becauseℛ\(τ\)\\mathcal\{R\}\(\\tau\)is never observed under𝒞gtf\\mathcal\{C\}\_\{\\mathrm\{gtf\}\}, we estimate the*direction*of skill change from observable differences between paired executions with and without the candidate skill \(§[3\.2](https://arxiv.org/html/2606.14239#S3.SS2)\)\. The paired trajectories are the primary evidence: PACE extracts segment\-anchored diagnostic signals from their behavioral divergences, and these signals directly drive the content of skill edits\. A three\-way verdict \(*skill\_helped*,*skill\_hurt*, or*skill\_inert*\) acts as the decision gate determining whether to commit, roll back, or defer each update, while the actual modifications are grounded in the full trajectory evidence rather than in the verdict alone\. The next section instantiates this formulation as a concrete evolution loop, detailing how the paired trajectories, the PACE evaluators, and the structural verifier interact to drive ground\-truth\-free edits\.
## 3Method
### 3\.1Overview
The constrained optimization in Eq\.[1](https://arxiv.org/html/2606.14239#S2.E1)poses a fundamental challenge: how can a skill be improved when the very signals used by all prior evolution methods are explicitly prohibited? We address this by designing an evolution loop that derives its update signal entirely from the task itself, without any external infrastructure\. The central mechanism is paired trajectory auditing: by executing the task with and without the candidate skill, the system directly observes the skill’s effect on agent behavior and uses the resulting trajectory evidence as the basis for editing, with the three\-way verdict gating each commit or rollback\. Figure[1](https://arxiv.org/html/2606.14239#S1.F1)illustrates the resulting system, which consists of four cooperating components described in the sections that follow\.
GivenTT,WW, and an initial skillS0S\_\{0\}, the system begins with two one\-time setup steps\. A task interpreter first analyzesTTandS0S\_\{0\}in depth, examining the task’s requirements, data schema, and workflow structure alongside the initial skill’s coverage and potential conflicts\. The resulting structured task specification drives the two subsequent setup steps: compiling an Anchor Verifier that encodes objective constraints derivable from the task description alone and is locked for the remainder of evolution, and running a compatibility pre\-assessment that routes the task to either a Refine or a Repair pipeline based on the nature of the detected misalignment betweenS0S\_\{0\}andTT\(§[3\.4](https://arxiv.org/html/2606.14239#S3.SS4)\)\.
The two pipelines share the same evaluation infrastructure but apply different constraint gates to how the skill may be modified\. Figure[2](https://arxiv.org/html/2606.14239#S3.F2)details the anatomy of a single iteration\. Each iteration executes the task in parallel under with\-skill and without\-skill conditions, producing a trajectory pair\(τw,τwo\)\(\\tau\_\{w\},\\tau\_\{wo\}\)\. PACE decomposes the trajectory differences into diagnostic signals anchored to specific passages in the skill document; the Anchor Verifier independently enforces hard structural constraints\. Their combined verdict determines the loop’s next action: a*skill\_helped*verdict commits the proposed update;*skill\_hurt*triggers an immediate rollback;*skill\_inert*defers to the next iteration\. The loop terminates when the skill reaches a stable state, defined by three jointly satisfied criteria: no*skill\_hurt*in the two most recent iterations, the Anchor Verifier passing, and no actionable surgery targets remaining\. The loop runs for at most five iterations\. The complete procedure is formalized in Algorithm[1](https://arxiv.org/html/2606.14239#alg1)\(Appendix[A](https://arxiv.org/html/2606.14239#A1)\)\.
### 3\.2Paired Trajectory Auditing
Figure 2:OneSkillAudititeration\.*Setup*\(left\): task interpreter, Anchor Verifier generation, and pre\-assessment routing\.*Paired Trajectory Auditing*\(center\): parallel with\-skill \(τw\\tau\_\{w\}\) and without\-skill \(τwo\\tau\_\{wo\}\) execution, PACE across four dimensions, and Anchor Verifier check\.*Decision & Edit*\(right\): verdict\-driven commit, defer, or rollback, followed by Refine or Repair edits to produceS∗S^\{\*\}\.A single execution trajectory mixes two sources of variation: the inherent difficulty of the task and the effect of the skill\. Paired execution separates them by running the same task twice under identical conditions: once with the candidate skillSSinjected into the agent context, producing trajectoryτw\\tau\_\{w\}, and once without any skill, producing trajectoryτwo\\tau\_\{wo\}\. The behavioral differences betweenτw\\tau\_\{w\}andτwo\\tau\_\{wo\}expose the skill’s effect against the same baseline task difficulty\. This contrast is informative whenever the without\-skill runτwo\\tau\_\{wo\}executes the task coherently enough to serve as a reference; when it does not, PACE marks the comparison uninformative and the iteration falls back to single\-trajectory evaluation onτw\\tau\_\{w\}\(Appendix[E](https://arxiv.org/html/2606.14239#A5)\), a failure mode we analyze in §[4\.3](https://arxiv.org/html/2606.14239#S4.SS3)\.
Raw trajectory pairs are evidence, not diagnosis\. To convert them into actionable modification signals, we introduce PACE, a process\-reward evaluator cluster\(Choudhury,[2025](https://arxiv.org/html/2606.14239#bib.bib20)\)that assesses whether each trajectory step is trustworthy rather than only whether the final answer is correct\. PACE is built from twelve templates spanning four dimensions:Process Adherence,Artifact Evidence,Consistency, andEffectiveness Delta\. Table[1](https://arxiv.org/html/2606.14239#S3.T1)summarizes the mapping\. Each evaluator reads both trajectories, compares agent behavior at divergence points, and outputs a structured judgment: a result \(Pass/Fail/Warning\), a set ofaction\_signalseach anchored to a specific passage in the skill document via a verbatimsegment\_quote, and a set ofprotected\_hintsmarking passages whose removal would cause regression\. This segment\-level anchoring is the essential difference from pass/fail feedback: rather than telling the evolution loop “the skill is bad,” PACE tells it “with this paragraph present, the agent took this wrong action at this step, whereas the without\-skill run did not\.”
Table 1:PACE evaluation dimensions, with one representative evaluator per dimension\. Each evaluator comparesτw\\tau\_\{w\}andτwo\\tau\_\{wo\}along one dimension and outputs segment\-anchored diagnostic signals\. The complete inventory of all twelve evaluators is given in Appendix[C](https://arxiv.org/html/2606.14239#A3)\(Table[6](https://arxiv.org/html/2606.14239#A3.T6)\)\.
### 3\.3Evaluation: PACE and the Anchor Verifier
The twelve PACE evaluators produce soft, segment\-anchored signals, but a skill update needs a single accept/reject decision\. We combine the signals by priority\. First, a single*skill\_hurt*report from any evaluator vetoes the update: the overall verdict is*skill\_hurt*and the change is rolled back, regardless of every other signal\. Absent any hurt signal, the Artifact Evidence evaluator \(eval\-output\-evidence\-check\) is decisive, since it inspects actual file\-system outputs rather than reasoning traces; any remaining disagreement is settled by majority vote into*skill\_helped*or*skill\_inert*\. This asymmetry is deliberate: the cost of accepting a harmful update far exceeds the cost of blocking a beneficial one\.
The aggregatedaction\_signalsare collected into asurgery\_targetslist specifying which skill passages to modify and how; this list, together with the full trajectory evidence from\(τw,τwo\)\(\\tau\_\{w\},\\tau\_\{wo\}\), constitutes the editing brief passed to the Skill Iterator\. The aggregatedprotected\_hintsform aprotected\_segmentslist of passages the Skill Iterator must not touch\. The three\-way verdict acts solely as a commit/rollback gate:*skill\_helped*authorizes the Skill Iterator to proceed,*skill\_hurt*cancels the update regardless of the surgery targets, and*skill\_inert*defers to the next iteration\. The actual content of every edit is determined by the segment\-anchored trajectory evidence, not by the verdict category alone\.
PACE signals are soft and, as LLM\-generated judgments, prone to drift: the same trajectory quality may receive a more lenient assessment in later rounds as the evaluator’s implicit reference shifts\. The Anchor Verifier provides a deterministic, static counterpart\. It is compiled once fromTTby extracting only constraints checkable without any ground truth \(file existence, format compliance, values recomputable from workspace data, and required companion files\), and then locked for the remainder of evolution\. Its coverage is intentionally narrow so that falseFAILs \(which trigger forced rollbacks\) remain unlikely\.
The Anchor Verifier serves two roles\. As an evolution target, it encodes hard structural requirements the skill must satisfy\. As a drift guard, it ensures that even if all PACE dimensions report improvement, a regression on the Anchor Verifier forces an automatic rollback\. PACE’s soft signals drive the direction of evolution; the Anchor Verifier’s hard constraints define its boundaries\. Because it is emitted as a deterministic check script and never regenerated or re\-queried from an LLM during iteration, its verdicts are fully reproducible and immune to the non\-determinism that affects PACE evaluators\. Generation details are in Appendix[E](https://arxiv.org/html/2606.14239#A5)\.
### 3\.4Dual\-Strategy Evolution
Not all skill deficiencies respond to the same intervention\. A skill that is broadly effective but contains noisy or redundant passages needs subtraction: removing distractions so that the effective core can be followed more reliably\. A skill whose core workflow conflicts with the task, references outdated APIs, or actively misleads the agent needs replacement: locating the harmful passages and substituting them with correct guidance\. Applying a uniform strategy to both cases leads to two systematic failure modes: aggressive modification of an effective skill destroys content that was already working, while conservative patching of a harmful skill never reaches the root cause\.
As part of its initial analysis \(§[3\.1](https://arxiv.org/html/2606.14239#S3.SS1)\), the task interpreter assesses the compatibility betweenS0S\_\{0\}andTTand routes the task to one of two pipelines\. The decision turns on a single question: does the skill’s core workflow actively*conflict*with the task \(prescribing steps, interfaces, or outputs that contradict what the task requires\), or is the skill broadly on\-target but*imprecise*, carrying noise, redundancy, or minor gaps around a sound core? A genuine conflict routes the task to the Repair pipeline, which may replace or delete the offending content; otherwise the task enters the conservative Refine pipeline\. Both pipelines share the full paired auditing infrastructure \(PACE, Anchor Verifier, git\-based version control\) and differ only in the constraint gates they impose on the Skill Iterator\. Table[2](https://arxiv.org/html/2606.14239#S3.T2)summarizes the key differences\.
Table 2:Constraint gates for the Refine and Repair pipelines\. Both share paired trajectory auditing and the Anchor Verifier; they differ in modification scope and strategy\.In both pipelines, the Skill Iterator receives thesurgery\_targetsandprotected\_segmentsproduced by PACE together with the paired trajectory evidence\(τw,τwo\)\(\\tau\_\{w\},\\tau\_\{wo\}\), and executes targeted edits: each modification must be anchored to a specific surgery target and grounded in the behavioral divergence evidence, while protected segments cannot be altered\. The Refine pipeline’s Skill Iterator operates under a subtraction\-first mandate and halts entirely if any edit causes a*skill\_hurt*verdict, preserving the effective baseline\. The Repair pipeline’s Skill Iterator has access to a swap protocol that can replace a harmful passage with a verified alternative extracted from the without\-skill trajectoryτwo\\tau\_\{wo\}, and may fill diagnosed knowledge gaps after the harmful content has been removed\. Operational details of the Skill Iterator, including the priority ordering of edit operations and the regression protection mechanism, are provided in Appendix[E](https://arxiv.org/html/2606.14239#A5)\.
## 4Experiments
### 4\.1Setup
#### Benchmark\.
We evaluate on SkillsBench\(Liet al\.,[2026b](https://arxiv.org/html/2606.14239#bib.bib21)\), using its latest release combined with the v1\.1 expansion, retaining all tasks that execute without environment errors and yielding a working set of 89 runnable tasks\. SkillsBench’s 11 fine\-grained domains are consolidated into 8 by merging small domains:*Healthcare*into*Natural Science*;*Energy*,*Manufacturing*, and*Robotics*into*Industrial & Physical Systems*; and*Mathematics*expanded to*Mathematics & OR*\. Each task runs inside a Harbor container\(Merrillet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib23)\); the verifier executes after the agent finishes and returns a reward in\[0,1\]\[0,1\]\. Our primary metric is the*average task reward*: the mean of the per\-task rewards over all 89 tasks; tasks that time out receive a reward of 0\. Evolution and evaluation are strictly separated: the evolution loop runs in a stub container with no access to the pytest verifier or any test content; the real verifier executes only after the evolution loop terminates, ensuring our ground\-truth\-free constraint holds end\-to\-end\.
#### Baselines\.
We compare against two reference points\.*No skill*: the agent executes the task with no skill injected\.*With skill*: the expert\-authored skills shipped with SkillsBench; these represent the best static skill a practitioner can produce without iterative refinement\.
#### Models\.
The evolution loop uses Claude Code with Claude Opus 4\.8 as the execution model for all agent runs, PACE evaluations, and Skill Iterator calls\. Post\-evolution evaluation also uses Claude Opus 4\.8\.
### 4\.2Main Results
Table 3:Average task reward \(%\) on 89 SkillsBench tasks across 8 domains, all evaluated with Claude Opus 4\.8\. Each cell averages per\-task rewards in\[0,1\]\[0,1\]; missing or timed\-out tasks count as 0\. With skill is the static benchmark\-shipped skill;SkillAuditis the evolved output\. TheAveragerow is the mean over all 89 tasks \(a single micro\-average, not the mean of the eight domain rows\)\. The final column reportsSkillAudit’s gain over the static with\-skill baseline;SkillAuditimproves on it in seven of eight domains \(matching it in the last\) and by \+17\.2 pp overall\. Bold marks the best result per row\.Table[3](https://arxiv.org/html/2606.14239#S4.T3)presents the main results\.SkillAuditachieves 73\.9% average task reward on 89 tasks, exceeding the no\-skill baseline \(40\.9%\) by \+33\.0 pp and the with\-skill baseline \(56\.7%\) by \+17\.2 pp\. These gains are obtained without accessing any test script, reference solution, or oracle signal at any point during evolution\. Evolution outperforms the with\-skill baseline in seven of eight domains; the only exception is Finance & Economics, where evolution matches but does not exceed the static skill\. The two largest gains are Software Engineering \(\+38\.5\+38\.5pp\) and Office & White Collar \(\+26\.7\+26\.7pp\), both domains where the with\-skill baseline performs at or below the no\-skill baseline; the static skill is neutral or actively harmful there, and the skill\-hurt veto described in §[3\.2](https://arxiv.org/html/2606.14239#S3.SS2)recovers a substantial margin over both baselines\. Mathematics & OR shows a notable gain of\+21\.9\+21\.9pp despite carrying the weakest with\-skill baseline across all domains \(35\.6%35\.6\\%\)\. Cybersecurity and Industrial & Physical Systems see more modest improvements \(\+2\.9\+2\.9and\+10\.2\+10\.2pp respectively\)\.
### 4\.3Success Patterns and Limits of Skill Evolution
#### Evolution protects strong skills and partially recovers weak ones\.
Splitting the 89 tasks by whether their*initial*\(with\-skill\) reward reaches0\.50\.5yields a high\-quality group \(≥0\.5\\geq 0\.5,n=59n=59\) and a low\-quality group \(<0\.5<0\.5,n=30n=30\); the split is on the initial skill’s reward, not on the evolved outcome\. Among the 59 tasks whose initial skill already worked, evolution preserves the result on 54 \(92%\), each retaining or improving its reward, consistent with the Refine pipeline’s non\-degradation mandate\. The low\-quality group presents the greater challenge: evolution lifts 13 of 30 \(43%\) to a passing state, while the rest stay at reward 0\. The system thus reliably*protects*working skills but*recovers*only about 43% of failing ones—precisely the cases that demand*adding*domain knowledge rather than removing noise\. A striking special case is skill\-hurt recovery: three tasks ship a skill that is strictly worse than no skill at all \(court\-form\-filling,flink\-query,jax\-computing\-basics, each with\-skill reward0\.00\.0vs\. no\-skill reward1\.01\.0\), and the skill\-hurt veto restores all three to reward1\.01\.0by detecting thatτw\\tau\_\{w\}underperformsτwo\\tau\_\{wo\}and removing the harmful guidance\.
#### What evolves well is governed by observability, not by domain\.
Examining success rates across task and knowledge labels, we find the decisive factor is not the task’s*domain*but its*verifiability structure*: whether correctness leaves a trace that some observable signal \(the Anchor Verifier, a produced artifact, a runtime result\) can read\. We tag every task with multi\-label task\-type and knowledge\-type labels; per\-label counts are over the tasks bearing each label and do not partition the 89 tasks\. The cleanest cut is by*knowledge type*: what kind of knowledge the skill encodes\. Skills built on*executable, observable*knowledge evolve well: library\-API usage reaches 79\.2% and mathematical methods 80\.7% average reward, a combined 79\.9% over the 61 tasks whose skills rest on either, because deleting a load\-bearing API call or formula changes an observable result \(a stack trace, a wrong number\) that the auditor can read\. By contrast, skills encoding*domain procedure*\(69\.2% average reward,n=58n=58\), knowledge that prescribes*what to do*in a way that leaves no structural trace when removed, evolve markedly worse, and they dominate the failure set: of the tasks evolution leaves at reward 0, 77% carry a domain\-procedure label, far above their 65% base rate in the full benchmark\. The same task can be easy for a capable agent yet unevolvable for us, because the bottleneck is not task difficulty but whether the*knowledge*the skill carries is observable to the auditor\.
The task\-type cut supports the same conclusion\. Structurally checkable work evolves well: formatting \(100%\), generation \(100%\), transformation \(88\.9%\), and optimization \(80\.0%\) all reach at least 80% average reward, because the auditor can confirm whether a deleted passage is load\-bearing by observing compilation, a produced file, or a numerical result\. Semantic and multi\-step judgment tasks evolve worst: search \(25\.0%\), planning \(57\.5%\), and repair \(45\.0%\) average reward\. We highlight the knowledge\-type cut as primary because task type is a noisier proxy: some apparently checkable types still host failures \(e\.g\. several calculation and analysis tasks fail not because the arithmetic is hard but because the skill they need is a piece of*domain procedure*the auditor cannot validate\), whereas the knowledge\-type boundary tracks the failures directly\.
*Paired auditing yields a real gradient only when task correctness leaves an observable structural trace, or when the without\-skill run produces a reusable correct fragment; when neither holds, evolution optimizes a proxy that is blind to correctness\.*The same boundary explains both our gains and our failures\. The clearest illustration is the CVE\-repair taskfix\-druid\-loophole\-cve: the auditor can confirm that a patch file exists and that Maven compiles, but not whether the patch blocks the exploit at runtime, so the skill is pruned to a structurally valid form that passes these checks while losing the codebase\-specific knowledge of*where*to apply the fix\. The system sometimes perceives its own blind spot: onfinancial\-modeling\-qathe evolved index records a gap note “Current Gaps \(Not Covered\): Dice game scoring algorithms\.” The system knows the knowledge is missing but cannot synthesize it from a signal that only confirms “answer\.txtexists and contains a number,” not whether the number is right\. The practical implication for skill authors and skill\-learning systems is the same:*a skill is only as evolvable as its weakest claim is observable*\. Domain procedure that no test can witness must be supplied by a human or a stronger oracle, because ground\-truth\-free optimization will neither protect nor repair what it cannot see\.
### 4\.4Structural Analysis of Evolved Skills
Reading the initial and evolved skill documents for all 89 tasks reveals what ground\-truth\-free evolution actually does to a skill—and, by extension, what distinguishes a good agent skill from a poor one\.
#### Evolution edits cluster into a few repeatable moves\.
Across successfully evolved tasks, the edits are not arbitrary rewrites but a small vocabulary of recurring operations: \(i\)*prune off\-domain skills*bundled with the relevant one—pedestrian\-traffic\-countingdeletes three of four sibling skills \(alternative\-model menus and cost calculators\), keeping only the one the task uses; \(ii\)*strip tutorial prose while keeping the executable core*—software\-dependency\-auditremoves the “What is CVSS?” severity table, version explainer, and reference URLs but preserves the JSON schema and extraction function the verifier inspects; \(iii\)*de\-hardcode paths, versions, and parameters*, replacing them with “follow the task” plus an explicit anti\-hardcoding rule \(data\-to\-d3,threejs\-structure\-parser\); \(iv\)*inline a constraint next to the step it governs*rather than in a distant notes section \(weighted\-gdp\-calcplaces “use the exact row range, not the whole sheet” directly above the formula\); \(v\)*add verbatim reminders*for copy\-sensitive outputs; \(vi\)*supply the missing I/O contract*—gravitational\-wave\-detectionadds the input\-loading call and the exact output header and row count; and \(vii\)*fix a one\-character path or name typo*that silently breaks execution \(simpo\-code\-reproduction:python\_int\.txt→\\topython\_info\.txt\)\. Every one of these moves shifts text*toward*what the environment can observe and*away*from what only a human reader values; even as documents shrink, the density of imperative, verifier\-checkable instructions rises\.
#### Evolution invents a navigation layer that no author wrote\.
The single most systematic addition is structural, not textual\.None of the 89 initial skill sets contains a top\-level index; 80 of the 89 evolved sets do, and most carry a keyword→\\toskill routing table\. For several already\-passing tasks \(r2r\-mpc\-control,energy\-ac\-optimal\-power\-flow\) the*only*change evolution makes is adding this index—and some indices honestly flag their own gaps \(“metrics computation is not covered by any skill; implement from the task spec directly”\)\. The system independently rediscovers that a multi\-file skill needs a dispatcher, suggesting that navigability, not just content, is a first\-class property of a usable skill\.
#### Skills converge to a dense middle, from both directions\.
Line\-count change correlates negatively with initial size \(corr=−0\.378\\mathrm\{corr\}=\-0\.378over 89 tasks\)\. Small skills grow and large ones are pruned: tasks starting below 300 lines gain a median of\+15\+15lines \(the missing contract\), while tasks starting above 1,000 lines lose a mean of616616\(off\-domain bundles and explanation\), with the bulk of evolved skills settling into a compact band\. Evolution is therefore not “make it shorter”—it is regression toward an information\-dense middle: starve the agent and it adds the missing contract, flood it and it deletes the noise\. Table[4](https://arxiv.org/html/2606.14239#S4.T4)summarizes the per\-pipeline dynamics; the full per\-task distribution appears in Appendix[D](https://arxiv.org/html/2606.14239#A4)\(Figure[5](https://arxiv.org/html/2606.14239#A4.F5)\)\. The two pipelines behave as designed: Refine carries a heavy left tail of large subtractions \(median\+19\+19on passing tasks but a long deletion tail down to−7,375\-7\{,\}375\), while Repair makes more targeted swaps with fewer extreme deletions\.
Table 4:Skill line\-count change \(Δ=evolved−initial\\Delta=\\text\{evolved\}\-\\text\{initial\}, all\.mdfiles\) by routing pipeline\. Refine’s long deletion tail reflects its subtraction\-first mandate; Repair makes more targeted edits\. Full distribution in Appendix[D](https://arxiv.org/html/2606.14239#A4)\.
#### Two lessons for authoring agent skills\.
Across these edits, a consistent prescriptive picture emerges\.*\(1\) A good skill is a verifier\-observable execution contract, not a tutorial\.*Every successful edit moves text toward what an environment can check \(exact calls, schemas, paths, headers, formulas\) and away from definitions, rationale, and best\-practice prose; in short,*write skills the way a test would read them*\.*\(2\) More content is often negative value; aim for one task\-scoped, navigable, contract\-dense unit\.*Large generic or persona bundles actively distract a capable agent \(flink\-querypasses after collapsing a 7,380\-line bundle to a 5\-line note, because the bloat was net\-harmful\), while under\-specified skills mostly need the missing contract added; and when several skills coexist, a routing index lets the agent find the right one and recognize what is absent\.
#### The edit vocabulary serves different ends in the two quality groups\.
The two lessons above describe what a good evolved skill looks like; examining the initial–evolved diffs separately for the high\-quality and low\-quality groups of §[4\.3](https://arxiv.org/html/2606.14239#S4.SS3)shows how the same vocabulary achieves those properties through two distinct operating modes\. On theprotectedgroup \(skills that already worked\), the dominant mode is restraint\. For a substantial portion of these tasks the evolved skill is structurally unchanged: the convergence gate found no edit that improves an already\-passing result\. Where the system does act on a working skill, it stays conservative, applying safe subtraction \(deleting non\-load\-bearing bloat\) or small targeted corrections \(fixing a wrong column name or unit, inlining an I/O guardrail\) while leaving the validated main procedure intact\. No protected task undergoes a destructive rewrite of working logic; when the system acts on a strong skill it trims or corrects at the margin, never replacing content that already functions\. On therepairedgroup \(skills in the low\-quality group that evolution lifted to passing\), the same vocabulary turns constructive\. The most common repair supplies the missing execution contract, adding the exact deliverable specification the verifier reads \(output schema, filename, header, units, input\-parsing recipe\) to a skill whose method was sound but whose product was never specified\. The remaining repairs split between subtraction that removes off\-domain or net\-harmful guidance \(the same bundle\-collapse mechanism illustrated forflink\-queryabove\) and correction of a single mechanical defect, such as a wrong output extension or a transposed array axis, that silently routed a correct computation to the wrong place; that one\-token fixes recover full credit in such cases shows that a portion of with\-skill failures are not reasoning gaps but mechanical mismatches the auditor can localize precisely\. The protected group demonstrates that ground\-truth\-free evolution does not break what works; the repaired group shows it recovers failures by making a skill’s contract observable, not by supplying domain knowledge the auditor cannot verify—the same observability boundary that governs §[4\.3](https://arxiv.org/html/2606.14239#S4.SS3)\.
#### Two limits of observable evidence\.
The same reliance on observable evidence that powers these moves also bounds them in two opposite directions\.*Over\-pruning*strikes semantic tasks where structural signals poorly predict correctness: with no observable trace to mark a passage as load\-bearing, subtraction can delete domain knowledge the verifier can never justify restoring\.*Under\-pruning*is the opposite failure: among tasks that pass after evolution, ten retain bloated skills above 1,500 lines \(up tolean4\-proofat 12,036\), each within roughly±60\\pm 60lines of its initial size—because they already pass the Anchor Verifier under both conditions, the convergence gate treats them as “no change needed” and never accumulates the evidence that would authorize compressing the redundant background\. Both stem from the same rule that an edit must be justified by an observable change, leaving subtraction over\-firing where signals are too weak and under\-firing where they are already satisfied\.
### 4\.5Case Studies
#### Case A: Refine —software\-dependency\-audit\.
This cybersecurity task requires offline Trivy scanning of a 1,282\-packagepackage\-lock\.jsonand writing a fixed\-schema CSV of HIGH/CRITICAL vulnerabilities\. The with\-skill baseline was functionally correct but bloated with CVSS tutorial text and a hardcoded cache path \(\./trivy\-cache\) that conflicted with the workspace layout\. After paired execution, PACE anchored two divergences, both on the Consistency dimension:*eval\-task\-alignment*flagged the hardcoded cache\-path mismatch, and*eval\-method\-adherence*showed thatτw\\tau\_\{w\}used the full offline Trivy flag set whileτwo\\tau\_\{wo\}omitted critical flags\. The Refine pipeline deleted∼\\sim35% of skill lines while protecting the CVSS source\-priority loop; on iteration 2,τw\\tau\_\{w\}passed the Anchor Verifier whileτwo\\tau\_\{wo\}failed, yielding a*skill\_helped*verdict\. Post\-evolution evaluation reaches reward1\.01\.0on Opus \(Appendix[B](https://arxiv.org/html/2606.14239#A2)\)\.
#### Case B: Repair —data\-to\-d3\.
This media task requires a D3\.js force\-simulation bubble chart with a bidirectionally linked data table and verbatim column labels checked by 13 structural constraints\. The initial 188\-line skill pinned a specific D3 version, prescribeddist/output paths, and discouraged live force simulation—all contradicting the task specification\. Pre\-audit surgery removed the conflicting segments, but iteration 1 triggered a mandatory regression revert whenτwo\\tau\_\{wo\}passed Anchor checks whileτw\\tau\_\{w\}failed on column\-name casing\. The Repair pipeline then deleted distracting boilerplate that split the agent’s attention, retained verified tooltip and click\-handler patterns, and added a two\-line verbatim\-label reminder \(188→\\to50 lines\); iteration 5 converged at 13/13 Anchor passes\. Benchmark evaluation improves from0\.00\.0to1\.01\.0on Opus \(Appendix[B](https://arxiv.org/html/2606.14239#A2)\)\.
## 5Related Work
#### Adapting agents through external knowledge\.
Rather than fine\-tuning model weights, a growing body of work adapts frozen LLM agents by injecting structured external knowledge at inference time\. Methods such as Voyager, ExpeL, and Reflexion accumulate reusable procedural knowledge from execution history through code libraries, trajectory distillation, or verbal self\-reflection\(Wanget al\.,[2024a](https://arxiv.org/html/2606.14239#bib.bib24); Zhaoet al\.,[2024](https://arxiv.org/html/2606.14239#bib.bib27); Shinnet al\.,[2023](https://arxiv.org/html/2606.14239#bib.bib25); Madaanet al\.,[2023](https://arxiv.org/html/2606.14239#bib.bib29)\)\. At a higher level of structure, the Anthropic Agent Skills specification defines portable multi\-file packages as an adaptation interface for professional workflows\(Anthropic,[2025](https://arxiv.org/html/2606.14239#bib.bib31)\)\. Several recent methods automate their construction from diverse sources: Trace2Skill from execution traces\(Niet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib3)\), SkillWeaver from web demonstrations\(Zhenget al\.,[2025](https://arxiv.org/html/2606.14239#bib.bib32)\), AutoSkill from dialogue histories\(Yanget al\.,[2026b](https://arxiv.org/html/2606.14239#bib.bib9)\), and SkillFoundry from heterogeneous scientific resources\(Shenet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib8)\)\. The executable scripts in these packages follow the code\-action paradigm, in which agent behavior is expressed as runnable code\(Wanget al\.,[2024b](https://arxiv.org/html/2606.14239#bib.bib30)\)\. At ecosystem scale, a separate line of work addresses how thousands of such skills are organized, retrieved, and orchestrated\(Liet al\.,[2026a](https://arxiv.org/html/2606.14239#bib.bib34)\)\. These methods treat the resulting artifact as a finished product; none revisits it once deployed, and generating skills in a single pass without iterative refinement provides no improvement on average\(Liet al\.,[2026b](https://arxiv.org/html/2606.14239#bib.bib21)\)\. Anthropic’s own skill\-creator tool introduces a human\-in\-the\-loop refinement cycle that does execute with\-skill and without\-skill runs in parallel\(Anthropic,[2025](https://arxiv.org/html/2606.14239#bib.bib31)\)\. In this cycle, a practitioner provides evaluation test cases with explicit assertions, reviews outputs in a browser\-based interface, and decides what to change; the paired contrast serves as a measurement aid for human decision\-making rather than an autonomous optimization signal\.SkillAuditbegins where these methods stop: given an initial skill, it asks how that skill can improve autonomously through interaction with the task, without predefined test cases or human review\.
#### Feedback\-driven skill refinement and its oracle dependency\.
The gap above has motivated iterative skill refinement, but existing methods typically close their evolution loop with an externally sourced ground\-truth signal\. As illustrated in Figure[1](https://arxiv.org/html/2606.14239#S1.F1), these methods fall into two paradigms\.
*Oracle\-Gated Evolution*methods gate each update on an external pass/fail signal\. SkillOpt turns scored rollouts into bounded edits on a single skill document, accepting a candidate only when it strictly improves a held\-out validation score\(Yanget al\.,[2026a](https://arxiv.org/html/2606.14239#bib.bib5)\)\. EvoSkill maintains a Pareto frontier of skill candidates, retaining only those that improve held\-out validation performance\(Alzubiet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib6)\)\. CoEvoSkills and SkillEvolver loosen this dependence but do not escape it\. CoEvoSkills co\-evolves a Skill Generator with a Surrogate Verifier, yet still closes its loop on an opaque pass/fail bit from hidden tests\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.14239#bib.bib1)\)\. SkillEvolver refines a meta\-skill from failures observed when a deployed skill is reused by another agent, which removes the fixed test suite but requires an active deployment pipeline as the signal source\(Zhanget al\.,[2026a](https://arxiv.org/html/2606.14239#bib.bib2)\)\.
*Failure\-Signal Driven*methods instead consume richer diagnostic information from external sources\. SkillForge diagnoses execution failures against enterprise knowledge bases and historical support tickets through an automated analyzer–diagnostician–optimizer pipeline\(Liuet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib7)\)\. SkillClaw aggregates cross\-user interaction trajectories across a multi\-user agent ecosystem as its evolution signal\(Maet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib4)\)\. RL\-based approaches optimize against task\-outcome rewards or composite signals\(Wanget al\.,[2025](https://arxiv.org/html/2606.14239#bib.bib13); Xiaet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib10); Visheet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib11); Wanget al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib12); Shiet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib14); Ouyanget al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib15)\), and the same evolve\-from\-feedback recipe has been applied to the agent’s memory operations themselves\(Zhanget al\.,[2026c](https://arxiv.org/html/2606.14239#bib.bib35)\)\. In each case, the evolution loop depends on signals that are unavailable when the required infrastructure does not exist\.
A related line of work evolves prompts or context playbooks rather than structured skill packages\(Khattabet al\.,[2023](https://arxiv.org/html/2606.14239#bib.bib19); Agrawalet al\.,[2026](https://arxiv.org/html/2606.14239#bib.bib16); Zhanget al\.,[2026d](https://arxiv.org/html/2606.14239#bib.bib17); Yuksekgonulet al\.,[2024](https://arxiv.org/html/2606.14239#bib.bib18)\); these methods target a different abstraction layer and all rely on evaluation functions or execution success signals\.SkillAuditoperates without any such infrastructure, deriving its evolution signal entirely from the behavioral contrast between with\-skill and without\-skill executions\.
## 6Conclusion
We introducedSkillAudit, a skill evolution framework that operates without any hidden test, reference solution, or oracle signal\. Its core mechanism, paired trajectory auditing, runs each task under with\-skill and without\-skill conditions and uses the behavioral contrast as the primary evolution signal: a three\-way verdict gates commit/rollback decisions, while the segment\-anchored trajectory evidence drives the actual content of skill edits\. A ground\-truth\-free evaluation architecture pairs a locked Anchor Verifier, which encodes objective structural constraints, withPACE, a process\-aligned evaluator cluster spanning Process Adherence, Artifact Evidence, Consistency, and Effectiveness Delta; together they provide stable, reproducible guidance without ground\-truth labels\. A dual\-strategy routing mechanism matches the intervention to the nature of the deficiency, applying subtraction\-first Refine to broadly effective skills and diagnosis\-driven Repair to skills whose core guidance is wrong\. On 89 containerized tasks across 8 professional domains,SkillAuditachieves 73\.9% average task reward, exceeding the no\-skill baseline \(40\.9%\) by \+33\.0 pp and the with\-skill baseline \(56\.7%\) by \+17\.2 pp\.
#### Future Work\.
Several directions remain open\. Most urgently, a systematic ablation study is needed to establish which components are load\-bearing: comparing the full system against variants that omit the Anchor Verifier, collapse dual\-routing to a single pipeline, or replace structured PACE output with raw trajectory diffs would clarify whether the full architecture is justified by its component\-level contributions\. A complementary question is the reliability of the pre\-assessment routing heuristic—quantifying how often Refine versus Repair assignment matches the post\-hoc appropriate choice, and characterizing the downstream cost of misrouting, would sharpen understanding of where the dual\-strategy mechanism adds the most value\. Beyond these, cross\-model transfer \(evolving a skill with one model and deploying it on another\) would test whether the captured knowledge is genuinely procedural rather than model\-specific\.
## References
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2026\)GEPA: reflective prompt evolution can outperform reinforcement learning\.InInternational Conference on Learning Representations \(ICLR\),Note:OralCited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p4.1)\.
- S\. Alzubi, N\. Provenzano, J\. Bingham, W\. Chen, and T\. Vu \(2026\)EvoSkill: automated skill discovery for multi\-agent systems\.External Links:2603\.02766,[Link](https://arxiv.org/abs/2603.02766)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p3.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p2.1)\.
- Anthropic \(2025\)Agent skills\.Note:[https://docs\.anthropic\.com/en/docs/agents\-and\-tools/claude\-code/skills](https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/skills)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Choudhury \(2025\)Process reward models for llm agents: practical framework and directions\.External Links:2502\.10325,[Link](https://arxiv.org/abs/2502.10325)Cited by:[§3\.2](https://arxiv.org/html/2606.14239#S3.SS2.p2.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber \(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2023\)DSPy: compiling declarative language model calls into self\-improving pipelines\.External Links:2310\.03714,[Link](https://arxiv.org/abs/2310.03714)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p4.1)\.
- H\. Li, C\. Mu, J\. Chen, S\. Ren, Z\. Cui, Y\. Zhang, L\. Bai, and S\. Hu \(2026a\)Organizing, orchestrating, and benchmarking agent skills at ecosystem scale\.External Links:2603\.02176,[Link](https://arxiv.org/abs/2603.02176)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun, S\. Wang, B\. Li, Q\. Zeng, D\. Wang, X\. Zhao, Y\. Wang, R\. B\. Chaim, Z\. Di, Y\. Gao, J\. He, Y\. He, L\. Jing, L\. Kong, X\. Lan, J\. Li, S\. Li, Y\. Li, Y\. Lin, X\. Liu, X\. Liu, H\. Lyu, Z\. Ma, B\. Wang, R\. Wang, T\. Wang, W\. Ye, Y\. Zhang, H\. Xing, Y\. Xue, S\. Dillmann, and H\. Lee \(2026b\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.External Links:2602\.12670,[Link](https://arxiv.org/abs/2602.12670)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.14239#S4.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, X\. Luo, L\. Li, G\. Huang, J\. Liu, and H\. Qiao \(2026\)SkillForge: forging domain\-specific, self\-evolving agent skills in cloud technical support\.External Links:2604\.08618,[Document](https://dx.doi.org/10.1145/3805712.3808466),[Link](https://arxiv.org/abs/2604.08618)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p3.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu \(2026\)SkillClaw: let skills evolve collectively with agentic evolver\.External Links:2604\.08377,[Link](https://arxiv.org/abs/2604.08377)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p3.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36\.Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj,et al\.\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.External Links:2601\.11868,[Link](https://arxiv.org/abs/2601.11868)Cited by:[§4\.1](https://arxiv.org/html/2606.14239#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Ni, Y\. Liu, X\. Liu, Y\. Sun, M\. Zhou, P\. Cheng, D\. Wang, X\. Jiang, and G\. Jiang \(2026\)Trace2Skill: distill trajectory\-local lessons into transferable agent skills\.External Links:2603\.25158,[Link](https://arxiv.org/abs/2603.25158)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Ouyang, J\. Yan, Y\. Chen, R\. Han, Z\. Wang, B\. D\. Mishra, R\. Meng, C\. Li, Y\. Jiao, K\. Zha, M\. Shen, V\. Tirumalashetty, G\. Lee, J\. Han, T\. Pfister, and C\. Lee \(2026\)SkillOS: learning skill curation for self\-evolving agents\.External Links:2605\.06614,[Link](https://arxiv.org/abs/2605.06614)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- S\. Shen, W\. Cheng, M\. Ma, A\. Turcan, M\. J\. Zhang, and J\. Ma \(2026\)SkillFoundry: building self\-evolving agent skill libraries from heterogeneous scientific resources\.External Links:2604\.03964,[Link](https://arxiv.org/abs/2604.03964)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Shi, Y\. Chen, Z\. Lu, Y\. Miao, S\. Liu, Q\. Gu, X\. Cai, X\. Wang, and A\. Zhang \(2026\)Skill1: unified evolution of skill\-augmented agents via reinforcement learning\.External Links:2605\.06130,[Link](https://arxiv.org/abs/2605.06130)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36\.Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Vishe, R\. Surana, X\. Jiang, Z\. Huang, X\. Li, N\. L\. Kuang, T\. Yu, R\. A\. Rossi, J\. Shang, J\. McAuley, and J\. Wu \(2026\)Skill\-r1: agent skill evolution via reinforcement learning\.External Links:2605\.09359,[Link](https://arxiv.org/abs/2605.09359)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024a\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- H\. Wang, G\. Wang, H\. Xiao, Y\. Zhou, Y\. Pan, J\. Wang, K\. Xu, Y\. Wen, X\. Ruan, X\. Chen, and H\. Qi \(2026\)Skill\-sd: skill\-conditioned self\-distillation for multi\-turn llm agents\.External Links:2604\.10674,[Link](https://arxiv.org/abs/2604.10674)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong \(2025\)Reinforcement learning for self\-improving agent with skill library\.External Links:2512\.17102,[Link](https://arxiv.org/abs/2512.17102)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- X\. Wang, Y\. Chen, L\. Yuan, Y\. Zhang, Y\. Li, H\. Peng, and H\. Ji \(2024b\)Executable code actions elicit better llm agents\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen, Z\. Zheng, C\. Xie, and H\. Yao \(2026\)SkillRL: evolving agents via recursive skill\-augmented reinforcement learning\.External Links:2602\.08234,[Link](https://arxiv.org/abs/2602.08234)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- R\. Xu and Y\. Yan \(2026\)Agent skills for large language models: architecture, acquisition, security, and the path forward\.External Links:2602\.12430,[Link](https://arxiv.org/abs/2602.12430)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1)\.
- Y\. Yang, Z\. Gong, W\. Huang, Q\. Yang, Z\. Zhou, Z\. Huang, Y\. Li, X\. Gao, Q\. Dai, B\. Liu, K\. Qiu, Y\. Yang, D\. Chen, X\. Yang, and C\. Luo \(2026a\)SkillOpt: executive strategy for self\-evolving agent skills\.External Links:2605\.23904,[Link](https://arxiv.org/abs/2605.23904)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p3.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p2.1)\.
- Y\. Yang, J\. Li, Q\. Pan, B\. Zhan, Y\. Cai, L\. Du, J\. Zhou, K\. Chen, Q\. Chen, X\. Li, B\. Zhang, and L\. He \(2026b\)AutoSkill: experience\-driven lifelong learning via skill self\-evolution\.External Links:2603\.01145,[Link](https://arxiv.org/abs/2603.01145)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)TextGrad: automatic “differentiation” via text\.External Links:2406\.07496,[Link](https://arxiv.org/abs/2406.07496)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p4.1)\.
- G\. Zhang, E\. Zhu, J\. Zhou, C\. Jia, and H\. Wang \(2026a\)SkillEvolver: skill learning as a meta\-skill\.External Links:2605\.10500,[Link](https://arxiv.org/abs/2605.10500)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p3.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p2.1)\.
- H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng, X\. Liu, X\. Li, and P\. S\. Yu \(2026b\)CoEvoSkills: self\-evolving agent skills via co\-evolutionary verification\.External Links:2604\.01687,[Link](https://arxiv.org/abs/2604.01687)Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p3.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p2.1)\.
- H\. Zhang, Q\. Long, J\. Bao, T\. Feng, W\. Zhang, H\. Yue, and W\. Wang \(2026c\)MemSkill: learning and evolving memory skills for self\-evolving agents\.External Links:2602\.02474,[Link](https://arxiv.org/abs/2602.02474)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p3.1)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun \(2026d\)Agentic context engineering: evolving contexts for self\-improving language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px2.p4.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: llm agents are experiential learners\.InAAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.14239#S1.p1.1),[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
- B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig, and Y\. Su \(2025\)SkillWeaver: web agents can self\-improve by discovering and honing skills\.External Links:2504\.07079,[Link](https://arxiv.org/abs/2504.07079)Cited by:[§5](https://arxiv.org/html/2606.14239#S5.SS0.SSS0.Px1.p1.1)\.
## Appendix AEvolution Loop Algorithm
Algorithm 1SkillAuditEvolution Loop0:Task description
TT, workspace
WW, initial skill
S0S\_\{0\}
0:Evolved skill
S∗S^\{\*\}
1:
ℐ←TaskInterpreter\(T,S0\)\\mathcal\{I\}\\leftarrow\\textsc\{TaskInterpreter\}\(T,S\_\{0\}\)\{structured task specification\}
2:
V←LockAnchorVerifier\(ℐ\)V\\leftarrow\\textsc\{LockAnchorVerifier\}\(\\mathcal\{I\}\);
mode←PreAssess\(ℐ\)\\textit\{mode\}\\leftarrow\\textsc\{PreAssess\}\(\\mathcal\{I\}\)\{one\-time init\}
3:
S←S0S\\leftarrow S\_\{0\}
4:for
t=1,…,5t=1,\\ldots,5do
5:
τw,τwo←PairedExecution\(T,W,S\)\\tau\_\{w\},\\tau\_\{wo\}\\leftarrow\\textsc\{PairedExecution\}\(T,W,S\)
6:
verdict←Aggregate\(PACE\(τw,τwo,S\),V\(τw\)\)\\textit\{verdict\}\\leftarrow\\textsc\{Aggregate\}\\\!\\big\(\\textsc\{PACE\}\(\\tau\_\{w\},\\tau\_\{wo\},S\),\\;V\(\\tau\_\{w\}\)\\big\)\{hurt has veto\}
7:if
verdict=helped\\textit\{verdict\}=\\textit\{helped\}then
8:
S←SkillIterator\(S,τw,τwo,mode\)S\\leftarrow\\textsc\{SkillIterator\}\(S,\\tau\_\{w\},\\tau\_\{wo\},\\textit\{mode\}\)
9:elseif
verdict=hurt\\textit\{verdict\}=\\textit\{hurt\}then
10:rollback
SS
11:endif
12:ifConverged\(S,V,t\)\(S,V,t\)then
13:break
14:endif
15:endfor
16:return
SS\{S∗←SS^\{\*\}\\leftarrow S\}
## Appendix BDetailed Case Studies
Table[5](https://arxiv.org/html/2606.14239#A2.T5)summarizes four evolution runs selected to illustrate Refine and Repair behavior\. Cases A and B are the primary narratives in §[4\.5](https://arxiv.org/html/2606.14239#S4.SS5)\. Case C \(lab\-unit\-harmonization\) shows a Refine run that reaches a clear paired contrast at iteration 2\. Case D \(exceltable\-in\-ppt\) shows a Repair run that detects and rolls back a*skill\_hurt*signal across four iterations before converging\. Figure[3](https://arxiv.org/html/2606.14239#A2.F3)shows the per\-iteration verdict timeline for all four cases, and Figure[4](https://arxiv.org/html/2606.14239#A2.F4)shows their skill line counts before and after evolution\. Evolved skills are archived for reproducibility\.
Table 5:Case\-study tasks: routing mode, skill size change, paired contrast signal, and post\-evolution Opus reward\.Figure 3:Evolution timelines for the four case\-study tasks\. Green boxes mark commits or*skill\_helped*verdicts; red marks regression revert; blue marks PACE\-driven edits or fill\-gap injections\.Figure 4:Skill document line counts before and after evolution for the four case\-study tasks\. Every case shrinks: the Repair taskdata\-to\-d3undergoes the largest relative reduction \(−73%\-73\\%\), the Refine taskssoftware\-dependency\-audit\(−35%\-35\\%\) andlab\-unit\-harmonization\(−37%\-37\\%\) strip bulk while preserving the load\-bearing core, and the Repair taskexceltable\-in\-pptmakes a targeted removal \(−3%\-3\\%\) to eliminate a single harmful instruction\.### B\.1Case A:software\-dependency\-audit\(Refine, primary\)
#### Task\.
Scan/root/package\-lock\.jsonoffline with Trivy, filter HIGH/CRITICAL CVEs, and write/root/security\_audit\.csvwith headerPackage,Version,CVE\_ID,Severity,CVSS\_Score,Fixed\_Version,Title,Url\. The Anchor Verifier checks file existence, exact header, severity enum, and CVSS format without accessing hidden pytest tests\.
#### Evolution run\.
Baseline: 935 lines across three skill files\. Iteration 1: bothτw\\tau\_\{w\}andτwo\\tau\_\{wo\}pass Anchor checks; PACE aggregatessurgery\_targetsfrom*eval\-portfolio\-quality*\(overlapping CVSS definitions across files\) and*eval\-task\-alignment*\(WC\-001: hardcodedTRIVY\_CACHE\_PATH = ’\./trivy\-cache’vs\. actual DB at/root/\.cache/trivy/db/\)\. Skill Iterator uses the trajectory evidence fromτwo\\tau\_\{wo\}showing the cache path mismatch to delete tutorial sections and replace the hardcoded cache constant with acache\_dirparameter;protected\_segmentspreserve the CVSS source\-priority loopfor source in \[’nvd’,’ghsa’,’redhat’\]and offline flags\-\-skip\-db\-update,\-\-offline\-scan\. Result: 610 lines\. Iteration 2:τw\\tau\_\{w\}PASS,τwo\\tau\_\{wo\}FAIL—without\-skill omits\-\-skip\-db\-updateand uses a malformed CSV row; verdict*skill\_helped*; fast exit\.
#### Representative edit\.
Intrivy\-offline\-vulnerability\-scanning/SKILL\.md, the evolved skill replaces a fixed\./trivy\-cacheconstant with:
> "\-\-cache\-dir", cache\_dir \# Path to pre\-downloaded database
and removes∼\\sim50 lines of “What is CVSS?” scoring tables unrelated to the deliverable\.
#### PACE dimensions triggered\.
Consistency \(WC\-001 path conflict via eval\-task\-alignment; full offline command via eval\-method\-adherence\); Effectiveness Delta \(duplicate helpers via eval\-portfolio\-quality\)\.
### B\.2Case B:data\-to\-d3\(Repair, primary\)
#### Task\.
Build/root/output/index\.html: D3\.js force simulation \(sector\-clustered bubbles sized by market cap, ETF tooltips excluded\) plus a 50\-row table with columnsTicker symbol,Full company name,Sector,Market caplinked bidirectionally to bubble clicks\. The Anchor Verifier enforces 13 hard constraints including the D3 bundle path and verbatim column strings\.
#### Evolution run\.
Initial skill \(188 lines\) pins a specific D3 version, recommendsdist/output paths, and prescribes fixed\-tick deterministic force layouts\. Pre\-audit surgery writes fourdelete\_or\_swaptargets withwithout\_verified\_snippetreplacements drawn fromτwo\\tau\_\{wo\}behavior \(e\.g\., “Use the D3 version specified in the task”\)\. Iteration 1:τwo\\tau\_\{wo\}13/13 PASS,τw\\tau\_\{w\}12/13 FAIL \(Title Case column headers\); mandatoryregression\_revert\_to\_baseline\. Iteration 2: aggressive deletion to 48 lines \(tooltip and click patterns only\), guided by the column\-name divergence evidence inτw\\tau\_\{w\}vs\.τwo\\tau\_\{wo\}\. Iteration 3:τw\\tau\_\{w\}13/13 PASS,τwo\\tau\_\{wo\}12/13 FAIL⇒\\Rightarrow*skill\_helped*\. Iteration 5: add verbatim\-label reminder \(50 lines\); 13/13 PASS; convergence declared\. Benchmark: Opus0\.00\.0\(initial skill\)→\\to1\.01\.0\(evolved\)\.
#### Representative evolved skill \(50 lines\)\.
> Critical: Use the task’s exact column names and labels verbatim—copy the precise wording and casing from the task description\.
The skill no longer pins a D3 version or output directory; it delegates version and path requirements to the task text while retaining reusable interaction snippets\.
#### Mechanism note\.
This case illustrates*attention fragmentation*: even factually useful patterns \(tooltips, click handlers\) harm performance when bundled with 140 lines of conflicting defaults\. Repair responds with deletion\-first surgery plus a minimal boundary reminder, not a single\-paragraph swap\.
### B\.3Case C:lab\-unit\-harmonization\(Refine\)
#### Task\.
Harmonize laboratory measurement units across a clinical dataset \(Natural Science domain\): read heterogeneous unit strings from input CSV files, convert all values to SI base units, and write a normalised output with a conformance report\. The Anchor Verifier checks file existence, required column headers, and numeric range plausibility without accessing hidden test scripts\.
#### Evolution run\.
Baseline: 408 lines across three skill files\. Iteration 1: bothτw\\tau\_\{w\}andτwo\\tau\_\{wo\}pass Anchor checks; PACE returns*skill\_inert*with no high\-confidence surgery targets\. Iteration 2 is the decisive round:τw\\tau\_\{w\}passes all 8 Anchor checks \(score1\.01\.0\);τwo\\tau\_\{wo\}fails 1 of 8 \(score0\.880\.88, missing SI conversion for a legacymg/dLfield\)\. The behavioral divergence is directly traceable to the skill’s unit\-mapping table, which the without\-skill agent omits; this trajectory evidence drives the Skill Iterator to consolidate and protect that table while removing surrounding noise\. Verdict:*skill\_helped*; convergence declared\. Final skill: 259 lines \(−37%\-37\\%\)\.
#### Mechanism note\.
This case illustrates the diagnostic value of paired execution even when neither run fails Anchor checks in the first iteration: a*skill\_inert*verdict at iteration 1 correctly defers rather than acting on weak evidence, and the decisive behavioral contrast emerges at iteration 2\.
#### PACE dimensions triggered\.
Effectiveness Delta \(*eval\-incremental\-value*: unit\-mapping table absent inτwo\\tau\_\{wo\}\); Artifact Evidence \(*eval\-output\-evidence\-check*: SI compliance inτw\\tau\_\{w\}only\)\.
### B\.4Case D:exceltable\-in\-ppt\(Repair\)
#### Task\.
Insert a formatted Excel table into a PowerPoint slide deck \(Office & White Collar domain\): read/root/input\.xlsx, extract the target sheet, and embed it as a live table in/root/output/results\.pptxwhile preserving all cell formulas\. The Anchor Verifier enforces output path, slide count, and—critically—formula preservation \(6 formulas must survive the round\-trip\)\.
#### Evolution run\.
Baseline: 1825 lines\. Iteration 1: bothτw\\tau\_\{w\}andτwo\\tau\_\{wo\}pass Anchor checks \(6/6\);*eval\-safety\-compliance*reports no issue; verdict*skill\_inert*\. Iteration 2:τwo\\tau\_\{wo\}PASS \(6/6\);τw\\tau\_\{w\}FAIL \(0/6\)—the with\-skill agent follows the skill’s‘‘MANDATORY IF USING FORMULAS: Use the recalc\.py script’’directive, which silently zeroes all formulas during the recalculation pass\.*eval\-safety\-compliance*catches the formula destruction; verdict*skill\_hurt*\. The Repair pipeline rolls back to baseline and, using the trajectory evidence showing exactly whererecalc\.pydestroys formulas inτw\\tau\_\{w\}, surgically deletes therecalc\.pyworkflow step and its associated section \(−52\-52lines\)\. Iteration 3: both PASS; formula preservation confirmed \(6/66/6\); verdict*skill\_helped*\. Iteration 4 \(verification\): both PASS; 8/11 evaluators PASS; convergence declared\. Final skill: 1773 lines \(−3%\-3\\%\)\. Benchmark: Opus0\.00\.0\(with\-skill\)→\\to1\.01\.0\(evolved\)\.
#### Mechanism note\.
This case illustrates*latent harm*: therecalc\.pyinstruction is factually correct in other contexts but actively destructive for this task’s formula\-preservation requirement\. The Anchor Verifier did not check formula counts at iteration 1, so the harm was invisible; PACE’s*eval\-safety\-compliance*evaluator surfaced it at iteration 2 through the behavioral contrast\. The Repair pipeline’s rollback\-then\-targeted\-delete response is the intended behavior: the trajectory evidence from the paired runs pinpoints the exact harmful passage, and a single surgical removal of 52 lines converts a harmful skill into a passing one\.
## Appendix CPACE Evaluator Inventory
PACE is built from twelve evaluator templates spanning the four dimensions introduced in §[3\.2](https://arxiv.org/html/2606.14239#S3.SS2)\. Table[1](https://arxiv.org/html/2606.14239#S3.T1)in the main text lists one representative evaluator per dimension; Table[6](https://arxiv.org/html/2606.14239#A3.T6)below gives the complete inventory\. Eight evaluators run on every iteration; the remaining four \(*eval\-tool\-use\-rationality*,*eval\-error\-robustness*,*eval\-method\-adherence*,*eval\-safety\-compliance*\) are triggered conditionally, i\.e\. only when the task specification or trajectory makes them applicable \(e\.g\.,*eval\-method\-adherence*fires only when the task explicitly mandates a specific method or tool\)\. Each evaluator reads bothτw\\tau\_\{w\}andτwo\\tau\_\{wo\}, compares behavior at divergence points, and emits segment\-anchoredaction\_signalsandprotected\_hintsas described in §[3\.2](https://arxiv.org/html/2606.14239#S3.SS2)\.
Table 6:Complete PACE evaluator inventory \(12 templates across four dimensions\)\.†\\daggermarks the four conditionally triggered evaluators; the other eight run every iteration\.
## Appendix DSkill Length Dynamics \(Full Distribution\)
Figure[5](https://arxiv.org/html/2606.14239#A4.F5)gives the full per\-task distribution of skill line\-count change summarized in Table[4](https://arxiv.org/html/2606.14239#S4.T4), as empirical CDFs stratified by routing pipeline and post\-evolution Opus pass/fail\. The horizontal axis uses a symmetric log scale so that large deletions \(e\.g\.,flink\-query,−7,375\-7\{,\}375lines\) remain visible without compressing the central mass near zero\. Refine carries a heavier left tail of large subtractions, consistent with its subtraction\-first mandate; Repair tasks that fail cluster near small positiveΔ\\Delta, reflecting incomplete convergence rather than uniform shrinkage\.
\(a\)Refine pipeline \(n=43n=43\): passn=39n=39, failn=4n=4\.
\(b\)Repair pipeline \(n=46n=46\): passn=28n=28, failn=18n=18\.
Figure 5:Empirical CDFs of skill line\-count change \(Δ=evolved−initial\\Delta=\\text\{evolved\}\-\\text\{initial\}\) for all 89 tasks, stratified by routing mode and Opus pass/fail\. Solid: pass; dashed: fail\. Dotted line marksΔ=0\\Delta=0; symmetric log scale \(linear for\|Δ\|≤50\|\\Delta\|\\leq 50\)\.
## Appendix EImplementation Details
#### Anchor Verifier generation\.
The Anchor Verifier is compiled once from the structured task specification produced by the task interpreter \(§[3\.1](https://arxiv.org/html/2606.14239#S3.SS1)\)\. From the specification it extracts only constraints checkable without any ground truth: required output files and their paths, exact headers or schemas declared in the task, enumerated value sets, numeric fields recomputable from workspace data, and required companion files\. These are emitted as a deterministic check script and then locked for the remainder of evolution\. Coverage is kept intentionally narrow so that aFAIL, which forces a rollback, reflects a genuine structural regression rather than an overly strict check\.
#### Skill Iterator edit operations\.
Given thesurgery\_targetsandprotected\_segmentsfrom PACE together with the paired trajectory evidence, the Skill Iterator applies edits in a fixed priority order: \(i\) remove or correct any passage flagged as harmful; \(ii\) for Refine, delete noisy or redundant passages under a subtraction\-first budget; \(iii\) for Repair, swap a harmful passage for a verified alternative extracted fromτwo\\tau\_\{wo\}, then fill diagnosed gaps\. Every edit must be anchored to a specific surgery target and grounded in the behavioral divergence evidence, and protected segments are never altered\. After each commit the change is versioned with git, so any edit that later produces a*skill\_hurt*verdict or an Anchor regression can be rolled back to the previous committed state\.
#### Overlapping edit conflicts\.
A surgery target and a protected segment can physically overlap—e\.g\., a code block that mixes a removable tutorial comment with a load\-bearing API call\. The Skill Iterator resolves such conflicts conservatively: protection takes precedence at the finest available granularity\. The overlapping span is split at the protected boundary, only the unprotected remainder is eligible for deletion or replacement, and if a clean split is not possible the entire span is retained\. This biases the system toward under\-editing rather than destroying load\-bearing content, consistent with the asymmetric cost of a harmful update\.
#### Degenerate without\-skill runs\.
When the without\-skill trajectoryτwo\\tau\_\{wo\}fails to provide a coherent reference \(§[3\.2](https://arxiv.org/html/2606.14239#S3.SS2)\), the Effectiveness Delta evaluators record a null incremental signal rather than a fabricated comparison, the swap protocol \(which would import a verified snippet fromτwo\\tau\_\{wo\}\) is disabled, and the iteration relies on the Anchor Verifier and the single\-trajectory evaluators applied toτw\\tau\_\{w\}\. The skill can still be pruned or corrected, but gap\-filling from the contrast is suspended until a future iteration yields an informative pair\.Similar Articles
SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents
SkillCAT is a training-free framework for LLM agent skill self-evolution that addresses limitations of single-trace bias, unverified merging, and full corpus loading via three stages: Contrastive Causal Extraction, Assessment-Augmented Evolution, and Topology-Aware Task Execution, achieving up to 40.40% improvement on benchmarks.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
OpenSkillEval is an automatic evaluation framework for auditing open-source skills used by LLM agents across multiple downstream tasks. Using over 600 dynamically generated tasks and 30 skills, the authors find that skill availability does not guarantee effective usage and that benefits depend heavily on the model and framework.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
SkillAdaptor is a training-free step-level skill adaptation framework with explicit failure attribution for LLM agents, improving performance on WebShop, PinchBench, and Claw-Eval.