Not All Skills Help: Measuring and Repairing Agent Knowledge

arXiv cs.CL 06/16/26, 04:00 AM Papers
llm-agents skill-curation causal-attribution appworld tau-bench deepseek-v3 gpt-4.1
Summary
This paper identifies that naive skill accumulation in LLM agents can cause performance regressions, as skills beneficial for some tasks hurt others. The authors propose Assay, a framework that measures per-skill causal contributions and applies per-task masking, achieving state-of-the-art results on AppWorld and τ-bench without weight updates.
arXiv:2606.15390v1 Announce Type: new Abstract: LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:47 AM
# Not All Skills Help: Measuring and Repairing Agent Knowledge
Source: [https://arxiv.org/html/2606.15390](https://arxiv.org/html/2606.15390)
Yixuan Wang1\*Yiyang Zhou1\*Yiming Liang2Congyu Zhang1Fuxiao Liu3 Jiawei Zhou1Huaxiu Yao1 1UNC Chapel Hill2Purdue3NVIDIA

###### Abstract

LLM agents can improve without weight updates by accumulating natural\-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone\. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks\. Measuring per\-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods\. We proposeAssay, a framework that separates generation from curation: it computes a per\-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task\. Across seven base models spanning four providers and two benchmarks \(AppWorld andτ\\tau\-bench\),Assayconsistently improves over prior skill\-curation approaches\. On AppWorld’s hardest split, DeepSeek\-V3 achieves 69\.3% task\-goal completion \(47\.4% relative improvement\), a new state of the art among all published methods including weight\-tuned approaches\. Onτ\\tau\-bench retail, GPT\-4\.1 improves by 8\.7% relative, advancing past o4\-mini, o1, and GPT\-4\.5 on the public leaderboard without any weight modification\. Ablation traces the dominant gain to per\-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally\. Code is available at[https://github\.com/aiming\-lab/assay](https://github.com/aiming-lab/assay)\.

††footnotetext:∗Equal contribution\.## 1Introduction

The past two years have seen rapid progress in LLM agents that improve without weight updates\. The recipe is simple: let the agent attempt tasks, distill successful trajectories into natural\-language*skills*\(short rules, heuristics, procedural templates\), and inject them into the context window for future tasks\[[9](https://arxiv.org/html/2606.15390#bib.bib1),[23](https://arxiv.org/html/2606.15390#bib.bib2),[22](https://arxiv.org/html/2606.15390#bib.bib3),[3](https://arxiv.org/html/2606.15390#bib.bib4)\]\. On interactive benchmarks such as AppWorld\[[11](https://arxiv.org/html/2606.15390#bib.bib8)\], skill\-based methods have produced double\-digit gains, rivaling methods that fine\-tune model weights\. An assumption runs quietly through this work: LLM judgment is a sufficient supervisory signal for the entire skill lifecycle\. Generation, retention, and retrieval are all delegated to the same LLM, and nothing ever checks whether a retained skill actually helps\.

We find that this unchecked accumulation has a systematic downside\. Across seven models and two benchmarks, skills essential on the tasks from which they were learned become pure overhead on tasks where they do not apply\. On AppWorld, rules for multi\-step purchases exhaust the step budget on simple single\-action tasks; onτ\\tau\-bench retail\[[17](https://arxiv.org/html/2606.15390#bib.bib9)\], rules from complex exchanges derail straightforward cancellations\. Figure[1](https://arxiv.org/html/2606.15390#S1.F1)traces an instance end to end: a Spotify playback verification rule distracts the agent from a dimensional constraint in an Amazon purchasing task; per\-task masking suppresses it and the agent succeeds\. When we trace failures across hundreds of tasks, we find that a small set of skills accounts for a disproportionate share of regressions, and that the same skill can help on one task type while hurting on another\.

![Refer to caption](https://arxiv.org/html/2606.15390v1/figure1.jpg)Figure 1:Skills designed for one domain can harm tasks in another; per\-task masking fixes this\.The task is to purchase a coffee grinder on Amazon that fits a 6\.3×\\times6\.3\-inch countertop and has a seller rating≥\\geq4\.5\.*Left*: without any skill library, the agent fails to enforce the size constraint\.*Centre*: the full skill library makes things worse: off\-topic rules, including a Spotify playback verification rule \(vc\-00043\) and a side\-effect checking rule \(vc\-00021b\), distract the agent from the dimensional requirement\.*Right*: per\-task masking suppresses these two skills and the agent succeeds\. The causal effect ofvc\-00043averages−0\.067\-0\.067across development tasks \(bottom\), confirming consistent harm\.To understand these failures, we measure per\-skill causal effects via randomized masking\[[2](https://arxiv.org/html/2606.15390#bib.bib14)\]on held\-out tasks\. The resulting attribution reveals*causal heterogeneity*: many skills reverse sign across task types, helping on some and hurting on others\. No single\-task judge can detect this, because the reversal is visible only when evidence is aggregated across many tasks\. We therefore separate the two roles: judgment generates skills, and measurement curates them\. We presentAssay, a framework that operationalises this separation in three stages: measure per\-skill causal effects on held\-out tasks via randomized masking, restructure the library offline by splitting heterogeneous skills and retiring inert ones, and personalise the library per test instance by suppressing skills with negative predicted impact\.

In summary, our primary contribution isAssay, a framework that provides the first empirical characterisation of causal heterogeneity in agent skill libraries and a practical method for resolving it\. Across seven models and two benchmarks,Assayconsistently improves over prior curation methods, with DeepSeek\-V3 achieving a new state of the art on AppWorld \(69\.3% TGC, 47\.4% relative improvement\) and GPT\-4\.1 advancing past o4\-mini, o1, and GPT\-4\.5 onτ\\tau\-bench without weight modification\. Ablation traces the dominant gain to per\-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally\.

## 2Assay: Attribution\-Based Skill Selection and Assembly

Generating a useful skill from a single task requires creativity; deciding whether that skill actually helps across many tasks requires empirical evidence that no single task can provide\. We operationalise this separation in a framework we callAssay\. The entire framework flows from a single object: a per\-skill, per\-task causal attribution matrix𝐂∈ℝN×M\\mathbf\{C\}\\in\\mathbb\{R\}^\{N\\times M\}, computed once on a small held\-out set, from which all curation decisions derive\. We describe how𝐂\\mathbf\{C\}is measured \(§[2\.1](https://arxiv.org/html/2606.15390#S2.SS1)\), how it guides offline library restructuring \(§[2\.2](https://arxiv.org/html/2606.15390#S2.SS2)\), and how it personalises the library for each test task at inference time \(§[2\.3](https://arxiv.org/html/2606.15390#S2.SS3)\)\. Figure[2](https://arxiv.org/html/2606.15390#S2.F2)gives an overview\.

Preliminaries\.We assume access to a skill library𝒮=\{s1,…,sN\}\\mathcal\{S\}=\\\{s\_\{1\},\\ldots,s\_\{N\}\\\}produced by an existing curation pipeline and a held\-out development set𝒟=\{d1,…,dM\}\\mathcal\{D\}=\\\{d\_\{1\},\\ldots,d\_\{M\}\\\}disjoint from both training and test splits\. During upstream curation, we apply difficulty\-aware ordering: each training task is run twice with the bare agent to estimate difficulty, and tasks are sorted hardest\-first so that the curator encounters high\-signal failure cases early\. The resulting library contains skills automatically distilled from training trajectories\.

In addition to these learned skills, we append five hand\-written operational templates \(prefixtpl\-\) that capture common procedural patterns such as pagination, data validation, and cross\-app identity resolution\. These templates are derived from each benchmark’s public documentation and training\-split failure analysis, and are marked as protected: they are exempt from all subsequent modification and masking \(full text in Appendix[B](https://arxiv.org/html/2606.15390#A2)\)\.

![Refer to caption](https://arxiv.org/html/2606.15390v1/figure2.png)Figure 2:Framework overview\.*Stage 1*: randomized masking produces a causal attribution matrix𝐂∈ℝN×M\\mathbf\{C\}\\in\\mathbb\{R\}^\{N\\times M\}; each cell records whether a skill helps \(green\) or hurts \(red\) on a given task\.*Stage 2*: three operations driven by𝐂\\mathbf\{C\}\(*split*,*retire*,*merge*\) restructure the library offline, subject to a development gate\.*Stage 3*: at inference time, per\-task masking suppresses skills with negative predicted causal effect, with a fallback to the full library\.### 2\.1Measuring Causal Effects

A skill library is a collection of natural\-language instructions that jointly shape the agent’s behaviour\. The challenge in evaluating any single skill is that its effect depends on which other skills are co\-present: a verification rule may be harmless alone but harmful alongside a pagination rule that already performs the same check\. To disentangle these interactions, we turn to the simplest tool in causal inference: a randomized experiment\.

Randomized masking protocol\.For each ofKKindependent trials, we construct a random maskmk⊆𝒮m\_\{k\}\\subseteq\\mathcal\{S\}by including each skill independently with probabilityff\(Bernoulli sampling\)\. Let𝟏\[sj∈mk\]\\mathbf\{1\}\[s\_\{j\}\\in m\_\{k\}\]denote the inclusion indicator for skillsjs\_\{j\}in maskmkm\_\{k\}, and letok\(di\)∈\{0,1\}o\_\{k\}\(d\_\{i\}\)\\in\\\{0,1\\\}denote the binary outcome on development taskdid\_\{i\}when the agent operates under maskmkm\_\{k\}\. Define the sets of masks that include and exclude skillsjs\_\{j\}asℳj\+=\{k:sj∈mk\}\\mathcal\{M\}\_\{j\}^\{\+\}=\\\{k:s\_\{j\}\\in m\_\{k\}\\\}andℳj−=\{k:sj∉mk\}\\mathcal\{M\}\_\{j\}^\{\-\}=\\\{k:s\_\{j\}\\notin m\_\{k\}\\\}, respectively\. The causal score of skillsjs\_\{j\}on taskdid\_\{i\}is the difference\-in\-means estimator:

𝐂\[j,i\]=1\|ℳj\+\|∑k∈ℳj\+ok\(di\)−1\|ℳj−\|∑k∈ℳj−ok\(di\)\.\\mathbf\{C\}\[j,i\]\\;=\\;\\frac\{1\}\{\|\\mathcal\{M\}\_\{j\}^\{\+\}\|\}\\sum\_\{k\\in\\mathcal\{M\}\_\{j\}^\{\+\}\}o\_\{k\}\(d\_\{i\}\)\\;\-\\;\\frac\{1\}\{\|\\mathcal\{M\}\_\{j\}^\{\-\}\|\}\\sum\_\{k\\in\\mathcal\{M\}\_\{j\}^\{\-\}\}o\_\{k\}\(d\_\{i\}\)\.\(1\)Under Bernoulli sampling, skills are included independently, so𝐂\[j,i\]\\mathbf\{C\}\[j,i\]is an unbiased estimator of the average treatment effect \(ATE\) of including skillsjs\_\{j\}on taskdid\_\{i\}, marginalised over the distribution of co\-occurring skills\. The variance of this estimator is bounded by

Var\(𝐂\[j,i\]\)≤14\(1\|ℳj\+\|\+1\|ℳj−\|\),\\mathrm\{Var\}\\bigl\(\\mathbf\{C\}\[j,i\]\\bigr\)\\;\\leq\\;\\frac\{1\}\{4\}\\\!\\left\(\\frac\{1\}\{\|\\mathcal\{M\}\_\{j\}^\{\+\}\|\}\+\\frac\{1\}\{\|\\mathcal\{M\}\_\{j\}^\{\-\}\|\}\\right\),\(2\)sinceok\(di\)∈\{0,1\}o\_\{k\}\(d\_\{i\}\)\\in\\\{0,1\\\}impliesVar\(ok\(di\)\)≤14\\mathrm\{Var\}\(o\_\{k\}\(d\_\{i\}\)\)\\leq\\tfrac\{1\}\{4\}\. Per\-cell variance decreases with more masks; per\-task masking further reduces effective noise by averaging over nearest development tasks \(Eq\.[6](https://arxiv.org/html/2606.15390#S2.E6)\)\. All skills participate in attribution without exception; prefix\-based protection applies only at the downstream masking stage \(§[2\.3](https://arxiv.org/html/2606.15390#S2.SS3)\)\. Specific hyperparameter choices and statistical validation are reported in §[3](https://arxiv.org/html/2606.15390#S3)\.

Row statistics\.Given the full matrix𝐂\\mathbf\{C\}, we derive two summary statistics from each row that drive all subsequent curation decisions\. The*global causal score*of skillsjs\_\{j\}is the row mean:

C¯\(j\)=1M∑i=1M𝐂\[j,i\],\\bar\{C\}\(j\)\\;=\\;\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\mathbf\{C\}\[j,i\],\(3\)capturing the average marginal contribution ofsjs\_\{j\}across all development tasks\. The*causal heterogeneity*of skillsjs\_\{j\}is the row range:

H\(sj\)=maxi⁡𝐂\[j,i\]−mini⁡𝐂\[j,i\],H\(s\_\{j\}\)\\;=\\;\\max\_\{i\}\\,\\mathbf\{C\}\[j,i\]\\;\-\\;\\min\_\{i\}\\,\\mathbf\{C\}\[j,i\],\(4\)measuring the degree to which the skill’s effect varies across tasks\.

###### Definition 1\(Causally heterogeneous skill\)\.

A skillsjs\_\{j\}is*causally heterogeneous*at thresholdτ\\tauifH\(sj\)≥τH\(s\_\{j\}\)\\geq\\tau, i\.e\., its causal effect reverses or substantially varies across development tasks\.

A skill with highH\(sj\)H\(s\_\{j\}\)but near\-zeroC¯\(j\)\\bar\{C\}\(j\)is the most dangerous: it helps on some tasks and hurts on others, but its positive and negative effects cancel in aggregate, making it invisible to any curation method that evaluates skills by their global score alone\. The full attribution is computed once per base model, because the causal structure of a skill library depends on the model that interprets it\.

### 2\.2Offline Library Restructuring

The attribution matrix𝐂\\mathbf\{C\}reveals which skills are problematic, but measurement alone does not fix the library\. A heterogeneous skill that helps on some tasks and hurts on others cannot simply be removed without losing its beneficial effects\. Instead, the library must be restructured so that each skill’s applicability conditions are made explicit\. The row statisticsC¯\(j\)\\bar\{C\}\(j\)andH\(sj\)H\(s\_\{j\}\)partition skills into three regimes: uniformly beneficial \(C¯\(j\)\\bar\{C\}\(j\)positive,H\(sj\)H\(s\_\{j\}\)small\), negligible \(\|C¯\(j\)\|\|\\bar\{C\}\(j\)\|small,H\(sj\)H\(s\_\{j\}\)small\), and causally heterogeneous \(H\(sj\)≥τsplitH\(s\_\{j\}\)\\geq\\tau\_\{\\text\{split\}\}\)\. We apply three operations targeting each regime in turn:*split*resolves heterogeneous skills into conditional variants,*retire*removes negligible ones, and*merge*deduplicates near\-identical skills introduced by splitting\. The order is chosen to prevent information loss\.

Step 1: Split\.For each skillsjs\_\{j\}withH\(sj\)≥τsplitH\(s\_\{j\}\)\\geq\\tau\_\{\\text\{split\}\}, we use the base LLM to rewritesjs\_\{j\}into two conditional variants, each with an explicit trigger condition specifying when it should apply\. The rewriting is guided by the causal score vector\(𝐂\[j,1\],…,𝐂\[j,M\]\)\(\\mathbf\{C\}\[j,1\],\\ldots,\\mathbf\{C\}\[j,M\]\): one variant targets tasks where the skill helps, and the other targets tasks where it hurts\. Each rewritten pair must pass a*development gate*: the restructured library must achieve pass rate≥\\geqthat of the original on allMMattribution tasks\. If it does not, the original skill is kept\. Splitting runs first because a causally heterogeneous skill hasC¯\(j\)≈0\\bar\{C\}\(j\)\\approx 0\(positive and negative effects cancel\) and would be incorrectly retired if retirement ran first\. At mostτmax\_split\\tau\_\{\\text\{max\\\_split\}\}candidates are processed; this is the one point where LLM judgment re\-enters curation, bounded in scope and empirically validated\.

Step 2: Retire\.Once heterogeneous skills have been resolved, retirement targets the remaining low\-signal skills\. Any skill with\|C¯\(j\)\|<τretire\|\\bar\{C\}\(j\)\|<\\tau\_\{\\text\{retire\}\}is removed\.

Step 3: Merge\.Splitting may introduce near\-duplicate variants\. Remaining skills are embedded; pairs exceeding cosine similarityτmerge\\tau\_\{\\text\{merge\}\}are clustered, and the highest\-scoring member of each cluster \(byC¯\\bar\{C\}\) is retained\. Merging runs last precisely because it must operate on the library that splitting and retirement have already shaped\.

### 2\.3Per\-Task Causal Masking

Offline restructuring produces a single repaired library𝒮′\\mathcal\{S\}^\{\\prime\}, but a static library cannot account for the full diversity of test conditions\. The core limitation of existing skill\-application methods, whether they inject all skills or retrieve by semantic similarity, is that they cannot distinguish a*helpful*skill from a*harmful*one: the two look identical in embedding space\. We address this by framing skill selection at inference time as a per\-task risk minimization problem: for each test task, predict which skills would help and which would hurt, then suppress the harmful ones\.

Predicted causal effect\.The key idea is to transfer causal evidence from similar development tasks to the new test task\. Given a test taskttwith instruction embedding𝐞t∈ℝd\\mathbf\{e\}\_\{t\}\\in\\mathbb\{R\}^\{d\}, we identify thekknearest development tasks𝒩\(t\)⊂𝒟\\mathcal\{N\}\(t\)\\subset\\mathcal\{D\}by cosine similarity and compute attention weights via a temperature\-scaled softmax:

wi\(t\)=exp⁡\(τ⋅cos⁡\(𝐞t,𝐞di\)\)∑i′∈𝒩\(t\)exp⁡\(τ⋅cos⁡\(𝐞t,𝐞di′\)\),i∈𝒩\(t\),w\_\{i\}\(t\)\\;=\\;\\frac\{\\exp\\bigl\(\\tau\\cdot\\cos\(\\mathbf\{e\}\_\{t\},\\,\\mathbf\{e\}\_\{d\_\{i\}\}\)\\bigr\)\}\{\\displaystyle\\sum\_\{i^\{\\prime\}\\in\\mathcal\{N\}\(t\)\}\\exp\\bigl\(\\tau\\cdot\\cos\(\\mathbf\{e\}\_\{t\},\\,\\mathbf\{e\}\_\{d\_\{i^\{\\prime\}\}\}\)\\bigr\)\},\\qquad i\\in\\mathcal\{N\}\(t\),\(5\)whereτ\\tauis a temperature parameter that concentrates weight on the closest neighbours\. The predicted causal effect of skillsjs\_\{j\}on taskttis a kernel\-weighted projection of thejj\-th row of𝐂\\mathbf\{C\}onto the task:

C^\(sj,t\)=∑i∈𝒩\(t\)wi\(t\)⋅𝐂\[j,i\]=𝐂\[j,:\]𝐰\(t\),\\hat\{C\}\(s\_\{j\},\\,t\)\\;=\\;\\sum\_\{i\\in\\mathcal\{N\}\(t\)\}w\_\{i\}\(t\)\\cdot\\mathbf\{C\}\[j,i\]\\;=\\;\\mathbf\{C\}\[j,:\]\\,\\mathbf\{w\}\(t\),\(6\)where𝐰\(t\)∈ℝM\\mathbf\{w\}\(t\)\\in\\mathbb\{R\}^\{M\}is the weight vector \(zero outside𝒩\(t\)\\mathcal\{N\}\(t\)\)\. This can be read as a soft retrieval over the attribution matrix: each test task induces a different linear combination of the development\-task columns of𝐂\\mathbf\{C\}, producing a task\-specific causal profile for every skill\.

Risk\-minimizing masking rule\.Given the predicted causal effect for each skill, we suppress skills that are predicted to hurt while retaining all potentially helpful ones\. Under the approximation that skills contribute independently to task success, the expected harm of including skillsjs\_\{j\}for taskttis proportional tomax⁡\(0,−C^\(sj,t\)\)\\max\(0,\-\\hat\{C\}\(s\_\{j\},t\)\)\. A removal\-only design minimises this expected harm:

𝒮t=\{sj∈𝒮′:C^\(sj,t\)≥τmaskorsj∈𝒮protected\},\\mathcal\{S\}\_\{t\}\\;=\\;\\bigl\\\{s\_\{j\}\\in\\mathcal\{S\}^\{\\prime\}:\\hat\{C\}\(s\_\{j\},t\)\\geq\\tau\_\{\\text\{mask\}\}\\;\\;\\text\{or\}\\;\\;s\_\{j\}\\in\\mathcal\{S\}\_\{\\text\{protected\}\}\\bigr\\\},\(7\)where𝒮protected\\mathcal\{S\}\_\{\\text\{protected\}\}denotes skills with protected prefixes \(tpl\-,shr\-,api\-\)\. If the filtered set is too small, the full library𝒮′\\mathcal\{S\}^\{\\prime\}is used instead, ensuring graceful degradation\. The asymmetry of Eq\. \([7](https://arxiv.org/html/2606.15390#S2.E7)\) is deliberate: missing a critical skill \(e\.g\., a pagination template\) causes catastrophic failure, while retaining a mildly harmful skill among many has a diluted effect\. Algorithm[1](https://arxiv.org/html/2606.15390#alg1)summarises the procedure\.

Algorithm 1Per\-Task Causal Masking\(inference time\)0:Test task

tt, restructured library

𝒮′\\mathcal\{S\}^\{\\prime\}, attribution matrix

𝐂∈ℝN×M\\mathbf\{C\}\\in\\mathbb\{R\}^\{N\\times M\}, development embeddings

\{𝐞di\}i=1M\\\{\\mathbf\{e\}\_\{d\_\{i\}\}\\\}\_\{i=1\}^\{M\}, parameters

kk,

τ\\tau,

τmask\\tau\_\{\\text\{mask\}\},

τmin\\tau\_\{\\text\{min\}\}
0:Task\-specific skill library

𝒮t⊆𝒮′\\mathcal\{S\}\_\{t\}\\subseteq\\mathcal\{S\}^\{\\prime\}
1:// Stage A: Predict per\-skill causal effect

2:Compute embedding

𝐞t\\mathbf\{e\}\_\{t\}of task instruction

3:

𝒩\(t\)←k\\mathcal\{N\}\(t\)\\leftarrow knearest development tasks by

cos⁡\(𝐞t,𝐞di\)\\cos\(\\mathbf\{e\}\_\{t\},\\mathbf\{e\}\_\{d\_\{i\}\}\)
4:

𝐰\(t\)←softmax\(τ⋅\[cos⁡\(𝐞t,𝐞di\)\]i∈𝒩\(t\)\)\\mathbf\{w\}\(t\)\\leftarrow\\text\{softmax\}\\bigl\(\\tau\\cdot\[\\cos\(\\mathbf\{e\}\_\{t\},\\mathbf\{e\}\_\{d\_\{i\}\}\)\]\_\{i\\in\\mathcal\{N\}\(t\)\}\\bigr\)⊳\\trianglerightEq\. \([5](https://arxiv.org/html/2606.15390#S2.E5)\)

5:foreach skill

sj∈𝒮′s\_\{j\}\\in\\mathcal\{S\}^\{\\prime\}do

6:

C^\(sj,t\)←𝐂\[j,:\]𝐰\(t\)\\hat\{C\}\(s\_\{j\},t\)\\leftarrow\\mathbf\{C\}\[j,:\]\\,\\mathbf\{w\}\(t\)⊳\\trianglerightEq\. \([6](https://arxiv.org/html/2606.15390#S2.E6)\)

7:endfor

8:// Stage B: Risk\-minimizing filtering

9:

𝒮t←\{sj∈𝒮′:C^\(sj,t\)≥τmaskorsj∈𝒮protected\}\\mathcal\{S\}\_\{t\}\\leftarrow\\\{s\_\{j\}\\in\\mathcal\{S\}^\{\\prime\}:\\hat\{C\}\(s\_\{j\},t\)\\geq\\tau\_\{\\text\{mask\}\}\\;\\text\{or\}\\;s\_\{j\}\\in\\mathcal\{S\}\_\{\\text\{protected\}\}\\\}⊳\\trianglerightEq\. \([7](https://arxiv.org/html/2606.15390#S2.E7)\)

10:// Stage C: Graceful fallback

11:if

\|𝒮t\|<τmin\|\\mathcal\{S\}\_\{t\}\|<\\tau\_\{\\text\{min\}\}then

12:

𝒮t←𝒮′\\mathcal\{S\}\_\{t\}\\leftarrow\\mathcal\{S\}^\{\\prime\}⊳\\trianglerightRevert to full library

13:endif

14:return

𝒮t\\mathcal\{S\}\_\{t\}

Summary\.The three stages form a coherent pipeline unified by a single principle: judgment generates candidate skills, and measurement curates them\. Every stage includes a structural fallback \(offline restructuring rolls back failing splits; per\-task masking reverts to the full library when too aggressive\), so the pipeline cannot degrade performance below any prefix of stages\. Because the framework operates entirely at inference time and requires only a skill library and a small development set, it can be applied on top of any existing skill\-generation method\.

## 3Experiments

We evaluate on two benchmarks, seven base models spanning four providers, and two agent architectures, applying the same pipeline and hyperparameters throughout\. Our experiments address the following questions: \(1\) Does measurement\-driven curation generalise across models and benchmarks? \(2\) Where do the gains concentrate, and where does uncurated skill injection cause harm? \(3\) Which stages of the pipeline matter most?

### 3\.1Experimental Setup

Benchmarks\.AppWorld\[[11](https://arxiv.org/html/2606.15390#bib.bib8)\]simulates nine consumer applications \(email, calendar, Venmo, Spotify, Amazon, etc\.\) in which the agent composes multi\-step Python API calls via a code REPL\. We evaluate on two official test splits:test\_normal\(168 tasks\) andtest\_challenge\(417 tasks\), both spanning difficulty levels 1–3\.test\_challengehas a higher concentration of level\-3 tasks \(47% vs\. 37%\), making it substantially harder on average\. The primary metric is Task Goal Completion \(TGC\); we also report Sub\-Goal Completion \(SGC; Appendix[J](https://arxiv.org/html/2606.15390#A10)\)\. The agent is a ReAct agent\[[18](https://arxiv.org/html/2606.15390#bib.bib10)\]with a 40\-step budget\.τ\\tau\-bench\[[17](https://arxiv.org/html/2606.15390#bib.bib9)\]simulates retail customer service through function\-calling tools\. The agent converses with an LLM\-simulated customer \(GPT\-4o, temperature 0\) and must satisfy requests while adhering to a policy document\. We evaluate on the retail domain \(115 tasks\) with exact database\-state match; partial credit is not awarded\. The agent is the benchmark’s nativeToolCallingAgentwith a 30\-turn budget\. In both benchmarks, the skill library is injected as part of the system prompt and all agent calls use temperature 0\.

Models and data\.We evaluate seven models in standard \(non\-reasoning\) mode: GPT\-5\.4, GPT\-5\.1, GPT\-4\.1, and GPT\-4o \(OpenAI\); DeepSeek\-V3 \(DeepSeek, open\-weight\); Claude Sonnet 4\.5 \(Anthropic\); and Gemini 2\.5 Pro \(Google\)\. We deliberately exclude reasoning variants to isolate the effect of skill curation from chain\-of\-thought reasoning\. Each model runs the complete pipeline independently; the attribution matrix𝐂\\mathbf\{C\}is recomputed per model because the causal structure of a skill library depends on the model that interprets it \(§[2\.1](https://arxiv.org/html/2606.15390#S2.SS1)\)\. Data is partitioned into strictly disjoint sets: AppWorld uses 90 training, 15 development \(of 57\), and168\+417168\{\+\}417test tasks;τ\\tau\-bench uses 500 training, 15 development \(of 20\), and 115 test\.

Baselines\.For each model on AppWorld, we compare against the best available published method: ACE\[[22](https://arxiv.org/html/2606.15390#bib.bib3)\]for GPT\-5\.1 and DeepSeek\-V3, CUGA\[[8](https://arxiv.org/html/2606.15390#bib.bib5)\]for GPT\-4\.1, andGuptaet al\.\[[3](https://arxiv.org/html/2606.15390#bib.bib4)\]for GPT\-4o\. We additionally report the current AppWorld leaderboard leader\.††Alibaba Cloud ApsaraLab, AppWorld leaderboard submission \(February 2026, Qwen3\-14B, weight\-tuned\)\. No accompanying publication is available as of this writing\.Onτ\\tau\-bench, we compare against each model’s unaugmented baseline, since no prior skill\-based method has reported results on this benchmark under the same evaluation protocol\. For GPT\-5\.4, Claude Sonnet 4\.5, and Gemini 2\.5 Pro, no prior skill\-based method has reported results on AppWorld either; we compare against the unaugmented ReAct baseline for these models\.

Attribution parameters\.The attribution usesK=12K\{=\}12random masks with inclusion probabilityf=0\.4f\{=\}0\.4, yielding approximately 5 masks per skill and a per\-cell standard deviation of at most 0\.29\. Per\-task masking averages overk=8k\{=\}8nearest development tasks, reducing effective per\-decision noise toσ≈0\.10\\sigma\\approx 0\.10\. Bootstrap analysis confirms that 97\.2% of masking decisions are directionally stable under mask resampling \(Appendix[K](https://arxiv.org/html/2606.15390#A11)\)\. The full attribution requires 180 agent rollouts per model \(K=12K\{=\}12masks×\\timesM=15M\{=\}15development tasks\)\. All remaining hyperparameters are listed in Appendix Table[5](https://arxiv.org/html/2606.15390#A7.T5)\(Appendix[G](https://arxiv.org/html/2606.15390#A7)\); no per\-cell tuning is performed\.

### 3\.2Main Results

Table 1:AppWorld results\(TGC, %\)\.Δ\\Deltais computed relative to bare ReAct\.Bold: best prompt\-based method per column\.Red: degrades over ReAct\. SGC is reported in Appendix[J](https://arxiv.org/html/2606.15390#A10)\.test\_normal\(168\)test\_challenge\(417\)ModelMethodTGCΔ\\DeltaTGCΔ\\DeltaGPT\-5\.1ReAct61\.952\.5ACE67\.3\+\+5\.449\.9−\-2\.6Ours77\.4\+\+15\.566\.4\+\+13\.9DeepSeek\-V3ReAct69\.147\.0ACE78\.0\+\+8\.963\.1\+\+16\.1Ours83\.3\+\+14\.269\.3\+\+22\.3GPT\-4\.1ReAct66\.750\.4CUGA73\.2\+\+6\.557\.6\+\+7\.2Ours75\.6\+\+8\.964\.0\+\+13\.6GPT\-4oReAct48\.830\.268\.5\+\+19\.738\.9\+\+8\.7Ours71\.4\+\+22\.641\.0\+\+10\.8GPT\-5\.4ReAct82\.181\.1Ours88\.7\+\+6\.685\.4\+\+4\.3Sonnet 4\.5ReAct83\.970\.3Ours89\.3\+\+5\.475\.3\+\+5\.0Gemini 2\.5ReAct72\.649\.4Ours81\.0\+\+8\.454\.9\+\+5\.5Leaderboard best∗\(Qwen3\-14B, wt\-tuned\)86\.967\.6∗No accompanying publication; see footnote in §[3](https://arxiv.org/html/2606.15390#S3)\.AppWorld\.Table[1](https://arxiv.org/html/2606.15390#S3.T1)presents results on both AppWorld splits\. Every model improves over its respective baseline on both splits\. Ontest\_normal, GPT\-5\.1 gains 10\.1 points over ACE \(67\.3→\\to77\.4\) and DeepSeek\-V3 gains 5\.3 points \(78\.0→\\to83\.3\), both from the same upstream library with only curation changed\. GPT\-4\.1 and GPT\-4o improve over CUGA and[Guptaet al\.](https://arxiv.org/html/2606.15390#bib.bib4)respectively, baselines that already incorporate structured retrieval or demonstration selection\. Three additional models \(GPT\-5\.4, Sonnet 4\.5, Gemini 2\.5 Pro\) confirm generality across four providers with gains ranging from\+\+4\.3 to\+\+8\.4 across both splits, with no per\-provider tuning\. On the harder split, DeepSeek\-V3 achieves 69\.3% TGC, a new state of the art among all published methods including weight\-tuned approaches, representing a 47\.4% relative improvement over bare ReAct\.

![Refer to caption](https://arxiv.org/html/2606.15390v1/figure3.png)Figure 3:Per\-difficulty breakdown on AppWorldtest\_challenge\(GPT\-5\.1, 417 tasks\)\. The uncurated skill library \(red\) improves Level 1 but degrades Level 2 and 3\. Our method \(green\) recovers at every level\.The most revealing result is not the gain but the regression that precedes it\. Ontest\_challenge, the upstream skill library*decreases*GPT\-5\.1’s TGC from 52\.5% to 49\.9%: a library designed to help has made the agent strictly worse\. Our method reverses this regression and improves by 26\.5% relative beyond bare ReAct\. Figure[3](https://arxiv.org/html/2606.15390#S3.F3)reveals where the harm concentrates: on Level 2 and Level 3 tasks, the uncurated library degrades performance, while Level 1 tasks already benefit\. Per\-task masking recovers the degraded levels, with the largest gain on Level 3 \(43\.1→\\to71\.3, a 65\.4% relative improvement\), confirming that the method’s value is greatest where uncurated libraries do the most damage\. A reverse\-masking control \(suppressing the most*positively*\-scoring skills instead\) degrades performance, confirming that the direction of masking, not mere context reduction, drives the improvement\.

τ\\tau\-bench\.Table[2](https://arxiv.org/html/2606.15390#S3.T2)evaluates whether the framework transfers to a qualitatively different setting: conversational rather than code\-based, function\-calling rather than REPL, and with a simulated human in the loop\.

RankMethodScore1–5Claude family80\.5–86\.2Ours \(GPT\-5\.4\)↑\\uparrow7\.080\.96GLM\-4\.579\.77GLM\-4\.5\-Air77\.98Qwen3\-Coder 480B77\.5Ours \(GPT\-4\.1\)↑\\uparrow5\.973\.9Ours \(Gemini 2\.5\)↑\\uparrow8\.773\.99o4\-mini71\.810o170\.813GPT\-4\.568\.414GPT\-4\.1\(raw\)68\.0Ours \(DS\-V3\)↑\\uparrow2\.266\.1Ours \(GPT\-4o\)↑\\uparrow2\.262\.6GPT\-4o\(raw\)60\.3Table 2:τ\\tau\-bench retail positioning on the public leaderboard\.Green rows with↑\\uparrowshow our method’s gain over the raw baseline\. GPT\-4\.1 advances from rank 14 to rank 8–9; GPT\-5\.4 reaches the top\-5 range\. Two models \(GPT\-5\.1, Sonnet 4\.5\) show zero gain \(§[A\.3](https://arxiv.org/html/2606.15390#A1.SS3)\)\.The framework transfers successfully\. GPT\-4\.1 improves by 8\.7% relative \(68\.0%→\\to73\.9%\), advancing from rank 14 to the rank 8–9 range on the leaderboard, past o4\-mini, o1, and GPT\-4\.5, without any weight modification\. GPT\-5\.4 gains 9\.5% relative \(73\.9%→\\to80\.9%\) and Gemini 2\.5 Pro gains 13\.3% relative \(65\.2%→\\to73\.9%\)\. Two models show zero gain: GPT\-5\.1 \(62\.6%→\\to62\.6%\) and Sonnet 4\.5 \(73\.0%→\\to73\.0%\); we analyse these boundary conditions in Appendix[A\.3](https://arxiv.org/html/2606.15390#A1.SS3)and attribute them to high baseline competence that saturates the benefit of prompt\-time skill injection\. Gains concentrate on multi\-stepreturnandcanceltasks \(Appendix Table[4](https://arxiv.org/html/2606.15390#A6.T4)\), where the skill library encodes procedural knowledge that the unaugmented agent must otherwise rediscover from the policy document on every task\.

### 3\.3Ablation Study

Table[7](https://arxiv.org/html/2606.15390#A8.T7)\(Appendix[H](https://arxiv.org/html/2606.15390#A8)\) isolates each component’s contribution via sequential ablation on GPT\-5\.1 / AppWorldtest\_normal\. Templates provide a 9\.7% relative improvement, confirming that domain\-agnostic operational scaffolding carries measurable value\. Offline restructuring adds a further 2\.9% relative gain by resolving heterogeneous skills and retiring inert ones\. Per\-task masking contributes the largest single increment \(10\.7% relative improvement\), consistent with the central finding of this paper: the bottleneck is not which skills are in the library, but which skills each task should see\. The full pipeline achieves a 25\.0% relative improvement over bare ReAct\.

### 3\.4Causal Heterogeneity in Individual Skills

The aggregate results above show that curation helps; this section examines*why*by tracing causal heterogeneity to individual skills\. Two examples from the GPT\-5\.1 attribution on AppWorld illustrate the phenomenon\.

Example 1: Contact Validation Rule \(vc\-00054\)Skill\.“Validate that note names map unambiguously to contacts before creating Venmo requests\.” Global score\.C¯=−0\.03\\bar\{C\}=\-0\.03\(near zero, invisible to global curation\)\. Helps\(\+0\.50\+0\.50\): shared\-expense reconciliation tasks, where the validation catches real name ambiguities\. Hurts\(−0\.67\-0\.67\): single\-app tasks, where it forces the agent to cross\-reference contacts irrelevant to the task\.

Example 2: Response Checking Rule \(vc\-00021a\)Skill\.“For repeated side\-effecting API calls, capture and check each response\.” Global score\.C¯=\+0\.05\\bar\{C\}=\+0\.05\(near zero, invisible to global curation\)\. Helps\(\+0\.71\+0\.71\): mutation\-heavy tasks, where unchecked API calls cause silent failures\. Hurts\(−0\.23\-0\.23\): read\-only tasks, where the checks are pure overhead\.

Both skills would survive any global curation threshold\. Only per\-task measurement reveals their conditional nature, and only per\-task masking can suppress them selectively\. These examples confirm that causal heterogeneity is concrete and interpretable: skills encode assumptions about the task context, and when those assumptions are violated, the skill becomes harmful\. Additional examples are in Appendix[D](https://arxiv.org/html/2606.15390#A4)\.

## 4Related Work

Skill generation and curation\.A growing line of work improves agent performance by accumulating experience into the agent’s context rather than its weights\. The methods differ in mechanism: Reflexion\[[9](https://arxiv.org/html/2606.15390#bib.bib1)\]maintains verbal self\-reflections in an episodic memory buffer, ExpeL\[[23](https://arxiv.org/html/2606.15390#bib.bib2)\]extracts reusable insights by comparing successful and failed trajectories, ACE\[[22](https://arxiv.org/html/2606.15390#bib.bib3)\]grows evolving playbooks through a modular generate\-reflect\-curate loop, and CUGA\[[8](https://arxiv.org/html/2606.15390#bib.bib5)\]adopts a hierarchical planner\-executor architecture with context enrichment\. More recently, SkillNet\[[5](https://arxiv.org/html/2606.15390#bib.bib21)\]provides end\-to\-end tooling to create, evaluate, and connect skills within a unified ontology, and CoEvoSkills\[[20](https://arxiv.org/html/2606.15390#bib.bib22)\]co\-evolves a skill generator and surrogate verifier without access to ground\-truth tests\. These methods share a structural property: every decision in the skill lifecycle, from generation to retention to application, is made by LLM judgment operating within individual tasks\. No method aggregates evidence across tasks to verify that a retained skill actually helps\. Our work treats the output of any such pipeline as raw material for a second, empirical curation stage\.

Skill retrieval and application\.The same reliance on single\-task judgment extends to how skills are surfaced at test time\. Retrieval\-augmented generation\[[4](https://arxiv.org/html/2606.15390#bib.bib11)\]and its agent\-oriented variants\[[21](https://arxiv.org/html/2606.15390#bib.bib12)\]select context by embedding similarity, implicitly equating topical relevance with helpfulness\.Guptaet al\.\[[3](https://arxiv.org/html/2606.15390#bib.bib4)\]extend BERTScore\-Recall to set\-level demonstration selection for in\-context learning in agentic tasks, andSuet al\.\[[10](https://arxiv.org/html/2606.15390#bib.bib23)\]study skill retrieval augmentation at scale as agent skill libraries grow to thousands of entries\. Our causal attribution reveals that this equation can be precisely wrong: a skill about “verifying list contents before iteration” is semantically close to a single\-record lookup yet hurts on such tasks by inducing unnecessary verification steps\. Per\-task filtering addresses this by conditioning on predicted causal effect rather than similarity\.

Skill optimization beyond judgment\.Several methods go beyond single\-task judgment to optimize skill libraries\. Voyager\[[12](https://arxiv.org/html/2606.15390#bib.bib17)\]builds an expanding skill library but only adds skills, never removing or conditioning them\. SkillRL\[[15](https://arxiv.org/html/2606.15390#bib.bib15)\]uses reinforcement learning to recursively refine skills through failure analysis, treating the library as a dynamic component co\-evolving with the agent policy\. EvolveR\[[14](https://arxiv.org/html/2606.15390#bib.bib16)\]closes an experience\-driven evolution loop, and Agentic Memory\[[19](https://arxiv.org/html/2606.15390#bib.bib18)\]optimizes memory management with GRPO\. SkillClaw\[[7](https://arxiv.org/html/2606.15390#bib.bib19)\]enables collective skill evolution by aggregating trajectories across users to identify recurring behavioral patterns, and GraSP\[[16](https://arxiv.org/html/2606.15390#bib.bib20)\]compiles flat skill sets into typed directed acyclic graphs, observing that providing agents with more skills does not monotonically improve performance\. Weight\-based methods such as SAGE\[[13](https://arxiv.org/html/2606.15390#bib.bib6)\]and FireAct\[[1](https://arxiv.org/html/2606.15390#bib.bib7)\]internalise skills into model parameters via RL or trajectory fine\-tuning, avoiding context\-window limitations but requiring retraining for each base model\. Our approach is complementary to all of the above: it operates at inference time, requires no weight updates, and can be layered on top of any skill\-generation pipeline\. The key distinction is that we measure per\-skill causal effects via randomized masking, following the logic of Shapley\-value attribution\[[6](https://arxiv.org/html/2606.15390#bib.bib13)\]and randomized ablation\[[2](https://arxiv.org/html/2606.15390#bib.bib14)\], but applied to natural\-language skill instructions rather than model components\.

## 5Conclusion

We presentedAssay, a framework that separates skill generation from skill curation by measuring per\-skill causal effects via randomized masking\. The causal attribution reveals pervasive heterogeneity in skill libraries, and per\-task masking resolves it: on AppWorld, all seven models improve across four providers, with DeepSeek\-V3 achieving a new state of the art \(69\.3%, 47\.4% relative improvement\); onτ\\tau\-bench, GPT\-4\.1 advances past o4\-mini, o1, and GPT\-4\.5 without weight modification\. Ablation confirms that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally\. Two null results \(GPT\-5\.1, Sonnet 4\.5 onτ\\tau\-bench\) highlight a boundary: as base model competence strengthens, prompt\-time skill injection yields diminishing returns\. Extending the framework to online settings where skills are continuously added is a promising direction for future work\.

## References

- \[1\]B\. Chen, C\. Shu, E\. Shareghi, N\. Collier, K\. Narasimhan, and S\. Yao\(2023\)Fireact: toward language agent fine\-tuning\.arXiv preprint arXiv:2310\.05915\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[2\]I\. Covert, S\. Lundberg, and S\. Lee\(2021\)Explaining by removing: a unified framework for model explanation\.Journal of Machine Learning Research22\(209\),pp\. 1–90\.Cited by:[§1](https://arxiv.org/html/2606.15390#S1.p3.1),[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[3\]S\. Gupta, S\. Singh, A\. Sabharwal, T\. Khot, and B\. Bogin\(2025\)Leveraging in\-context learning for language model agents\.arXiv preprint arXiv:2506\.13109\.Cited by:[Table 8](https://arxiv.org/html/2606.15390#A10.T8.2.15.1),[Table 6](https://arxiv.org/html/2606.15390#A7.T6.3.2),[§1](https://arxiv.org/html/2606.15390#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.15390#S3.SS1.p3.1),[§3\.2](https://arxiv.org/html/2606.15390#S3.SS2.p1.4),[Table 1](https://arxiv.org/html/2606.15390#S3.T1.18.16.3),[§4](https://arxiv.org/html/2606.15390#S4.p2.1)\.
- \[4\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p2.1)\.
- \[5\]Y\. Liang, R\. Zhong, H\. Xu, C\. Jiang, Y\. Zhong, R\. Fang, J\. Gu, S\. Deng, Y\. Yao, M\. Wang,et al\.\(2026\)SkillNet: create, evaluate, and connect AI skills\.arXiv preprint arXiv:2603\.04448\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p1.1)\.
- \[6\]S\. M\. Lundberg and S\. Lee\(2017\)A unified approach to interpreting model predictions\.Advances in neural information processing systems30\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[7\]Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu\(2026\)SkillClaw: let skills evolve collectively with agentic evolver\.arXiv preprint arXiv:2604\.08377\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[8\]S\. Marreed, A\. Oved, A\. Yaeli, S\. Shlomov, I\. Levy, O\. Akrabi, A\. Sela, A\. Adi, and N\. Mashkif\(2025\)Towards enterprise\-ready computer using generalist agent\.arXiv preprint arXiv:2503\.01861\.Cited by:[§3\.1](https://arxiv.org/html/2606.15390#S3.SS1.p3.1),[§4](https://arxiv.org/html/2606.15390#S4.p1.1)\.
- \[9\]N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2606.15390#S1.p1.1),[§4](https://arxiv.org/html/2606.15390#S4.p1.1)\.
- \[10\]W\. Su, J\. Long, Q\. Ai, Y\. Tang, C\. Wang, Y\. Tu, and Y\. Liu\(2026\)Skill retrieval augmentation for agentic AI\.arXiv preprint arXiv:2604\.24594\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p2.1)\.
- \[11\]H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian\(2024\)Appworld: a controllable world of apps and people for benchmarking interactive coding agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16022–16076\.Cited by:[§1](https://arxiv.org/html/2606.15390#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.15390#S3.SS1.p1.1)\.
- \[12\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2024\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[13\]J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong\(2025\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[14\]R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang, and B\. Shi\(2025\)EvolveR: self\-evolving LLM agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[15\]P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen, Z\. Zheng, C\. Xie, and H\. Yao\(2026\)SkillRL: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[16\]T\. Xia, L\. Hu, Y\. Sun, M\. Xu, L\. Xu, S\. Wang, W\. Xu, and J\. Jiang\(2026\)GraSP: graph\-structured skill compositions for LLM agents\.arXiv preprint arXiv:2604\.17870\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[17\]S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan\(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.arXiv preprint arXiv:2406\.12045\.Cited by:[§1](https://arxiv.org/html/2606.15390#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.15390#S3.SS1.p1.1)\.
- \[18\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§3\.1](https://arxiv.org/html/2606.15390#S3.SS1.p1.1)\.
- \[19\]Y\. Yu, L\. Yao, Y\. Xie, Q\. Tan, J\. Feng, Y\. Li, and L\. Wu\(2026\)Agentic memory: learning unified long\-term and short\-term memory management for large language model agents\.arXiv preprint arXiv:2601\.01885\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p3.1)\.
- \[20\]H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng, X\. Liu, X\. Li, and P\. S\. Yu\(2026\)CoEvoSkills: self\-evolving agent skills via co\-evolutionary verification\.arXiv preprint arXiv:2604\.01687\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p1.1)\.
- \[21\]J\. Zhang, T\. Lan, R\. Murthy, Z\. Liu, W\. Yao, M\. Zhu, J\. Tan, T\. Hoang, Z\. Liu, L\. Yang,et al\.\(2024\)Agentohana: design unified data and training pipeline for effective agent learning\.arXiv preprint arXiv:2402\.15506\.Cited by:[§4](https://arxiv.org/html/2606.15390#S4.p2.1)\.
- \[22\]Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun\(2025\)Agentic context engineering: evolving contexts for self\-improving language models\.arXiv preprint arXiv:2510\.04618\.Cited by:[§1](https://arxiv.org/html/2606.15390#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.15390#S3.SS1.p3.1),[§4](https://arxiv.org/html/2606.15390#S4.p1.1)\.
- \[23\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang\(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§1](https://arxiv.org/html/2606.15390#S1.p1.1),[§4](https://arxiv.org/html/2606.15390#S4.p1.1)\.

## Appendix AExtended Analysis and Discussion

### A\.1The Landscape of Skill Effects

The causal attribution matrix, introduced in §[2\.1](https://arxiv.org/html/2606.15390#S2.SS1)as a computational tool, doubles as a diagnostic lens, and what it reveals is, to our knowledge, a finding that no prior work on agent skill libraries has reported\. On the GPT\-5\.1 attribution \(15 development tasks, 103 curated skills\), over 90% of skills have a per\-task causal range exceedingτsplit=0\.40\\tau\_\{\\text\{split\}\}\{=\}0\.40, meaning that nearly every skill in the library helps on some tasks and hurts on others\. Rather than relying solely on this descriptive statistic, we subject the heterogeneity claim to a stricter test: six independent lines of outcome\-level evidence, each probing a different consequence that*must*hold if causal heterogeneity is real\. For completeness, Appendix[K](https://arxiv.org/html/2606.15390#A11)\(Table[9](https://arxiv.org/html/2606.15390#A11.T9)\) reports that per\-cell permutation tests lack adequate power atM=12M\{=\}12, which is precisely why outcome\-level validation provides a more rigorous foundation:

1. 1\.Masking diversity\.Per\-task masking produces genuinely diverse skill subsets: 96% of dropped skills are selective \(not universally dropped\), with mean pairwise Jaccard similarity of only 0\.30, ruling out a fixed pruning artefact\.
2. 2\.Attribution alignment\.Dropped skills score significantly worse than kept skills on per\-task attribution \(gap==0\.147, Mann\-Whitneyp<10−98p<10^\{\-98\}\), confirming that masking decisions track measured causal signal\.
3. 3\.Sign\-reversing cases\.Recovered tasks consistently exhibit skills with positive aggregate but negative per\-task scores \(e\.g\.,vc\-00109a:C¯=\+0\.007\\bar\{C\}\{=\}\{\+\}0\.007vs\.C^=−0\.107\\hat\{C\}\{=\}\{\-\}0\.107\), the hallmark of causal heterogeneity\.
4. 4\.Split separation\.7 of 13 splits produce daughter variants with≥0\.05\\geq 0\.05systematic score divergence, confirming that the split operation captures real conditional structure\.
5. 5\.Reverse masking\.Suppressing the most*positively*\-scoring skills instead of the most negatively\-scoring ones degrades performance by 4\.7 pp, confirming that the direction of the attribution, not mere context reduction, drives the gain\.
6. 6\.Bootstrap sign stability\.Sign stability averages 71%/70% for positive/negative cells among the top\-26 heterogeneous skills, well above the 50% chance baseline\.

The remaining∼\\sim9% of skills have global scores near zero; skills with uniformly positive or uniformly negative effects are effectively absent\. The pattern is qualitatively similar across all seven models \(library sizes range from 71 to 126\)\. Figure[4](https://arxiv.org/html/2606.15390#A3.F4)visualises this structure directly: nearly every row of the attribution matrix contains both green and red cells\.

This finding reframes the curation problem\. The conventional assumption, implicit in all prior work, is that skills are either good or bad, and the curator’s job is to separate the two\. Our attribution reveals a different landscape: most skills are*conditionally*good or bad depending on the task, and the real challenge is determining*when*each skill should be active\. This is why per\-task masking dominates offline restructuring in the ablation \(\+\+7\.5 vs\.\+\+2\.0 pp\): the bottleneck is not removing bad skills from the library, but matching skills to tasks at inference time\. Concrete examples of sign\-reversing skills are provided in Appendix[D](https://arxiv.org/html/2606.15390#A4)\.

### A\.2Masking Behaviour

The landscape analysis explains*why*per\-task masking helps; we now examine*how*it operates in practice\. Across all 417test\_challengetasks \(GPT\-5\.1\), the method suppresses an average of 5\.5 skills per task \(5\.3% of the library\) among the 329 tasks where masking is active, with a standard deviation of 2\.8\. On the remaining 88 tasks \(21\.1%\), the fallback to the full library activates\. Two properties of the suppressed sets are notable\. First, the sets are highly task\-dependent: only 2 skills are suppressed on more than 80% of non\-fallback tasks, confirming that masking produces genuinely different libraries for different tasks rather than a fixed pruning\. Second, the suppressed skills are not random: they consistently correspond to skills whose predicted causal scores are strongly negative on the nearest development tasks, confirming that the masking rule selects on measured harm rather than surface features\.

A representative example ties the mechanism back to Figure[1](https://arxiv.org/html/2606.15390#S1.F1)\. Task6474048\_2\(the coffee grinder from §[1](https://arxiv.org/html/2606.15390#S1)\) fails both without skills \(incorrect dimension filtering\) and with the full library \(off\-topic Spotify and side\-effect\-checking rules distract the agent\)\. Per\-task masking drops exactly these two irrelevant skills, and the agent succeeds, a microcosm of the broader pattern: measurement identifies the interference, and masking removes it\.

### A\.3Limitations

Several limitations merit discussion\. GPT\-5\.1’s null result onτ\\tau\-bench \(62\.6%→\\to62\.6%\) highlights a boundary of context engineering\. Two factors compound\. First, GPT\-5\.1 already achieves 100% on read\-only queries without any skill library, yet trails GPT\-4\.1 by over 10 percentage points on multi\-step actions \(exchange, modify\), suggesting that its internal prior saturates easy tasks but is insufficiently aligned to absorb external procedural knowledge on hard ones\. Second, the skill library is curated on AppWorld, a code\-generation environment whose operational patterns transfer only partially toτ\\tau\-bench’s conversational customer\-service setting; GPT\-5\.1, with the strongest internal priors, is the most sensitive to this domain gap\. These two hypotheses make distinguishable predictions\. If the bottleneck is capability saturation, all action types should plateau; Table[4](https://arxiv.org/html/2606.15390#A6.T4)shows otherwise: GPT\-5\.1 achieves 100% on read\-only but only 56\.7% on exchanges, indicating uneven rather than uniform saturation\. If the bottleneck is domain transfer, the model with the strongest internal priors should be most resistant to external skill injection; GPT\-5\.1 is indeed the only model showing zero gain, consistent with this prediction\. Claude Sonnet 4\.5 independently confirms this pattern onτ\\tau\-bench \(73\.0%→\\to73\.0%\)\. Its base performance \(73\.0%\) is comparable to GPT\-4\.1’s augmented result \(73\.9%\), placing it in the high\-competence regime where prompt\-time skill injection yields diminishing returns\. Two models from different providers exhibiting the same null result strengthens the interpretation: the boundary is determined by the model’s baseline competence on the target domain, not by provider\-specific idiosyncrasies\. More broadly, the marginal value of prompt\-time skill injection appears to diminish as base model competence strengthens\. The attribution matrix is computed onM=15M\{=\}15development tasks and extrapolated via nearest\-neighbour weighting\. Ontest\_normal, mean cosine similarity to the top\-8 nearest development tasks is 0\.578, with 21\.4% of tasks below 0\.5; ontest\_challengethese figures are 0\.469 and 69\.1% \(Appendix[K](https://arxiv.org/html/2606.15390#A11), Figure[5](https://arxiv.org/html/2606.15390#A11.F5)\)\. Despite limited coverage, the method gains\+\+4\.8 pp even in the lowest\-coverage quartile oftest\_normal, supported by the structural fallback which activates on 21\.1% of tasks\. The split step invokes LLM judgment to produce conditional skill variants, reintroducing the subjectivity we aim to reduce; we bound this by limiting candidates and validating through the development gate, but a fully measurement\-driven splitting procedure remains open\. Finally, our framework assumes a fixed skill library; extending it to online settings where skills are continuously added would require incremental attribution updates\.

## Appendix BOperational Templates

The five AppWorld templates and fiveτ\\tau\-bench retail templates below are appended to every skill library as a protected bedrock layer\. They are exempt from all modification and masking operations\. Each template instantiates one of five domain\-agnostic operational principles: complete data collection before acting, verify intermediate state before proceeding, confirm before irreversible actions, resolve ambiguous entities from authoritative sources, and sequence dependent operations correctly\. In code\-generation agents \(AppWorld\), templates take the form of executable code scaffolds; in conversational agents \(τ\\tau\-bench\), they take the form of procedural checklists derived from the benchmark’s policy document\.

#### AppWorld Templates\.

\[tpl\-00001\] PAGINATION TEMPLATE\.Use for any API that returns a list\. Always paginate with awhile Trueloop overpage\_index; never userange\(N\)or fetch only one page\. Accumulate results intoall\_itemsand break when the page is empty\.

\[tpl\-00002\] DATA VALIDATION TEMPLATE\.Before acting on any data, print and inspect intermediate results \(item count, sample keys, first item\)\. Only proceed after confirming the data looks correct\.

\[tpl\-00003\] SAFE COMPLETE\_TASK TEMPLATE\.Before submitting an answer, print it alongside the number of source data points from which it was derived\. Only then callcomplete\_task\.

\[tpl\-00004\] LOGIN \+ ACCESS TOKEN TEMPLATE\.Standard login flow: retrieve account passwords viasupervisor\.show\_account\_passwords, match by app name, call the app’sloginendpoint, and store the access token\.

\[tpl\-00005\] CROSS\-APP IDENTITY RESOLUTION TEMPLATE\.When resolving a person across apps, always start from the phone contacts directory \(paginated\)\. Match by email or phone number, never by name alone\.

#### τ\\tau\-bench Retail Templates\.

\[tpl\-tb\-001\] ITEM MODIFICATION CHECKLIST\.Before callingmodify\_pending\_order\_itemsorexchange\_delivered\_order\_items: \(1\) callget\_product\_detailsfor each item, \(2\) present options and confirm exact choice, \(3\) ask if these are all the items, \(4\) collect all changes into a single list, \(5\) make one tool call\. These tools can only be called once\.

\[tpl\-tb\-002\] EXCHANGE ITEM RESOLUTION\.Six\-step procedure: get current item IDs, fetch product details for available variants, match customer requirements, present ambiguous options, compare prices if requested, confirm exact new item ID\.

\[tpl\-tb\-003\] MULTI\-ORDER SEQUENCING\.For requests involving multiple orders: get all order IDs, check each order’s status, process sequentially, confirm each action individually, track completed orders\.

\[tpl\-tb\-004\] STATUS\-GATED ACTION\.Before any order action, verify status viaget\_order\_details\. Cancel/modify require “pending”; return/exchange require “delivered”\. If status does not match, inform the customer of available actions\.

\[tpl\-tb\-005\] CONFIRMATION PROTOCOL\.Before any destructive action, present a summary of: order ID, specific items and changes, payment method for refund, expected outcome\. Wait for explicit customer confirmation before executing\.

## Appendix CCausal Attribution Matrix

![Refer to caption](https://arxiv.org/html/2606.15390v1/figure4_1.png)Figure 4:Causal attribution matrix for GPT\-5\.1 on AppWorld\(top 40 skills by heterogeneity, 7 informative development tasks\)\. Each cell shows the causal score𝐂\[j,i\]\\mathbf\{C\}\[j,i\]: green indicates the skill helps on that task, red indicates harm\.*Left*: the matrix itself, sorted by heterogeneityH\(sj\)H\(s\_\{j\}\)descending\. Nearly every row contains both positive and negative cells, and outcome\-level validation \(§[A\.1](https://arxiv.org/html/2606.15390#A1.SS1), Appendix[K](https://arxiv.org/html/2606.15390#A11)\) confirms this pattern reflects genuine task\-dependent effects rather than sampling noise\.*Centre*: the heterogeneity scoreH\(sj\)H\(s\_\{j\}\); the dashed orange line marks the split thresholdτsplit=0\.40\\tau\_\{\\text\{split\}\}\{=\}0\.40\.*Right*: the global causal scoreC¯\(j\)\\bar\{C\}\(j\)\. Skills with near\-zero global scores but wide per\-task ranges are the most dangerous, as they are invisible to any curation method that evaluates skills globally\.Figure[4](https://arxiv.org/html/2606.15390#A3.F4)visualises the causal attribution matrix for GPT\-5\.1 on AppWorld, restricted to the 40 most heterogeneous skills and the 7 most informative development tasks\. Each cell encodes the difference\-in\-means causal score𝐂\[j,i\]\\mathbf\{C\}\[j,i\]\(Eq\.[1](https://arxiv.org/html/2606.15390#S2.E1)\): green indicates that including the skill improves task performance, red indicates harm\. The pervasive coexistence of green and red within nearly every row confirms the central empirical finding of this paper \(over 90% of skills are causally heterogeneous, validated through six outcome\-level tests in §[A\.1](https://arxiv.org/html/2606.15390#A1.SS1)and Appendix[K](https://arxiv.org/html/2606.15390#A11)\) and motivates per\-task masking as the primary curation mechanism\. The right\-hand panels display the heterogeneity scoreH\(sj\)H\(s\_\{j\}\)and the global causal scoreC¯\(j\)\\bar\{C\}\(j\); skills with high heterogeneity but near\-zero global scores are the most dangerous, as they evade any curation method that evaluates skills only in aggregate\.

## Appendix DSign\-Reversing Skill Examples

Two examples from the GPT\-5\.1 attribution on AppWorld illustrate causal heterogeneity at the individual skill level\.

Example 1:vc\-00054\.“Validate that note names map unambiguously to contacts before creating Venmo requests\.” Global causal score:C¯=−0\.03\\bar\{C\}=\-0\.03\(near zero\)\. Per\-task range:\+0\.50\+0\.50on shared\-expense reconciliation tasks \(catches real name ambiguities\) to−0\.67\-0\.67on single\-app tasks \(forces irrelevant cross\-referencing\)\. A judgment\-based curator would retain this skill because its global score is harmless; only per\-task measurement reveals that it helps half the time and hurts the other half\.

Example 2:vc\-00021a\.“For repeated side\-effecting API calls, capture and check each response\.” Global causal score:C¯=\+0\.05\\bar\{C\}=\+0\.05\(near zero\)\. Per\-task range:\+0\.71\+0\.71on mutation\-heavy tasks \(catches silent API failures\) to−0\.23\-0\.23on read\-only tasks \(adds pure overhead\)\. Both examples share the signature of causal heterogeneity: near\-zero global score, large per\-task range, and invisible to any curation method that does not disaggregate by task\.

## Appendix EPer\-Difficulty Breakdown

Table[3](https://arxiv.org/html/2606.15390#A5.T3)disaggregates Task Goal Completion by difficulty level \(L1–L3\) for six base models on both AppWorld test splits\. Level 1 tasks are largely saturated across methods, while the most substantial improvements from our pipeline appear on Level 2 and Level 3 tasks—precisely the multi\-step scenarios where uncurated skill libraries cause the most interference\. For GPT\-5\.1 ontest\_challenge, the upstream ACE library degrades L2 and L3 performance relative to bare ReAct \(50\.0→\\to44\.0 and 48\.2→\\to43\.1\), while our method recovers and substantially exceeds baseline performance at every difficulty level\.

Table 3:Per\-difficulty breakdown\(TGC\)\. Gains concentrate on Level 2 and Level 3 tasks\. GPT\-4o per\-level breakdown is not available at single\-attempt granularity\.test\_normaltest\_challengeModelMethodL1L2L3AllL1L2L3AllGPT\-5\.1ReAct86\.068\.844\.461\.969\.450\.048\.252\.5ACE86\.072\.946\.067\.380\.644\.043\.149\.9Ours93\.093\.860\.377\.488\.968\.771\.366\.4DS\-V3ReAct89\.575\.046\.069\.161\.145\.343\.147\.0ACE93\.085\.458\.778\.077\.861\.359\.063\.1Ours93\.091\.768\.383\.386\.172\.760\.569\.3GPT\-4\.1ReAct84\.279\.241\.366\.779\.243\.345\.150\.4CUGA91\.277\.154\.073\.291\.758\.744\.157\.6Ours91\.285\.454\.075\.675\.061\.362\.164\.0GPT\-5\.4ReAct85\.782\.178\.682\.183\.581\.378\.481\.1Ours91\.187\.587\.588\.787\.184\.284\.985\.4Sonnet 4\.5ReAct87\.580\.483\.983\.971\.271\.967\.670\.3Ours94\.685\.787\.589\.375\.576\.374\.175\.3Gemini 2\.5ReAct76\.869\.671\.472\.648\.246\.853\.249\.4Ours80\.480\.482\.181\.056\.154\.054\.754\.9
## Appendix FPer\-Action Breakdown \(τ\\tau\-bench\)

Table[4](https://arxiv.org/html/2606.15390#A6.T4)reports per\-action accuracy onτ\\tau\-bench retail for all seven models under our method\. The five action categories—exchange, return, modify, cancel, and read\-only—differ substantially in procedural complexity: read\-only queries require no state mutation, while exchanges involve multi\-step item resolution, price comparison, and confirmation\. All models achieve near\-perfect accuracy on read\-only queries \(except GPT\-4o\), and the largest gains from skill\-based curation concentrate on the more complex action types \(return and cancel\), where procedural templates and curated skills provide the strongest marginal value over unaugmented baselines\.

Table 4:τ\\tau\-bench per\-action breakdown\(our method\)\. Task counts in parentheses\. Models sorted by overall score\.ModelExchange\(30\)Return\(31\)Modify\(34\)Cancel\(11\)Read\-only\(9\)AllGPT\-5\.473\.380\.679\.487\.510080\.9GPT\-4\.170\.083\.967\.681\.810073\.9Gemini 2\.576\.771\.064\.787\.591\.773\.9Sonnet 4\.566\.771\.070\.687\.591\.773\.0DS\-V370\.064\.561\.863\.610066\.1GPT\-5\.156\.764\.555\.963\.610062\.6GPT\-4o66\.777\.461\.827\.344\.462\.6
## Appendix GReproducibility Details

This section provides the complete set of hyperparameters, model identifiers, and software versions needed to reproduce all experiments\. Table[5](https://arxiv.org/html/2606.15390#A7.T5)lists every hyperparameter used across the pipeline; all values are fixed across every \(model, benchmark\) configuration with no per\-cell tuning\. Table[6](https://arxiv.org/html/2606.15390#A7.T6)lists the exact API identifiers for all models\.

Table 5:Hyperparameters\.All values are fixed across every \(model, benchmark\) cell\.StageParameterValueUpstream curationDifficulty\-estimation rollouts2Task orderingHardest\-firstCausal attributionDevelopment tasksMM15Random masksKK12Keep probabilityff0\.4Offline restructuringSplit thresholdτsplit\\tau\_\{\\text\{split\}\}0\.40Max split candidates15Retire thresholdτretire\\tau\_\{\\text\{retire\}\}0\.10Merge similarityτmerge\\tau\_\{\\text\{merge\}\}0\.85Per\-task maskingNearest neighbourskk8Softmax temperatureτ\\tau5Mask thresholdτmask\\tau\_\{\\text\{mask\}\}−\-0\.10Min retained skills30SharedEmbedding modelQwen3\-Emb\-0\.6BModel identifiers\.All models are accessed via API; Table[6](https://arxiv.org/html/2606.15390#A7.T6)lists the exact identifiers returned by each provider\.

Table 6:Model API identifiers\.ModelAPI identifierProviderGPT\-5\.1gpt\-5\.1\-2025\-11\-13Azure OpenAIGPT\-5\.4gpt\-5\.4\-2026\-03\-05Azure OpenAIGPT\-4\.1gpt\-4\.1\-2025\-04\-14Azure OpenAIGPT\-4ogpt\-4o\-2024\-11\-20†\\daggerAzure OpenAIDeepSeek\-V3deepseek\-chat\(V3\.2\)DeepSeek directSonnet 4\.5claude\-sonnet\-4\-5\-20250929Anthropic directGemini 2\.5 Progemini\-2\.5\-proGoogle AI StudioEmbeddingQwen3\-Embedding\-0\.6BHuggingFace \(local\)τ\\tau\-bench user simgpt\-4o\-2024\-11\-20Azure OpenAI
†Our agent and theGuptaet al\.\[[3](https://arxiv.org/html/2606.15390#bib.bib4)\]baseline use this version\. The ReAct baseline \(48\.8%\) is from the AppWorld leaderboard entry usinggpt\-4o\-2024\-05\-13\.

Benchmark versions\.AppWorld:appworld==0\.1\.4\.dev0\(pip, aligned with the ACE codebase\)\.τ\\tau\-bench: commit4754e6bofsierra\-research/tau\-bench\.

Determinism\.All agent calls use temperature 0 and seed 100\. Attribution masks are generated with Python’srandom\.Random\(seed\)where seed=\\,\{=\}\\,42 for all \(model, benchmark\) cells except GPT\-5\.1 on AppWorld, which uses seed=\\,\{=\}\\,45\. Attribution configuration is otherwise uniform:K=12K\{=\}12masks,M=15M\{=\}15development tasks, keep probabilityf=0\.4f\{=\}0\.4\(Bernoulli\)\.

Library sizes\.After upstream curation \(before offline restructuring\), the skill library contains 103 skills for GPT\-5\.1, 87 for GPT\-4\.1, 126 for GPT\-4o, 80 for DeepSeek\-V3, 95 for GPT\-5\.4, 71 for Claude Sonnet 4\.5, and 88 for Gemini 2\.5 Pro on AppWorld\. Onτ\\tau\-bench, library sizes are 76, 72, 61, 161, 81, 94, and 68 respectively\. DeepSeek\-V3’s notably largerτ\\tau\-bench library \(161 vs\. 61–76 for the OpenAI models\) reflects more aggressive skill generation during curation: DeepSeek\-V3 proposed 115 ADD actions over 500 training tasks, compared to 15–30 for the other models under the same curation prompt\. Our downstream attribution and masking mechanisms are agnostic to library size\. Variation arises because each model’s curation run produces a different set of skills from the same training tasks\.

## Appendix HSequential Ablation

Table[7](https://arxiv.org/html/2606.15390#A8.T7)isolates the contribution of each pipeline component by sequentially adding them to the bare ReAct baseline on GPT\-5\.1 / AppWorldtest\_normal\(168 tasks\)\. This additive design ensures that each row’sΔ\\Deltareflects the marginal value of the newly added component on top of all preceding ones\. The results confirm that per\-task masking contributes the largest single increment \(\+\+7\.5 pp\), consistent with the finding that the dominant challenge is not removing globally harmful skills but selecting the right skill subset for each individual task\.

Table 7:Sequential ablation\(GPT\-5\.1, AppWorldtest\_normal, 168 tasks\)\. Each row adds one component to the preceding configuration\. Per\-task masking contributes the largest single increment \(\+\+7\.5 pp\), consistent with the finding that over 90% of skills are causally heterogeneous \(§[A\.1](https://arxiv.org/html/2606.15390#A1.SS1)\)\.ConfigurationTGC𝚫\\boldsymbol\{\\Delta\}\(a\) ReAct baseline61\.9—\(b\) \+ Templates67\.9\+\+6\.0\(c\) \+ Offline Restructuring69\.9\+\+2\.0\(d\) \+ Per\-Task Masking \(full\)77\.4\+\+7\.5
## Appendix IOffline Restructuring: Split Example

To illustrate what offline restructuring produces in practice, we show one of the 13 splits performed on the GPT\-5\.1 AppWorld library \(bifurcation scoreH=1\.56H=1\.56\)\.

Before \(single skill\)\.\[psw\-00007\]“Many APIs return items in pages\. Make sure to run through all the pages by looping overpage\_index\.”

This rule is essential for tasks that require scanning many items \(e\.g\., browsing products, enumerating emails\), but harmful for targeted purchases where the agent already knows which item it wants: exhaustive pagination wastes the step budget\.

After \(two conditional variants\)\.\[psw\-00007a\]\[IF task involves browsing or comparing multiple product options or scanning through multiple emails/messages\]:“Many APIs return items in pages\. Make sure to run through all the pages by looping overpage\_index\.”

\[psw\-00007b\]\[IF task is a targeted purchase of a specific known item or constrained to prior sellers\]:“Do NOT run through all pages or loop overpage\_index; instead, focus only on the specific item or on items from previously used sellers without exhaustive pagination\.”

Both variants pass the development gate \(no regression on any of theM=15M\{=\}15attribution tasks\)\. The trigger conditions are generated by the base LLM from the per\-task causal score vector; the development gate ensures they do not introduce regressions\.

## Appendix JSub\-Goal Completion \(SGC\)

Table[8](https://arxiv.org/html/2606.15390#A10.T8)reports Sub\-Goal Completion on AppWorld, a finer\-grained metric that awards partial credit for completing individual sub\-goals within each task even when the overall task fails\. SGC complements the primary TGC metric by revealing whether improvements reflect more tasks being fully solved or deeper progress on partially solved ones\. The pattern mirrors the TGC results: our method achieves the highest SGC for six of seven models on both splits, with the largest gains ontest\_challengewhere uncurated libraries cause the most interference\.

Table 8:AppWorld SGC results\(%\)\.Bold: best prompt\-based method per column\.Red: degrades over ReAct\.test\_normal\(168\)test\_challenge\(417\)ModelMethodSGCSGCGPT\-5\.1ReAct46\.430\.9ACE55\.428\.8Ours71\.448\.9DeepSeek\-V3ReAct48\.225\.2ACE66\.143\.9Ours71\.452\.5GPT\-4\.1ReAct46\.432\.4CUGA62\.548\.2Ours60\.748\.2GPT\-4oReAct32\.113\.0[Guptaet al\.](https://arxiv.org/html/2606.15390#bib.bib4)57\.123\.0Ours58\.930\.2GPT\-5\.4ReAct91\.991\.7Ours96\.795\.4Sonnet 4\.5ReAct94\.088\.7Ours97\.293\.5Gemini 2\.5ReAct87\.880\.5Ours95\.591\.8Leaderboard best∗\(Qwen3\-14B, wt\-tuned\)80\.450\.4∗No accompanying publication; see footnote in §[3](https://arxiv.org/html/2606.15390#S3)\.
## Appendix KStatistical Validation

Statistical validation is conducted on GPT\-5\.1; the same attribution protocol \(K=12K\{=\}12,M=15M\{=\}15,f=0\.4f\{=\}0\.4\) applies to all seven models\.

### K\.1Power Analysis

Table[9](https://arxiv.org/html/2606.15390#A11.T9)reports the statistical power of a per\-skill permutation test for detecting a true±0\.30\\pm 0\.30causal effect at theα=0\.05\\alpha\{=\}0\.05level, as a function of the number of random masksMM\. AtM=12M\{=\}12\(our setting\), per\-cell permutation tests achieve only 38\.5% power, and the descriptive thresholdH≥0\.40H\\geq 0\.40cannot distinguish real heterogeneity from noise at the single\-skill level\. This is precisely why we adopt a higher evidentiary standard: rather than relying on per\-cell significance, we validate heterogeneity through six independent outcome\-level tests \(§[A\.1](https://arxiv.org/html/2606.15390#A1.SS1)\) that are robust to the per\-cell noise level\. The table also shows that increasingMMto 30–50 masks would bring per\-cell power to 80–98%, providing a clear path for future work to strengthen the per\-skill analysis\.

Table 9:Power analysis for per\-skill heterogeneity detection\.Power is computed for a true±0\.30\\pm 0\.30effect via 1000\-trial permutation simulation\.MasksMMPer\-cellσ\\sigmaPower \(±\\pm0\.30\)NullH≥0\.40H\{\\geq\}0\.40FPR12 \(ours\)0\.29038\.5%∼\\sim100%200\.22559\.5%—300\.18380\.0%—500\.14298\.0%—
### K\.2Bootstrap Decision Stability

To assess the robustness of individual curation decisions, we resample theK=12K\{=\}12attribution masks with replacement 1000 times and recompute each decision\. Table[10](https://arxiv.org/html/2606.15390#A11.T10)summarises the results\. Among the 287 \(skill, task\) cells where per\-task masking suppresses a skill \(C^<−0\.10\\hat\{C\}<\-0\.10\), 97\.2% remain below the threshold at the bootstrap median, indicating strong directional stability—the method consistently identifies the same skills as harmful for each task\. Thek=8k\{=\}8averaging in Eq\.[6](https://arxiv.org/html/2606.15390#S2.E6)is key to this stability: it reduces the effective per\-decision noise fromσ≈0\.29\\sigma\\approx 0\.29\(per cell\) toσ≈0\.10\\sigma\\approx 0\.10\(per decision\), bringing the signal\-to\-noise ratio close to 1 for the masking threshold\. At the stricter 95% CI level, 4\.2% of cells are confirmed, reflecting the inherent conservatism of cell\-level confidence intervals atM=12M\{=\}12; the directional stability rate is the operationally relevant metric since the fallback mechanism \(§[2\.3](https://arxiv.org/html/2606.15390#S2.SS3)\) ensures graceful degradation for borderline cases\.

Table 10:Bootstrap decision stability\(1000 resamples\)\.Decision typeTotalDirectionally stableStrictly stable\(median\)\(95% CI\)Mask \(C^<−0\.10\\hat\{C\}<\-0\.10\)287 cells97\.2%4\.2%Retire \(\|C¯\|<0\.10\|\\bar\{C\}\|<0\.10\)100 skills88\.0% \(≥\\geq80% boots\)—
### K\.3Outcome\-Level Heterogeneity Evidence

Since per\-cell significance testing lacks power atM=12M\{=\}12\(§[K\.1](https://arxiv.org/html/2606.15390#A11.SS1)\), we validate the heterogeneity claim through six independent lines of outcome\-level evidence, summarised in Table[11](https://arxiv.org/html/2606.15390#A11.T11)\. Each line tests a different prediction of the causal heterogeneity hypothesis; together they provide convergent support that the attribution matrix captures genuine task\-dependent structure\.

Table 11:Outcome\-level evidence for causal heterogeneity\.EvidenceStatisticMask diversity96% selective, mean Jaccard==0\.30Drop vs\. Keep attributiongap==0\.147, Mann\-Whitneyp<10−98p<10^\{\-98\}Sign\-reversing casesvc\-00109a:C¯=\+0\.007\\bar\{C\}\{=\}\{\+\}0\.007,C^=−0\.107\\hat\{C\}\{=\}\{\-\}0\.107Split separation7/13 splits≥0\.05\\geq 0\.05divergenceReverse masking−\-4\.7 pp \(causal direction confirmed\)Sign stability71%/70% pos/neg \(vs\. 50% null\)
### K\.4Development Set Coverage

Figure[5](https://arxiv.org/html/2606.15390#A11.F5)shows the cumulative distribution of mean cosine similarity between each test task and itsk=8k\{=\}8nearest development tasks\. Ontest\_normal, 78\.6% of tasks have similarity≥0\.50\\geq 0\.50; ontest\_challenge, only 30\.9% do\. Table[12](https://arxiv.org/html/2606.15390#A11.T12)stratifies performance by coverage quartile ontest\_normal\(GPT\-5\.1\)\. The method improves over ReAct even in the lowest\-coverage quartile \(\+\+4\.8 pp\), though gains are largest in the mid\-range where the attribution signal is strongest\.

![Refer to caption](https://arxiv.org/html/2606.15390v1/figure5.png)Figure 5:Coverage of test tasks by the attribution development set\.CDF of mean cosine similarity to the top\-8 nearest development tasks\. The dashed line marks similarity==0\.50\.Table 12:Performance by coverage quartile\(GPT\-5\.1,test\_normal, 168 tasks\)\.QuartileSim rangeReActOurs𝚫\\boldsymbol\{\\Delta\}Q1 \(low\)0\.29–0\.5161\.9%66\.7%\+\+4\.8Q20\.52–0\.5859\.5%78\.6%\+\+19\.0Q30\.58–0\.6269\.0%85\.7%\+\+16\.7Q4 \(high\)0\.62–0\.8371\.4%78\.6%\+\+7\.1
Not All Skills Help: Measuring and Repairing Agent Knowledge

Similar Articles

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

The Scaling Laws of Skills in LLM Agent Systems

Submit Feedback

Similar Articles

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
The Scaling Laws of Skills in LLM Agent Systems