Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

arXiv cs.CL Papers

Summary

Proposes SelSkill, a dual-granularity preference-learning framework that learns when to invoke skills in agentic tasks, improving task success by 10.9% on ALFWorld and 5.7% on BFCL.

arXiv:2606.00510v1 Announce Type: new Abstract: Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:37 PM

# Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning
Source: [https://arxiv.org/html/2606.00510](https://arxiv.org/html/2606.00510)
Chishui Chen1,2Jiaye Lin111footnotemark:1Te Sun3Junxi Wang2 Yi Yang1,4Cong Qin1,5Yangen Hu1Lu Pan1Ke Zeng1 1Meituan2Fudan University3Shanghai Jiao Tong University 4Nanjing University5Peking University \{chenchishui, linjiaye\}@meituan\.com

###### Abstract

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks\. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point\. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process\. To address this issue, we proposeSelSkill, a dual\-granularity preference\-learning framework for selective skill invocation\. SelSkill formulates skill use as askill\-or\-skipdecision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke\-skip preference pairs from shared trajectory prefixes\. It further combines episode\-level outcome preferences with step\-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation\. On ALFWorld with Qwen3\-8B, SelSkill improves task success by10\.9percentage points and execution precision by29\.1percentage points\. On BFCL, it improves task success by5\.7percentage points and execution precision by29\.5percentage points\. Zero\-shot results on Tau\-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills\.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual\-Granularity Preference Learning

Chishui Chen1,2††thanks:Equal contribution\.Jiaye Lin111footnotemark:1††thanks:Corresponding author\.Te Sun3Junxi Wang2Yi Yang1,4Cong Qin1,5Yangen Hu1Lu Pan1Ke Zeng11Meituan2Fudan University3Shanghai Jiao Tong University4Nanjing University5Peking University\{chenchishui, linjiaye\}@meituan\.com

††footnotetext:Code:[GitHub Repository](https://github.com/ChenChiShui/selective-skill-invocation)## 1Introduction

As agent systems are increasingly applied to long\-horizon, highly interactive tasks, relying on the model to plan and execute from scratch at each step can underuse prior experience and lead to inefficient exploration\(Wanget al\.,[2025b](https://arxiv.org/html/2606.00510#bib.bib58)\)\. In this context, agent skills have received growing attention in settings such as web interaction and software engineering\(Wanget al\.,[2025a](https://arxiv.org/html/2606.00510#bib.bib57); Linget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib56); Liet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib64)\)\. As callable procedural modules, agent skills encapsulate domain knowledge, applicability conditions, and reusable execution policies, providing agents with reusable support for complex task solving\(Jianget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib30); Wanget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib31)\)\.

Existing methods mainly focus on either identifying useful skills or improving their construction and use\. Some retrieve relevant skills from a library based on the current task context\(Zhenget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib32); Suet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib42)\), while others construct or refine skills from external knowledge and interaction trajectories, sometimes integrating them into agent policy optimization\(Xiaet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib2); Tuet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib3)\)\. However, these methods largely assume that relevant skills should be invoked, while overlooking a more fundamental question: even if a skill is relevant, should the agent actually invoke it at the current decision point? During task execution, unhelpful invocations may introduce irrelevant context, thereby disrupting an otherwise correct execution process\. Through our analysis, we reveal two important characteristics of skills during task execution:

\(I\) Highly Concentrated Skill Benefits\.Figures[1](https://arxiv.org/html/2606.00510#S1.F1)\(a\)–\(b\) show that effective skill use does not mean invoking a skill whenever it appears relevant\. The case study in Figure[1](https://arxiv.org/html/2606.00510#S1.F1)\(a\) illustrates that a seemingly relevant skill call can still produce an unnecessarily broad and suboptimal response\. Across multiple benchmarks, the counterfactual results in Figure[1](https://arxiv.org/html/2606.00510#S1.F1)\(b\) show that enabling skill access improves the final outcome in only about14%of paired trajectories, has no clear effect in about78%, and worsens the outcome in about8%\. Further analysis shows that harmful invocations are often semantically close to effective skill uses in the same context\. Thus,the value of a skill is concentrated in a narrow set of states, requiring precise invocation rather than calling skills whenever they appear relevant\.

![Refer to caption](https://arxiv.org/html/2606.00510v1/x1.png)Figure 1:Motivation for selective skill invocation\.\(a\)A representative skill\-or\-skip case illustrates that a relevant skill may still be unnecessary for the current request\.\(b\)Counterfactual analysis shows that beneficial effects of skill access are concentrated in only a small fraction of paired trajectories\.\(c\)Episode\-level feedback cannot directly identify the local contribution of each invocation\.\(II\) Trajectory\-Level Ambiguity\.Figure[1](https://arxiv.org/html/2606.00510#S1.F1)\(c\) suggests that episode\-level feedback alone makes it difficult to determine the effect of an individual skill invocation\. For example, a skill call may help complete the task, act as an unhelpful step, or have its negative effect corrected by later actions\. The final outcome does not reliably indicate whether invoking a skill was helpful at the current decision point\. This makes it insufficient to learn skill invocation only from episode\-level feedback or to address this problem with simple retrieval\-based rules\. Thus,the effect of a skill invocation should be assessed at the decision\-point level\.

Therefore, learning an effective skill invocation policy requires accounting for both the concentrated benefits of skill use and the ambiguity of trajectory\-level feedback\. This calls for learning signals that capture not only the overall utility of skill use for task completion, but also the local effectiveness of invoking a skill at a specific decision point\. To this end, we proposeSelSkill, a preference\-learning framework for selective skill invocation\. SelSkill uses the model’s predictive uncertainty to guide the selection of candidate skill\-decision points, and constructs contrastive training pairs by comparing skill invocation with skipping at these points\. Furthermore, SelSkill combines episode\-level outcome preferences with step\-level invocation preferences, enabling the agent to more accurately determine when to invoke a skill and when to skip it\. In summary, the main contributions of this paper are as follows:

- •Systematic Analysis\.We provide a detailed analysis of the limitations of existing skill invocation methods and formulate selective skill invocation as askill\-or\-skipproblem at each decision point, determining whether the agent should invoke a skill under the current state\.
- •Novel Optimization Framework\.We proposeSelSkill, a preference\-learning framework for selective skill invocation, which optimizes the skill invocation policy by constructing invoke–skip contrastive pairs and combining episode\-level and step\-level preferences\.
- •Strong Empirical Results\.On the ALFWorld benchmark, SelSkill improves the task success rate by10\.9percentage points and execution precision by29\.1percentage points\. On the BFCL benchmark, SelSkill improves the task success rate by5\.7percentage points and execution precision by29\.5percentage points\.

## 2Related Work

### 2\.1From Tools and Experience to Skills

Language agents often use external tools and past experience to extend the base model\. Prior work studies API invocation, function selection, and argument generation\(Schicket al\.,[2023](https://arxiv.org/html/2606.00510#bib.bib36); Patilet al\.,[2024](https://arxiv.org/html/2606.00510#bib.bib37)\)\. Later work abstracts tool chains or interaction traces into reusable procedural representations\(Chenet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib38)\)\. Building on these abstractions, skills provide a compact form of reusable experience while retaining part of the executability of tools\. They package domain knowledge, applicability conditions, and executable or textual procedures\(Jianget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib30); Linget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib56)\)\. They can also be organized into structured libraries to support retrieval and controlled injection during agent execution\(Wanget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib31)\)\.

### 2\.2Skill Integration and Optimization

Existing skill\-based agent methods mainly utilize skills in three ways\. First, routing\-based methods address skill selection in large libraries by matching the current context to a small set of candidate skills using routers, retrievers, or graph\-based representations\(Zhenget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib32); Lianget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib33); Liuet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib39)\)\. Second, skill\-library management methods maintain and expand the library by adding, revising, or pruning skills based on environment interaction\(Yanget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib40); Niet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib4); Miet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib35); Ouyanget al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib41)\)\. Third, building on dynamically maintained skill or experience libraries, reinforcement\-learning methods use retrieved reusable knowledge to guide exploration and provide behavior priors during policy optimization\(Xiaet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib2); Tuet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib3); Luet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib1); Shiet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib51)\)\. However, existing work mainly studies how to obtain, maintain, or use skills, while a relevant skill may still be unnecessary or harmful at a specific decision point, a concern also noted in recent analyses of skill\-based agents\(Liet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib64); Suet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib42)\)\.

### 2\.3Selectivity in Related Agent Settings

Related studies have identified selectivity as a concern in agents that access external resources\. In tool\-augmented agents, models may invoke tools when they are not helpful or fail to use tool results effectively\(Chenet al\.,[2025](https://arxiv.org/html/2606.00510#bib.bib7); Xuet al\.,[2025](https://arxiv.org/html/2606.00510#bib.bib46); Rosset al\.,[2025](https://arxiv.org/html/2606.00510#bib.bib44)\)\. In memory\-augmented agents, retrieved experience may not match the current task context\(Xionget al\.,[2025](https://arxiv.org/html/2606.00510#bib.bib47)\)\. Concurrent work further learns proactive retrieval over an evolving experience base through paired retrieval/no\-retrieval rollouts\(Caiet al\.,[2026](https://arxiv.org/html/2606.00510#bib.bib63)\)\. These studies collectively suggest that external assistance should not be used indiscriminately in practice\.

## 3Preliminary

We consider an agent that performs tasks in a multi\-step environment\. At steptt, the agent conditions on a trajectory prefixhth\_\{t\}and generates an actionata\_\{t\}\. The prefixhth\_\{t\}may include the task instruction, interaction history, environment observations, and previously returned tool or skill outputs\.

In addition to ordinary actions, the agent has access to a fixed skill library𝒮\\mathcal\{S\}\. Each skills∈𝒮s\\in\\mathcal\{S\}is a callable procedural module with lightweight metadatamsm\_\{s\}and full skill contentcsc\_\{s\}\. The metadata includes the skill name and a short description, while the full content contains reusable knowledge, constraints, procedures, or action policies\. We denote the visible metadata of the skill library asM𝒮=\{ms:s∈𝒮\}M\_\{\\mathcal\{S\}\}=\\\{m\_\{s\}:s\\in\\mathcal\{S\}\\\}\. At decision time, the agent policy is written asπθ​\(at∣ht,M𝒮\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid h\_\{t\},M\_\{\\mathcal\{S\}\}\), whereata\_\{t\}can be either an ordinary environment action or a skill invocation\. The full skill contentcsc\_\{s\}is not injected into the model context by default; it is loaded or executed only after the model explicitly invokes the corresponding skill\. Specifically,*memory skills*return textual information such as strategy hints or API documentation, while*executable skills*encapsulate action or tool\-call sequences\. This paper does not study how to generate, modify, or improve the skills themselves\. Instead, given a fixed skill library, we study*selective skill invocation*: deciding whether and when a relevant skill should intervene in a multi\-step trajectory\.

![Refer to caption](https://arxiv.org/html/2606.00510v1/x2.png)Figure 2:The overview of SelSkill\. We construct episode\-level trajectory preferences and entropy\-guided decision\-point preferences, and jointly optimize the policy for selective skill invocation\.For each benchmark, the skill library is constructed offline from the training split and remains fixed throughout training and evaluation; construction details are provided in Appendix[A](https://arxiv.org/html/2606.00510#A1)\.

## 4Methodology

### 4\.1Overview

The overview of SelSkill is illustrated in Figure[2](https://arxiv.org/html/2606.00510#S3.F2)\. Our framework constructs two complementary preference signals, namely episode\-level preferences and local decision\-point preferences\. These two signals guide the agent to balance overall task utility with the local effectiveness of skill invocation, as detailed in the following subsections\.

### 4\.2Preference Construction

#### Episode\-level preferences\.

Episode\-level preferences provide a global task\-outcome signal\. For the same task, we sample multiple complete trajectories and group them according to final task success\. If one trajectory succeeds and another fails, we construct a preference pair:

\(τ\+,τ−\),\(\\tau^\{\+\},\\tau^\{\-\}\),\(1\)whereτ\+\\tau^\{\+\}denotes a successful trajectory andτ−\\tau^\{\-\}denotes a failed trajectory\. This pair indicates that the model should prefer the complete behavior sequence that solves the task\.

This signal constrains the overall downstream utility of skill invocation\. It does not directly determine whether an individual skill call is necessary, but it identifies which complete trajectories are ultimately more effective\.

#### Local decision\-point preferences\.

A limitation of episode\-level preferences is that they only provide trajectory\-level success or failure feedback, making it difficult to assign credit to a specific skill invocation decision\. To directly optimize local invocation decisions, we further construct local decision\-point preferences\. Specifically, for each rollout, we record token\-level log\-probabilities during generation and compute the predictive entropy at candidate skill\-decision points\. These candidate points cover uncertain states both after skill invocation and during ordinary generation\. Given a trajectory prefixhth\_\{t\}, the predictive entropy is defined as:

H​\(ht\)=−∑vpθ​\(v∣ht\)​log⁡pθ​\(v∣ht\),\\displaystyle H\(h\_\{t\}\)=\-\\sum\_\{v\}p\_\{\\theta\}\(v\\mid h\_\{t\}\)\\log p\_\{\\theta\}\(v\\mid h\_\{t\}\),\(2\)wherevvdenotes a token in the vocabulary\. A higher entropy indicates greater uncertainty in the model’s subsequent generation\. Motivated by prior findings that tool interactions can produce high\-entropy decision points suitable for targeted branching\(Donget al\.,[2025](https://arxiv.org/html/2606.00510#bib.bib55); Chenet al\.,[2025](https://arxiv.org/html/2606.00510#bib.bib7)\), we use entropy to prioritize candidate positions for local invoke/skip comparison during pair construction\.

We further examine this heuristic through an entropy\-fork analysis on ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2606.00510#bib.bib24)\)\. As a diagnostic analysis, we create invoke/skip forks at actual skill\-call positions and compare the token\-level entropy of the two paths after the fork, as shown in Figure[3](https://arxiv.org/html/2606.00510#S4.F3)\. The invoke path with skill injection exhibits higher average token entropy in subsequent action prediction\. This suggests that skill injection often increases the model’s uncertainty when integrating the returned skill information into the next actions\. We therefore use predictive entropy as a lightweight signal to prioritize candidate branch points that are more likely to produce informative invoke/skip comparisons\.

![Refer to caption](https://arxiv.org/html/2606.00510v1/x3.png)Figure 3:Token entropy around invoke/skip points\.For each selected skill\-decision point\(ht,s\)\(h\_\{t\},s\), we construct two continuations: one that invokes skillssand one that skips it\. Both continuations start from the same prefixhth\_\{t\}and differ only in the forced local decision: invoking skillssor skipping it\. We roll out both branches to the end of the episode and assign labels by an outcome\-efficiency rule\. A successful continuation is preferred over a failed one\. If both continuations succeed, we prefer the shorter one, measured by the number of environment steps after the branch\. If both continuations fail, we discard the branch\. This keeps redundant\-but\-successful skill calls in the training signal and encourages the model to skip unnecessary skills\.

\(ct\+,ct−∣ht,s\),\(c\_\{t\}^\{\+\},c\_\{t\}^\{\-\}\\mid h\_\{t\},s\),\(3\)wherect\+c\_\{t\}^\{\+\}andct−c\_\{t\}^\{\-\}denote the preferred and dispreferred continuations, respectively, under this outcome\-efficiency ordering\.

This construction differs from episode\-level pairing because the two continuations share the same historyhth\_\{t\}and are generated by an explicit invoke/skip intervention at the branch point\. Thus, although the label is still evaluated by downstream outcome and efficiency, the comparison controls for the pre\-branch trajectory and isolates the immediate skill\-or\-skip choice more directly than pairing independently sampled complete trajectories\.

To focus this signal on the local invoke/skip decision, we compute the DPO loss only within a local window after the branching point\. Specifically, we apply a local loss maskMt\(n\)M\_\{t\}^\{\(n\)\}to the continuation and keep only the nextnnassistant turns after the branch for local DPO\. This makes the gradients more directly target the short\-term consequences of the invoke/skip decision rather than the entire continuation of each branch\.

Local decision\-point preferences concentrate the learning signal around specific invoke/skip decisions, while episode\-level preferences provide full\-trajectory constraints\.

### 4\.3Preference Optimization

We optimize the constructed preference data using Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.00510#bib.bib50)\)\. For a conditioning inputzzand an outputyy, we definerθ​\(z,y\)=log⁡πθ​\(y∣z\)−log⁡πref​\(y∣z\)r\_\{\\theta\}\(z,y\)=\\log\\pi\_\{\\theta\}\(y\\mid z\)\-\\log\\pi\_\{\\mathrm\{ref\}\}\(y\\mid z\), whereπθ\\pi\_\{\\theta\}is the trainable model andπref\\pi\_\{\\mathrm\{ref\}\}is the reference model\. Given a preference pair\(y\+,y−\)\(y^\{\+\},y^\{\-\}\), the DPO loss is

ℒDPO=−𝔼\(z,y\+,y−\)∼𝒟\[logσ\(βrθ\(z,y\+\)−βrθ\(z,y−\)\)\],\\begin\{split\}\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}=\-\\mathbb\{E\}\_\{\(z,y^\{\+\},y^\{\-\}\)\\sim\\mathcal\{D\}\}\\Big\[\\log\\sigma\\Big\(\\beta r\_\{\\theta\}\(z,y^\{\+\}\)\\\\ \\quad\-\\beta r\_\{\\theta\}\(z,y^\{\-\}\)\\Big\)\\Big\],\\end\{split\}\(4\)whereβ\\betais the DPO temperature\.

We merge episode\-level preference pairs and local decision\-point preference pairs into the training data and optimize them with the same DPO objective\. For episode\-level pairs,zzis the task context, whiley\+y^\{\+\}andy−y^\{\-\}are the successful and failed trajectories\. For local pairs, the model is conditioned on the pre\-branch historyhth\_\{t\}and the full skill metadata list, whilessonly identifies the candidate skill used to create the forced invoke/skip fork\. The paired continuations\(y\+,y−\)\(y^\{\+\},y^\{\-\}\)are then ordered by the outcome\-efficiency rule\. During optimization,Mt\(n\)M\_\{t\}^\{\(n\)\}masks out tokens outside the firstnnassistant turns after the branch, so local DPO trains only on the selected assistant\-generated tokens\.

## 5Experiments

### 5\.1Experimental Setup

#### Benchmarks\.

We evaluate SelSkill on four benchmarks\. ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2606.00510#bib.bib24)\)evaluates multi\-step embodied decision making, and BFCL\(Patilet al\.,[2025](https://arxiv.org/html/2606.00510#bib.bib60)\)evaluates multi\-turn function calling\. We further use PopQA\(Mallenet al\.,[2023](https://arxiv.org/html/2606.00510#bib.bib62)\)and Tau\-bench\(Yaoet al\.,[2024](https://arxiv.org/html/2606.00510#bib.bib61)\)to evaluate out\-of\-domain transfer\.

#### Backbones\.

We use models with basic task\-solving ability and valid skill\-call formats, allowing SelSkill to focus on selective invocation\. For ALFWorld, since raw models are unstable in environment interaction, we train no\-skill Qwen3\-4B/8BYanget al\.\([2025](https://arxiv.org/html/2606.00510#bib.bib29)\)policies with GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.00510#bib.bib20)\)asRL\-Init, and enable skills afterward\. For BFCL, we use Qwen3\-14B asBase, which already supports reliable function\-style skill calls through prompting without additional RL initialization\. Details are provided in Appendix[G](https://arxiv.org/html/2606.00510#A7)\.

#### Baselines\.

We compareSelSkillwith No\-Skill, skill\-enabled baselines without selective\-invocation training \(RL\-Init w/ Skillfor ALFWorld andBase w/ Skillfor BFCL\), and signal ablations \(Episode only, Entropy\-local only, and Skill\-call\-local only\)\. For out\-of\-domain evaluation, we compare No\-Skill with \+Skill, which enables target\-benchmark skills without additional training\. For completeness, Appendix[D](https://arxiv.org/html/2606.00510#A4)reports additional engineering baselines to examine whether context optimization or rule\-based invocation can address this problem\.

#### Metrics\.

We report the following metrics\.SRmeasures the task success rate\.Exec\. Prec\.measures whether a skill invocation is valid and successfully completed\. For executable skills, this means the skill can interact with the environment without execution errors caused by unmet preconditions, invalid arguments, or malformed calls; for memory skills, this means the call returns valid content for subsequent generation\. SR@Invoke and SR@Skip report success rates conditioned on whether an episode invokes at least one skill\. Their changes provide an indirect view of how selectively the model invokes skills across different trajectory states\. We additionally report Skill/ep as an indicator of invocation frequency and Avg\. Steps as an indicator of trajectory efficiency\.

MethodSR\(↑\\uparrow\)Exec\. Prec\.\(↑\\uparrow\)SR@Invoke\(↑\\uparrow\)SR@Skip\(↑\\uparrow\)Skill/epAvg\. Steps\(↓\\downarrow\)*ALFWorld benchmark*Qwen3\-4B \(No\-Skill\)66\.4————26\.2\+RL\-Init w/ Skill69\.587\.881\.463\.50\.5823\.8\+SelSkill Round173\.481\.686\.864\.00\.6821\.5\+SelSkill Round277\.396\.092\.666\.20\.5921\.3Qwen3\-8B \(No\-Skill\)78\.9————22\.3\+RL\-Init w/ Skill75\.870\.973\.778\.82\.5524\.0\+SelSkill Round182\.894\.190\.778\.80\.6619\.7\+SelSkill Round285\.996\.691\.483\.90\.4616\.3\+SelSkill Round386\.7100\.097\.083\.20\.4416\.9*BFCL benchmark*Qwen3\-14B \(No\-Skill\)14\.1————24\.2\+Base w/ Skill18\.544\.013\.422\.80\.7318\.2\+SelSkill Round122\.669\.714\.830\.80\.9214\.1\+SelSkill Round224\.273\.518\.232\.41\.0114\.2Table 1:Performance comparison of different baselines on ALFWorld and BFCL benchmarks\.Bluedenotes our methods\. For each backbone, the best results are highlighted inbold\.

### 5\.2Main Results

#### Performance on ALFWorld\.

Table[1](https://arxiv.org/html/2606.00510#S5.T1)reports the main results on ALFWorld\. The experimental results can be summarized in three points:

\(i\) Necessity of selective invocation\.Simply enabling skills on RL\-Init does not lead to consistent performance gains: it improves task success in some settings but degrades it in others\. This inconsistency suggests that the model cannot yet reliably determine when and how to invoke skills\.

\(ii\) Improved invocation reliability\.Compared with skill\-enabled initialization, SelSkill substantially improves execution precision and shortens trajectories overall\. For Qwen3\-8B, Exec\. Prec\. increases from 70\.9% to 100\.0%, while Avg\. Steps decreases from 24\.0 to 16\.9 by Round3\.

\(iii\) A narrow but reliable invocation boundary\.For Qwen3\-8B, SelSkill Round3 achieves 97\.0% SR@Invoke while reducing Skill/ep from 2\.55 to 0\.44, suggesting that the model learns a more selective and reliable invocation policy\. Qwen3\-4B likewise improves SR and SR@Invoke, while keeping its invocation frequency close to the skill\-enabled initialization\.

#### Performance on BFCL\.

Table[1](https://arxiv.org/html/2606.00510#S5.T1)also reports the main results on BFCL\. The experimental results can be summarized in three points:

\(i\) Incremental value of skill access\.Enabling skills improves SR from 14\.1% to 18\.5%, showing that skills can provide additional value in multi\-turn function\-calling tasks, while leaving room for learning more reliable invocation behavior\.

\(ii\) Execution precision as a key bottleneck\.Base w/ Skillachieves only 44\.0% Exec\. Prec\., indicating that many skill calls are not validly and successfully executed\.

\(iii\) Higher\-quality invocation\.Compared withBase w/ Skill, SelSkill Round2 improves SR from 18\.5% to 24\.2% and Exec\. Prec\. from 44\.0% to 73\.5%, while reducing Avg\. Steps from 18\.2 to 14\.2\. Notably, Skill/ep increases from 0\.73 to 1\.01 rather than decreasing\. This suggests that SelSkill does not simply suppress skill use; instead, it enables more reliable invocations, higher task success, and shorter trajectories\.

### 5\.3Ablation Study

SettingSR\(↑\\uparrow\)Skill/epExec\. Prec\.\(↑\\uparrow\)*Episode\-level*Episode only\(standard DPO\)75\.01\.3075\.4*Step\-level*Entropy\-local only70\.31\.7781\.0Skill\-call\-local only80\.54\.2941\.7*Episode\-level \+ Step\-level*SelSkill \(n=1n=1\)79\.70\.8091\.2SelSkill \(n=3n=3\)82\.80\.6694\.1SelSkill \(n=alln=\\mathrm\{all\}\)82\.00\.7292\.6Table 2:Ablation study on ALFWorld benchmark using Qwen3\-8B\.Reddenotes mixed\-signal training, andnndenotes the number of post\-branch assistant turns covered by the loss mask\.MethodOverall EM\(↑\\uparrow\)High\-popMid\-popLow\-popEM\(↑\\uparrow\)Skill RateEM\(↑\\uparrow\)Skill RateEM\(↑\\uparrow\)Skill RateBase20\.142\.2—8\.4—9\.6—\+Skill61\.058\.478%62\.093%62\.794%SelSkill62\.960\.257%63\.987%64\.587%Table 3:OOD transfer on PopQA benchmark\.We conduct ablation experiments on ALFWorld with Qwen3\-8B to examine the contributions of different preference signals\. All variants use the same one\-round training setting asSelSkill Round1\. SelSkill combines episode\-level preferences, which compare successful and failed trajectories, with entropy\-guided step\-level preferences, which supervise local invoke/skip decisions\. The loss\-mask parameternncontrols how many post\-branch assistant turns are included in the local training objective\. Table[2](https://arxiv.org/html/2606.00510#S5.T2)reports the results, which can be summarized in three points:

\(i\) Episode\-level supervision is insufficient\.Episode only, which corresponds to standard DPO using only episode\-level preference pairs, achieves 75\.0% SR and 75\.4% Exec\. Prec\. This suggests that trajectory\-level preference learning alone cannot reliably supervise local skill invocation decisions\.

\(ii\) Local supervision alone is unbalanced\.Entropy\-local onlyachieves relatively high Exec\. Prec\. but low SR, whileSkill\-call\-local onlyimproves SR to 80\.5% but reduces Exec\. Prec\. to 41\.7% with 4\.29 Skill/ep\. This indicates that local supervision alone may improve one aspect of invocation behavior while sacrificing others\.

\(iii\) Mixed signals achieve the best balance\.SelSkillwithn=3n=3achieves the highest SR of 82\.8% and Exec\. Prec\. of 94\.1%, with 0\.66 Skill/ep\. These results show that combining episode\-level and step\-level preferences improves both task success and invocation quality, rather than relying on more frequent skill calls\.

### 5\.4Out\-of\-Domain Generalization

A key question is whether the selective invocation ability learned on BFCL can generalize to new domains with previously unseen skills\. We evaluate BFCL SelSkill Round2 in a zero\-shot manner on two OOD benchmarks that are not seen during training\. PopQA uses Wikipedia retrieval skills for knowledge\-intensive question answering, while Tau\-bench evaluates multi\-turn service\-oriented agent tasks with benchmark\-specific skills and a GPT\-4\.1 user simulator\. Tables[3](https://arxiv.org/html/2606.00510#S5.T3)and[4](https://arxiv.org/html/2606.00510#S5.T4)report the results on PopQA and Tau\-bench, respectively\. The results can be summarized in three points:

\(i\) Generalization across domains and unseen skills\.Neither benchmark uses the BFCL skills available during training, so this evaluation tests the transfer of*invocation judgment*rather than memorization of specific skill knowledge\. On both OOD benchmarks, enabling benchmark\-specific skills improves performance over the no\-skill baseline, showing that the newly provided skills are useful in their target domains\. SelSkill further improves over the corresponding \+Skill baseline, suggesting that its learned invocation criterion transfers to new domains and previously unseen skills\.

\(ii\) Selective retrieval on PopQA\.In this setting, Skill Rate denotes the percentage of examples where the model invokes a Wikipedia retrieval skill\. Since PopQA is single\-turn, Skill Rate is equivalent toSkill/ep\. We use entity popularity as a rough proxy for how likely the answer is to be covered by the model’s parametric knowledge: high\-popularity entities are more likely to be internalized by the model, while mid\- and low\-popularity entities typically require external retrieval more often\. Table[3](https://arxiv.org/html/2606.00510#S5.T3)shows that SelSkill improves EM across all popularity groups while reducing Skill Rate in an intuitive way\. For high\-popularity entities, Skill Rate drops substantially from 78% to 57%, while EM improves from 58\.4 to 60\.2\. In contrast, for mid\- and low\-popularity entities, SelSkill only slightly reduces Skill Rate, from 93% and 94% to 87%, while improving EM from 62\.0/62\.7 to 63\.9/64\.5\. This indicates that SelSkill does not simply suppress skill use; instead, it tends to preserve retrieval for examples that likely require external knowledge and skip unhelpful calls when parametric knowledge is more likely to suffice\.

MethodAvg\.\(↑\\uparrow\)Air\.\(↑\\uparrow\)Ret\.\(↑\\uparrow\)SR@Inv\./Skip\(↑\\uparrow\)Base31\.922\.041\.7— / —\+Skill39\.634\.045\.241\.7 / 47\.8SelSkill41\.434\.048\.750\.0/49\.2Table 4:OOD transfer on Tau\-bench\.Avg\. is mean pass@1 over airline \(Air\.\) and retail \(Ret\.\) domains\.\(iii\) More reliable invocation on Tau\-bench\.On Tau\-bench, SelSkill transfers to interactive service tasks with different APIs, policy constraints, and unseen skills\. Compared with \+Skill, it improves average pass@1 from 39\.6 to 41\.4\. Both Avg\. and SR@Invoke are macro\-averaged across airline and retail; the latter increases from 41\.7 to 50\.0, indicating that invoked episodes succeed more often in new settings\.

## 6Analyses

#### Gradient localization\.

To examine whether step\-level preferences provide more localized supervision for skill invocation, we compare gradient peak positions between ALFWorld episode\-level pairs from complete trajectories and entropy\-guided step\-level pairs branched at high\-uncertainty skill\-decision points\. For each pair, we align the skill\-call position to 0, record the token position with the largest gradient norm, and visualize its distribution using kernel density estimation\. The vertical axis in Figure[4](https://arxiv.org/html/2606.00510#S6.F4)represents the estimated density of gradient peak positions\.

Figure[4](https://arxiv.org/html/2606.00510#S6.F4)shows that episode\-level preferences produce more dispersed gradient peaks, whereas step\-level preferences concentrate them around the skill\-call region\. This suggests that episode\-level preferences provide broad trajectory\-level guidance, while step\-level preferences more directly supervise local invoke/skip decisions\. The two signals thus provide complementary supervision for selective skill invocation, consistent with the advantage of mixed\-signal training in Table[2](https://arxiv.org/html/2606.00510#S5.T2)\.

![Refer to caption](https://arxiv.org/html/2606.00510v1/x4.png)Figure 4:Step\-level preferences produce more localized gradient peaks around the skill\-call token\.
#### Additional analyses\.

Appendix[B](https://arxiv.org/html/2606.00510#A2)uses counterfactual comparisons to examine when skill calls are useful\. Skill invocation improves only a small fraction of trajectories, and harmful calls can still be semantically plausible\. Appendix[C](https://arxiv.org/html/2606.00510#A3)examines episode\-level reinforcement learning and finds that it does not reliably calibrate local invocation decisions\. Skill usage remains unstable even when overall task success improves\. Appendix[D](https://arxiv.org/html/2606.00510#A4)compares SelSkill with context injection, conservative prompting, and an explicit skip option\. These simple alternatives do not replace selectivity training\. Appendix[E](https://arxiv.org/html/2606.00510#A5)evaluates entropy\-guided branching\. It selects more informative local comparisons than random branching and achieves comparable performance to all\-skill branching at lower sampling cost\. Appendix[F](https://arxiv.org/html/2606.00510#A6)tests robustness under injected distractor skills\. SelSkill almost never invokes noise skills and shows only modest degradation with the largest skill listing\. Finally, Appendix[H](https://arxiv.org/html/2606.00510#A8)presents cases in which skills are unnecessary, invoked before their preconditions are met, or beneficial when invoked at the appropriate time\.

## 7Conclusion

We formulate selective skill invocation as a*skill\-or\-skip*decision and proposeSelSkill, which learns selective invocation policies from combined episode\- and step\-level preferences\. On ALFWorld and BFCL, SelSkill improves task success over skill\-enabled baselines by10\.9and5\.7points, and execution precision by29\.1and29\.5points, respectively\. Zero\-shot results on PopQA and Tau\-bench further suggest that this selectivity transfers to new domains with previously unseen skills\.

## Limitations

This work studies selective skill invocation with a fixed, offline\-constructed skill library; its construction and evolution are beyond our scope\. We use predictive entropy as a lightweight heuristic for prioritizing candidate decision points, leaving alternative selection criteria to future study\. Due to computational constraints, we do not evaluate substantially larger models\. Our benchmarks do not fully capture deployments with irreversible external effects; safer and more controllable skill invocation remains future work\.

## References

- Y\. Cai, J\. Zhou, Q\. Chen, and L\. He \(2026\)Ask only when needed: proactive retrieval from memory and skills for experience\-driven lifelong agents\.CoRRabs/2604\.20572\.External Links:[Link](https://doi.org/10.48550/arXiv.2604.20572),[Document](https://dx.doi.org/10.48550/ARXIV.2604.20572),2604\.20572Cited by:[§2\.3](https://arxiv.org/html/2606.00510#S2.SS3.p1.1)\.
- S\. Chen, J\. Gai, R\. Zhou, J\. Zhang, T\. Zhu, J\. Li, K\. Wang, Z\. Wang, Z\. Chen, K\. Kaleb, N\. Miao, S\. Gao, C\. Lu, M\. Li, J\. He, and Y\. W\. Teh \(2026\)SkillCraft: can LLM agents learn to use tools skillfully?\.CoRRabs/2603\.00718\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.00718),[Document](https://dx.doi.org/10.48550/ARXIV.2603.00718),2603\.00718Cited by:[§2\.1](https://arxiv.org/html/2606.00510#S2.SS1.p1.1)\.
- Y\. Chen, G\. Dong, and Z\. Dou \(2025\)Toward effective tool\-integrated reasoning via self\-evolved preference learning\.CoRRabs/2509\.23285\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.23285),[Document](https://dx.doi.org/10.48550/ARXIV.2509.23285),2509\.23285Cited by:[§2\.3](https://arxiv.org/html/2606.00510#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2606.00510#S4.SS2.SSS0.Px2.p1.2)\.
- G\. Dong, H\. Mao, K\. Ma, L\. Bao, Y\. Chen, Z\. Wang, Z\. Chen, J\. Du, H\. Wang, F\. Zhang, G\. Zhou, Y\. Zhu, J\. Wen, and Z\. Dou \(2025\)Agentic reinforced policy optimization\.CoRRabs/2507\.19849\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.19849),[Document](https://dx.doi.org/10.48550/ARXIV.2507.19849),2507\.19849Cited by:[§4\.2](https://arxiv.org/html/2606.00510#S4.SS2.SSS0.Px2.p1.2)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026\)SoK: agentic skills \- beyond tool use in LLM agents\.CoRRabs/2602\.20867\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.20867),[Document](https://dx.doi.org/10.48550/ARXIV.2602.20867),2602\.20867Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00510#S2.SS1.p1.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun, S\. Wang, B\. Li, Q\. Zeng, D\. Wang, X\. Zhao, Y\. Wang, R\. B\. Chaim, Z\. Di, Y\. Gao, J\. He, Y\. He, L\. Jing, L\. Kong, X\. Lan, J\. Li, S\. Li, Y\. Li, Y\. Lin, X\. Liu, X\. Liu, H\. Lyu, Z\. Ma, B\. Wang, R\. Wang, T\. Wang, W\. Ye, Y\. Zhang, H\. Xing, Y\. Xue, S\. Dillmann, and H\. Lee \(2026\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.External Links:2602\.12670,[Link](https://arxiv.org/abs/2602.12670)Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- Y\. Liang, R\. Zhong, H\. Xu, C\. Jiang, Y\. Zhong, R\. Fang, J\. Gu, S\. Deng, Y\. Yao, M\. Wang, S\. Qiao, X\. Xu, T\. Wu, K\. Wang, Y\. Liu, Z\. Bi, J\. Lou, Y\. E\. Jiang, H\. Zhu, G\. Yu, H\. Hong, L\. Huang, H\. Xue, C\. Wang, Y\. Wang, Z\. Shan, X\. Chen, Z\. Tu, F\. Xiong, X\. Xie, P\. Zhang, Z\. Gui, L\. Liang, J\. Zhou, C\. Wu, J\. Shang, Y\. Gong, J\. Lin, C\. Xu, H\. Deng, W\. Zhang, K\. Ding, Q\. Zhang, F\. Huang, N\. Zhang, J\. Z\. Pan, G\. Qi, H\. Wang, and H\. Chen \(2026\)SkillNet: create, evaluate, and connect AI skills\.CoRRabs/2603\.04448\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.04448),[Document](https://dx.doi.org/10.48550/ARXIV.2603.04448),2603\.04448Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- G\. Ling, S\. Zhong, and R\. Huang \(2026\)Agent skills: A data\-driven analysis of claude skills for extending large language model functionality\.CoRRabs/2602\.08004\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.08004),[Document](https://dx.doi.org/10.48550/ARXIV.2602.08004),2602\.08004Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00510#S2.SS1.p1.1)\.
- D\. Liu, Z\. Li, H\. Du, X\. Wu, S\. Gui, Y\. Kuang, and L\. Sun \(2026\)Graph of skills: dependency\-aware structural retrieval for massive agent skills\.CoRRabs/2604\.05333\.External Links:[Link](https://doi.org/10.48550/arXiv.2604.05333),[Document](https://dx.doi.org/10.48550/ARXIV.2604.05333),2604\.05333Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- Z\. Lu, Z\. Yao, J\. Wu, C\. Han, Q\. Gu, X\. Cai, W\. Lu, J\. Xiao, Y\. Zhuang, and Y\. Shen \(2026\)SKILL0: in\-context agentic reinforcement learning for skill internalization\.CoRRabs/2604\.02268\.External Links:[Link](https://doi.org/10.48550/arXiv.2604.02268),[Document](https://dx.doi.org/10.48550/ARXIV.2604.02268),2604\.02268Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2023, Toronto, Canada, July 9\-14, 2023,A\. Rogers, J\. L\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),pp\. 9802–9822\.External Links:[Link](https://doi.org/10.18653/v1/2023.acl-long.546),[Document](https://dx.doi.org/10.18653/V1/2023.ACL-LONG.546)Cited by:[§5\.1](https://arxiv.org/html/2606.00510#S5.SS1.SSS0.Px1.p1.1)\.
- Q\. Mi, Z\. Ma, M\. Yang, H\. Li, Y\. Wang, H\. Zhang, and J\. Wang \(2026\)Skill\-pro: learning reusable skills from experience via non\-parametric ppo for llm agents\.External Links:2602\.01869,[Link](https://arxiv.org/abs/2602.01869)Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- J\. Ni, Y\. Liu, X\. Liu, Y\. Sun, M\. Zhou, P\. Cheng, D\. Wang, E\. Zhao, X\. Jiang, and G\. Jiang \(2026\)Trace2Skill: distill trajectory\-local lessons into transferable agent skills\.CoRRabs/2603\.25158\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.25158),[Document](https://dx.doi.org/10.48550/ARXIV.2603.25158),2603\.25158Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- S\. Ouyang, J\. Yan, Y\. Chen, R\. Han, Z\. Wang, B\. D\. Mishra, R\. Meng, C\. Li, Y\. Jiao, K\. Zha, M\. Shen, V\. Tirumalashetty, G\. Lee, J\. Han, T\. Pfister, and C\. Lee \(2026\)SkillOS: learning skill curation for self\-evolving agents\.External Links:2605\.06614,[Link](https://arxiv.org/abs/2605.06614)Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(BFCL\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research\.External Links:[Link](https://proceedings.mlr.press/v267/patil25a.html)Cited by:[§5\.1](https://arxiv.org/html/2606.00510#S5.SS1.SSS0.Px1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.00510#S2.SS1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html)Cited by:[§4\.3](https://arxiv.org/html/2606.00510#S4.SS3.p1.6)\.
- H\. Ross, A\. S\. Mahabaleshwarkar, and Y\. Suhara \(2025\)When2Call: when \(not\) to call tools\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),pp\. 3391–3409\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-long.174),[Document](https://dx.doi.org/10.18653/V1/2025.NAACL-LONG.174)Cited by:[§2\.3](https://arxiv.org/html/2606.00510#S2.SS3.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.00510#S2.SS1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§5\.1](https://arxiv.org/html/2606.00510#S5.SS1.SSS0.Px2.p1.1)\.
- Y\. Shi, Y\. Chen, Z\. Lu, Y\. Miao, S\. Liu, Q\. GU, X\. Cai, X\. Wang, and A\. Zhang \(2026\)Skill1: unified evolution of skill\-augmented agents via reinforcement learning\.External Links:2605\.06130,[Link](https://arxiv.org/abs/2605.06130)Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. J\. Hausknecht \(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=0IOX0YcCdTn)Cited by:[§4\.2](https://arxiv.org/html/2606.00510#S4.SS2.SSS0.Px2.p2.1),[§5\.1](https://arxiv.org/html/2606.00510#S5.SS1.SSS0.Px1.p1.1)\.
- W\. Su, J\. Long, Q\. Ai, Y\. Tang, C\. Wang, Y\. Tu, and Y\. Liu \(2026\)Skill retrieval augmentation for agentic AI\.CoRRabs/2604\.24594\.External Links:[Link](https://doi.org/10.48550/arXiv.2604.24594),[Document](https://dx.doi.org/10.48550/ARXIV.2604.24594),2604\.24594Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- S\. Tu, C\. Xu, Q\. Zhang, Y\. Zhang, X\. Lan, L\. Li, D\. Li, and D\. Zhao \(2026\)Dynamic dual\-granularity skill bank for agentic rl\.External Links:2603\.28716,[Link](https://arxiv.org/abs/2603.28716)Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang, and S\. Deng \(2026\)SkillX: automatically constructing skill knowledge bases for agents\.CoRRabs/2604\.04804\.External Links:[Link](https://doi.org/10.48550/arXiv.2604.04804),[Document](https://dx.doi.org/10.48550/ARXIV.2604.04804),2604\.04804Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00510#S2.SS1.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei \(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.CoRRabs/2212\.03533\.External Links:[Link](https://doi.org/10.48550/arXiv.2212.03533),[Document](https://dx.doi.org/10.48550/ARXIV.2212.03533),2212\.03533Cited by:[Appendix B](https://arxiv.org/html/2606.00510#A2.p4.1)\.
- Z\. Z\. Wang, A\. Gandhi, G\. Neubig, and D\. Fried \(2025a\)Inducing programmatic skills for agentic tasks\.CoRRabs/2504\.06821\.External Links:[Link](https://doi.org/10.48550/arXiv.2504.06821),[Document](https://dx.doi.org/10.48550/ARXIV.2504.06821),2504\.06821Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p1.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2025b\)Agent workflow memory\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research\.External Links:[Link](https://proceedings.mlr.press/v267/wang25bx.html)Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen, Z\. Zheng, C\. Xie, and H\. Yao \(2026\)SkillRL: evolving agents via recursive skill\-augmented reinforcement learning\.External Links:2602\.08234,[Link](https://arxiv.org/abs/2602.08234)Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- Z\. Xiong, Y\. Lin, W\. Xie, P\. He, Z\. Liu, J\. Tang, H\. Lakkaraju, and Z\. Xiang \(2025\)How memory management impacts llm agents: an empirical study of experience\-following behavior\.External Links:2505\.16067,[Link](https://arxiv.org/abs/2505.16067)Cited by:[§2\.3](https://arxiv.org/html/2606.00510#S2.SS3.p1.1)\.
- H\. Xu, Z\. Wang, Z\. Zhu, L\. Pan, X\. Chen, S\. Fan, L\. Chen, and K\. Yu \(2025\)Alignment for efficient tool calling of large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 17776–17792\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.898),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.898)Cited by:[§2\.3](https://arxiv.org/html/2606.00510#S2.SS3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§5\.1](https://arxiv.org/html/2606.00510#S5.SS1.SSS0.Px2.p1.1)\.
- Y\. Yang, J\. Li, Q\. Pan, B\. Zhan, Y\. Cai, L\. Du, J\. Zhou, K\. Chen, Q\. Chen, X\. Li, B\. Zhang, and L\. He \(2026\)AutoSkill: experience\-driven lifelong learning via skill self\-evolution\.CoRRabs/2603\.01145\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.01145),[Document](https://dx.doi.org/10.48550/ARXIV.2603.01145),2603\.01145Cited by:[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-bench: A benchmark for tool\-agent\-user interaction in real\-world domains\.CoRRabs/2406\.12045\.External Links:[Link](https://doi.org/10.48550/arXiv.2406.12045),[Document](https://dx.doi.org/10.48550/ARXIV.2406.12045),2406\.12045Cited by:[§5\.1](https://arxiv.org/html/2606.00510#S5.SS1.SSS0.Px1.p1.1)\.
- Y\. Zheng, Z\. Zhang, C\. Ma, Y\. Yu, J\. Zhu, Y\. Wu, T\. Xu, B\. Dong, H\. Zhu, R\. Huang, and G\. Yu \(2026\)SkillRouter: skill routing for LLM agents at scale\.CoRRabs/2603\.22455\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.22455),[Document](https://dx.doi.org/10.48550/ARXIV.2603.22455),2603\.22455Cited by:[§1](https://arxiv.org/html/2606.00510#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.00510#S2.SS2.p1.1)\.

###### Appendix

1. [1Introduction](https://arxiv.org/html/2606.00510#S1)
2. [2Related Work](https://arxiv.org/html/2606.00510#S2)1. [2\.1From Tools and Experience to Skills](https://arxiv.org/html/2606.00510#S2.SS1) 2. [2\.2Skill Integration and Optimization](https://arxiv.org/html/2606.00510#S2.SS2) 3. [2\.3Selectivity in Related Agent Settings](https://arxiv.org/html/2606.00510#S2.SS3)
3. [3Preliminary](https://arxiv.org/html/2606.00510#S3)
4. [4Methodology](https://arxiv.org/html/2606.00510#S4)1. [4\.1Overview](https://arxiv.org/html/2606.00510#S4.SS1) 2. [4\.2Preference Construction](https://arxiv.org/html/2606.00510#S4.SS2) 3. [4\.3Preference Optimization](https://arxiv.org/html/2606.00510#S4.SS3)
5. [5Experiments](https://arxiv.org/html/2606.00510#S5)1. [5\.1Experimental Setup](https://arxiv.org/html/2606.00510#S5.SS1) 2. [5\.2Main Results](https://arxiv.org/html/2606.00510#S5.SS2) 3. [5\.3Ablation Study](https://arxiv.org/html/2606.00510#S5.SS3) 4. [5\.4Out\-of\-Domain Generalization](https://arxiv.org/html/2606.00510#S5.SS4)
6. [6Analyses](https://arxiv.org/html/2606.00510#S6)
7. [7Conclusion](https://arxiv.org/html/2606.00510#S7)
8. [References](https://arxiv.org/html/2606.00510#bib)
9. [ASkill Library](https://arxiv.org/html/2606.00510#A1)
10. [BCounterfactual Skill Benefit](https://arxiv.org/html/2606.00510#A2)
11. [CEpisode\-Level RL Does Not Reliably Calibrate Skill Use](https://arxiv.org/html/2606.00510#A3)
12. [DEngineering Baselines](https://arxiv.org/html/2606.00510#A4)
13. [EAblation on Entropy\-Guided Branching](https://arxiv.org/html/2606.00510#A5)
14. [FRobustness to Distractor Skills](https://arxiv.org/html/2606.00510#A6)
15. [GExperimental Details](https://arxiv.org/html/2606.00510#A7)1. [G\.1Benchmark\-Specific Setup](https://arxiv.org/html/2606.00510#A7.SS1) 2. [G\.2Preference Pair Collection](https://arxiv.org/html/2606.00510#A7.SS2) 3. [G\.3Training Rounds and Loss Masking](https://arxiv.org/html/2606.00510#A7.SS3) 4. [G\.4Compute Cost](https://arxiv.org/html/2606.00510#A7.SS4)
16. [HTrajectory Analysis](https://arxiv.org/html/2606.00510#A8)

## Appendix ASkill Library

We study selective skill invocation under a fixed skill library\. All skills are constructed offline before policy optimization, using only training resources such as training trajectories, task instructions, API schemas, and environment documentation\. These skills remain unchanged throughout SelSkill training and evaluation, and no evaluation examples are used during skill construction\.

For each benchmark, we identify recurring procedural patterns from training resources and consolidate them into reusable callable skills\. Each skill consists of lightweight metadata and full skill content\. The metadata is visible to the model before invocation and supports the skill\-or\-skip decision, while the full content is injected into the context or executed only after the model explicitly chooses to invoke the skill\.

Skill construction follows two principles\. First, each skill should capture a reusable procedure that can apply across multiple task instances, such as checking preconditions, following environment constraints, retrieving supporting evidence, or executing a common action sequence\. It should not encode an instance\-specific solution or a shortcut tailored to a particular example\. Second, skill content should avoid any evaluation leakage\. It must not contain evaluation answers, evaluation trajectories, target states, case identifiers, or other information that would allow the model to solve an evaluation instance by memorization rather than by deciding when to invoke a fixed skill\.

Table[5](https://arxiv.org/html/2606.00510#A1.T5)shows representative skills from each benchmark\. The entries in the table are summarized and shortened; full skill definitions are provided in the supplementary materials\.

BenchmarkSkillsExampleBefore invocationAfter invocationALFWorld10 \(4 / 6\)heat objectUse when a task requires a heated object\.Check object and microwave availability, then perform the heating procedure\.BFCL18 \(10 / 8\)place stock orderUse for stock order placement or verification\.Check ticker, side, quantity, order type, and price, while preserving user\-specified limit prices\.Tau\-bench airline11 \(4 / 7\)cancel flightUse when a user requests flight cancellation\.Check cancellation eligibility under airline policy before calling the cancellation API\.Tau\-bench retail6 \(2 / 4\)cancel orderUse when a user requests order cancellation\.Check order status, cancellation window, and item eligibility before calling the cancellation API\.PopQA4 \(4 / 0\)lookup person factUse when uncertain about a person’s biographical attribute\.Retrieve Wikipedia evidence and extract the requested attribute\.Table 5:Representative skills from each benchmark\.Skill counts are reported as total \(memory / executable\)\. Entries are summarized and shortened\.
## Appendix BCounterfactual Skill Benefit

We conduct counterfactual experiments to characterize when skill access actually changes the execution trajectory\. For each benchmark and model, we deliberately construct a counterfactual pair for the same task instance: a skill\-disabled trajectory and a skill\-enabled trajectory under the same task setting\. We then align the two runs by case ID and compare whether enabling skills changes the final outcome\. This design compares outcome changes under skill access on the same case, rather than comparing different tasks or different sampled instances\.

The analysis covers 513 paired runs across multiple benchmarks and model families: BFCL and Tau\-bench with Qwen3\-14B, ALFWorld with Qwen3\-8B, and BFCL with Gemini\-2\.5\-flash\-lite\. For BFCL, we use the base and long\-context multi\-turn splits and exclude missing\-function and missing\-parameter categories, where the required function or parameter information is unavailable and skill use is therefore structurally blocked\. For Tau\-bench and ALFWorld, we align paired runs by task ID/trial and game file, respectively\.

Panel A: Counterfactual outcome changesOutcomeBFCL Qwen3ALFWorld Qwen3Tau\-bench Qwen3BFCL GeminiTotal\(n=124\)\(n=124\)\(n=84\)\(n=84\)\(n=181\)\(n=181\)\(n=124\)\(n=124\)\(n=513\)\(n=513\)Positive21 \(16\.9%\)9 \(10\.7%\)34 \(18\.8%\)5 \(4\.0%\)69 \(13\.5%\)Negative6 \(4\.8%\)1 \(1\.2%\)20 \(11\.0%\)16 \(12\.9%\)43 \(8\.4%\)No clear effect97 \(78\.2%\)74 \(88\.1%\)127 \(70\.2%\)103 \(83\.1%\)401 \(78\.2%\)
Panel B: Semantic relevance of invoked skillsOutcomeBM25Embedding cosine \(E5\)InterpretationPositive4\.580\.721Relatively relevantNegative4\.190\.719Relatively relevantNo clear effect3\.620\.703Relatively less relevant

Table 6:Counterfactual analysis of skill benefit\.Panel Areports outcome changes after enabling skills\.Panel Breports similarity between task instructions and invoked skill metadata, showing that both helpful and harmful invocations can appear semantically relevant\.Table[6](https://arxiv.org/html/2606.00510#A2.T6)shows that skill benefits are highly concentrated\. Across 513 paired runs, enabling skills improves the trajectory in only 13\.5% of cases, harms an otherwise correct trajectory in 8\.4%, and has no clear effect in 78\.2%\. Thus, skill access is not uniformly beneficial: most cases either do not require the skill or cannot be changed by it, while a smaller but non\-negligible set of cases is sensitive to the invocation decision\.

Panel B further examines whether harmful invocations are simply caused by choosing semantically unrelated skills\. We compute BM25 and embedding\-based cosine similarity between the task instruction and the invoked skill metadata, using E5Wanget al\.\([2022](https://arxiv.org/html/2606.00510#bib.bib48)\)as the embedding model\. Negative cases have similarity scores close to Positive cases and higher than No\-clear\-effect, indicating that harmful invocations are often semantically plausible\. The BM25 gap between effective or harmful invocations and unnecessary invocations is significant \(p=0\.042p=0\.042\), while embedding similarity shows a consistent but non\-significant trend\. This suggests that many failures arise after the model has already selected a plausible skill: the harder problem is deciding whether the current state provides the right conditions and timing for invoking it effectively\.

Stepw/o Skill SR\(↑\\uparrow\)w/ Skill SR\(↑\\uparrow\)Δ\\DeltaSkill/epSR@Invoke\(↑\\uparrow\)SR@Skip\(↑\\uparrow\)019\.521\.4\+1\.911\.1320\.750\.0524\.219\.0\-5\.21\.2524\.314\.31026\.639\.1\+12\.53\.2630\.863\.61528\.945\.8\+16\.912\.2533\.8100\.02042\.254\.0\+11\.81\.8413\.584\.02544\.566\.3\+21\.82\.8431\.7100\.03052\.348\.3\-4\.00\.979\.882\.63563\.363\.9\+0\.63\.1819\.497\.94079\.759\.3\-20\.42\.4522\.2100\.0Table 7:Comparison of GRPO\+KL Performance on ALFWorld benchmark\. All SR values are percentages\. SR@Invoke and SR@Skip are computed on validation episodes with and without skill calls, respectively\.
## Appendix CEpisode\-Level RL Does Not Reliably Calibrate Skill Use

We compare two GRPO\+KL variants on ALFWorld to examine whether episode\-level reward can learn selective skill invocation\. Both variants start from Qwen3\-8B base and use the same training hyperparameters: learning rate1​e−61\\mathrm\{e\}\{\-6\}, group size 8, train batch size 32, mini\-batch size 64, KL coefficient 0\.01, maximum 50 steps per episode, and validation temperature 0\.4\.GRPO\-w/oSkilldoes not receive any skill listing and is trained with task reward only:\+10\+10for success and−0\.1\-0\.1for invalid actions\.GRPO\-w/Skillreceives the skill listing at every step and uses the same task reward plus a skill\-success bonus of\+1\.0\+1\.0whenever a skill executes without error\.

Table[7](https://arxiv.org/html/2606.00510#A2.T7)shows that episode\-level RL improves task success at some checkpoints, but does not produce stable skill\-use behavior\. The advantage ofGRPO\-w/SkilloverGRPO\-w/oSkillfluctuates across training, and skill calls per episode do not converge to a consistent pattern\. Episodes without skill calls often achieve higher success rates than episodes with skill calls, suggesting that the learned policy still invokes skills in many low\-yield states\. This does not mean that skill calls directly cause failure, since invoked episodes may be harder or more uncertain\. Rather, the result shows that sparse task reward, even with a simple skill\-success bonus, is too coarse to supervise the local skill\-or\-skip decision\. The policy can improve final task success while still failing to learn when a skill should be skipped\.

Table[8](https://arxiv.org/html/2606.00510#A3.T8)illustrates this issue with a representative ALFWorld trajectory\. The model eventually succeeds, but it invokesheat\_objectfor a task that only requires placing a tomato on the microwave\. The redundant calls waste steps, introduce precondition errors, and make the trajectory much longer than necessary\. Thus, the final success reward cannot distinguish a skill that helped from a skill that the model succeeded despite\. This is the credit assignment problem at the individual skill\-call level\.

RL\-Init model: successful trajectory with redundant skill calls*Task: Put some tomato on microwave\.*Success in 18 steps; optimal≈\\approx6\.Correct prefixgo to fridge 1→\\rightarrowopen fridge 1→\\rightarrowtake tomato 3 from fridge 1\. The agent correctly obtains the tomato\.Redundant invocationSkill\(heat\_object\)is invoked and executes, although the task only requires placing the tomato in the microwave\.State\-breaking actiongo to countertop 1→\\rightarrowmove tomato 3 to countertop 1\. The unnecessary skill invocation changes the subsequent behavior, and the agent puts down the tomato\.Failed invocationsThe model invokesSkill\(heat\_object\)twice more\. Both calls fail because the agent is no longer holding an object, violating the skill precondition\.Recoverygo to countertop 1→\\rightarrowtake tomato 3 from countertop 1→\\rightarrowSkill\(heat\_object\)→\\rightarrowmove tomato 3 to microwave 1\. The task is eventually completed, but only after several unnecessary steps\.Table 8:Success Despite Redundant Skill Calls\.An ALFWorld trajectory where the model succeeds on a placement\-only task despite unnecessaryheat\_objectcalls that introduce precondition errors and extra steps\.
## Appendix DEngineering Baselines

We evaluate whether simple engineering changes can replace selectivity training\. All conditions use the ALFWorld RL\-Init model as the baseline\. We compare three engineering interventions with the baseline and our SelSkill Round 3 model\.

*Skill\-as\-Context*prepends all skill\-body text to the system prompt before each step and removes the skill tool\. The model can read skill knowledge as plain text, but cannot invoke skills during task execution\. This setting tests whether information availability alone is sufficient for effective skill use\.

*Conservative Prompt*keeps the original skill listing and skill tool, but adds a stronger instruction that the model should prefer direct environment actions and invoke a skill only when it is clearly necessary\. This setting tests whether prompt\-level constraints alone can suppress unhelpful calls without harming overall task behavior\.

*Explicit Skip Option*adds a universal skip skill,self\_reasoning\(\), to the skill listing\. Its when\-to\-use description states that it should be used when no other skill is applicable\. This setting tests whether the skip decision can be represented as an explicit invocable option\.

Table[9](https://arxiv.org/html/2606.00510#A4.T9)shows that none of the engineering interventions improves over the RL\-Init baseline\. Skill\-as\-Context reaches 71\.9% SR, suggesting that providing skill knowledge as context alone cannot replace explicit invocation decisions\. Conservative Prompt reduces Skill/ep from 2\.55 to 1\.41, but its overall SR also drops to 71\.9%\. This suggests that stronger prompt constraints mainly suppress skill use at a coarse level, rather than improving the model’s state\-specific invocation judgment\. Explicit Skip Option shows a similar pattern: making “skip” an explicit callable option does not enable an untrained model to reliably determine whether the current state truly requires a skill\. In contrast, SelSkill Round 3 reaches 86\.7% SR with lower Skill/ep and higher invocation precision, indicating that selective skill invocation requires training signal beyond context injection, prompt constraints, or skill\-listing design\.

ConditionSR\(↑\\uparrow\)Skill/epInv\./Skip\(↑\\uparrow\)RL\-Init w/ Skill75\.82\.5573\.7 / 78\.8Skill\-as\-Context71\.9—— / —Conservative71\.91\.4178\.0 / 66\.7Explicit Skip70\.32\.3472\.4 / 67\.3SelSkill86\.70\.4497\.0/83\.2Table 9:Engineering baselines on ALFWorld benchmark\.Inv\./Skip denotes SR@Invoke / SR@Skip\.
## Appendix EAblation on Entropy\-Guided Branching

We ablate the branching\-point selection strategy used for local decision\-point preference construction\. All conditions use the same base model, Qwen3\-8B RL\-Init on ALFWorld, and the sameK=4K=4free\-sampling procedure\. To isolate the effect of branching\-point selection, we match the final amount of training data to the entropy\-guided setting across all strategies\.

StrategyPairs/game\(↑\\uparrow\)Inv\.\-pref\. \(%\)All\-skill0\.1831\.0Random0\.6815\.7Entropy0\.5724\.1Table 10:Quality of retained local preference pairs before size matching\.Table[10](https://arxiv.org/html/2606.00510#A5.T10)reports the retained local preference pairs before final size matching\. Random branching retains many pairs, but only a relatively small portion of them prefer the invoke continuation\. Entropy\-guided branching retains a comparable number of pairs while yielding a higher invoke\-preferred ratio\. All\-skill branching has the highest invoke\-preferred ratio, but it retains substantially fewer pairs per game\. This is because it attempts forks at all skill\-invocation positions, while many of these positions lead the invoke and skip continuations to the same final outcome and therefore cannot form a clear outcome\-efficiency preference\. Thus, trying more branch points does not necessarily produce more valid preference pairs\. These results suggest that entropy\-guided selection can more efficiently identify positions where valid local invoke/skip comparisons can be constructed\.

ConditionSR\(↑\\uparrow\)Skill/epExec\. Prec\.\(↑\\uparrow\)RL\-Init w/ Skill75\.82\.5570\.9All\-skill82\.80\.9577\.0Random80\.50\.5996\.1Entropy\-guided82\.80\.6694\.1Table 11:Experimental Results on ALFWorld benchmark with Different Branching Strategies\.All values except Skill/ep are percentages\.Table[11](https://arxiv.org/html/2606.00510#A5.T11)further compares the downstream results after size\-matched training\. Entropy\-guided branching achieves the same SR as all\-skill branching, while the latter requires roughly three times the sampling cost\. It also yields slightly higher SR than random branching\. This indicates that entropy is not an exact causal criterion, but it provides useful guidance for branching\-point selection, achieving training performance close to exhaustive branching with much lower sampling cost\.

## Appendix FRobustness to Distractor Skills

We evaluate whether the invocation decisions of SelSkill Round2 remain robust as the skill listing expands\. Starting from the standard 18\-skill BFCL setting, we inject noise skills to create listings of 28, 38, and 68 skills\. The noise skills are synthetically constructed to cover unrelated domains, such as calendar management, music streaming, e\-commerce, and fitness tracking, with realisticwhen\-to\-useconditions that do not overlap with any BFCL evaluation task\. They provide no task\-relevant information and serve only as listing distractors\. All other conditions remain identical to the main experiment\.

Skill listing sizeSR\(↑\\uparrow\)Noise skill calls18 skills \(standard\)24\.2—28 skills \(\+10 noise\)24\.20 / 301 \(0\.0%\)38 skills \(\+20 noise\)25\.45 / 250 \(2\.0%\)68 skills \(\+50 noise\)21\.40 / 263 \(0\.0%\)Table 12:Robustness of SelSkill Round2 under expanded skill listings\.Table[12](https://arxiv.org/html/2606.00510#A6.T12)shows that SR remains stable when the skill listing expands from 18 to 38 skills, and only drops modestly when the listing is further expanded to 68 skills\. This decline is likely due to additional context noise from the longer skill listing, rather than incorrect invocations of the injected noise skills\. The model almost never invokes these noise skills, indicating strong robustness to irrelevant or low\-quality skills\. This behavior is consistent with the skill\-or\-skip formulation: the model learns not only to identify potentially relevant skills, but also to skip skills that should not intervene in the current task state\. These results suggest that SelSkill Round2 remains applicable under larger and noisier skill libraries\.

SettingALFWorldBFCLBackboneQwen3\-8B/4BQwen3\-14BModeThinking\-styleNon\-thinkingInitializationRL\-InitBaseEval decodingGreedyGreedyEval metricTask successExact matchSelSkill rounds3 \(8B\) / 2 \(4B\)2Learning rate1×10−61\{\\times\}10^\{\-6\}5×10−65\{\\times\}10^\{\-6\}Max length409612288TuningFull\-parameter fine\-tuningβ\\beta0\.1OptimizerAdamW with cosine scheduleWarmup0\.1Epochs3Local maskn=3n=3post\-branch assistant turnsTable 13:Model, evaluation, and training settings\.
## Appendix GExperimental Details

### G\.1Benchmark\-Specific Setup

#### ALFWorld\.

We keep the benchmark’s thinking\-style interaction format, because the agent needs to reason over environment observations before producing executable actions\. After RL initialization, skills are enabled through the system prompt, which contains skill metadata and few\-shot skill\-call examples\. The full skill body is loaded only after the model explicitly invokes the corresponding skill\. Evaluation uses greedy decoding on a fixed held\-out split\.

#### BFCL\.

We use Qwen3\-14B in non\-thinking mode for BFCL\. BFCL contains long multi\-turn function\-calling conversations, and enabling thinking substantially increases the context length during rollout collection and preference training\. We therefore use non\-thinking mode to keep the interaction within the context budget\. Skills are enabled by adding skill metadata and few\-shot examples to the system prompt\. The skill metadata follows the BFCL tool\-calling format and includes the skill\-use condition in the description\. Evaluation uses exact scoring with greedy decoding\.

### G\.2Preference Pair Collection

#### Episode\-level collection\.

For episode\-level data, we sampleK=10K=10complete trajectories or task outputs for the same input and label them using the final benchmark outcome\. When constructing episode\-level pairs, we remove malformed positive samples so that the chosen side does not contain invalid action or call formats\.

#### Local branching collection\.

For local data, we first run the current policy and record token\-level log probabilities during generation\. For each sampled instance, we form two candidate pools: the top\-3 high\-entropy positions following skill invocations and the top\-3 high\-entropy positions during ordinary generation\. The former captures uncertain states after skill intervention, while the latter captures uncertain states in ordinary execution\. For each selected branch point, we use interrupted rollout withK=4K=4\. Specifically, we use a temporary intervention prompt only during data collection to elicit invoke and skip continuations from the same trajectory prefix at that branch point\. Apart from this local collection prompt, all continuations use the same original prompt, skill listing, decoding setting, and evaluation protocol\. After the local invoke/skip choice is made, generation continues normally under the current policy\. During training, we restore the original prompt so that the chosen and rejected continuations are conditioned on the same original prefix\.

The resulting continuations are then filtered and ordered by the outcome\-efficiency rule in Section 4\. As with episode\-level data, we remove malformed positive samples from local pairs\.

#### Sampling budget\.

The rollout hyperparameters above are chosen based on preliminary runs\. We set episode\-level collection toK=10K=10, and local collection to two groups of top\-3 high\-entropy positions withK=4K=4rollout at each branch point, so that episode\-level and local collection produce comparable numbers and proportions of valid preference pairs under a similar rollout\-time budget\. This avoids having one type of preference signal dominate the training data and allows the global outcome signal and the local invoke/skip signal to be combined more evenly\.

### G\.3Training Rounds and Loss Masking

We use an iterative preference\-training schedule for both training benchmarks\. At each round, rollout data are collected with the previous round’s model, and the next model is trained only on preference pairs collected in that round\. Preference data are not accumulated across successive rounds\. Qwen3\-8B on ALFWorld runs three SelSkill training rounds fromRL\-Init, while Qwen3\-4B runs two rounds\. BFCL runs two rounds from the Qwen3\-14B model with skill prompting enabled\.

We do not continue to additional rounds because the number of valid preference pairs drops substantially as the model improves under the same data\-collection setup\. For episode\-level data, higher task success makes it harder to collect both successful and failed outputs for the same input, reducing the number of clear outcome comparisons available for training\. For local data, the model’s invocation behavior becomes more deterministic, so many invoke/skip continuations lead to the same final outcome and cannot form a clear outcome\-efficiency preference\. As a result, later rounds produce much smaller preference datasets for continued optimization\. In preliminary runs, training on such small datasets led to severe overfitting rather than further performance gains\.

For local pairs, the loss is applied only to selected assistant\-generated tokens around the branching decision\. Environment observations, skill returns, tool outputs, and tokens outside the local window are masked out\. The local mask covers the firstn=3n=3assistant turns after the branch, so the optimization focuses on the short\-term consequences of the invoke/skip decision\.

All training uses full\-parameter fine\-tuning with the hyperparameters shown in Table[13](https://arxiv.org/html/2606.00510#A6.T13)\.

### G\.4Compute Cost

The training and preference\-data collection runs reported in this paper are conducted on 8 NVIDIA A100 80GB GPUs\. Table[14](https://arxiv.org/html/2606.00510#A7.T14)reports the wall\-clock time for the main SelSkill experiments and the additional GRPO comparison\. The reported time is measured from training logs or file timestamps\.

Experiment / ComponentTime*ALFWorld 8B*GRPO \(50 steps\)∼\\sim30\.0 hSelSkill rollout×3\\times 3∼\\sim11\.7 hSelSkill training×3\\times 3∼\\sim2\.7 hTotal∼\\sim44\.4 h*ALFWorld 4B*GRPO \(60 steps\)∼\\sim27\.0 hSelSkill rollout×2\\times 2∼\\sim7\.8 hSelSkill training×2\\times 2∼\\sim1\.2 hTotal∼\\sim36\.0 h*BFCL 14B*SelSkill rollout×2\\times 2∼\\sim4\.6 hSelSkill training×2\\times 2∼\\sim1\.8 hTotal∼\\sim6\.4 h*GRPO comparison*Without skill \(40 steps\)∼\\sim27\.0 hWith skill \(40 steps\)∼\\sim64\.0 hTable 14:Wall\-clock time for the main SelSkill experiments and the additional GRPO comparison\.

## Appendix HTrajectory Analysis

We provide qualitative cases illustrating three typical skill\-invocation patterns: a relevant skill can pollute the context when triggered at an inappropriate time, a useful skill can fail when invoked before its preconditions are met, and a skill can help when invoked after its preconditions are satisfied\.

#### BFCL context pollution\.

Table[15](https://arxiv.org/html/2606.00510#A8.T15)shows a BFCL case where the user asks to buy 100 shares of AAPL at a limit price of $150\. The trajectory without skill invocation preserves this explicit constraint and places the order correctly\. In contrast, invokingplace\_stock\_orderat this point introduces a retrieved live price into the context, causing the model to override the specified limit price and later operate on the wrong order\. This case shows that a relevant skill triggered at an inappropriate time can pollute the context and derail an otherwise correct trajectory\.

TurnWithout skill\(✓\)With skill\(×\)1place\_order\(price=150, amount=100\)
→\\toorder 12446✓Skill\(place\_stock\_order\)
→\\toget\_stock\_info\(AAPL\)→\\to$227\.16
→\\toplace\_order\(price=227\.16, amount=100\)
→\\toinsufficient balance×2get\_order\_details\(12446\)
→\\tocorrect recent order✓get\_order\_history\(\)
→\\toold completed order 12345
get\_order\_details\(12345\)×3cancel\_order\(12446\)✓cancel\_order\(12345\)
→\\toalready completed×4trading\_logout\(\)✓trading\_logout\(\)✓Table 15:Context pollution in BFCL\.A mistimed skill call introduces an additional live price into the context, causing the model to override the user\-specified limit price and derail subsequent execution\.
#### ALFWorld premature call\.

Table[16](https://arxiv.org/html/2606.00510#A8.T16)shows an ALFWorld failure where the task is to examine an alarm clock with a desk lamp\. The skillexamine\_with\_lightis useful only after the agent holds the desk lamp, but the model invokes it too early\. The skill therefore returns no effect repeatedly, and the agent exhausts the step budget\. This case illustrates that even a useful skill should be skipped until its preconditions are satisfied\.

Without skill\(✓\)With skill\(×\)go to sidetable 1
examine alarmclock 1
go to desk 1
pick up desklamp 1
go to sidetable 1
use desklamp 1 with alarmclock 1
→\\totask complete✓go to sidetable 1
examine alarmclock 1
Skill\(examine\_with\_light, alarmclock 1\)
→\\to“Nothing happens\.”×
Skill\(examine\_with\_light, alarmclock 1\)
→\\to“Nothing happens\.”×
go to desk 1
Skill\(examine\_with\_light, alarmclock 1\)
→\\to“Nothing happens\.”×
*repeats failing skill attempts*
→\\tostep budget exhausted×Table 16:Premature invocation in ALFWorld\.The model invokesexamine\_with\_lightbefore acquiring the lamp, so the skill precondition is not satisfied\.
#### ALFWorld correct call\.

Table[17](https://arxiv.org/html/2606.00510#A8.T17)shows a successful ALFWorld trajectory where the task is to put a hot egg in the fridge\. After finding and picking up the egg, the SelSkill model invokesheat\_objectat an appropriate state, when its precondition is satisfied\.

With skill\(✓\)Without skill\(×\)go to countertop 1
go to countertop 2
go to cabinet 1
go to fridge 1
open fridge 1
take egg 3 from fridge 1
Skill\(heat\_object\)
→\\togo to microwave
→\\toput egg in microwave
→\\toheat egg
→\\toretrieve hot egg
go to fridge 1
move egg 3 to fridge 1
→\\totask complete✓go to countertop 1
go to countertop 2
go to fridge 1
go to stoveburner 1
go to fridge 1
go to sinkbasin 1
*repeats navigation among countertops, stove burners, microwave, cabinets, sink, coffeemachine, and garbage can*
go to microwave 1
*visits microwave but never executes the heat sequence*
→\\tostep budget exhausted×Table 17:Correct invocation in ALFWorld\.For a task requiring a heated egg, SelSkill invokesheat\_objectonly after the egg has been picked up, when the skill precondition is satisfied\.

Similar Articles