Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

arXiv cs.CL Papers

Summary

This paper proposes MASA, a framework that adapts skills to each LLM backbone without modifying weights, using hierarchical evolution and a model-conditioned rewriter, achieving gains of up to 25.8 points over baselines.

arXiv:2605.30723v1 Announce Type: new Abstract: LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model-dependent: a skill that benefits one backbone can harm another. Motivated by this observation, we propose MASA Model-Aware Skill Alignment, a framework that adapts skills to each target backbone without modifying agent weights. MASA operates in two stages: (1) a hierarchical skill evolution pipeline that iteratively rewrites general and task-specific skills using hill climbing and UCB-driven tree search, guided by environment feedback and model capability profiles; and (2) a lightweight model-conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to 25.8 points over the strongest baseline. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:28 AM

# Model-Aware Skill Alignment for LLM Agents
Source: [https://arxiv.org/html/2605.30723](https://arxiv.org/html/2605.30723)
## ![[Uncaptioned image]](https://arxiv.org/html/2605.30723v1/asset/logo.png)Skill is Not One\-Size\-Fits\-All: Model\-Aware Skill Alignment for LLM Agents

Jianxiang Yu, Jiapeng Zhu, Bochen Lin,Qier Cui,Zichen Ding,Xiang Li East China Normal University, Shanghai, China [jianxiangyu@stu\.ecnu\.edu\.cn](https://arxiv.org/html/2605.30723v1/mailto:[email protected])

###### Abstract

LLM agents increasingly retrieve externally curated*skills*—procedural instructions retrieved at decision time—to improve performance on long\-horizon interactive tasks\. Existing skill libraries are typically treated as model\-agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors\. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model\-dependent: a skill that benefits one backbone can harm another\. Motivated by this observation, we propose MASA \(*Model\-Aware Skill Alignment*\), a framework that adapts skills to each target backbone without modifying agent weights\. MASA operates in two stages: \(1\) a hierarchical skill evolution pipeline that iteratively rewrites general and task\-specific skills using hill climbing and UCB\-driven tree search, guided by environment feedback and model capability profiles; and \(2\) a lightweight model\-conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass\. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to25\.825\.8points over the strongest baseline\. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost\. Our code is publicly available\.111[https://github\.com/jianxiangyu/MASA\_](https://github.com/jianxiangyu/MASA_)

![[Uncaptioned image]](https://arxiv.org/html/2605.30723v1/asset/logo.png)Skill is Not One\-Size\-Fits\-All: Model\-Aware Skill Alignment for LLM Agents

Jianxiang Yu, Jiapeng Zhu, Bochen Lin, Qier Cui, Zichen Ding, Xiang Li††thanks:Corresponding authorEast China Normal University, Shanghai, China[jianxiangyu@stu\.ecnu\.edu\.cn](https://arxiv.org/html/2605.30723v1/mailto:[email protected])

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.30723v1/x1.png)Figure 1:Skill granularity is not one\-size\-fits\-all\.ALFWorld success rate \(%\) for four Qwen3 backbones under aNo\-Skillcontrol and three granularity levels \(Concise,Moderate,Detailed\)\. The optimal level differs across backbones\.LLM agents increasingly solve long\-horizon interactive tasks, including web navigationOuyanget al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib63)\), embodied controlLuet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib67)\), and tool useSchicket al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib11)\); Jianget al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib65)\); Wanget al\.\([2024a](https://arxiv.org/html/2605.30723#bib.bib53)\); Hsiaoet al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib66)\)\. A common approach to steer these agents without modifying model weights is to retrieve short pieces of procedural knowledge—which we call*skills*—from an external library at each stepWanget al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib12),[2024b](https://arxiv.org/html/2605.30723#bib.bib13),[2024c](https://arxiv.org/html/2605.30723#bib.bib49)\); Chenet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib44)\); Zhaoet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib45)\); Maet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib68)\)\. Existing skill\-library systems, whether hand\-craftedZhuet al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib14)\)or distilled from agent trajectoriesZhaoet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib45)\); Chenet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib44)\); Xiaet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib15)\); Wanget al\.\([2025a](https://arxiv.org/html/2605.30723#bib.bib16)\), typically construct a single shared library and reuse it across different LLM backbones\. In practice, deployment constraints such as latency budgets, inference cost, and hardware availability mean that real\-world agent systems must operate with backbones of vastly different scales rather than simply relying on the strongest available modelYaoet al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib54)\); Zhenget al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib56)\)\. This deployment heterogeneity raises a critical question for skill\-library design: can a single skill formulation serve models with substantially different capacities equally well?

To examine this, we experiment on ALFWorlShridharet al\.\([2020](https://arxiv.org/html/2605.30723#bib.bib9)\)\(full setup and analysis in §[2](https://arxiv.org/html/2605.30723#S2)\): keeping the principles of a skill library fixed, we vary only its granularity and evaluate four Qwen3 backbones \(4B–32B\)Yanget al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib51)\)\. As Figure[1](https://arxiv.org/html/2605.30723#S1.F1)shows, the optimal granularity varies across models; indeed, a skill that boosts one backbone can actively degrade another\. A parallel experiment on the Gemma3 family \(Appendix[C\.3](https://arxiv.org/html/2605.30723#A3.SS3)\) confirms that the same pattern holds across families, and that models of the same size but from different families also prefer different skill formulations\. This observation suggests that the effectiveness of a skill library depends not only on what knowledge it encodes, but also on how that knowledge is expressed relative to the target model’s capacity: when the expression is misaligned, retrieved skills distract rather than help\. A well\-designed skill library should amplify the strengths of its target backbone, unlocking capabilities that generic, model\-agnostic skills cannot\.

We pursue this goal withMASA,Model\-AwareSkillAlignment, a framework that aligns the formulation of a skill library with each target backbone without modifying agent weights\. MASA treats skill alignment as a hierarchical search problem driven by environment feedback\. It first runs a*hierarchical model\-conditioned skill evolution*: a stronger teacher LLM iteratively rewrites skills guided by a capability profile of the target model, applying hill\-climbing over general skills and UCB\-driven tree search over task\-specific skills\. To eliminate the costly teacher at deployment, the discovered rewrites train a lightweight model\-conditioned skill rewriter that adapts skills in a single forward pass, outperforming the teacher while being orders of magnitude cheaper\.

Our main contributions are as follows:

- •We empirically demonstrate that different models require different skill formulations: the same skill library that benefits one backbone can actively degrade another\. This finding challenges the one\-size\-fits\-all assumption and motivates model\-aware skill alignment\.
- •We propose MASA, a framework that aligns skill formulations with each target backbone\. It combines iterative search to evolve optimal skills with a lightweight rewriter that transforms unaligned skills into model\-appropriate ones\.
- •We evaluate MASA across three diverse environments and four Qwen3 backbones, achieving the highest success rate with gains up to\+25\.8\+25\.8points\. MASA\-rewriter further generalizes to unseen tasks and environments in a single forward pass, outperforming a much larger teacher LLM at negligible cost\.

![Refer to caption](https://arxiv.org/html/2605.30723v1/x2.png)Figure 2:The overall framework of MASA\.
## 2Preliminary Study: One Skill Library Does Not Fit All

Before introducing MASA, we ask whether a single skill library serves all model scales equally\. To isolate the effect of skill form from skill content, we keep the underlying*principles*fixed and vary only the*granularity*of their textual expression\.

### 2\.1Setup

We use ALFWorldShridharet al\.\([2020](https://arxiv.org/html/2605.30723#bib.bib9)\), a text\-based household task suite spanning six task types, and evaluate on the validation split\. We compare four Qwen3 backbones \(4B/8B/14B/32B\)Yanget al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib51)\)that differ primarily in capacity while sharing the same architecture and training regimen\. We design oneNo Skillcontrol and three skill\-granularity levels that encode identical behavioral principles but differ in representational depth\. Following prior work, we adopt the skill library ofXiaet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib15)\)as theModeratevariant and construct theConciseandDetailedvariants through controlled rewriting that preserves the underlying principles while adjusting granularity \(see Table[4](https://arxiv.org/html/2605.30723#A3.T4)in Appendix[C\.1](https://arxiv.org/html/2605.30723#A3.SS1)for side\-by\-side examples\)\. All three levels use the same retrieval pipeline, ensuring that observed differences are attributable to granularity alone\.

### 2\.2Findings

Figure[1](https://arxiv.org/html/2605.30723#S1.F1)reports overall ALFWorld success rates\.

#### Finding 1: The optimal skill form is model\-dependent, and mismatches can hurt\.

No single granularity level is uniformly optimal across models\. Qwen3\-4B performs best withModerate skillswhile Qwen3\-14B and Qwen3\-32B achieve their highest scores withDetailed skills\. Notably, Qwen3\-8B performs best under theNo Skillcondition \(32\.1%32\.1\\%\) and all three skill variants reduce performance\. Importantly, this does not suggest that skills are inherently incompatible with Qwen3\-8B\. Trajectory inspection reveals that, without external guidance, Qwen3\-8B often follows short and effective action chains that directly solve the task\. Misaligned skills instead introduce procedural reasoning patterns that override these naturally concise action chains, causing the model to over\-explore or deliberate unnecessarily\. This suggests that the effectiveness of a skill depends not only on its content but also on whether its expression is compatible with the model’s default problem\-solving strategy\.

#### Finding 2: The granularity–performance relationship is non\-monotonic and defies simple heuristics\.

It is unclear how skill granularity should scale with model capability: smaller models may benefit from concise guidance due to limited context utilization capacity, yet they may also require more explicit procedural supervision because of weaker reasoning abilities\. Our results show that neither direction holds consistently\. Qwen3\-32B underperforms Qwen3\-14B by4\.64\.6points underDetaileddespite being twice the size, inverting the usual scaling trend\. For Qwen3\-4B, performance does not monotonically improve in either direction:Moderateoutperforms bothConciseandDetailed, indicating that the optimum lies at an intermediate level that cannot be reached by simply “adding more detail” or “stripping to minimum\.” This complexity necessitates search\-based rather than rule\-based skill adaptation\.

#### Finding 3: Performance varies sharply across task types\.

Per\-task breakdown \(Appendix[C\.2](https://arxiv.org/html/2605.30723#A3.SS2)\) reveals that within a given model–granularity pairing, success rates can vary by over6060points across task types—a spread far exceeding the differences between granularity levels for any single task\. For example, Qwen3\-14B withConcise skillsscores74\.274\.2onPickbut only13\.713\.7onCool\. Some task types benefit from detailed skills regardless of model size, while others are insensitive or even harmed\. This heterogeneity suggests that global optimization alone is insufficient—skill alignment must also operate at the task\-type level to address the distinct demands of each task type\.

A parallel experiment on the Gemma3 family \(4B/12B/27B\) reveals the same scale\-dependent trend \(Appendix[C\.3](https://arxiv.org/html/2605.30723#A3.SS3)\), suggesting the phenomenon generalizes across model families\.

#### Implications\.

Together, the three findings impose concrete design requirements:

1. \(i\)*Model\-conditioned:*the optimal skill form varies per backbone, so alignment must be explicitly conditioned on the target model’s capacity \(Finding 1\)\.
2. \(ii\)*Search\-based rather than heuristic:*the relationship between skill granularity and performance is non\-monotonic and model\-specific, ruling out simple alignment rules \(Finding 2\)\.
3. \(iii\)*Task\-type\-specific:*within the same backbone, different task types respond differently to the same skills, requiring per\-task\-type adaptation in addition to global optimization \(Finding 3\)\.

We further note that our controlled study varies only one axis of skill form \(textual granularity\) while holding content fixed\. In practice, misalignment can also arise from differences in decision strategy, framing, or format, suggesting that a complete solution must perform open\-ended, model\-aware rewriting\. MASA is designed to address all three requirements\.

## 3Method: MASA

We present MASA, a framework that*conditions*skill evolution on the capability profile of a target backbone, yielding skill libraries specifically adapted to each model rather than relying on a universal, model\-agnostic formulation\. MASA comprises two complementary components: a search\-time skill evolution pipeline \(Section[3\.2](https://arxiv.org/html/2605.30723#S3.SS2)\) that evolves skills under explicit capability conditioning provided by a structured*model card*, and a deployment\-time skill rewriter \(Section[3\.3](https://arxiv.org/html/2605.30723#S3.SS3)\) that learns this model\-conditioned rewriting policy and adapts new skills in a single forward pass\. An overview of the framework is shown in Figure[2](https://arxiv.org/html/2605.30723#S1.F2)\.

### 3\.1Problem Formulation and Skill Library

#### Agent setup\.

A frozen LLM agentFFinteracts with environmentℰ\\mathcal\{E\}\. At each steptt, the agent receives observationoto\_\{t\}, retrieves relevant skills from a skill library𝒮\\mathcal\{S\}, and produces an actionata\_\{t\}:

at∼F\(⋅∣τ<t,𝒮^t\),𝒮^t=TopK\(𝒮,ot,k\),a\_\{t\}\\sim F\\\!\\left\(\\cdot\\mid\\tau\_\{<t\},\\;\\hat\{\\mathcal\{S\}\}\_\{t\}\\right\),\\quad\\hat\{\\mathcal\{S\}\}\_\{t\}=\\mathrm\{TopK\}\(\\mathcal\{S\},o\_\{t\},k\),\(1\)whereτ<t=\(o1,a1,…,ot−1,at−1\)\\tau\_\{<t\}=\(o\_\{1\},a\_\{1\},\\ldots,o\_\{t\-1\},a\_\{t\-1\}\)is the interaction history andTopK\\mathrm\{TopK\}retrieves thekkmost relevant skills by cosine similarity\. The backboneFFremains frozen throughout, and the sole optimization variable is the skill library𝒮\\mathcal\{S\}\.

#### Hierarchical skill library\.

FollowingXiaet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib15)\), we structure𝒮\\mathcal\{S\}into two levels:*general skills*𝒮G\\mathcal\{S\}^\{G\}\(cross\-task strategy principles\) and*task\-specific skills*𝒮T=\{𝒮Tc\}c∈𝒞\\mathcal\{S\}^\{T\}=\\\{\\mathcal\{S\}^\{T\_\{c\}\}\\\}\_\{c\\in\\mathcal\{C\}\}, where each𝒮Tc\\mathcal\{S\}^\{T\_\{c\}\}contains action guidelines tailored to task typeccand𝒞\\mathcal\{C\}is the set of all task types\. At inference, a lightweight encoder \(Qwen3\-Embedding\-0\.6B\) separately retrieves top\-kGk\_\{G\}general skills and top\-kTk\_\{T\}task\-specific skills for the current observation\.

#### Model card\.

The key conditioning signal in MASA is the*model card*ℳF\\mathcal\{M\}\_\{F\}, a structured profile of a target backboneFF\. Each card contains: \(i\)*architecture metadata*\(model family, parameter count, layer/attention configuration, context window\), \(ii\)*training provenance*\(training data scale, multilingual support\), and \(iii\)*capability profile*\(strengths and weaknesses of the backbone\)\. The construction protocol is detailed in Appendix[D](https://arxiv.org/html/2605.30723#A4)\.

#### Objective\.

We define a per\-episode adjusted rewardRRthat balances task completion against skill\-induced stalling:

R​\(F,𝒮,e\)=SR​\(F,𝒮,e\)−λ⋅NHR​\(F,𝒮,e\),R\(F,\\mathcal\{S\},e\)=\\mathrm\{SR\}\(F,\\mathcal\{S\},e\)\-\\lambda\\cdot\\mathrm\{NHR\}\(F,\\mathcal\{S\},e\),\(2\)whereeedenotes a single episode,SR∈\{0,1\}\\mathrm\{SR\}\\in\\\{0,1\\\}is task success,NHR\\mathrm\{NHR\}is the*nothing\-happens rate*—the fraction of steps after which the environment state remains unchanged, serving as a proxy for skill\-induced stalling \(e\.g\., the agent repeatedly issuing ineffective or invalid actions\), andλ∈\[0,1\]\\lambda\\in\[0,1\]controls the penalty strength\. The overall optimization objective seeks the skill library maximizing expected adjusted reward over a set of evaluation episodes𝒟\\mathcal\{D\}:

𝒮F⋆=arg⁡max𝒮⁡𝔼e∼𝒟​\[R​\(F,𝒮,e\)\],\\mathcal\{S\}^\{\\star\}\_\{F\}=\\arg\\\!\\max\_\{\\mathcal\{S\}\}\\;\\mathbb\{E\}\_\{e\\sim\\mathcal\{D\}\}\\\!\\left\[R\(F,\\mathcal\{S\},e\)\\right\],\(3\)where𝒮F⋆\\mathcal\{S\}^\{\\star\}\_\{F\}denotes the optimal skill library adapted to backboneFF\.

### 3\.2Hierarchical Model\-Conditioned Skill Evolution

The skill evolution pipeline is a teacher\-driven search over skill texts\. A stronger*teacher*LLMTT\(i\) analyzes failure trajectories ofFFto produce a structured failure attribution and \(ii\) rewrites skills conditioned on the model cardℳF\\mathcal\{M\}\_\{F\}, steering edits toward formulations compatible withFF’s observed characteristics\.

#### Two\-stage optimization\.

The pipeline optimizes𝒮G\\mathcal\{S\}^\{G\}and\{𝒮Tc\}\\\{\\mathcal\{S\}^\{T\_\{c\}\}\\\}in separate stages, motivated by both computational efficiency and conceptual separation\. From a computational perspective, a single edit to𝒮G\\mathcal\{S\}^\{G\}requires evaluation over the full task suite, whereas edits to𝒮Tc\\mathcal\{S\}^\{T\_\{c\}\}affect only a single task type\. From a modeling perspective, the two skill levels encode fundamentally different forms of knowledge\. General skills capture backbone\-level behavioral guidance that is intended to transfer across tasks \(e\.g\., “always verify your action parsed correctly”\), while task\-specific skills encode domain procedures tailored to particular environments \(e\.g\., “check the fridge before the counter”\)\. Therefore, separating the two stages simplifies credit assignment across the two skill levels while substantially reducing search cost\.

#### Stage 1: General skills via iterative hill climbing\.

General skills encode high\-level behavioral priors that affect agent behavior across many task types\. Evaluating a candidate general skill requires running the agent across the full task suite and aggregating feedback over diverse environments, making exhaustive search prohibitively expensive\. We therefore optimize𝒮G\\mathcal\{S\}^\{G\}via iterative hill climbingRussell \([2010](https://arxiv.org/html/2605.30723#bib.bib50)\), which provides a simple and effective strategy for progressively improving the current skill set under environment feedback\.

Each iteration proceeds as follows\.*Rollout*: the target modelFFequipped with the current best general skills is rolled out across all task types to compute the reward across episodes\.*Analysis*: the teacher collects failed trajectories from these rollouts and produces a structured failure attribution focusing on general behavioral deficiencies rather than task\-specific procedural gaps\.*Rewrite*: given the current best skill set, the failure attribution, the model cardℳF\\mathcal\{M\}\_\{F\}, and theKKhighest\-reward skill sets from all previous iterations \(which help the teacher learn from the full optimization trajectory rather than only the most recent failure\), the teacher outputs a revised general skill set\.*Accept/Reject*: the new candidate is evaluated on the full task suite and accepted only if it achieves a higher reward than the current best\. Search terminates after at mostIIiterations or afterppconsecutive iterations without improvement\.

#### Stage 2: Task\-specific skills via per\-type tree search\.

Unlike general skills, task\-specific skills encode domain procedures where multiple structurally different strategies may be effective for the same task type\. This motivates a tree\-structured search that can explore diverse branches rather than committing to a single refinement path\. We run an independent tree search per task typecc, where each node holds a candidate task\-specific skill set𝒮Tc\\mathcal\{S\}^\{T\_\{c\}\}and each edge corresponds to a teacher rewrite\.

Each iteration proceeds in four steps\.*Selection*: starting from the root, UCB1Kocsis and Szepesvári \([2006](https://arxiv.org/html/2605.30723#bib.bib25)\)is applied recursively to select the most promising leaf node, balancing exploitation of high\-reward nodes with exploration of less\-visited ones\.*Expansion*: the target modelFFis rolled out on type\-cctasks using the selected node’s skill set, and the teacher collects failed trajectories, produces a failure attribution, and outputs a revised task\-specific skill set—forming a new child node\.*Evaluation*: the new child’s skill set is evaluated on type\-cctasks to obtain its average reward\.*Backpropagation*: the reward is propagated from the new node back to the root, updating visit counts and value estimates along the path\. Per\-type trees are independent and fully parallelizable\.

Overall, the two stages run sequentially:𝒮FG⁣⋆\\mathcal\{S\}^\{G\\star\}\_\{F\}obtained in Stage 1 is held fixed throughout Stage 2, and the final output is a*model\-specific*skill library𝒮F⋆=\(𝒮FG⁣⋆,\{𝒮FTc⁣⋆\}c∈𝒞\)\\mathcal\{S\}^\{\\star\}\_\{F\}=\(\\mathcal\{S\}^\{G\\star\}\_\{F\},\\\{\\mathcal\{S\}^\{T\_\{c\}\\star\}\_\{F\}\\\}\_\{c\\in\\mathcal\{C\}\}\)\. The detailed procedures are given in Algorithms[1](https://arxiv.org/html/2605.30723#alg1)and[2](https://arxiv.org/html/2605.30723#alg2), and further details of the two\-stage search are provided in Appendix[E](https://arxiv.org/html/2605.30723#A5)\.

### 3\.3Model\-Conditioned Skill Rewriter

The skill evolution pipeline delivers strong skill libraries but requires substantial compute \(hundreds to thousands of full\-environment rollouts\) and an environment\-provided reward signal\. MASA\-Rewriter addresses this by learning the rewriting policy that the evolution pipeline implicitly executes, enabling cheap adaptation to new domains and tasks without further environment interaction\.

#### Training data\.

Each training instance maps an input skill set to the corresponding evolved optimum:

\(ℳF,𝒮Fin,d\)⟶𝒮F⋆,\(\\mathcal\{M\}\_\{F\},\\;\\mathcal\{S\}\_\{F\_\{\\text\{in\}\}\},\\;d\)\\longrightarrow\\mathcal\{S\}^\{\\star\}\_\{F\},\(4\)whereℳF\\mathcal\{M\}\_\{F\}is the model card,ddis the task description, and𝒮Fin\\mathcal\{S\}\_\{F\_\{\\text\{in\}\}\}is an input skill set \(either general𝒮G\\mathcal\{S\}^\{G\}or task\-specific𝒮Tc\\mathcal\{S\}^\{T\_\{c\}\}\)\. The output𝒮F⋆\\mathcal\{S\}^\{\\star\}\_\{F\}is always drawn from the evolution pipeline’s high\-scoring skill sets\. To ensure the rewriter learns to improve skills regardless of their initial quality,𝒮Fin\\mathcal\{S\}\_\{F\_\{\\text\{in\}\}\}is deliberately sampled from sources spanning a wide quality range: \(i\)*search intermediates*at early, mid, and late stages of the evolution pipeline; \(ii\)*cross\-model transfers*—optimal skills from a different backbone; \(iii\)*one\-shot teacher rewrites*without iterative search; and \(iv\)*augmented variants*\(noisy, partial, or verbose perturbations of existing skills\)\. This diversity exposes the rewriter to the full range of input conditions it may encounter at deployment\. Additional details are provided in Appendix[F](https://arxiv.org/html/2605.30723#A6)\.

#### Training\.

We instantiate the skill rewriter with Qwen3\-4B, a lightweight backbone chosen to keep deployment cost minimal while retaining sufficient capacity for structured rewriting\. The model is trained via supervised fine\-tuning \(SFT\) with cross\-entropy loss:

ℒ=−𝔼𝒟train\[logpθ\(𝒮F⋆\|ℳF,𝒮Fin,d\)\]\.\\mathcal\{L\}=\-\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{\\text\{train\}\}\}\\\!\\left\[\\log\\,p\_\{\\theta\}\\\!\\left\(\\mathcal\{S\}^\{\\star\}\_\{F\}\\;\\middle\|\\;\\mathcal\{M\}\_\{F\},\\,\\mathcal\{S\}\_\{F\_\{\\text\{in\}\}\},\\,d\\right\)\\right\]\.\(5\)

#### Inference\.

At deployment, given the target backbone’s model cardℳF\\mathcal\{M\}\_\{F\}, an input skill set𝒮Fin\\mathcal\{S\}\_\{F\_\{\\text\{in\}\}\}, and the task descriptiondd, the skill rewriter produces an adapted skill set in a single forward pass:

𝒮F⋆=fθ​\(ℳF,𝒮Fin,d\),\\mathcal\{S\}^\{\\star\}\_\{F\}=f\_\{\\theta\}\\\!\\left\(\\mathcal\{M\}\_\{F\},\\;\\mathcal\{S\}\_\{F\_\{\\text\{in\}\}\},\\;d\\right\),\(6\)without requiring any environment interaction or iterative search\.

ALFWorldWebShopModelMethodPickLookCleanHeatCoolPick2SR↑\\uparrowSteps↓\\downarrowSR↑\\uparrowScore↑\\uparrowSteps↓\\downarrowQwen3\-4BNo Skill20\.015\.418\.518\.816\.012\.517\.144\.623\.042\.29\.5\+ Base Skill20\.030\.829\.612\.520\.08\.320\.042\.319\.434\.811\.0\+ DS\-Adapter28\.630\.837\.018\.832\.012\.527\.140\.019\.225\.712\.4\+ MASA25\.753\.840\.737\.524\.020\.831\.438\.426\.449\.18\.4Qwen3\-8BNo Skill54\.346\.229\.66\.224\.020\.832\.139\.14\.632\.710\.0\+ Base Skill17\.138\.540\.731\.220\.012\.525\.040\.56\.032\.69\.3\+ DS\-Adapter25\.738\.544\.425\.016\.016\.727\.139\.64\.418\.212\.6\+ MASA62\.938\.570\.475\.056\.037\.557\.929\.228\.660\.14\.7Qwen3\-14BNo Skill65\.738\.525\.943\.816\.029\.237\.936\.72\.819\.912\.7\+ Base Skill68\.646\.244\.425\.020\.033\.342\.134\.11\.614\.813\.5\+ DS\-Adapter68\.653\.840\.718\.840\.029\.244\.334\.82\.012\.613\.6\+ MASA85\.753\.881\.556\.244\.045\.864\.325\.729\.254\.48\.0Qwen3\-32BNo Skill48\.646\.244\.425\.032\.016\.736\.437\.06\.635\.29\.9\+ Base Skill48\.646\.240\.750\.044\.020\.841\.435\.67\.224\.212\.0\+ DS\-Adapter51\.438\.559\.337\.544\.029\.245\.032\.23\.614\.313\.3\+ MASA57\.146\.277\.881\.376\.054\.265\.724\.334\.659\.97\.3

Table 1:Performance on ALFWorld and WebShop\. ALFWorld reports per\-task and average success rate \(SR %\), and average interaction steps across all task types; WebShop reports average SR \(%\), score, and average steps\.Boldmarks the best within each backbone\.
#### Complementary roles\.

The skill evolution pipeline provides per\-backbone upper bounds via explicit search and produces the skill rewriter’s training signal for the skill rewriter\. MASA\-Rewriter amortizes this search into a single forward pass, enabling rapid adaptation without environment interaction\. The evolution pipeline is preferred when rollout budget permits thorough optimization, whereas the skill rewriter is better suited to compute\-constrained deployment scenarios\.

## 4Experiments

We evaluate whether model\-conditioned skill evolution outperforms model\-agnostic baselines across diverse environments and backbones, and whether MASA\-Rewriter can generalize the learned rewriting policy to held\-out task types and unseen environments without additional search\.

### 4\.1Experimental Setup

#### Environments\.

We evaluate on three environments spanning distinct action spaces and reasoning demands\. \(i\) ALFWorldShridharet al\.\([2020](https://arxiv.org/html/2605.30723#bib.bib9)\)is a text\-based embodied environment where agents complete household tasks \(e\.g\., heating, cleaning, picking up objects\) by issuing text commands to navigate rooms and interact with objects\. It contains six task types with varying difficulty\. \(ii\) WebShopYaoet al\.\([2022a](https://arxiv.org/html/2605.30723#bib.bib8)\)simulates online shopping: agents navigate a realistic web interface, search for products, compare attributes, and make purchase decisions that satisfy natural\-language user specifications\. \(iii\) Search\-augmented QA requires agents to retrieve and synthesize information from web search results\. We include seven benchmarks covering both single\-hop \(NQKwiatkowskiet al\.\([2019](https://arxiv.org/html/2605.30723#bib.bib37)\), TriviaQAJoshiet al\.\([2017](https://arxiv.org/html/2605.30723#bib.bib38)\), PopQAMallenet al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib39)\)\) and multi\-hop reasoning \(HotpotQAYanget al\.\([2018](https://arxiv.org/html/2605.30723#bib.bib40)\), 2WikiHoet al\.\([2020](https://arxiv.org/html/2605.30723#bib.bib41)\), MuSiQueTrivediet al\.\([2022](https://arxiv.org/html/2605.30723#bib.bib42)\), BambooglePresset al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib43)\)\)\.

#### Backbones and baselines\.

Target agents are Qwen3\-\{4B, 8B, 14B, 32B\}Yanget al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib51)\);222All Qwen3 backbones are used in non\-thinking mode\. This choice reflects typical deployment scenarios where latency and token budgets are constrained, and ensures that observed performance differences are attributable to skill design rather than reasoning\-mode configuration\.the teacher LLM is DeepSeek\-V4\-ProDeepSeek\-AI \([2026](https://arxiv.org/html/2605.30723#bib.bib60)\)\. We compare against three baselines: \(1\)*No Skill*\(the raw backbone without any skill augmentation\), \(2\)*Base Skill*\(the initial skill library from SkillRLXiaet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib15)\), shared across all backbones without model\-specific adaptation\), and \(3\)*DS\-Adapter*\(a one\-shot teacher rewrite that adapts the Base Skill library conditioned on the model cardℳF\\mathcal\{M\}\_\{F\}, without iterative search\)\.

The Base Skill library also serves as the initialization𝒮FG0\\mathcal\{S\}^\{G\_\{0\}\}\_\{F\}and𝒮FTc0\\mathcal\{S\}^\{T\_\{c\_\{0\}\}\}\_\{F\}for MASA’s evolution pipeline\.

### 4\.2Skill Evolution Evaluation

ModelMethodSingle\-Hop QAMulti\-Hop QAAvg\.NQ†TriviaQA⋆PopQA⋆HotpotQA†2Wiki⋆MuSiQue⋆Bamboogle⋆Qwen3\-4BNo Skill29\.451\.037\.227\.722\.86\.49\.332\.9\+ Base Skill34\.557\.438\.228\.524\.47\.810\.135\.5\+ DS\-Adapter33\.056\.541\.828\.623\.99\.312\.936\.2\+ MASA35\.555\.338\.927\.427\.09\.461\.336\.9Qwen3\-8BNo Skill19\.146\.530\.324\.830\.66\.768\.131\.3\+ Base Skill34\.058\.538\.828\.625\.96\.210\.136\.1\+ DS\-Adapter33\.257\.638\.727\.822\.95\.67\.735\.0\+ MASA36\.456\.739\.028\.625\.710\.062\.537\.2Qwen3\-14BNo Skill33\.860\.240\.631\.726\.87\.610\.537\.6\+ Base Skill35\.360\.539\.532\.730\.311\.415\.338\.5\+ DS\-Adapter33\.960\.239\.531\.628\.59\.212\.537\.7\+ MASA35\.661\.840\.732\.830\.09\.78\.939\.0Qwen3\-32BNo Skill29\.159\.838\.332\.229\.38\.664\.538\.1\+ Base Skill33\.861\.439\.333\.826\.011\.767\.738\.7\+ DS\-Adapter34\.461\.540\.634\.032\.011\.664\.140\.6\+ MASA37\.061\.640\.034\.235\.611\.866\.141\.5

Table 2:Search\-augmented QA results \(success rate %\)\. Skill evolution is conducted on NQ and HotpotQA;†\{\\dagger\}and⋆\\starindicate in\-domain and out\-of\-domain datasets, respectively\.Boldmarks the best within each backbone\.Table[1](https://arxiv.org/html/2605.30723#S3.T1)reports ALFWorld and WebShop results across all four backbones\.

#### ALFWorld\.

MASA achieves the highest average success rate for every backbone:31\.431\.4\(4B\),57\.957\.9\(8B\),64\.364\.3\(14B\), and65\.765\.7\(32B\), with gains of\+4\.3\+4\.3,\+25\.8\+25\.8,\+20\.0\+20\.0, and\+20\.7\+20\.7over the strongest baseline respectively\. We highlight several observations:

\(1\) Per\-task dominance\.Beyond the aggregate, MASA achieves the best per\-task SR in most individual task types\. For Qwen3\-14B and 32B, MASA ranks first on*all six*task types simultaneously, indicating that the evolved skills improve overall performance without sacrificing coverage across tasks\.

\(2\) Model\-agnostic skills can hurt\.Base Skill and DS\-Adapter exhibit severe performance drops on individual tasks, indicating that generic or one\-shot adapted skills can introduce model\-specific conflicts\. In contrast, MASA avoids these regressions through iterative model\-conditioned search\.

\(3\) Scaling behavior\.For 8B and above, the backbones have sufficient capacity to leverage model\-specific skills, yielding substantial improvements\. The gain on 4B is comparatively modest, likely due to the backbone’s inherent capability ceiling limiting how much skill guidance can help\.

\(4\) Inference efficiency\.MASA consistently reduces average interaction steps \(e\.g\., 8B:39\.1→29\.239\.1\\to 29\.2; 14B:36\.7→25\.736\.7\\to 25\.7\)\. By tailoring skills to each backbone’s specific behavior patterns, MASA helps the agent locate target objects and execute correct action sequences more precisely, reducing redundant exploration and failed attempts\.

#### WebShop\.

MASA again achieves the highest success rate and score for every backbone, substantially outperforming all baselines\. WebShop reveals a critical challenge for larger Qwen3 models:

\(1\) Larger models perform worse than 4B without adaptation\.Notably, 8B/14B/32B baselines all underperform 4B on WebShop \(e\.g\., 14B No Skill:2\.8%2\.8\\%vs\. 4B No Skill:23\.0%23\.0\\%\)\. We trace this to excessive chain\-of\-thought generation: larger models produce verbose reasoning preambles before each action, inflating action length and exhausting the step budget on deliberation rather than environment interaction \(detailed statistics in Appendix[G](https://arxiv.org/html/2605.30723#A7)\)\. Since model\-agnostic skills are not designed to address this model\-specific behavioral pattern, they provide limited benefit and in some cases further degrade performance\.

\(2\) MASA addresses this challenge\.By evolving skills conditioned on each backbone’s behavioral profile, MASA guides models toward effective action patterns—achieving SR of26\.426\.4\(4B\),28\.628\.6\(8B\),29\.229\.2\(14B\), and34\.634\.6\(32B\), far surpassing all baselines\. The efficiency gain is also notable: baselines that do succeed average1212–1313steps, whereas MASA achieves higher SR in only77–88steps, dropping to just4\.74\.7steps on 8B\.

#### Search\-augmented QA\.

Table[2](https://arxiv.org/html/2605.30723#S4.T2)shows that MASA achieves the highest average SR for every backbone\. Skill evolution is conducted only on NQ and HotpotQA, yet the gains generalize strongly to out\-of\-domain benchmarks \(⋆\\star\)—e\.g\., on 4B, MASA improves Bamboogle from12\.912\.9\(best baseline\) to61\.361\.3\. On the largest backbone \(32B\), MASA ranks first on 5 out of 7 datasets\. These results demonstrate that the evolved skills capture transferable strategies for retrieval and reasoning, rather than overfitting to the datasets used during skill evolution\.

### 4\.3Skill Rewriter Generalization

![Refer to caption](https://arxiv.org/html/2605.30723v1/x3.png)Figure 3:OOD generalization of MASA\-Rewriter on held\-out ALFWorld task types\.Pinkbars denote baselines andbluebars denote MASA\-Rewriter variants\.We evaluate whether MASA\-Rewriter can adapt skills for task types not seen during its training, by holding out three ALFWorld task types \(Clean,Heat,Cool\) and asking MASA\-Rewriter to produce task\-specific skills for these types\. The general skills remain unchanged\.

We compare against three baselines:*Base Skill*,*4B\-Rewrite*\(Qwen3\-4B used as a rewriter without SFT—i\.e\., the same architecture as MASA\-Rewriter but without learned rewriting ability\), and*DS\-Adapter*\(one\-shot teacher rewrite targeting the specific held\-out task\)\. The two MASA\-Rewriter variants differ in training data composition:

#### Cross\-environment transfer\.

MASA\-Rewriter is trained exclusively on skill evolution traces from Search and WebShop, then applied to ALFWorld without any in\-environment data\. Despite the substantial environment gap \(different action spaces and observation formats\), Cross\-env MASA\-Rewriter outperforms DS\-Adapter on all four backbones \(Figure[3](https://arxiv.org/html/2605.30723#S4.F3)\), with gains of\+1\.5\+1\.5\(4B\),\+3\.0\+3\.0\(8B\),\+2\.9\+2\.9\(14B\), and\+3\.0\+3\.0\(32B\)\. This demonstrates that the learned rewriting policy captures model\-specific adaptation patterns that transfer across environments—the rewriter can produce useful skills even without exposure to the target environment during training\.

#### Cross\-task transfer\.

Building on Cross\-env, this variant additionally uses evolution traces from three other ALFWorld task types \(Pick,Look,Pick2\), excluding the held\-out evaluation types\. Cross\-task MASA\-Rewriter achieves substantially larger gains:\+8\.8\+8\.8\(8B\),\+13\.2\+13\.2\(14B\), and\+7\.4\+7\.4\(32B\) over DS\-Adapter\. The gains over Cross\-env are especially large on 8B \(\+5\.8\+5\.8\) and 14B \(\+10\.3\+10\.3\), suggesting that even skills from unrelated task types provide valuable supervision for adapting to environment\-specific interaction patterns and observation structures\.

Notably, MASA\-Rewriter \(4B parameters\) consistently surpasses DS\-Adapter powered by DeepSeek\-V4, demonstrating that a small trained rewriter can outperform a much larger teacher at a fraction of the inference cost\.

Due to space constraints, additional materials including extended related work discussion, ablations, supplementary validation on Gemma3, qualitative examples, and full hyperparameter details are reported in Appendices[A](https://arxiv.org/html/2605.30723#A1)–[I](https://arxiv.org/html/2605.30723#A9)\.

## 5Conclusion

We presented MASA, motivated by the observation that the one\-size\-fits\-all assumption in agent skill design breaks down across model scales\. MASA addressed this through hierarchical skill evolution and a lightweight model\-conditioned rewriter that amortizes search into a single forward pass\. Across three environments and four Qwen3 backbones, MASA achieved the best success rate in all settings, with gains up to\+25\.8\+25\.8points\. The rewriter further generalized to unseen tasks and environments at negligible deployment cost\.

We hope this work motivates treating skills as model\-aware artifacts that should be adapted to their target backbone rather than shared uniformly across models of different capacities\. With proper alignment, even compact models can exhibit behaviors traditionally associated with frontier\-scale systems, enabling more accessible and resource\-efficient deployment\. Looking forward, we envision MASA\-Rewriter as a lightweight plug\-and\-play middleware that automatically rewrites existing skill libraries for new backbones, requiring no environment rollouts, retraining, or manual prompt engineering\. This positions skill alignment as infrastructure rather than a per\-deployment engineering effort\.

## Limitations

Our empirical evidence is currently restricted to the Qwen3 family \(4B/8B/14B/32B\); extending the skill evolution and rewriter to cover more model families—both open\-weight \(e\.g\., LlamaGrattafioriet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib57)\), MistralLiuet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib58)\)\) and proprietary \(e\.g\., GPT\-o3OpenAI \([2025](https://arxiv.org/html/2605.30723#bib.bib61)\), ClaudeAnthropic \([2024](https://arxiv.org/html/2605.30723#bib.bib59)\)\)—and more diverse environments would further strengthen the generality of the skill rewriter, though it requires substantially more compute\. In particular, applying the evolution pipeline to closed\-source models demands hundreds of environment rollouts through paid APIs, making the per\-backbone search cost significantly higher than for locally hosted models; the the skill rewriter rewriter offers a partial remedy by amortizing this cost once trajectories from a few backbones are available\.

Additionally, the skill rewriter is trained on skill\-evolution trajectories collected from ALFWorld, WebShop, and Search\-QA, and the evolution pipeline itself relies on environments that provide automatic success/failure signals \(e\.g\., task completion flags\) to judge whether a rewritten skill is effective\. Incorporating domains without such built\-in reward signals \(e\.g\., open\-ended web tasks or real\-world applicationsSunet al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib62)\)\) would require designing external evaluators or human annotations, but would enable the framework to serve an even broader range of agent applications\.

## Ethical Considerations

Data and Licensing\.MASA does not introduce new data collection from human subjects; all experiments use standard public benchmarks \(ALFWorld, WebShop, and open\-domain QA datasets\) and publicly released models accessed in accordance with their respective licenses\.

Safety of Agent Empowerment\.By improving the effectiveness of LLM agents through skill adaptation, MASA may also increase the capability of agents operating in interactive environments\. Overall, the framework should be deployed in safety\-critical or high\-risk settings with additional monitoring, policy constraints, and human oversight\.

Bias and Reliability of Evolved Skills\.The skill evolution pipeline may inherit biases or unsafe heuristics from the trajectories and feedback used during optimization\. Evolved skill libraries should therefore be inspected and validated before deployment\.

## References

- Anthropic \(2024\)The claude 3 model family: opus, sonnet, haiku\.Technical reportAnthropic\.External Links:[Link](https://www.anthropic.com/news/claude-3-family)Cited by:[Limitations](https://arxiv.org/html/2605.30723#Sx1.p1.1)\.
- L\. Chen, E\. Feng, Y\. Xia, and H\. Chen \(2026\)SkVM: revisiting language vm for skills across heterogenous llms and harnesses\.arXiv preprint arXiv:2604\.03088\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1)\.
- M\. Chen, Y\. Li, Y\. Yang, S\. Yu, B\. Lin, and X\. He \(2024\)Automanual: constructing instruction manuals by llm agents via interactive environmental learning\.Advances in Neural Information Processing Systems37,pp\. 589–631\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- Y\. Chen, Z\. Wen, G\. Fan, Z\. Chen, W\. Wu, D\. Liu, Z\. Li, B\. Liu, and Y\. Xiao \(2023\)Mapo: boosting large language model performance with model\-adaptive prompt optimization\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 3279–3304\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px2.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4 technical report\.Technical reportDeepSeek\-AI\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Limitations](https://arxiv.org/html/2605.30723#Sx1.p1.1)\.
- Q\. Guo, R\. Wang, J\. Guo, B\. Li, K\. Song, X\. Tan, G\. Liu, J\. Bian, and Y\. Yang \(2024\)Connecting large language models with evolutionary algorithms yields powerful prompt optimizers\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 34133–34156\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px2.p1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- V\. Hsiao, M\. Roberts, and L\. Smith \(2025\)Procedural knowledge improves agentic llm workflows\.arXiv preprint arXiv:2511\.07568\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026\)SoK: agentic skills–beyond tool use in llm agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1601–1611\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.197864\.Cited by:[§C\.3](https://arxiv.org/html/2605.30723#A3.SS3.p1.1)\.
- L\. Kocsis and C\. Szepesvári \(2006\)Bandit based monte\-carlo planning\.InEuropean conference on machine learning,pp\. 282–293\.Cited by:[§3\.2](https://arxiv.org/html/2605.30723#S3.SS2.SSS0.Px3.p2.3)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee,et al\.\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 453–466\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- A\. H\. Liu, K\. Khandelwal, S\. Subramanian, V\. Jouault, A\. Rastogi, A\. Sadé, A\. Jeffares, A\. Jiang, A\. Cahill, A\. Gavaudan,et al\.\(2026\)Ministral 3\.arXiv preprint arXiv:2601\.08584\.Cited by:[Limitations](https://arxiv.org/html/2605.30723#Sx1.p1.1)\.
- Z\. Lu, Z\. Yao, J\. Wu, C\. Han, Q\. Gu, X\. Cai, W\. Lu, J\. Xiao, Y\. Zhuang, and Y\. Shen \(2026\)Skill0: in\-context agentic reinforcement learning for skill internalization\.arXiv preprint arXiv:2604\.02268\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu \(2026\)Skillclaw: let skills evolve collectively with agentic evolver\.arXiv preprint arXiv:2604\.08377\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 9802–9822\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- OpenAI \(2025\)Introducing openai o3 and o4\-mini\.Technical reportOpenAI\.External Links:[Link](https://openai.com/index/introducing-o3-and-o4-mini/)Cited by:[Limitations](https://arxiv.org/html/2605.30723#Sx1.p1.1)\.
- S\. Ouyang, J\. Yan, Y\. Chen, R\. Han, Z\. Wang, B\. D\. Mishra, R\. Meng, C\. Li, Y\. Jiao, K\. Zha,et al\.\(2026\)SkillOS: learning skill curation for self\-evolving agents\.arXiv preprint arXiv:2605\.06614\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- S\. J\. Russell \(2010\)Artificial intelligence a modern approach\.Pearson Education, Inc\.\.Cited by:[§3\.2](https://arxiv.org/html/2605.30723#S3.SS2.SSS0.Px2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.Advances in neural information processing systems36,pp\. 68539–68551\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- M\. Sclar, Y\. Choi, Y\. Tsvetkov, and A\. Suhr \(2024\)Quantifying language models’ sensitivity to spurious features in prompt design or: how i learned to start worrying about prompt formatting\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 25055–25083\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.30723#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Singh, J\. D\. Co\-Reyes, R\. Agarwal, A\. Anand, P\. Patil, X\. Garcia, P\. J\. Liu, J\. Harrison, J\. Lee, K\. Xu,et al\.\(2023\)Beyond human data: scaling self\-training for problem\-solving with language models\.arXiv preprint arXiv:2312\.06585\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px2.p1.1)\.
- Q\. Sun, K\. Cheng, Z\. Ding, C\. Jin, Y\. Wang, F\. Xu, Z\. Wu, C\. Jia, L\. Chen, Z\. Liu,et al\.\(2025\)Os\-genesis: automating gui agent trajectory construction via reverse task synthesis\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5555–5579\.Cited by:[Limitations](https://arxiv.org/html/2605.30723#Sx1.p2.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong \(2025a\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin,et al\.\(2024a\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 186345\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- Y\. Wang, Q\. Liu, Z\. Wang, Z\. Li, W\. Wei, Y\. Liu, and Y\. Bao \(2025b\)PromptBridge: cross\-model prompt transfer for large language models\.arXiv preprint arXiv:2512\.01420\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px2.p1.1)\.
- Z\. Wang, S\. Cai, A\. Liu, Y\. Jin, J\. Hou, B\. Zhang, H\. Lin, Z\. He, Z\. Zheng, Y\. Yang,et al\.\(2024b\)Jarvis\-1: open\-world multi\-task agents with memory\-augmented multimodal language models\.IEEE Transactions on Pattern Analysis and Machine Intelligence47\(3\),pp\. 1894–1907\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024c\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30723#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.30723#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.30723#S3.SS1.SSS0.Px2.p1.8),[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Xu, D\. Lu, Z\. Shen, J\. Wang, Z\. Wang, Y\. Mao, C\. Xiong, and T\. Yu \(2025\)Agenttrek: agent trajectory synthesis via guiding replay with web tutorials\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 79822–79843\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.30723#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px2.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 12028–12068\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px2.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022a\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§4\.1](https://arxiv.org/html/2605.30723#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022b\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1)\.
- Z\. Yao, Y\. Xu, H\. Xu, Y\. Liao, and Z\. Xie \(2025\)Efficient deployment of large language models on resource\-constrained devices\.arXiv preprint arXiv:2501\.02438\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- Y\. Zheng, Y\. Chen, B\. Qian, X\. Shi, Y\. Shu, and J\. Chen \(2025\)A review on edge large language models: design, execution, and applications\.ACM Computing Surveys57\(8\),pp\. 1–35\.Cited by:[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.
- X\. Zhu, Y\. Chen, H\. Tian, C\. Tao, W\. Su, C\. Yang, G\. Huang, B\. Li, L\. Lu, X\. Wang,et al\.\(2023\)Ghost in the minecraft: generally capable agents for open\-world environments via large language models with text\-based knowledge and memory\.arXiv preprint arXiv:2305\.17144\.Cited by:[Appendix A](https://arxiv.org/html/2605.30723#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30723#S1.p1.1)\.

## Appendix ARelated Work

#### LLM agents and skill libraries\.

Equipping LLM agents with reusable procedural knowledge is a scalable approach to improve agent performance without modifying model weights\. Early efforts such as ReActYaoet al\.\([2022b](https://arxiv.org/html/2605.30723#bib.bib10)\)and ReflexionShinnet al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib18)\)leverage textual feedback as in\-context skill; VoyagerWanget al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib12)\)maintains a growing skill library for Minecraft; JARVIS\-1Wanget al\.\([2024b](https://arxiv.org/html/2605.30723#bib.bib13)\)and Ghost\-in\-the\-MinecraftZhuet al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib14)\)cache successful behaviors for replay at inference; AgentTrekXuet al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib17)\)bootstraps web agents with synthesized trajectories; AutoManualChenet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib44)\)induces an*operating manual*from interaction traces; and ExpeLZhaoet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib45)\)distills cross\-trial experiences into a reusable insights library\. More recent systems further elevate skills into first\-class agent components: SkillRLXiaet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib15)\)distills trajectories into a hierarchical SkillBank and recursively evolves skills with the agent policy; and SkillOSOuyanget al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib63)\)learns a long\-horizon curator that inserts, updates, and deletes skills in an external SkillRepo\. Notably, SkVMChenet al\.\([2026](https://arxiv.org/html/2605.30723#bib.bib64)\)also identifies the model\-skill mismatch problem—reporting that 87% of tasks have at least one LLM that gains no benefit from the same skill—and addresses it by compiling skills into optimized runtime formats \(e\.g\., code solidification, parallelization\) to reduce latency and token cost\. MASA shares the same motivation but pursues a complementary direction: rather than compiling skills for execution efficiency, we*rewrite the natural\-language expression*of skills to match each backbone’s comprehension and reasoning style, directly improving task success rate\.

#### Model\-aware adaptation and prompt optimization\.

LLM behavior is highly sensitive to instruction phrasing even under semantically equivalent promptsSclaret al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib48)\), motivating methods that tailor prompts to specific backbones\. Teacher\-driven search methods such as OPROYanget al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib20)\)and EvoPromptGuoet al\.\([2024](https://arxiv.org/html/2605.30723#bib.bib22)\)iteratively refine a single instruction for a given task, yet treat the target model as fixed context—the same output applies regardless of backbone\. MAPOChenet al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib23)\)and PromptBridgeWanget al\.\([2025b](https://arxiv.org/html/2605.30723#bib.bib24)\)further account for model identity by optimizing or transferring individual task instructions across backbones, yet they operate on single monolithic prompts in non\-agent settings rather than on retrievable multi\-entry skill libraries used at agent decision time\. MASA differs in two key respects: \(i\) the optimization target is a dynamically retrieved*skill library*rather than a monolithic prompt, and \(ii\) the evolutionary search is jointly steered by a structured model capability profile and environment reward signals, explicitly conditioning skill expression on target\-model characteristics\. Furthermore, we train a lightweight skill rewriter that amortizes the expensive search process into a single forward pass—conceptually related to distilling costly inference\-time computation into efficient learned modelsSinghet al\.\([2023](https://arxiv.org/html/2605.30723#bib.bib32)\)—enabling skill adaptation to new backbones without repeated search\.

## Appendix BAblations

#### Two\-stage evolution pipeline \(Table[3\(a\)](https://arxiv.org/html/2605.30723#A2.T3.st1)\)\.

We ablate the two\-stage search structure by replacing each stage’s evolved skills with one\-shot teacher \(DeepSeek\-V4\) rewrites, isolating the contribution of each search stage\.*w/o Task\-specific*retains MASA\-evolved general skills but substitutes teacher\-written task\-specific skills \(i\.e\., Stage 1 only\), while*w/o General*retains MASA\-evolved task\-specific skills but uses teacher\-written general skills \(i\.e\., Stage 2 only\)\. Both stages contribute to the full pipeline, but their relative importance is environment\- and scale\-dependent\. On ALFWorld, removing task\-specific search causes the largest drops for Qwen3\-8B \(−25\.0\-25\.0\) and Qwen3\-32B \(−15\.7\-15\.7\), indicating that per\-task\-type procedural guidance is critical for these backbones\. Conversely, removing general search most severely affects Qwen3\-14B \(−16\.4\-16\.4\), suggesting that high\-level behavioral priors are essential when the model has sufficient capacity to follow them but still benefits from strategic framing\. On WebShop, removing general skills is catastrophic for 8B/14B/32B \(SR drops to single digits\), while removing task\-specific skills has a comparatively modest effect\. This asymmetry reflects the nature of each environment—WebShop demands consistent high\-level decision strategies that general skills encode, whereas ALFWorld requires fine\-grained procedural sequences that task\-specific skills address\.

#### Rewriter model card \(Table[3\(b\)](https://arxiv.org/html/2605.30723#A2.T3.st2)\)\.

We ablate the model card input to the MASA\-Rewriter by comparing performance with and without the target model’s capability card, using both training data variants:*Cross\-env*\(trained on Search \+ WebShop\) and*Cross\-task*\(trained on Search \+ WebShop \+ ALFWorld Pick/Look/Pick2\)\. All results are average SR on three held\-out ALFWorld tasks \(Clean/Heat/Cool\)\. Removing the model card consistently degrades performance\. For Cross\-task, the gap is especially pronounced on Qwen3\-14B, confirming that the card provides critical conditioning signal for smaller backbones\. For Cross\-env, model card removal also causes substantial drops\. Overall, these results suggest that the model card provides useful backbone\-specific conditioning signals that help the rewriter generate more appropriate skill adaptations\.

Variant4B8B14B32BALFWorldFull pipeline31\.457\.964\.365\.7w/o Task\-specific25\.032\.963\.650\.0w/o General25\.050\.047\.964\.3WebShopFull pipeline26\.428\.629\.234\.6w/o Task\-specific22\.425\.624\.231\.8w/o General23\.47\.210\.29\.6\(a\)The search\-based evolution pipeline\.
Variant4B8B14B32BCross\-envw/ Model Card32\.432\.438\.251\.5w/o Model Card14\.723\.533\.839\.7Cross\-taskw/ Model Card32\.538\.248\.555\.9w/o Model Card8\.830\.920\.642\.5\(b\)Model card conditioning in the MASA\-Rewriter\.

Table 3:Ablation studies of MASA\.

## Appendix CPreliminary Study: Supplementary Details

### C\.1Skill Variant Comparison

Table[4](https://arxiv.org/html/2605.30723#A3.T4)shows concrete examples of the three non\-empty ALFWorld skill variants used in the preliminary study\. All variants keep the same skill IDs and task coverage; what changes is how much procedural text is exposed to the agent\. We use bold text to highlight trigger conditions, executable steps, and failure\-prevention cues added by the more detailed variants\.

\(a\)General Skill:*Systematic Exploration*GranularitySkill TextConcisePrinciple:Search all surfaces and containers once before revisiting\.ModeratePrinciple:Search every plausible surface or container exactly once before revisiting; prioritize unopened or unseen locations to cover the whole room methodically\.When to apply:Anytime the goal object count is not yet met and unexplored locations remain\.DetailedPrinciple:When searching for an object, follow these steps exactly:
Step 1: Make a mental list of ALL possible locations in the room \(countertop 1, countertop 2, shelf 1, drawer 1, cabinet 1, fridge 1, etc\.\)\.
Step 2: Visit each location one by one using ’go to \[location\] \[number\]’ \(e\.g\., ’go to countertop 1’\)\.
Step 3: For closed containers \(drawer, cabinet, fridge, safe, microwave\), always use ’open \[container\] \[number\]’ to check inside\.
Step 4: Read the observation carefully — look for the exact name of the target object\.
Step 5: Mark each location as ’checked’ mentally and do NOT go back to it\.
Step 6: Only after checking ALL locations in the room, consider that the object may not be present\.
EXAMPLE: Looking for a mug → ’go to countertop 1’ → check → ’go to countertop 2’ → check → ’go to shelf 1’ → check → ’open cabinet 1’ → check inside → continue until found\.When to apply:At the VERY START of every task that involves finding or locating any object\. This is always your first action — never skip the systematic search\.
\(b\)Task\-Specific Skill:*Open Then Heat*GranularitySkill TextConcisePrinciple:Open microwave, put object in, heat it\.ModeratePrinciple:Upon reaching the microwave with the target in hand, always open the door, place the object inside, and execute the heat action before leaving\.When to apply:Immediately after navigating to the microwave with the target object held\.DetailedPrinciple:The microwave heating sequence must be executed in this EXACT order:
\(1\) ’go to microwave 1’ — navigate to the microwave\.
\(2\) ’open microwave 1’ — the door must be open to put things in\.
\(3\) ’put \[object\] in/on microwave 1’ — place the object inside\.
\(4\) ’heat \[object\] with microwave 1’ — execute the heating action\.
\(5\) ’open microwave 1’ — open the door again to retrieve \(if needed\)\.
\(6\) ’take \[object\] from microwave 1’ — take the now\-heated object\.
COMMON MISTAKE: Trying to ’heat’ without first putting the object in the microwave → fails\.
ANOTHER MISTAKE: Forgetting to open the microwave before putting the object in → fails\.When to apply:When you are holding the target object and ready to heat it\. Execute this exact 6\-step sequence\.

Table 4:Examples from the ALFWorld skill\-library variants used in the preliminary study\. We show the same cross\-task skill and task\-specific skill under three granularity levels:Concise,Moderate, andDetailed\. The empty\-bank control \(No Skill\) is omitted for brevity\.
### C\.2Per\-Task Breakdown: Qwen3

Table[5](https://arxiv.org/html/2605.30723#A3.T5)expands the overall numbers visualized in Figure[1](https://arxiv.org/html/2605.30723#S1.F1)into per\-task success rates for each \(model, skill\) cell\. The breakdown is computed on the same ALFWorld validation set\. Note the large within\-condition swings across task types \(e\.g\., Qwen3\-14B Concise:74\.274\.2onPickvs\.13\.713\.7onCool; Qwen3\-4B Detailed:1\.61\.6onPickvs\.46\.746\.7onLook\), which are substantially larger than cross\-condition differences and motivate the task\-specific tree\-search stage of the evolution pipeline\.

ModelSkillPickCleanHeatCoolPick2LookOverallQwen3\-4BNo Skill20\.018\.518\.816\.012\.515\.417\.1Concise3\.217\.937\.59\.110\.740\.016\.4Moderate20\.029\.612\.520\.08\.330\.820\.0Detailed1\.616\.19\.36\.810\.746\.712\.8Qwen3\-8BNo Skill54\.329\.66\.224\.020\.846\.232\.1Concise32\.233\.925\.011\.325\.040\.027\.9Moderate17\.140\.731\.220\.012\.538\.525\.0Detailed6\.517\.921\.913\.610\.750\.017\.1Qwen3\-14BNo Skill65\.725\.943\.816\.029\.238\.537\.9Concise74\.226\.818\.813\.730\.343\.436\.8Moderate68\.644\.425\.020\.033\.346\.242\.1Detailed64\.546\.418\.834\.146\.466\.747\.5Qwen3\-32BNo Skill48\.644\.425\.032\.016\.746\.236\.4Concise56\.541\.037\.518\.233\.956\.640\.7Moderate48\.640\.750\.044\.020\.846\.241\.4Detailed54\.846\.431\.236\.428\.660\.042\.9Table 5:Per\-task ALFWorld success rate \(%\) for the four Qwen3 backbones under each skill granularity condition\.
### C\.3Supplementary Validation: Gemma3

To verify that the scale\-dependent granularity pattern is not unique to Qwen3, we repeat the fixed\-granularity sweep on Gemma3 backbones \(4B/12B/27B\)Kamathet al\.\([2025](https://arxiv.org/html/2605.30723#bib.bib52)\)\. It supports the motivating conclusion: the best skill form is model\-dependent rather than universally transferable\. Gemma3\-4B and Gemma3\-12B are strongest with Concise skills, while Gemma3\-27B reaches its best success rate with Detailed skills\. Figure[4](https://arxiv.org/html/2605.30723#A3.F4)shows the overall results and Table[6](https://arxiv.org/html/2605.30723#A3.T6)gives the per\-task breakdown\.

Comparing models of the same parameter count across families further isolates the effect of architecture and training from that of scale alone\. Gemma3\-4B achieves its best performance withConciseskills, whereas Qwen3\-4B peaks underModerateskills \(Figure[1](https://arxiv.org/html/2605.30723#S1.F1)vs\. Figure[4](https://arxiv.org/html/2605.30723#A3.F4)\)\. Despite identical parameter budgets, the two models respond to skill granularity in qualitatively different ways—confirming that the optimal skill form is determined by a model’s overall characteristics \(architecture, training data, alignment procedure\) rather than parameter count alone\. This observation reinforces the necessity of conditioning skill adaptation on a rich model profile rather than relying on scale as a proxy\.

![Refer to caption](https://arxiv.org/html/2605.30723v1/x4.png)Figure 4:The supplementary validation of Gemma family\.ModelSkillPickCleanHeatCoolPick2LookOverallGemma3\-4BNo Skill22\.60\.00\.00\.07\.140\.010\.7Concise22\.67\.16\.20\.07\.133\.312\.1Moderate3\.27\.10\.04\.57\.140\.08\.6Detailed0\.00\.00\.00\.00\.00\.00\.0Gemma3\-12BNo Skill16\.13\.60\.00\.03\.626\.77\.9Concise32\.310\.712\.50\.07\.133\.315\.7Moderate9\.710\.70\.00\.07\.133\.39\.3Detailed38\.721\.46\.20\.03\.66\.715\.0Gemma3\-27BNo Skill22\.614\.318\.813\.625\.040\.021\.4Concise61\.321\.431\.29\.142\.946\.736\.4Moderate51\.632\.118\.84\.542\.953\.335\.0Detailed41\.960\.731\.231\.835\.766\.744\.3Table 6:Per\-task ALFWorld success rate \(%\) for the three Gemma3 backbones under each skill granularity condition\.

## Appendix DModel Card Construction

Each model card is constructed from a fixed rubric combining public documentation and automated analysis:

1. 1\.*Architecture metadata\.*Model family, variant name, parameter count, architecture type, layer/attention configuration, context window, and vocabulary size—sourced directly from the published model card or config files\.
2. 2\.*Training provenance\.*Whether the checkpoint is base or instruction\-tuned, the alignment pipeline \(e\.g\., SFT \+ DPO \+ GRPO\), training data scale, and multilingual support—sourced from official documentation\.
3. 3\.*Capability profile\.*Strengths are extracted from the model’s official release notes \(e\.g\., “strong at math and code generation”\)\. Weaknesses are generated by the teacher LLM summarizing behavioral patterns observed during a small set of preliminary rollouts \(Section[2](https://arxiv.org/html/2605.30723#S2)\), produced automatically without human annotation\.

Note that the card does not include any downstream evaluation results \(e\.g\., ALFWorld success rates\) or oracle style labels \(e\.g\.,*prefers\_concise*\)\. The preliminary rollouts used for weakness summarization are disjoint from the evaluation set\.

Below is the card for Qwen3\-4B; cards for the remaining backbones follow the same template\.

\#ModelCard:Qwen3\-4B

\#Source:https://huggingface\.co/Qwen/Qwen3\-4B

\#===ArchitectureMetadata===

family:"Qwen3"

variant:"4B"

parameter\_count:"4B"

architecture:"dense\-transformer"

num\_layers:36

hidden\_size:2560

num\_attention\_heads:32

num\_kv\_heads:8

context\_window:32768

vocab\_size:151936

\#===TrainingProvenance===

base\_or\_instruct:"instruct"

alignment\_method:"SFT\+DPO\+GRPO"

training\_data\_size:"36Ttokens"

multilingual:true

\#===OfficialCapabilities\(fromreleasenotes\)===

strengths:"math,codegeneration,instruction

following,multilingual,tooluse,thinking

modesupport"

\#===ObservedWeaknesses\(teacher\-summarized\)===

weaknesses:"limitedreasoningdepthduetosmall

parametercount,maystrugglewithcomplex

multi\-stepplanning"

## Appendix ESkill Evolution Pipeline Details

This appendix provides the full algorithmic procedures \(Algorithms[1](https://arxiv.org/html/2605.30723#alg1)and[2](https://arxiv.org/html/2605.30723#alg2)\) and hyperparameters for the two\-stage skill evolution pipeline described in Section[3\.2](https://arxiv.org/html/2605.30723#S3.SS2)\. All evolution experiments use the original training split of each environment for exploration; the evaluation results reported in the main paper are obtained on the held\-out test split\.

Algorithm 1Stage 1: General Skill Search \(Hill Climbing\)0:Target model

FF, model card

ℳF\\mathcal\{M\}\_\{F\}, teacher

TT, eval set

𝒟\\mathcal\{D\}\(sampled from training episodes\), initial general skills

𝒮FG0\\mathcal\{S\}^\{G\_\{0\}\}\_\{F\}, max iterations

II, patience

pp, history size

KK
0:Optimized general skills

𝒮FG⁣⋆\\mathcal\{S\}^\{G\\star\}\_\{F\}
1:

𝒮FG⁣⋆←𝒮FG0\\mathcal\{S\}^\{G\\star\}\_\{F\}\\leftarrow\\mathcal\{S\}^\{G\_\{0\}\}\_\{F\}\{current best general skill set\}

2:

R⋆←Eval​\(F,𝒮FG⁣⋆,𝒟\)R^\{\\star\}\\leftarrow\\mathrm\{Eval\}\(F,\\mathcal\{S\}^\{G\\star\}\_\{F\},\\mathcal\{D\}\)\{its average reward across all task types\}

3:

ℋ←\{\(𝒮FG0,R⋆\)\}\\mathcal\{H\}\\leftarrow\\\{\(\\mathcal\{S\}^\{G\_\{0\}\}\_\{F\},R^\{\\star\}\)\\\}\{search history: \(skill set, reward\) pairs\}

4:for

i=1i=1to

IIdo

5:// Rollout & Analysis

6:

ℱi←CollectFailures​\(F,𝒮FG⁣⋆,𝒟\)\\mathcal\{F\}\_\{i\}\\leftarrow\\mathrm\{CollectFailures\}\(F,\\mathcal\{S\}^\{G\\star\}\_\{F\},\\mathcal\{D\}\)
7:

attri←T\.Analyze​\(ℱi\)\\mathrm\{attr\}\_\{i\}\\leftarrow T\.\\mathrm\{Analyze\}\(\\mathcal\{F\}\_\{i\}\)\{structured failure attribution\}

8:// Rewrite

9:

𝒮FGi←T\.Rewrite​\(𝒮FG⁣⋆,attri,TopK​\(ℋ,K\),ℳF\)\\mathcal\{S\}^\{G\_\{i\}\}\_\{F\}\\leftarrow T\.\\mathrm\{Rewrite\}\(\\mathcal\{S\}^\{G\\star\}\_\{F\},\\mathrm\{attr\}\_\{i\},\\mathrm\{TopK\}\(\\mathcal\{H\},K\),\\mathcal\{M\}\_\{F\}\)
10:// Accept / Reject

11:

Ri←Eval​\(F,𝒮FGi,𝒟\)R\_\{i\}\\leftarrow\\mathrm\{Eval\}\(F,\\mathcal\{S\}^\{G\_\{i\}\}\_\{F\},\\mathcal\{D\}\)
12:

ℋ←ℋ∪\{\(𝒮FGi,Ri\)\}\\mathcal\{H\}\\leftarrow\\mathcal\{H\}\\cup\\\{\(\\mathcal\{S\}^\{G\_\{i\}\}\_\{F\},R\_\{i\}\)\\\}
13:if

Ri\>R⋆R\_\{i\}\>R^\{\\star\}then

14:

𝒮FG⁣⋆←𝒮FGi\\mathcal\{S\}^\{G\\star\}\_\{F\}\\leftarrow\\mathcal\{S\}^\{G\_\{i\}\}\_\{F\};

R⋆←RiR^\{\\star\}\\leftarrow R\_\{i\}\{accept\}

15:endif

16:ifno improvement for

ppconsecutive iterationsthen

17:break

18:endif

19:endfor

20:return

𝒮FG⁣⋆\\mathcal\{S\}^\{G\\star\}\_\{F\}

Algorithm 2Stage 2: Task\-Specific Skill Search \(Per\-Type Tree Search\)0:Target model

FF, model card

ℳF\\mathcal\{M\}\_\{F\}, teacher

TT, fixed general skills

𝒮FG⁣⋆\\mathcal\{S\}^\{G\\star\}\_\{F\}, initial task\-specific skills

\{𝒮FTc0\}c∈𝒞\\\{\\mathcal\{S\}^\{T\_\{c\_\{0\}\}\}\_\{F\}\\\}\_\{c\\in\\mathcal\{C\}\}, iterations

JJ
0:Optimized task\-specific skills

\{𝒮FTc⁣⋆\}c∈𝒞\\\{\\mathcal\{S\}^\{T\_\{c\}\\star\}\_\{F\}\\\}\_\{c\\in\\mathcal\{C\}\}
1:foreach task type

c∈𝒞c\\in\\mathcal\{C\}in paralleldo

2:Initialize tree root with

𝒮FTc0\\mathcal\{S\}^\{T\_\{c\_\{0\}\}\}\_\{F\}
3:for

j=1j=1to

JJdo

4:// Selection

5:

n←UCB1Select​\(root\)n\\leftarrow\\mathrm\{UCB1Select\}\(\\text\{root\}\)\{select leaf via Eq\.[7](https://arxiv.org/html/2605.30723#A5.E7)\}

6:// Expansion

7:

ℱ←CollectFailures​\(F,𝒮FG⁣⋆,𝒮F,nTc,c\)\\mathcal\{F\}\\leftarrow\\mathrm\{CollectFailures\}\(F,\\mathcal\{S\}^\{G\\star\}\_\{F\},\\mathcal\{S\}^\{T\_\{c\}\}\_\{F,n\},c\)
8:

attr←T\.Analyze​\(ℱ\)\\mathrm\{attr\}\\leftarrow T\.\\mathrm\{Analyze\}\(\\mathcal\{F\}\)\{failure attribution\}

9:

𝒮F′⁣Tc←T\.Rewrite​\(𝒮F,nTc,attr,ℳF\)\\mathcal\{S\}^\{\\prime T\_\{c\}\}\_\{F\}\\leftarrow T\.\\mathrm\{Rewrite\}\(\\mathcal\{S\}^\{T\_\{c\}\}\_\{F,n\},\\,\\mathrm\{attr\},\\,\\mathcal\{M\}\_\{F\}\)
10:Add

𝒮F′⁣Tc\\mathcal\{S\}^\{\\prime T\_\{c\}\}\_\{F\}as child of node

nn
11:// Evaluation

12:

R′←Eval​\(F,𝒮FG⁣⋆,𝒮F′⁣Tc,c\)R^\{\\prime\}\\leftarrow\\mathrm\{Eval\}\(F,\\mathcal\{S\}^\{G\\star\}\_\{F\},\\mathcal\{S\}^\{\\prime T\_\{c\}\}\_\{F\},c\)
13:// Backpropagation

14:Update visit counts and value estimates from new node to root

15:endfor

16:

𝒮FTc⁣⋆←\\mathcal\{S\}^\{T\_\{c\}\\star\}\_\{F\}\\leftarrowskill set of the highest\-value node

17:endfor

18:return

\{𝒮FTc⁣⋆\}c∈𝒞\\\{\\mathcal\{S\}^\{T\_\{c\}\\star\}\_\{F\}\\\}\_\{c\\in\\mathcal\{C\}\}

### E\.1Stage 1: Hill Climbing

Maximum iterationsI=10I\{=\}10; patiencep=3p\{=\}3\(early stopping after 3 consecutive iterations without improvement\); top\-K=5K\{=\}5highest\-reward historical skill sets provided to the teacher at each iteration\. A candidate general skill set is accepted if and only if its average adjusted reward strictly exceeds the current best\.

### E\.2Stage 2: UCB\-Driven Tree Search

At each iteration, the nodennmaximizing the following UCB1 score is selected:

UCB1​\(n\)=R¯​\(n\)\+C​ln⁡NparentNn,\\mathrm\{UCB1\}\(n\)=\\bar\{R\}\(n\)\+C\\sqrt\{\\frac\{\\ln N\_\{\\mathrm\{parent\}\}\}\{N\_\{n\}\}\},\(7\)whereR¯​\(n\)\\bar\{R\}\(n\)is the mean adjusted reward of nodennand all its descendants,NnN\_\{n\}is the visit count of nodenn,NparentN\_\{\\mathrm\{parent\}\}is the visit count of its parent, andC=1\.4C\{=\}1\.4is the exploration constant\. We runJ=10J\{=\}10iterations per task type withN=100N\{=\}100episodes per node evaluation\.

## Appendix FSkill Rewriter Training Details

We perform full\-parameter SFT on Qwen3\-4B in BF16 precision\. Training uses AdamW \(lr1​e−51\\mathrm\{e\}\{\-5\}, cosine schedule, warmup ratio0\.10\.1, gradient checkpointing\), effective batch size44\(per\-device1×1\\timesgradient accumulation44\),55epochs, and max sequence length40964096\. We select the best checkpoint based on training loss convergence\.

The training data consists of pairs for in\-domain tasks, with data augmentation including noisy inputs \(noise ratio0\.30\.3\), partial inputs \(keep ratio0\.60\.6\), and cross\-model transfer pairs\. We train two rewriter variants: a combined rewriter on 769 samples from all three environments \(ALFWorld Pick/Look/Pick2 only—excluding the held\-out types, WebShop, and Search\), and an environment\-specific rewriter on 499 samples \(WebShop \+ Search only\)\.

## Appendix GWebShop Supplementary Results

### G\.1Trajectory Analysis: Why Larger Models Fail

We analyze failed WebShop trajectories to understand why larger Qwen3 models \(8B/14B/32B\) perform worse than 4B under baseline conditions \(Section[4\.2](https://arxiv.org/html/2605.30723#S4.SS2)\)\.

Table[7](https://arxiv.org/html/2605.30723#A7.T7)reveals a striking pattern: Qwen3\-4B produces concise, action\-only outputs \(0% steps with chain\-of\-thought,∼73\{\\sim\}73chars per action\), while 8B/14B/32B prepend extensive reasoning preambles before each action command\. Qwen3\-14B is the most severe case, with 97% of steps containing verbose reasoning\. This behavior exhausts the fixed step budget on deliberation rather than environment interaction—the agent “thinks” through multiple options but never completes enough purchase actions to succeed\.

ModelCoT \(%\)Action Len\.Qwen3\-4B073 charsQwen3\-8B571,021 charsQwen3\-14B97574 charsQwen3\-32B66491 charsTable 7:WebShop trajectory statistics\. CoT: fraction of steps containing reasoning preambles\.
### G\.2Per\-Category Breakdown

Table[8](https://arxiv.org/html/2605.30723#A7.T8)provides the full per\-category success rate breakdown\. Several observations stand out:

- •For 8B/14B/32B baselines, most categories have near\-zero SR, consistent with the verbose\-reasoning bottleneck identified above\.
- •MASA achieves the best SR in the vast majority of categories across all backbones, with particularly large gains onOtherandElectronics\.
- •The improvement is broad rather than category\-specific: MASA does not exploit a single easy category to inflate the average but improves performance across the board\.

Per\-Category SR \(%\)ModelMethodApparelOtherFootwearHomeElec\.Access\.BeautyAvg\.Qwen3\-4BNo Skill18\.640\.011\.119\.071\.420\.016\.723\.0\+ Base Skill16\.127\.026\.719\.028\.610\.016\.719\.4\+ DS\-Adapter20\.320\.013\.314\.328\.610\.016\.719\.2\+ MASA23\.238\.028\.919\.028\.620\.016\.726\.4Qwen3\-8BNo Skill2\.611\.00\.04\.814\.320\.00\.04\.6\+ Base Skill5\.59\.00\.00\.014\.320\.016\.76\.0\+ DS\-Adapter4\.85\.00\.00\.00\.010\.016\.74\.4\+ MASA27\.741\.013\.314\.357\.110\.033\.328\.6Qwen3\-14BNo Skill1\.06\.06\.74\.80\.010\.00\.02\.8\+ Base Skill0\.32\.04\.44\.80\.020\.00\.01\.6\+ DS\-Adapter0\.36\.02\.24\.80\.010\.00\.02\.0\+ MASA32\.833\.06\.79\.528\.630\.016\.729\.2Qwen3\-32BNo Skill4\.216\.02\.29\.50\.010\.00\.06\.6\+ Base Skill4\.814\.04\.49\.514\.320\.00\.07\.2\+ DS\-Adapter2\.38\.02\.24\.80\.010\.00\.03\.6\+ MASA32\.248\.026\.719\.042\.940\.033\.334\.6Table 8:WebShop per\-category success rate \(%\)\.Boldmarks the best within each backbone\.

## Appendix HSkill Rewriter OOD: Per\-Task Breakdown

Figure[5](https://arxiv.org/html/2605.30723#A8.F5)shows the per\-task SR breakdown for the OOD generalization experiment \(Section[4\.3](https://arxiv.org/html/2605.30723#S4.SS3)\)\.

![Refer to caption](https://arxiv.org/html/2605.30723v1/x5.png)Figure 5:Per\-task OOD generalization of MASA\-Rewriter\. Rows: target backbones \(4B–32B\)\. Columns: held\-out task types \(Clean, Heat, Cool\)\.#### Cross\-task transfer\.

Adding ALFWorldPick/Look/Pick2traces to the training set \(dark blue\) yields consistent improvements over Cross\-env on all three tasks\. The gains are most pronounced onCool\(e\.g\., 14B: 24\.0→\\to44\.0; 32B: 48\.0→\\to52\.0\) andClean\(e\.g\., 8B: 29\.6→\\to51\.9; 14B: 40\.7→\\to51\.9\), indicating that in\-environment traces help the rewriter learn ALFWorld\-specific action patterns such as navigation sequences and object interaction protocols\. OnHeat, Cross\-task improves for 4B and 14B but is comparable to Cross\-env for 8B and 32B, suggesting that Heat\-specific patterns are already partially captured by the cross\-environment signal\.

#### Cross\-environment transfer\.

Trained only on Search and WebShop traces, the Cross\-env rewriter \(light blue\) shows notable strengths onHeatacross all backbones—particularly 14B \(56\.2%\) and 32B \(56\.2%\)—substantially exceeding DS\-Adapter\. OnCleanandCool, Cross\-env performance is more mixed: it matches or slightly exceeds DS\-Adapter for 4B and 8B, but falls short on some larger\-backbone cells \(e\.g\., 14BClean: 40\.7 vs\. DS\-Adapter 40\.7, tied\)\. This suggests that cross\-environment transfer is most effective when the target task involves decision patterns \(e\.g\., sequential verification inHeat\) that overlap with those in the training environments\.

## Appendix IQualitative Analysis: Evolved Skill Examples

ModelEvolved Skill TextFailure ModeStrategy4BWhen selecting a color variant, match the EXACT string from the task requirement to the available options\. ‘navy blue’≠\\neq‘light blue’≠\\neq‘navy’\. ‘c3\-black’≠\\neq‘c\-black’\. If the exact color name is not available in the options list, this product CANNOT satisfy your requirement\-\-\-leave immediately\.Do NOT select an approximate or similar color\.Picks visually similar colors by guessingStrict binary match: exact or leave8BScan ALL color options\. If the EXACT color name appears, click it\. ‘c3\-black’ is NOT the same as ‘a6\-black’\. If the exact color is NOT available but a SIMILAR one exists \(e\.g\., goal says ‘green’, options have ‘e\-green’\),select the CLOSEST match and proceed to buy\. A close color match gives partial credit which is better than 0\.Even if the product doesn’t perfectly match\-\-\-BUY IT\.Abandons products too easily \(0 credit\)Flexible match: buy anyway for partial credit14BIf your required color is NOT in the admissible actions list,click ‘back to search’ immediately\. Do not try similar colors\. Do not try similar sizes\.One glance at options→\\toif exact match missing→\\toback to search\. Takes 1 step, not 5\.Wastes steps deliberating on bad productsFast\-fail: 1\-step exit if no exact match32BMatch color by checking if the task’s required color name appears as a SUBSTRINGin any admissible action, or vice versa\. ‘patina green’ matches ‘patina green’ \(exact\) but NOT ‘yellow’\. ‘green’ matches ‘a1\-green’ or ‘d01green’ \(contains\)\.When multiple options contain the color word, prefer the one that matches more of the full color name\.Mishandles coded names \(e\.g\.d01green\)Algorithmic: substring matching with preference ruleTable 9:WebShop color\-matching skill evolved by MASA for four backbones\.Red: rigid rejection rule \(4B\);Blue: flexible buy\-anyway heuristic \(8B\);Teal: 1\-step fast\-fail exit \(14B\);Violet: algorithmic substring matching \(32B\)\. Each strategy targets the dominant failure mode of its target model\.Table[9](https://arxiv.org/html/2605.30723#A9.T9)presents a case study of how MASA adapts skills differently for each backbone on the same subtask—WebShop’s color\-matching decision, identified as the highest\-failure\-rate subtask during skill evolution\.

Rather than producing minor wording variations, the evolution pipeline discovers qualitatively distinct strategies tailored to each model’s dominant failure mode:

- •Qwen3\-4B tends to guess visually similar colors\. The evolved skill imposes a strict binary rule: match exactly or leave immediately\.
- •Qwen3\-8B abandons products too easily, scoring zero\. The evolved skill encourages buying approximate matches for partial credit\.
- •Qwen3\-14B wastes steps deliberating on bad products\. The evolved skill enforces a one\-step fast\-fail exit when no exact match exists\.
- •Qwen3\-32B mishandles coded color names \(e\.g\.,d01green\)\. The evolved skill provides an algorithmic substring\-matching procedure with tie\-breaking rules\.

This demonstrates that model\-conditioned adaptation operates at the level of*decision strategy*—the same problem requires fundamentally different solutions depending on how each backbone fails\.

## Appendix JThe Use of Large Language Models \(LLMs\)

In this paper, large language models were utilized exclusively for grammatical polishing and stylistic refinement, aimed at enhancing the clarity and readability of our presentation of results\.

*The following pages contain supplementary tables and figures\.*

Similar Articles

The Scaling Laws of Skills in LLM Agent Systems

arXiv cs.CL

This paper identifies two coupled scaling laws for skill libraries in LLM agent systems: routing accuracy decays logarithmically with library size, and execution dynamics show a rescue effect. The laws are validated across 15 models and over a million decisions, and law-guided optimization significantly improves performance.