Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

arXiv cs.CL 06/11/26, 04:00 AM Papers
agent-skills evaluation benchmarks evolution llm-agents survey skill-ecosystems
Summary
This survey systematically examines skill evolution and evaluation for agentic systems, categorizing evolution into four paradigms and analyzing six skill-centric benchmark categories to identify structural gaps and open directions.
arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is https://github.com/Cassie07/AgentSkill_Survey
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:37 PM
# Agent Skill Evaluation and Evolution: Frameworks and Benchmarks
Source: [https://arxiv.org/html/2606.11435](https://arxiv.org/html/2606.11435)
Kexin Ding1,Yang Zhou1,Can Jin1,Feng Tong2, Mu Zhou1,Dimitris N\. Metaxas1

1Rutgers University,2University of North Carolina at Charlotte Correspondence:[dnm@cs\.rutgers\.edu](https://arxiv.org/html/2606.11435v1/mailto:email@domain)

###### Abstract

The growth of*agent skills*has transformed how agentic systems are built, evaluated, and deployed\. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real\-world applications\. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation\-driven skill evolution\. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation\. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability\. We also provide an analysis of six skill\-centric benchmark categories, identifying structural gaps in benchmark coverage, trade\-offs, and metric richness to advance skill research\. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe\. The project URL is[https://github\.com/Cassie07/AgentSkill\_Survey](https://github.com/Cassie07/AgentSkill_Survey)

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

Kexin Ding1, Yang Zhou1, Can Jin1, Feng Tong2,Mu Zhou1,Dimitris N\. Metaxas11Rutgers University,2University of North Carolina at CharlotteCorrespondence:[dnm@cs\.rutgers\.edu](https://arxiv.org/html/2606.11435v1/mailto:email@domain)

## 1Introduction

Agent skills equip LLM agents with domain\-specific knowledge at inference time, enabling agents to perceive and interact with environments through diverse external toolsZhanget al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib47)\)\. Unlike prompt engineeringWeiet al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib112)\); Brownet al\.\([2020](https://arxiv.org/html/2606.11435#bib.bib102)\), agent skills encode reusable, portable, multi\-step solutions that guide agents to address complex tasks through coordinated decision sequences, thereby substantially reducing tedious manual effort\.

As the scale and diversity of agent skills continue to growLianget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib53)\); Liet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib106)\), the absence of robust evaluation frameworks has become a critical bottleneck for skill\-guided agent deployment\. Meanwhile, diverse skills make the manual refinement inherently infeasible, a challenge further compounded by the lack of evolution approaches to capture the real\-world feedbackZhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\)\. An outdated or unsafe skill can propagate errors across downstream tasks, turning skill assessment into an open problem of diagnosis, maintenance, and alignment\. It is thus essential to establish automated and continuous mechanisms for agent skills, rather than relying on static pipelines, to ensure skills are generalizable across tasks and verifiably safe for public use\. In this survey, we position skill evolution and evaluation as the central focus of this emerging paradigm \(Figure[1](https://arxiv.org/html/2606.11435#S2.F1)\)\. Concretely, we introduce a four\-paradigm taxonomy of skill evolution strategies \([Section 3](https://arxiv.org/html/2606.11435#S3)\)\. We gain insights into designing evolution strategies towards enhancing skill creation, utility, and refinement with fewer human efforts\. We further provide a critical analysis of skill\-centric benchmarks \([Section 4](https://arxiv.org/html/2606.11435#S4)\) to assess their opportunities in multimodal skills, trajectory distillation, and skill security towards better real\-world agent deployment\.

## 2What is Skill?

![Refer to caption](https://arxiv.org/html/2606.11435v1/x1.png)Figure 1:We map the landscape of agent skill evolution strategies \(§3\) through comparative analysis and design recommendations\. We offer evaluation insights \(§4\) through structural gaps and benchmark limitations, and open challenges \(§5\) for robust real\-world skill deployment\.An agent skill is a structured packageLiet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib106)\):𝒮=\(C,π,T,ℛ\)\\mathcal\{S\}=\(C,\\,\\pi,\\,T,\\,\\mathcal\{R\}\), whereC:𝒪×𝒢→\{0,1\}C:\\mathcal\{O\}\\times\\mathcal\{G\}\\rightarrow\\\{0,1\\\}is the condition, mapping the agent observation \(𝒪\\mathcal\{O\}\) and goal \(𝒢\\mathcal\{G\}\) to the skill relevance;π\\piis the execution policy to encode the procedures;TTis the termination criterion, specifying when skill execution is completed; andℛ\\mathcal\{R\}is the reusable interface to indicate the composition with other skills\.

Human\-authored skills encapsulate domain expertise as machine\-interpretable procedural knowledgeLiet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib106)\)\. To expedite this process, automated skill creation enables an agent to generate skills with less human\-written effort\. For example, Skill CreatorAnthropic \([2026](https://arxiv.org/html/2606.11435#bib.bib5)\)could automatically create a full skill directory and test cases with minimal human text description\. Similarly, VoyagerWanget al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib110)\)creates skills as the executable code, including proposing tasks, refining code via environment feedback, self\-verification, and updating the skill library\. To better create reusable skills, reinforcement learning \(RL\) is integrated into the training loop, where rewards earned from reusing a skill on later tasks are propagated back to update the policy\. Inspired by Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib93)\), SAGEWanget al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib92)\)leverages the reusable rewards from group tasks, encouraging the agent to learn and create reusable skills\. ARISELiet al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib94)\)preserves successful reasoning patterns to train agents toward generating reusable skills, overcoming GRPO’s limitation of treating rollouts independently\.

Efficient skill usage strategies involve retrieval, routing, and management\. For each task, it is common that the agent cannot load all potential skills to assess their usability due to excessive time and token use\. To address, \(a\) Retrieval determines a small set of skills from a large skill pool; \(b\) Routing efficiently decides which skill should be executed at which step after retrieval; \(c\) Management keeps the skills organized, up\-to\-date, and safe to use \([AppendixA](https://arxiv.org/html/2606.11435#A1)\)\. These usage mechanisms establish the foundation upon which evolution and evaluation frameworks operate as outlined below\.

## 3Skill Evolution

Skill evolution is a continuous process to improve skill quality by learning from past success and failure patterns to achieve up\-to\-date capabilitiesZhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\)\. As the number of skills continues to grow, scaling manual refinement becomes increasingly impractical\. This hurdle motivates automatic strategies by leveraging skill execution records, including rich feedback signals and task\-solving trajectoriesNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)\. Such raw execution signals and trajectories are often noisy, mixing successful steps with irrelevant or failed ones\. Therefore, reliable skill evolution requires capturing reusable execution patterns across multiple trajectories rather than individual behaviorsZhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\)\. Moreover, growing skill libraries likely introduce conflicting contents, leading to redundant storage, excessive token consumption, and poor generalizabilityWanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib57)\); Gaoet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib91)\); Zhanget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib58)\)\. To address these hurdles, we outline evolution strategies especially along the source and granularity of the learning signal:Execution feedbackoperates on single\-run step\-level signals;Trajectory distillationoperates on multi\-run sequence\-level patterns;Compression and augmentationoperate on library\-level structures;Reinforcement learningoperates on task\-level rewards\. These paradigms are not mutually exclusive, but they represent the dominant design choices in the community\. We further structure and analyze how the current evolution paradigms align with benchmarks, highlighting trade\-offs and practical guidelines that motivate future research \([AppendixC](https://arxiv.org/html/2606.11435#A3)\)\.

Evolution StrategyExecution FeedbackSkillForgeLiuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib73)\)CoEvoSkillsZhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\)Skills\-CoachTianet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib39)\)Ctx2SkillSiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib40)\)AutoSkillYanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib54)\)SkillClawMaet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib65)\)EmbodiSkillJuet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib67)\)Trajectory DistillationSPARKZhouet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib116)\)Trace2SkillNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)Memento\-SkillsZhouet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib52)\)XSkillJianget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib66)\)Compression & AugmentationSkillNetLianget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib53)\)SkillXWanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib57)\)SkillReducerGaoet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib91)\)SkillFoundryShenet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib41)\)Reinforcement LearningD2SkillTuet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib72)\)SkillRLXiaet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib113)\)SkillOSOuyanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib8)\)Skill1Shiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib42)\)Table 1:Summary of skill evolution strategy in[Section 3](https://arxiv.org/html/2606.11435#S3)\. Categories are color\-coded:Execution Feedback,Trajectory Distillation,Compression & Augmentation,Reinforcement Learning\.### Execution Feedback

The record of skill execution can reveal valuable feedback signals for skill improvement, including runtime errors, incorrect outputs, unmet task specifications, and execution paths\. Inspired by human rewriting, an intuitive approach is to implement an automated loop that executes the existing skill, observes failure patterns from execution feedback, and then rewrites the skill to prevent such failures from recurring\. The execution feedback can come from either clear signalsLiuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib73)\); Zhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\); Tianet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib39)\); Siet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib40)\); Jinet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib103)\); Juet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib67)\)or implicit execution onesYanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib54)\); Maet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib65)\); Yanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib54)\), which are both crucial for guiding skill evolution\.

Traceable signals of skill evolution can come from real engineering activities, revealing user intent, agent tool calling, and concrete error patterns\. These signals are critical for automatically detecting, diagnosing, and correcting flawed skills\. For instance,SkillForgeLiuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib73)\)creates new skills by detecting the discrepancies between execution and reference behaviors\. In particular, SkillForge produces structured failure records to identify the systemic patterns, reducing the need for human rewriting and verification\. To support multi\-turn conversations,CoEvoSkillsZhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\)enables agents to reduce human–machine cognitive misalignment and produce self\-evolved skills that outperform human\-curated skills\. Especially to address failed executed skills, it introduces a verifier that can provide direct feedback about root\-cause analysis and revision suggestions\. Accessing the rich environment feedback can further enhance skill reliability\.EmbodiSkillJuet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib67)\)leverages the agent execution feedback by interacting with the environment to produce a trajectory of actions, observations, and a final reward\. Unlike relying on the real execution feedback,Skills\-CoachTianet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib39)\)executes skills on synthetic cases to achieve the evolution feedback\. Skills\-Coach produces several rewritten versions of the seed skill\. The highest\-scored rewritten skill serves as a successful signal to improve skill instructions, while the failure traces drive skill scripts to prevent failures\. Similarly,Ctx2SkillSiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib40)\)learns from feedbacks by producing synthetic diagnostic questions from the reference document\.

Even without a clear execution signal, user’s preferences across conversations, such as preferred tone, terminology, or writing conventions, remain valuable for improving skill evolution\. We support that the interaction traces could contain rich signals with reusable knowledge\.AutoSkillYanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib54)\)treats interactions from users as the main signal for skill evolution\. Rather than relying solely on failure correction, it turns user preferences into explicit capabilities that personalize the agent’s behavior\. Similarly,SkillClawMaet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib65)\)leverages heterogeneous user experiences from key trajectories that reflect how different users interact with tools and workflows\.

Across above execution feedback studies, we identify that a structured failure mode becomes a meaningful design factor\. Systems that well separate failure diagnosis from rewrite generation \(SkillForgeLiuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib73)\),CoEvoSkillsZhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\)\) tend to report stronger cross\-task results compared to systems operating on raw traces \(AutoSkillYanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib54)\),SkillClawMaet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib65)\)\), although head\-to\-head comparisons remain absent\. It is evident that feedback signals are inherently bounded by execution environment diversity rather than deployment—a structural constraint worth explicit attention in design and evaluation of skills\.

### Trajectory Distillation

Skill evolution via trajectory distillation gains momentum to improve skills through sequential memorization by capturing task\-specific, reusable patterns\. For instance,SPARKZhouet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib116)\)explores online trajectory verification to distill strong skills from executable evidence\. It introduces a key trajectory\-level measure to assess skill performance using task\-environment evidence rather than unverified prior plans\.Trace2SkillNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)updates skills from multiple execution trajectories by generating targeted patches from success and failure cases, then merging redundant fixes into a single conflict\-free skill file\.Memento\-SkillsZhouet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib52)\)introduces a read\-write reflective loop for skill evolution: a router retrieves relevant skills for execution, and the agent updates them from execution feedback, enabling iterative refinement and long\-term behavioral memory\. To broaden the data modality,XSkillJianget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib66)\)grounds skill learning in visual observations, capturing relations between visual states and decisions\. It extracts task\-level skills through visual summarization and action\-level experiences through cross\-rollout critiques of successes and failures, consolidating both into a unified skill bank through merging and refinement\.

### Compression and Augmentation

As skill libraries grow rapidly, skill overlap and conflict lead to redundant exploration and poor generalizationWanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib57)\); Lianget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib53)\)\. Therefore, skill compression and augmentation are increasingly important for reducing duplication, gaining complementary knowledge, and exploring reliable skills\.SkillNetLianget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib53)\)enables the community to create, evaluate, and connect agent skills as sourced from GitHub repositories and office documents\. The resulting skill\-similarity relation graph becomes a key to indicate whether the existing skills should be reused or merged\. However, SkillNet does not explicitly define skill categories to support its evolution\. To address this,SkillXWanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib57)\)builds a multi\-category skill evolution derived from execution trajectories by merging similar skills, decomposing complex ones, and assessing their generalization\. To broaden coverage, SkillX prioritizes underexplored or failure\-prone tools and synthesizes novel tasks to acquire and validate new skills to improve skill quality, richness, and coverage\.SkillFoundryShenet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib41)\)takes a knowledge\-driven approach for skill augmentation, organizing a tree structure where each node tracks references and existing skills to prioritize underexplored branches\. This structure mines heterogeneous scientific resources into executable skills\. Alternatively,SkillReducerGaoet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib91)\)reframes skill evolution as content cleanup, pruning overly long skill descriptions and reorganizing skills into actionable rules and references, preserving essential knowledge while reducing token overhead\. As these skills move into deployment, a promising frontier is grounding compression and augmentation decisions in live signals such as retrieval frequency and runtime failure rates rather than the offline design alone\.

### Reinforcement Learning

Reinforcement learning \(RL\)Liet al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib94)\); Zhouet al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib118)\)has emerged as a principled approach for aligning LLM agents with task execution rewards and driving reliable skill evolutionWanget al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib110),[2024](https://arxiv.org/html/2606.11435#bib.bib111)\); Jinet al\.\([2025b](https://arxiv.org/html/2606.11435#bib.bib104)\)\. However, standard RL only rewards single task per update, while the real value of a skill roots in its reusability across tasksTuet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib72)\)\. To achieve stable rewards,D2SkillTuet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib72)\)leverages the multiple\-rollout strategy in GRPO, enabling the policy agent to generate highly reusable skills\. For each task, the LLM agent retrieves the most relevant skills and runs them twice \(with the retrieved skills and without\)\. The resulting success\-rate gap between the two rollouts yields more stable rewards and improved skill reusability\. Also,SkillRLXiaet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib113)\)leverages GRPO and exploits both success and failure signals, collecting trajectories across multiple rollouts to update the policy for skill retrieval and refinement by identifying failure patterns to drive targeted skill revision\. However, these studiesTuet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib72)\); Xiaet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib113)\)treat skill retrieval, utilization, and evolution as separate components, risking conflicts during concurrent updates\.Skill1Shiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib42)\)overcomes this fragmentation via a unified co\-evolution by training a single policy to jointly perform skill search, selection, task solving, and skill evolution within a single rollout\. Overall, we observe that major efforts trace a clear trajectory in RL\-based skill evolution\. These approaches rely on task\-level rewards that can conflate skill quality with agent capability, leaving open whether performance gains come from skill evolution or model improvement\.

## 4Skill\-centric Evaluation and Benchmarks

Evaluation is important in the agent skill lifecycle because skills are continuously created, evolved, and shared among users\. Without rigorous evaluation, it is difficult to fairly assess skill quality and safety\. In principle, the skill evaluation should serve as: \(1\) validation of the comprehensive task performance; \(2\) comparison among skills in a fair environment; \(3\) safety auditing to detect harmful behaviors before skill deployments\. We discuss skill\-centric benchmarks to measure the realistic performance of agent skills\. In[Table 2](https://arxiv.org/html/2606.11435#S4.T2), we group them into six major categories\. We also include general\-domain benchmarks that were not designed for skill evaluation yet still applicable to assess agent skills \(details in[AppendixB](https://arxiv.org/html/2606.11435#A2)\)\.

[Table 2](https://arxiv.org/html/2606.11435#S4.T2)reveals three structural gaps that warrant systematic investigation\. First, utility and safety benchmarks collectively cover 11 professional domains and 581 auditable packages, whereas generation benchmarks span only 15 sub\-domains with 20 core tasks\. Second, no existing benchmark evaluates evolution longitudinally, i\.e\., tracking whether a skill improves across multiple rounds of feedback rather than measuring a single snapshot\. Third, evaluation metrics are predominantly binary \(pass/fail\), which overlook operational factors such as token cost, latency, and error type\. These gaps should guide future research as much as the benchmarks themselves\.

BenchmarkScale \(Total\)Task CompositionUtilitySkillsBenchLiet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib106)\)86 tasks\(7,308 trajectories;7 agent–model configs\)11 professional domains: healthcare, manufacturing, cybersecurity, natural science, energy, office & white collar, finance, media & content production, robotics, mathematics, software engineeringSkillCraftChenet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib49)\)126 tasksLong\-horizon compositional tool\-use tasks scaled by item count and tool\-call chain depth; agents cache successful tool sequences as a persistent skill libraryGenerationSkillLearnBenchZhonget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib36)\)20 tasks,100 instances6 categories, 15 sub\-domains \(software engineering, information retrieval, productivity tools, data & analytics, content & creative, utilities\); 3\-level evaluation \(skill quality, trajectory alignment, task outcome\)Retrieval & RoutingSRA\-BenchSuet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib44)\)5,400 instances\(636 gold skills in26,262 skill corpus\)6 source datasets \(TheoremQA, LogicBench, ToolQA, MedCalc\-Bench, CHAMP, BigCodeBench\); decomposed evaluation of skill retrieval, incorporation, and applicationSkillRouterZhenget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib64)\)75 core queries\(∼\\sim80K candidate skills\)SkillsBench\-derived routing benchmark; compares metadata\-only vs\. full\-body retrieval and rerankingAgentSkillOSLiet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib71)\)30 tasks\(200 to 200,000 skills\)Data computation, document creation, motion video, visual design, web interactionSafety & SecuritySkillTesterWanget al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib63)\)Per\-skill \(variable\)2 utility task groups \(common functional, edge functional\) \+ 3 security probe groups \(abnormal behavior control, permission boundary, sensitive data protection\); outputs utility score, security score, and 3\-level security status labelSkillGuardBenchLvet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib38)\)581 packages\(327 core\+ 254 public\-ecosystem\);5 evaluation views\(254–404 packages each\)Package\-level \(SKILL\.md\+ scripts \+ references \+ repo context\) auditing; 3\-way labels \(benign / suspicious / malicious\) covering hidden override, disguised transfer, remote bootstrap; semantics\-preserving rewrites for attack\-exact\-consistencySKILL\-INJECTSchmotzet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib35)\)23 skills,202 injection\-task pairs8 categories: data exfiltration, data destruction, DoS, ransomware, phishing, backdoors, bias manipulation, poisoningSoftware EngineeringSWE\-Skills\-BenchHanet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib48)\)∼\\sim565 task instances\(49 public SWE skills\)6 SWE subdomains over authentic GitHub repos pinned at fixed commits; requirement docs with deterministic execution\-based acceptance criteria; paired with/without\-skill evaluationReal\-world EnvironmentWildClawBenchDinget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib37)\)60 hand\-built tasks6 categories in a live OpenClaw environment: productivity flow, code intelligence, social interaction, search & retrieval, creative synthesis, safety alignment; Docker\-isolated grading injected post\-executionSkillForge benchmarkLiuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib73)\)3,737 tasks\(1,883 tickets\)Five real\-world cloud technical\-support scenarios\.Table 2:Skill\-centric dynamic benchmarks\. Categories are color\-coded:Utility,Generation,Retrieval & Routing,Safety & Security,Software Engineering,Real\-world environment\.### Skill Utility Benchmarks

Skill utility remains the primary criterion for assessing how skills improve an agent’s task completion performanceLiet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib62)\); Zhanget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib74)\); Gaoet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib91)\)\.SkillsBenchLiet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib106)\)covers 86 hand\-built tasks across 11 professional domains with deterministic verifiers\. Across 7,308 trajectories and 7 agent\-model configurations, curated skills raise the average 16% pass rate with effects ranging from\+4\.5\+4\.5pp in software engineering to\+51\.9\+51\.9pp in healthcare\. SkillsBench has since become a source for following benchmarks that extend the evaluation scope\. For instance, SkillTesterWanget al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib63)\)jointly measures utility and safety and SkillRouterZhenget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib64)\)evaluates skill retrieval and reranking\. Alternatively,SkillCraftChenet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib49)\)focuses on identifying whether agents can reuse their own tool compositions across tasks\. In particular, SkillCraft introduces long\-horizon tool\-use tasks whose difficulty is scaled along two axes: a quantitative axis that increases the number of agent\-process entries, and a structural axis that lengthens the tool\-call chain by composing subtasks into deeper workflows\. Under its protocol, a compelling trait is that an agent can package the successful tool sequence as a reusable skill for later tasks\. To date, the tool\-calling cost, latency, and reasoning quality remain underexplored factors that future benchmarks should explicitly address\.

### Skill Generation Benchmarks

The skill generation benchmarks provide a quantitative evaluation of skill quality, which is increasingly critical as manual quality assessment becomes challenging nowadays\. As hundreds of skills are produced, early identification of low\-quality skills could reduce token consumption\.SkillLearnBenchZhonget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib36)\)contains 20 skill\-dependent tasks across 15 sub\-domains, each with multiple variants that have the same task structure with different input values, enabling evaluation of skill reuse beyond one\-shot success\. SkillLearnBench indicates that external execution feedback is essential for skill improvement by preventing error accumulation\. To better evaluate skill generation, research efforts should assess skill document quality, adherence between execution and prescribed steps, and task completion success\.

### Skill Retrieval & Routing Benchmarks

Skill routing and retrieval benchmark focuses on evaluating the effectiveness and accuracy of skill usage\. Effective skill\-retrieval methods must distinguish near\-duplicate skills, while a well\-designed routing should coordinate the appropriate skills for each task\.SkillRouterZhenget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib64)\)contains roughly 80K skills and 75 expert\-verified queries\. SkillRouter demonstrates that only skill names and descriptions could result in a 31\-44% drop in routing accuracy compared to using the full skill body\.SRA\-BenchSuet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib44)\)decomposes skill augmentation into three separate stages: retrieval, incorporation, and application\. It mixes 636 manually written gold skills into a 26,262\-skill web\-collected corpus and pairs them with 5,400 capability\-intensive instances drawn from TheoremQA, LogicBench, ToolQA, MedCalc\-Bench, CHAMP, and BigCodeBench\.AgentSkillOS BenchmarkLiet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib71)\)shifts the focus of evaluation to the orchestration that allows multiple skills to work together on a task\. It constructs 30 artifact\-rich tasks across data computation, document creation, visual design, and web interaction\. Evaluating at ecosystem scales from 200 to 200K skills, this orchestration substantially outperforms single\-skill approaches when using the same skill set\.

### Skill Safety Auditing Benchmarks

Skill security threats become increasingly systemic as malicious skills can compromise user data, hijack execution flows, and silently degrade agent behavior\.SkillTesterWanget al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib63)\)jointly assesses utility and safety by executing candidate skills, reporting a composite security score alongside a three\-level status label \(e\.g\., safe, warning, or unsafe\) that allows users to weigh performance gains against known risks\.SkillGuardBenchLvet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib38)\)treats each skill as a multi\-file package labeled as benign, suspicious, or malicious\. The benchmark is built upon 581 packages across five evaluation views, with risk samples covering three recurring attack patterns: hidden override, disguised transfer, and remote bootstrap\. In addition to auditing skills,SKILL\-INJECTSchmotzet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib35)\)shifts the threat model towards runtime vulnerability, testing whether agents will execute malicious instructions embedded inside otherwise legitimate skills\. Although safety auditing benchmarks have advanced, they treat safety as a one\-time gate, leaving post\-installation skill behavior across evolving libraries largely unexamined\.

### Software Engineering Benchmark

The growth of skills in software engineering demands more rigorous benchmarking\. Current efforts are strongly tied to public repositories, leaving the real\-world engineering use unexplored\.SWE\-Skills\-BenchHanet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib48)\)takes a meaningful step toward closing this gap by pairing 49 publicly available software\-engineering skills with GitHub repositories and evaluating them across 565 automated task instances\. Its reliance on publicly curated skills and fixed\-commit repositories allows good reproducibility; however, it inherently omits proprietary workflows, legacy codebases, and continuously evolving engineering practices\. Future benchmarks should shift towards this trend by incorporating industry\-partnered task suites, dynamic repository states, and evaluation metrics that capture maintenance overhead, error recovery rate, and token efficiency\. Meanwhile, stratifying skills by authorship expertise offers an opportunity to advance the field from implementation\-level evaluation toward architecture\-level understanding\.

### Real\-World Environment Benchmarks

Evaluating agent skills in real\-world environments is difficult to standardize, but it is essential because only dynamic, open\-ended settings can reveal true deployment readiness of skills\.WildClawBenchDinget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib37)\)drops agents into a live personal\-assistant environment and runs 60 hand\-built original tasks across productivity flow, code intelligence, social interaction, and safety alignment\. Users can submit results from their own customized agents, turning the benchmark into a community\-driven testbed for skill ecosystems\. Similarly, SkillForgeLiuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib73)\)introduces a benchmark of five real\-world cloud technical\-support scenarios spanning 1,883 tickets and 3,737 tasks\. While both benchmarks advance the real\-world evaluation, their closed\-environment designs restrict external reproducibility and cross\-method comparison\. Moving forward, the field would benefit from open skill execution environments with standardized task interfaces, paired with auditable abilities of skill library updates\.

## 5Reflection and Future Directions

Evaluation and evolution are becoming cornerstones of trustworthy agent skills\. A central question is how to transform heterogeneous experiences, including human\-written instructions, execution traces, user feedback, tool calls, and multimodal observations, into reliable and verifiable knowledge\. Despite recent progress, open challenges persist in the handling of multimodal skills, effective use of trajectory data, and skill security, all of which demand systematic research efforts into robust evolution and evaluation frameworks\.

Major skill frameworks use text\-centered procedural packages that work for language, code, document, and API tasks, yet are ill\-suited for agents operating in multimodal environments\. Rich multimodal examples include desktop interfaces, web pages, embodied simulators, robotics, medical images, and visually grounded scientific workflowsZhouet al\.\([2025b](https://arxiv.org/html/2606.11435#bib.bib119)\)\. In these scenarios, the right agent action is not from a textual goal alone; it also highly depends on the visual state, spatial layout, object configuration, interface affordances, and modality\-specific constraintsJianget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib66)\)\. A multimodal skill should identify the target visual elements, interpret the current state, and map it back to the content described in the skill\. XSkillJianget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib66)\)offers insights into action\-level, context\-specific tool selection, and its skills capture task\-level procedural knowledge for planning\. Skill retrieval is driven by visual observations, which are revealed in past trajectories\. In physical embodied settings, skill evolution could be driven by the execution feedback from environmental interactions, which reflects the agent’s actions, observations, and rewardsJuet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib67)\)\. Yet the design, evaluation, and sharing of cross\-modality skills that enable agents to act across diverse real\-world sensory inputs remain largely underexplored\.

Trajectory data recordsZhouet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib116)\); Niet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)are crucial for broadening agent skill utility by revealing intermediate reasoning, tool choices, recovery attempts, and failure modes\. However, raw trajectories are often redundant and noisy\. A successful trace can contain irrelevant steps, while a failed one may still hold useful local decisionsNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)\. To build robust agent skills, two emerging designs could make knowledge distillation from trajectory data more effective\. First, distillation operates over batches of trajectories rather than single runs, since comparing many traces against each other is what isolates reusable patterns from task\-specific noiseNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)\. Second, distillation can be continuous rather than one\-shot, with skills serving as an evolving memory that is iteratively updated to reflect environmental changesZhouet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib116),[a](https://arxiv.org/html/2606.11435#bib.bib52)\)\. To unlock the full value of trajectory data, explicit curation along quality, diversity, and difficulty dimensions is essential to the skill evolution\.

Agent skills introduce non\-trivial security risks that warrant systematic evaluationWanget al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib63)\); Schmotzet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib35)\)\. Malicious skills could manipulate LLMs to leak sensitive data, execute unauthorized commands, or produce harmful decisions\. We identify three principal sources of skill poisoning that should be detected early and strictly avoided during skill evolution\. First, direct instruction poisoning embeds harmful instructions into the skill, causing unsafe behavior during execution\. Second, prompt injection occurs when a benign skill pulls content from untrusted external sources that carry malicious instructions\. Third, uncontrolled skill self\-evolution can silently strip existing safety constraints through unregulated updates\. Further, skills distilled from execution trajectories risk unintentional privacy leakage through the skill body or outputs\. Such risks are particularly acute in safety\-critical domains such as healthcare and finance, where human oversight should be mandatory before skill deployment\. To avoid such skill poisoning, we believe that a robust defense requires multi\-layered approaches: a\) establishing public reputation systems that track skill authorship, b\) enforcing fine\-grained permission boundaries on skill scripts, and c\) requiring explicit user confirmation before skills trigger sensitive actions\.

Current evolution and evaluation strategies are largely treated as sequential, decoupled stages: a skill is evaluated, judged, and then evolved, after which evaluation restarts from scratch\. This pipeline assumption is increasingly untenable at scale since the cost of re\-evaluating an entire library after each evolution cycle becomes prohibitive\. Two emerging approaches begin to close this gap from a joint learning perspective\. SkillOSOuyanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib8)\)builds an experience\-driven RL recipe that pairs a frozen executor with a trainable skill curator using grouped task streams\. Earlier trajectories update the skill repository, while later related tasks immediately evaluate those updates, leading to the effective evaluation via a structural component of the training loop rather than an external judge\. Meanwhile, SkillsVoteLiuet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib9)\)approaches the same problem through a lifecycle governance, profiling a million\-scale corpus for quality and verifiability\. It attributes post\-execution outcomes to skill use versus environment signals, and admits only successful discoveries through evidence\-gated updates, showing that governed skill libraries can improve frozen agents without any model updates\.

Realizing a unified framework that operates over living skill libraries more broadly demands meaningful progress on three key fronts\. First, skill libraries must be engineered with explicit versioning and dependency graphs, such that a localized update to one skill can be automatically tested for downstream effects as evidenced in SkillsVoteLiuet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib9)\)and SkillXJianget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib66)\)\. Second, evaluation signals should move beyond binary pass/fail toward composite rewards that capture latency, token cost, and generalizability\. SkillOS’s composite reward design and SkillReducer’s token\-efficiency setting begin to address this challengeGaoet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib91)\)\. Third, skill curators must generalize across executor backbones and task domains\. SkillOSOuyanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib8)\)shows empirically across multi\-turn and single\-turn reasoning settings, yet no existing benchmark treats cross\-domain curator generalization as an evaluation target\. Together, these directions suggest redesigning skill libraries from static repositories into a more living, monitored infrastructure, where evolution and evaluation can be two faces of the shared continuous learning processGaoet al\.\([2025b](https://arxiv.org/html/2606.11435#bib.bib7)\)\.

## 6Conclusion

In the era of rapid skills growth, the ability to continuously evolve and rigorously evaluate skills becomes vital to trustworthy agent deployment\. We introduce a four\-class taxonomy of skill evolution strategies, revealing that each class operates on distinct signal sources with performance trade\-offs\. Meanwhile, we analyze six skill\-centric benchmark categories to thoroughly assess existing skills for public use\. We identify structural gaps in longitudinal evaluation, retrieval coverage, and metric richness that should guide the next generation of benchmarks\. Finally, future research should treat skill ecosystems as evolving infrastructure, where continual evaluation and evolution are central for reliable use, dependency control, and real\-world agent deployment\.

## Limitations

This survey provides an overview of agent skill evolution and evaluation with several limitations\. First, given the rapid development of agent skill research, some recent methods or benchmarks may not be fully covered\. Second, the reviewed systems are evaluated on different benchmarks and base models; we report findings as described rather than providing a unified empirical comparison, which is better addressed by a dedicated benchmarking study\. Finally, we draw primarily on published papers and public repositories, and may therefore understate industrial practices where implementation details and proprietary skills remain undisclosed\.

## Ethical Considerations

AI assistant was used to refine the appendix table \(i\.e\.,[Appendix3](https://arxiv.org/html/2606.11435#A2.T3)\)\. All technical contents and final manuscript materials were reviewed and verified by the authors\.

## References

- Anthropic Skills Repository\.Note:[https://github\.com/anthropics/skills](https://github.com/anthropics/skills)GitHub repository, Accessed: 2026\-05\-25Cited by:[§2](https://arxiv.org/html/2606.11435#S2.p2.1)\.
- Art of Problem Solving \(n\.d\.\)AIME Problems and Solutions\.Note:[https://artofproblemsolving\.com/wiki/index\.php/AIME\_Problems\_and\_Solutions](https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions)Online resource, Accessed: 2026\-05\-25Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.2.2.2.1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.16.1.1.1)\.
- V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan \(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.arXiv preprint arXiv:2506\.07982\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.3.3.1.1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.11435#S1.p1.1)\.
- Y\. Cai, Y\. Hao, J\. Zhou, H\. Yan, Z\. Lei, R\. Zhen, Z\. Han, Y\. Yang, J\. Li, Q\. Pan,et al\.\(2025\)Building self\-evolving agents via experience\-driven lifelong learning: a framework and benchmark\.arXiv preprint arXiv:2508\.19005\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px6.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.40.1.1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.15.1.1.1)\.
- S\. Chen, J\. Gai, R\. Zhou, J\. Zhang, T\. Zhu, J\. Li, K\. Wang, Z\. Wang, Z\. Chen, K\. Kaleb,et al\.\(2026\)Skillcraft: can llm agents learn to use tools skillfully?\.arXiv preprint arXiv:2603\.00718\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px1.p1.2),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.12.14.2.1.1.1)\.
- S\. Ding, X\. Dai, L\. Xing, S\. Ding, Z\. Liu, Y\. JingYi, P\. Yang, Z\. Zhang, X\. Wei, X\. Fang,et al\.\(2026\)WildClawBench: a benchmark for real\-world, long\-horizon agent evaluation\.arXiv preprint arXiv:2605\.10912\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px6.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.12.22.10.1.1.1)\.
- L\. Fan, G\. Wang, Y\. Jiang, A\. Mandlekar, Y\. Yang, H\. Zhu, A\. Tang, D\. Huang, Y\. Zhu, and A\. Anandkumar \(2022\)Minedojo: building open\-ended embodied agents with internet\-scale knowledge\.Advances in Neural Information Processing Systems35,pp\. 18343–18362\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px8.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.50.1.1.1)\.
- B\. Gao, F\. Song, Z\. Yang, Z\. Cai, Y\. Miao, Q\. Dong, L\. Li, C\. Ma, L\. Chen, Z\. Tang,et al\.\(2025a\)Omni\-math: a universal olympiad level mathematic benchmark for large language models\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 100540–100569\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.19.1.1.1)\.
- H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu,et al\.\(2025b\)A survey of self\-evolving agents: what, when, how, and where to evolve on the path to artificial super intelligence\.arXiv preprint arXiv:2507\.21046\.Cited by:[§5](https://arxiv.org/html/2606.11435#S5.p6.1)\.
- Y\. Gao, Z\. Li, Z\. Ji, P\. Ma, S\. Wang,et al\.\(2026\)Skillreducer: optimizing llm agent skills for token efficiency\.arXiv preprint arXiv:2603\.29919\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.18.1.1.1),[§3](https://arxiv.org/html/2606.11435#S3.p1.1),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px1.p1.2),[§5](https://arxiv.org/html/2606.11435#S5.p6.1)\.
- X\. Guo, U\. Tyagi, A\. Gosai, P\. Vergara, J\. Park, E\. G\. H\. Montoya, C\. B\. C\. Zhang, B\. Hu, Y\. He, B\. Liu,et al\.\(2025\)Beyond seeing: evaluating multimodal llms on tool\-enabled image perception, transformation, and reasoning\.arXiv preprint arXiv:2510\.12712\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.42.1.1.1)\.
- T\. Han, Y\. Zhang, W\. Song, C\. Fang, Z\. Chen, Y\. Sun, and L\. Hu \(2026\)SWE\-skills\-bench: do agent skills actually help in real\-world software engineering?\.arXiv preprint arXiv:2603\.15401\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px5.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.11.11.2.1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px5.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.33.1.1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.26.1.1.1)\.
- G\. Jiang, Z\. Su, X\. Qu, and Y\. R\. Fung \(2026\)Xskill: continual learning from experience and skills in multimodal agents\.arXiv preprint arXiv:2603\.12056\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.14.1.1.1),[§5](https://arxiv.org/html/2606.11435#S5.p2.1),[§5](https://arxiv.org/html/2606.11435#S5.p6.1)\.
- C\. Jin, H\. Peng, Q\. Zhang, Y\. Tang, D\. N\. Metaxas, and T\. Che \(2025a\)Two heads are better than one: test\-time scaling of multi\-agent collaborative reasoning\.arXiv preprint arXiv:2504\.09772\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1)\.
- C\. Jin, H\. Peng, S\. Zhao, Z\. Wang, W\. Xu, L\. Han, J\. Zhao, K\. Zhong, S\. Rajasekaran, and D\. N\. Metaxas \(2025b\)Apeer: automatic prompt engineering enhances large language model reranking\.InCompanion Proceedings of the ACM on Web Conference 2025,pp\. 2494–2502\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1601–1611\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.23.1.1.1)\.
- R\. Ju, X\. Wang, X\. Ding, Y\. Yang, H\. Wu, S\. Jiang, Q\. Zhang, H\. Wen, X\. Li, W\. Wang,et al\.\(2026\)EmbodiSkill: skill\-aware reflection for self\-evolving embodied agents\.arXiv preprint arXiv:2605\.10332\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p2.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.9.1.1.1),[§5](https://arxiv.org/html/2606.11435#S5.p2.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee,et al\.\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 453–466\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.22.1.1.1)\.
- F\. Li, P\. Tagkopoulos, and I\. Tagkopoulos \(2025a\)SkillFlow: scalable and efficient agent skill retrieval system\.arXiv e\-prints,pp\. arXiv–2504\.Cited by:[Appendix A](https://arxiv.org/html/2606.11435#A1.p1.1),[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px1.p1.2)\.
- H\. Li, C\. Mu, J\. Chen, S\. Ren, Z\. Cui, Y\. Zhang, L\. Bai, and S\. Hu \(2026a\)Organizing, orchestrating, and benchmarking agent skills at ecosystem scale\.arXiv preprint arXiv:2603\.02176\.Cited by:[Appendix A](https://arxiv.org/html/2606.11435#A1.p1.1),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.8.8.2.1.1)\.
- M\. Li, J\. Zhong, S\. Zhao, H\. Zhang, S\. Lin, Y\. Lai, C\. Wei, K\. Psounis, and K\. Zhang \(2025b\)TIR\-bench: a comprehensive benchmark for agentic thinking\-with\-images reasoning\.arXiv preprint arXiv:2511\.01833\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.43.1.1.1)\.
- S\. Li, X\. Bu, W\. Wang, J\. Liu, J\. Dong, H\. He, H\. Lu, H\. Zhang, C\. Jing, Z\. Li,et al\.\(2025c\)Mm\-browsecomp: a comprehensive benchmark for multimodal browsing agents\.arXiv preprint arXiv:2508\.13186\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.45.1.1.1)\.
- X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun,et al\.\(2026b\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.arXiv preprint arXiv:2602\.12670\.Cited by:[§1](https://arxiv.org/html/2606.11435#S1.p2.1),[§2](https://arxiv.org/html/2606.11435#S2.p1.7),[§2](https://arxiv.org/html/2606.11435#S2.p2.1),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px1.p1.2),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.4.4.2.1.1)\.
- X\. Li, T\. Zhang, Y\. Dubois, R\. Taori, I\. Gulrajani, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Alpacaeval: an automatic evaluator of instruction\-following models\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px5.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.34.1.1.1)\.
- Y\. Li, R\. Miao, Z\. Qi, and T\. Lan \(2026c\)Arise: agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning\.arXiv preprint arXiv:2603\.16060\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.11435#S2.p2.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1)\.
- Q\. Liang, H\. Wang, Z\. Liang, and Y\. Liu \(2026a\)From skill text to skill structure: the scheduling\-structural\-logical representation for agent skills\.arXiv preprint arXiv:2604\.24026\.Cited by:[Appendix A](https://arxiv.org/html/2606.11435#A1.p1.1)\.
- Y\. Liang, R\. Zhong, H\. Xu, C\. Jiang, Y\. Zhong, R\. Fang, J\. Gu, S\. Deng, Y\. Yao, M\. Wang,et al\.\(2026b\)Skillnet: create, evaluate, and connect ai skills\.arXiv preprint arXiv:2603\.04448\.Cited by:[§1](https://arxiv.org/html/2606.11435#S1.p2.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.16.1.1.1)\.
- B\. Y\. Lin, Y\. Deng, K\. Chandu, A\. Ravichander, V\. Pyatkin, N\. Dziri, R\. Le Bras, and Y\. Choi \(2025\)Wildbench: benchmarking llms with challenging tasks from real users in the wild\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 47852–47870\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px5.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.36.1.1.1)\.
- H\. Liu, H\. Yang, T\. Jiang, B\. Tang, F\. Xiong, and Z\. Li \(2026a\)SkillsVote: lifecycle governance of agent skills from collection, recommendation to evolution\.arXiv preprint arXiv:2605\.18401\.Cited by:[§5](https://arxiv.org/html/2606.11435#S5.p5.1),[§5](https://arxiv.org/html/2606.11435#S5.p6.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2024\)Agentbench: evaluating llms as agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 52989–53046\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.11.1.1.1)\.
- X\. Liu, X\. Luo, L\. Li, G\. Huang, J\. Liu, and H\. Qiao \(2026b\)Skillforge: forging domain\-specific, self\-evolving agent skills in cloud technical support\.arXiv preprint arXiv:2604\.08618\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p2.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p4.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.3.1.1.1),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px6.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.12.12.2.1.1)\.
- M\. Luo, S\. Tan, J\. Wong, X\. Shi, W\. Y\. Tang, M\. Roongta, C\. Cai, J\. Luo, T\. Zhang, L\. E\. Li,et al\.\(2025\)Deepscaler: surpassing o1\-preview with a 1\.5 b model by scaling rl\.Notion Blog3\(5\)\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.20.1.1.1)\.
- L\. Lv, X\. Tang, J\. Wen, J\. Han, and S\. Hu \(2026\)Structured security auditing and robustness enhancement for untrusted agent skills\.arXiv preprint arXiv:2604\.25109\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px4.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.9.9.2.1.1)\.
- Z\. Ma, B\. Zhang, J\. Zhang, J\. Yu, X\. Zhang, X\. Zhang, S\. Luo, X\. Wang, and J\. Tang \(2024\)Spreadsheetbench: towards challenging real world spreadsheet manipulation\.Advances in Neural Information Processing Systems37,pp\. 94871–94908\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.47.1.1.1)\.
- Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu \(2026\)Skillclaw: let skills evolve collectively with agentic evolver\.arXiv preprint arXiv:2604\.08377\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p3.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p4.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.8.1.1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px6.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.38.1.1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 9802–9822\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.24.1.1.1)\.
- M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan,et al\.\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.arXiv preprint arXiv:2601\.11868\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.14.1.1.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)Gaia: a benchmark for general ai assistants\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 9025–9049\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.30.1.1.1)\.
- J\. Ni, Y\. Liu, X\. Liu, Y\. Sun, M\. Zhou, P\. Cheng, D\. Wang, E\. Zhao, X\. Jiang, and G\. Jiang \(2026\)Trace2skill: distill trajectory\-local lessons into transferable agent skills\.arXiv preprint arXiv:2603\.25158\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px3.p1.1),[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.12.1.1.1),[§3](https://arxiv.org/html/2606.11435#S3.p1.1),[§5](https://arxiv.org/html/2606.11435#S5.p3.1)\.
- S\. Ouyang, J\. Yan, Y\. Chen, R\. Han, Z\. Wang, B\. D\. Mishra, R\. Meng, C\. Li, Y\. Jiao, K\. Zha,et al\.\(2026\)SkillOS: learning skill curation for self\-evolving agents\.arXiv preprint arXiv:2605\.06614\.Cited by:[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.23.1.1.1),[§5](https://arxiv.org/html/2606.11435#S5.p5.1),[§5](https://arxiv.org/html/2606.11435#S5.p6.1)\.
- P\. Pasupat and P\. Liang \(2015\)Compositional semantic parsing on semi\-structured tables\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 1470–1480\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.29.1.1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.Advances in Neural Information Processing Systems37,pp\. 126544–126565\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.48.1.1.1)\.
- L\. Phan, A\. Gatti, Z\. Han, N\. Li, J\. Hu, H\. Zhang, C\. B\. C\. Zhang, M\. Shaaban, J\. Ling, S\. Shi,et al\.\(2025\)Humanity’s last exam\.arXiv preprint arXiv:2501\.14249\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.31.1.1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.28.1.1.1)\.
- Y\. Qing, B\. Zhu, M\. Du, Z\. Guo, T\. Y\. Zhuo, Q\. Zhang, J\. Zhang, H\. Cui, S\. M\. Yiu, D\. Huang,et al\.\(2026\)Effibench\-x: a multi\-language benchmark for measuring efficiency of llm\-generated code\.Advances in Neural Information Processing Systems38\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.17.1.1.1)\.
- D\. Schmotz, L\. Beurer\-Kellner, S\. Abdelnabi, and M\. Andriushchenko \(2026\)Skill\-inject: measuring agent vulnerability to skill file attacks\.arXiv preprint arXiv:2602\.20156\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px4.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.10.10.2.1.1),[§5](https://arxiv.org/html/2606.11435#S5.p4.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2](https://arxiv.org/html/2606.11435#S2.p2.1)\.
- S\. Shen, W\. Cheng, M\. Ma, A\. Turcan, M\. J\. Zhang, and J\. Ma \(2026\)SKILLFOUNDRY: building self\-evolving agent skill libraries from heterogeneous scientific resources\.arXiv preprint arXiv:2604\.03964\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.19.1.1.1)\.
- Y\. Shi, Y\. Chen, Z\. Lu, Y\. Miao, S\. Liu, Q\. Gu, X\. Cai, X\. Wang, and A\. Zhang \(2026\)Skill1: unified evolution of skill\-augmented agents via reinforcement learning\.arXiv preprint arXiv:2605\.06130\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.24.1.1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.7.1.1.1)\.
- A\. Shypula, A\. Madaan, Y\. Zeng, U\. Alon, J\. Gardner, M\. Hashemi, G\. Neubig, P\. Ranganathan, O\. Bastani, and A\. Yazdanbakhsh \(2023\)Learning performance\-improving code edits\.arXiv preprint arXiv:2302\.07867\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.1.1.2.1.1)\.
- S\. Si, H\. Zhao, Y\. Lei, Q\. Wang, D\. Chen, Z\. Wang, Z\. Wang, K\. Luo, Z\. Wang, G\. Chen,et al\.\(2026\)From context to skills: can language models learn from context skillfully?\.arXiv preprint arXiv:2604\.27660\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p2.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.6.1.1.1)\.
- W\. Su, J\. Long, Q\. Ai, Y\. Tang, C\. Wang, Y\. Tu, and Y\. Liu \(2026a\)Skill retrieval augmentation for agentic ai\.arXiv preprint arXiv:2604\.24594\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.6.6.2.1.1)\.
- Z\. Su, J\. Gao, H\. Guo, Z\. Liu, L\. Zhang, X\. Geng, S\. Huang, P\. Xia, G\. Jiang, C\. Wang,et al\.\(2026b\)Agentvista: evaluating multimodal agents in ultra\-challenging realistic visual scenarios\.arXiv preprint arXiv:2602\.23166\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.46.1.1.1)\.
- X\. Tao, Y\. Teng, X\. Su, X\. Fu, J\. Wu, C\. Tao, Z\. Liu, H\. Bai, R\. Liu, and L\. Kong \(2025\)Mmsearch\-plus: benchmarking provenance\-aware search for multimodal browsing agents\.arXiv preprint arXiv:2508\.21475\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.44.1.1.1)\.
- Y\. Tian, J\. Chen, L\. Zheng, M\. Tao, X\. Zeng, Z\. Yin, H\. Su, and X\. Sun \(2026\)Skills\-coach: a self\-evolving skill optimizer via training\-free grpo\.arXiv preprint arXiv:2604\.27488\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p2.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.5.1.1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.27.1.1.1)\.
- H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian \(2024\)Appworld: a controllable world of apps and people for benchmarking interactive coding agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16022–16076\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.12.1.1.1)\.
- S\. Tu, C\. Xu, Q\. Zhang, Y\. Zhang, X\. Lan, L\. Li, and D\. Zhao \(2026\)Dynamic dual\-granularity skill bank for agentic rl\.arXiv preprint arXiv:2603\.28716\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.21.1.1.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang,et al\.\(2026a\)SkillX: automatically constructing skill knowledge bases for agents\.arXiv preprint arXiv:2604\.04804\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.17.1.1.1),[§3](https://arxiv.org/html/2606.11435#S3.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px8.p1.1),[§2](https://arxiv.org/html/2606.11435#S2.p2.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1)\.
- J\. Wang, Y\. Ming, Z\. Ke, S\. Joty, A\. Albarghouthi, and F\. Sala \(2026b\)Skillorchestra: learning to route agents via skill transfer\.arXiv preprint arXiv:2602\.19672\.Cited by:[Appendix A](https://arxiv.org/html/2606.11435#A1.p1.1),[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px5.p1.1)\.
- J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong \(2025\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§2](https://arxiv.org/html/2606.11435#S2.p2.1)\.
- L\. Wang, Z\. Wang, and A\. Xu \(2026c\)SkillTester: benchmarking utility and security of agent skills\.arXiv preprint arXiv:2603\.28815\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px1.p1.2),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px4.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.12.19.7.1.1.1),[§5](https://arxiv.org/html/2606.11435#S5.p4.1)\.
- R\. Wang, P\. Jansen, M\. Côté, and P\. Ammanabrolu \(2022\)Scienceworld: is your agent smarter than a 5th grader?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 11279–11298\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.9.1.1.1)\.
- Z\. Wang, Y\. Shi, M\. Li, Z\. Liu, J\. M\. Zhang, C\. Wan, and X\. Gu \(2026d\)Effiskill: agent skill based automated code efficiency optimization\.arXiv preprint arXiv:2603\.27850\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px2.p1.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.11435#S1.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2024\)Longmemeval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px6.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.39.1.1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.22.1.1.1)\.
- Y\. Yang, J\. Li, Q\. Pan, B\. Zhan, Y\. Cai, L\. Du, J\. Zhou, K\. Chen, Q\. Chen, X\. Li,et al\.\(2026\)Autoskill: experience\-driven lifelong learning via skill self\-evolution\.arXiv preprint arXiv:2603\.01145\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px6.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p3.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p4.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.7.1.1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px6.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.25.1.1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.8.1.1.1)\.
- B\. Zhang, K\. Lazuka, and M\. Murag \(2025\)Equipping agents for the real world with agent skills\.Anthropic Engineering Blog\.Cited by:[§1](https://arxiv.org/html/2606.11435#S1.p1.1)\.
- H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng,et al\.\(2026a\)Coevoskills: self\-evolving agent skills via co\-evolutionary verification\.arXiv preprint arXiv:2604\.01687\.Cited by:[§1](https://arxiv.org/html/2606.11435#S1.p2.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p2.1),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px1.p4.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.4.1.1.1),[§3](https://arxiv.org/html/2606.11435#S3.p1.1),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px1.p1.2)\.
- H\. Zhang, Q\. Long, J\. Bao, T\. Feng, W\. Zhang, H\. Yue, and W\. Wang \(2026b\)MemSkill: learning and evolving memory skills for self\-evolving agents\.arXiv preprint arXiv:2602\.02474\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px6.p1.1),[§3](https://arxiv.org/html/2606.11435#S3.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px5.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.35.1.1.1)\.
- Y\. Zheng, Z\. Zhang, C\. Ma, Y\. Yu, J\. Zhu, Y\. Wu, T\. Xu, B\. Dong, H\. Zhu, R\. Huang,et al\.\(2026\)Skillrouter: skill routing for llm agents at scale\.arXiv preprint arXiv:2603\.22455\.Cited by:[Appendix A](https://arxiv.org/html/2606.11435#A1.p1.1),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px1.p1.2),[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.7.7.2.1.1)\.
- S\. Zhong, Y\. Lu, J\. Ning, Y\. Wan, L\. Feng, Y\. Ao, L\. F\. Ribeiro, M\. Dreyer, S\. Ammirati, and C\. Xiong \(2026\)SkillLearnBench: benchmarking continual learning methods for agent skill generation on real\-world tasks\.arXiv preprint arXiv:2604\.20087\.Cited by:[§4](https://arxiv.org/html/2606.11435#S4.SS0.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.11435#S4.T2.5.5.2.1.1)\.
- H\. Zhou, S\. Guo, A\. Liu, Z\. Yu, Z\. Gong, B\. Zhao, Z\. Chen, M\. Zhang, Y\. Chen, J\. Li,et al\.\(2026a\)Memento\-skills: let agents design agents\.arXiv preprint arXiv:2603\.18743\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px4.p2.3),[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.13.1.1.1),[§5](https://arxiv.org/html/2606.11435#S5.p3.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)Webarena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 15585–15606\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.10.1.1.1)\.
- Y\. Zhou, Z\. Dong, Z\. Wang, C\. Jin, S\. Zhao, B\. Guo, D\. Gu, L\. Zhang, M\. Zhou, and D\. N\. Metaxas \(2026b\)Evidence over plans: online trajectory verification for skill distillation\.arXiv preprint arXiv:2605\.09192\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.11435#S3.T1.1.11.1.1.1),[§5](https://arxiv.org/html/2606.11435#S5.p3.1)\.
- Y\. Zhou, C\. Jin, Z\. Dong, Z\. Wang, Y\. Yang, S\. Zhao, L\. Li, R\. Bao, Y\. Xie, and D\. N\. Metaxas \(2026c\)DARE: difficulty\-adaptive reinforcement learning with co\-evolved difficulty estimation\.arXiv preprint arXiv:2605\.09188\.Cited by:[§3](https://arxiv.org/html/2606.11435#S3.SS0.SSS0.Px4.p1.1)\.
- Y\. Zhou, M\. Zhao, Z\. Wang, D\. Gu, B\. Guo, R\. Ye, L\. Han, C\. Jin, and D\. N\. Metaxas \(2025a\)Mˆ 3\-bench: multi\-modal, multi\-hop, multi\-threaded tool\-using mllm agent benchmark\.arXiv preprint arXiv:2511\.17729\.Cited by:[Appendix B](https://arxiv.org/html/2606.11435#A2.SS0.SSS0.Px7.p1.2),[Table 3](https://arxiv.org/html/2606.11435#A2.T3.4.4.1.1.1)\.
- Y\. Zhou, S\. Zhao, Y\. Chen, Z\. Wang, C\. Jin, and D\. N\. Metaxas \(2025b\)Led: llm enhanced open\-vocabulary object detection without human curated data generation\.arXiv preprint arXiv:2503\.13794\.Cited by:[§5](https://arxiv.org/html/2606.11435#S5.p2.1)\.

## Appendix ASkill Usage

Skills are indexed by their name and description for rapid retrieval, while the full content will be loaded only upon selection\. Yet SkillRouterZhenget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib64)\)finds that skill names and descriptions alone are inaccurate for skill selection at scale\. Instead, SkillRouter adopts a retriever and a reranker to determine candidate skills by using full skill content\. To reduce the cost of skill retrieval, SkillFlowLiet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib62)\)avoids repeating skill retrieval by first identifying the missing skill required to solve the task, then querying an external agent for the successfully executed skill, and saving it locally for future use\. Unlike retrieval, effective routing enables the LLM agent to coordinate the appropriate skill to a specific task\. SkillOrchestraWanget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib50)\)compares success and failure trajectories to detect missing capabilities, which are summarized as new skills to build a skill handbook that can be consulted to identify required skills and route the task to the appropriate agent\. Skill management organizes and updates a collection of skills, including removing redundant skills, pruning low\-quality ones, refining skills as an up\-to\-date version, and controlling the size of skill library\. For example, AgentSkillOSLiet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib71)\)organizes skills into a capability tree, where the tree nodes are determined by skill categories and store the skill content\. To keep the tree manageable, only the top\-ranking skills are retained\. AgentSkillOS traverses the tree to retrieve skills and caches successful orchestration plans for reuse\. Similarly, SSL \(Scheduling\-Structural\-Logical\)Lianget al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib43)\)converts the original skill text into a graph to better organize its content, including skill interface signals, operational stages, and individual actions\. Different from implicit RL\-trained routing policies, an interpretable handbook avoids routing collapse, transfers across orchestrator backbones without retraining, and achieves up to 22\.5% accuracy gains at 700× lower learning cost than RL methods\.

## Appendix BGeneral\-domain Benchmarks

Although general\-domain benchmarks were not originally designed for skill evaluation, they can be readily adapted to assess the performance of agents that learn and apply skills as below\.

### Interactive Agent Environments\.

Interactive agent environments evaluate an agent’s ability to perceive the environment, plan multi\-step actions, and execute long\-horizon tasks under current observation states\. They are adaptable, using guidance from skills to assess an agent’s performance and determine skill quality\.ALFWorldShridharet al\.\([2020](https://arxiv.org/html/2606.11435#bib.bib75)\)aligns text\-based interactive household task completion with embodied ALFRED goals and TextWorld games, requiring agents to navigate rooms, manipulate objects, and follow natural\-language instructions\.WebShopYaoet al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib76)\)simulates online shopping over 1\.18M real product listings with 12,087 crowd\-sourced instructions, evaluating product search, attribute comparison, and goal\-directed purchasing\.ScienceWorldWanget al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib77)\)provides an interactive text environment at the level of an elementary\-school science curriculum, with 30 benchmark tasks \(and 7,200 parametric variations\) spanning thermodynamics, electrical circuits, chemistry, and biological processes\.WebArenaZhouet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib78)\)offers 812 realistic long\-horizon web\-based tasks requiring multi\-step browser interaction across four real\-world web applications \(e\-commerce, social forums, collaborative development, content management\)\.AgentBenchLiuet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib23)\)consolidates eight distinct interactive environments \(e\.g\., operating system, database, knowledge graph\) into a unified evaluation framework for assessing LLM\-as\-Agent reasoning and decision\-making\.AppWorldTrivediet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib79)\)provides a controllable world of 9 day\-to\-day apps operable via 457 APIs, with 750 natural\-agent tasks for benchmarking interactive coding agents over stateful application use\.

### Code Generation and Software Engineering\.

Code\-generation and software\-engineering benchmarks evaluate functional correctness, command\-line proficiency, and efficiency of program synthesis\. They are relevant to skill evaluation because reusable engineering knowledge, including algorithmic recipes, debugging routines, and build\-system patterns, is naturally described in the skill packages, and performance on these benchmarks could indicate whether the existing skills are useful for improving code generation or solving software engineering tasks\.Terminal\-BenchMerrillet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib90)\)provides 89 hard, realistic command\-line tasks for evaluating raw agent harness capability through shell\-based interactions and is used by SkillFlowLiet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib62)\)\.HumanEvalChenet al\.\([2021](https://arxiv.org/html/2606.11435#bib.bib12)\)releases 164 hand\-written Python programming problems with unit tests to measure functional correctness of code synthesized from docstrings\.MBPPAustinet al\.\([2021](https://arxiv.org/html/2606.11435#bib.bib10)\)\(Mostly Basic Programming Problems\) contains 974 entry\-level Python tasks \(374 train / 90 val / 500 test\) crowd\-sourced to cover programming fundamentals and standard\-library usage\. For code efficiency,EffiBench\-XQinget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib29)\)is the first large\-scale multi\-language code efficiency benchmark covering Python, C\+\+, Java, JavaScript, Ruby, and Go, and serves as the primary evaluation for EffiSkillWanget al\.\([2026d](https://arxiv.org/html/2606.11435#bib.bib68)\)\. ThePIE datasetShypulaet al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib30)\)contains over 77K paired slow/fast competitive C\+\+ programming submissions across 1,474 problems and is used by EffiSkill for offline mining of recurring slow\-to\-fast transformations\.

### Mathematical Reasoning\.

Mathematical reasoning benchmarks evaluate multi\-step symbolic and quantitative reasoning with verifiable answers\. They are well\-suited to skill evaluation as mathematical solution strategies are reusable across problems\.AMC/AIMEArt of Problem Solving \([n\.d\.](https://arxiv.org/html/2606.11435#bib.bib6)\)are annual competition\-level problem sets \(≈\\approx30 problems per year\) from the American Mathematics Competitions and the American Invitational Mathematics Examination, used as in\-distribution and out\-of\-distribution mathematical reasoning evaluations\.Omni\-MATHGaoet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib84)\)provides 4,428 Olympiad\-level problems spanning 33 sub\-domains, forming the out\-of\-distribution suite together with AMC/AIME for ARISELiet al\.\([2026c](https://arxiv.org/html/2606.11435#bib.bib94)\), which trains on theDeepScaleRLuoet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib24)\)dataset of approximately 40K math problem\-answer pairs compiled from AIME, AMC, Omni\-MATH, and Still\. AIME is also used by Trace2SkillNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)as one of its math reasoning evaluation domains\.

BenchmarkScale \(train / val / test or total\)CategoryInteractive Agent EnvironmentsALFWorldShridharet al\.\([2020](https://arxiv.org/html/2606.11435#bib.bib75)\)3,827 games across 6 task types \(pick & place, examine in light, clean/heat/cool & place, pick two & place\)Text / embodiedWebShopYaoet al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib76)\)1\.18M product listings; 12,087 instructionsWeb / shoppingScienceWorldWanget al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib77)\)30 task types; 7,200 parametric variationsInteractive scienceWebArenaZhouet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib78)\)812 long\-horizon web tasks across 4 web applicationsRealistic web GUIAgentBenchLiuet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib23)\)8 distinct environment typesMulti\-env agent suiteAppWorldTrivediet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib79)\)9 apps; 457 APIs; 750 autonomous agent tasksApp / coding controlCode Generation and Software Engineering BenchmarksTerminal\-BenchMerrillet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib90)\)89 terminal tasksCLI agentHumanEvalChenet al\.\([2021](https://arxiv.org/html/2606.11435#bib.bib12)\)164 hand\-written Python programming problemsCode correctnessMBPPAustinet al\.\([2021](https://arxiv.org/html/2606.11435#bib.bib10)\)974 programming tasksCode correctnessEffiBench\-XQinget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib29)\)623 problems across 6 programming languagesCode efficiencyPIE datasetShypulaet al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib30)\)∼\\sim77K slow/fast C\+\+ pairs over 1,474 problemsCode efficiencyMathematical Reasoning BenchmarksAMC / AIMEArt of Problem Solving \([n\.d\.](https://arxiv.org/html/2606.11435#bib.bib6)\)Annual competition\-level problem sets \(≈\\approx30 / year\)Math competitionOmni\-MATHGaoet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib84)\)4,428 Olympiad\-level problems across 33 sub\-domainsMath olympiadDeepScaleRLuoet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib24)\)40K problem\-answer pairsMath RL trainingQuestion Answering and Knowledge\-Intensive BenchmarksNatural Questions \(NQ\)Kwiatkowskiet al\.\([2019](https://arxiv.org/html/2606.11435#bib.bib18)\)307,373 train / 7,830 dev / 7,842 test queriesSingle\-hop QATriviaQAJoshiet al\.\([2017](https://arxiv.org/html/2606.11435#bib.bib17)\)650K question\-answer\-evidence triplesSingle\-hop QAPopQAMallenet al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib26)\)14K QA pairs over long\-tail Wikipedia entitiesSingle\-hop QAHotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.11435#bib.bib87)\)113K Wikipedia QA pairsMulti\-hop QA2WikiMultiHopQAHoet al\.\([2020](https://arxiv.org/html/2606.11435#bib.bib16)\)192,606 multi\-hop QA pairs over WikipediaMulti\-hop QAMuSiQueTrivediet al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib33)\)25K 2\-4\-hop QA pairsMulti\-hop QABambooglePresset al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib28)\)125 compositional 2\-hop questionsMulti\-hop QAWikiTableQuestionsPasupat and Liang \([2015](https://arxiv.org/html/2606.11435#bib.bib27)\)22,033 questions over 2,108 Wikipedia tablesTable QAGAIAMialonet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib82)\)466 questionsGeneral assistant QAHLEPhanet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib83)\)2,500 expert\-validated questionsExpert examKnowledge, Language, and Instruction\-Following BenchmarksMMLUHendryckset al\.\([2020](https://arxiv.org/html/2606.11435#bib.bib15)\)15,908 multi\-choice questions across 57 subjectsKnowledge, multi\-taskAlpacaEvalLiet al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib19)\)805 evaluation promptsInstruction followingMT\-BenchZhenget al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib34)\)80 multi\-turn questions across 8 categoriesMulti\-turn dialogWildBenchLinet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib22)\)1,024 challenging real\-user tasks from WildChat logsReal\-user instructionMemory and Conversational BenchmarksLoCoMoMaharanaet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib85)\)1,986 questionsLong\-horizon memoryLongMemEvalWuet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib86)\)500 QA items over long chat historiesLong\-horizon memoryStuLifeCaiet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib11)\)1,284 interdependent tasks across 3 phases / 10 sub\-scenariosLifelong learningMultimodal and Tool\-Use BenchmarksVisualToolBenchGuoet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib14)\)1,204 open\-ended vision tasks \(603 single\-turn / 601 multi\-turn\) across 5 domainsMultimodal tool useTIR\-BenchLiet al\.\([2025b](https://arxiv.org/html/2606.11435#bib.bib21)\)13 thinking\-with\-images tool\-use tasksMultimodal tool useMMSearch\-PlusTaoet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib32)\)311 provenance\-aware multimodal search tasksMultimodal searchMMBrowseCompLiet al\.\([2025c](https://arxiv.org/html/2606.11435#bib.bib20)\)224 hand\-crafted multimodal browsing questionsMultimodal web browsingAgentVistaSuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib31)\)209 tasks across 25 sub\-domains in 7 categoriesMultimodal agent suiteSpreadsheetBenchMaet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib25)\)912 questions; 2,729 test cases \(avg\. 3 per instruction\)Spreadsheet manipulationBFCL\-v3Patilet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib88)\)1,000 multi\-turn function calling dataTool callingτ2\\tau^\{2\}\-BenchBarreset al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib89)\)2,285 tasksConversational agentM3\-BenchZhouet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib117)\)28 servers with 231 toolsMultimodal tool useEmbodied / Open\-Ended EnvironmentsMineDojo / MinecraftFanet al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib13)\)730K\+ YouTube videos with time\-aligned transcripts, 6K\+ free\-form Wiki pages, and 340K\+ Reddit posts with multimedia contentsEmbodied / open\-endedTable 3:General\-domain evaluation benchmarks\.Scalereports the dataset size and the standard splits from the primary reference\.Categorycorresponds to the type of tasks\.
### Question Answering and Knowledge\-Intensive Tasks\.

Search\-augmented QA benchmarks evaluate retrieval, multi\-hop reasoning, and tool\-augmented information seeking with verifiable answers\. They are central to skill evaluation because skills that encapsulate query decomposition, evidence selection, and cross\-document synthesis can be measured by improvements on questions whose answers cannot be retrieved in a single hop\.

Single\-hop benchmarks includeNatural Questions \(NQ\)Kwiatkowskiet al\.\([2019](https://arxiv.org/html/2606.11435#bib.bib18)\), derived from 307,373 / 7,830 / 7,842 real anonymized Google search queries with Wikipedia answer annotations;TriviaQAJoshiet al\.\([2017](https://arxiv.org/html/2606.11435#bib.bib17)\), consisting of over 650K question\-answer\-evidence triples \(95K author\-written QA pairs\); andPopQAMallenet al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib26)\), 14K QA pairs converted from Wikidata triples to probe long\-tail entity knowledge\. Multi\-hop benchmarks includeHotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.11435#bib.bib87)\)with 113K Wikipedia\-based questions and supporting facts;2WikiMultiHopQAHoet al\.\([2020](https://arxiv.org/html/2606.11435#bib.bib16)\), 192,606multi\-hop questions combining structured and unstructured Wikipedia/Wikidata evidence;MuSiQueTrivediet al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib33)\), 25K 2\-4\-hop questions systematically composed from connected single\-hop pairs to enforce genuine multi\-hop reasoning; andBambooglePresset al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib28)\), 125 manually constructed 2\-hop questions designed to expose the compositionality gap\. Together these seven form the search\-augmented suite used by SkillRLXiaet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib113)\)\.WikiTableQuestionsPasupat and Liang \([2015](https://arxiv.org/html/2606.11435#bib.bib27)\)contains 22,033 complex questions over 2,108 semi\-structured Wikipedia tables requiring compositional semantic parsing, and is used by Trace2SkillNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)as an out\-of\-distribution evaluation that demonstrates cross\-model skill transfer \(up to\+57\.65\+57\.65% absolute gain\)\. For agent\-style general\-knowledge tasks,GAIA \(General AI Assistants\)Mialonet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib82)\)provides 466 questions \(166 validation / 300 sequestered test\) requiring web search, tools, and multi\-step reasoning, andHumanity’s Last Exam \(HLE\)Phanet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib83)\)contributes 2,500 expert\-validated questions spanning mathematics, sciences, and humanities; both are used by Memento\-SkillsZhouet al\.\([2026a](https://arxiv.org/html/2606.11435#bib.bib52)\)to evaluate continual skill\-based agent improvement, achieving26\.226\.2% and116\.2116\.2% relative gains, respectively\.

### Knowledge, Language, and Instruction\-Following\.

Knowledge and instruction\-following benchmarks evaluate the breadth of factual knowledge, open\-ended dialogue, and adherence to user instructions\. They could be extended to skill evaluation to assess the capability of skills in improving task analysis, response structuring, and tone/format adaptation\.MMLUHendryckset al\.\([2020](https://arxiv.org/html/2606.11435#bib.bib15)\)provides 15,908 multiple\-choice questions across 57 subjects spanning STEM, humanities, social sciences, and professional domains\.AlpacaEvalLiet al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib19)\)comprises 805 instruction prompts with GPT\-4\-based pairwise win\-rate annotation against a reference model,MT\-BenchZhenget al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib34)\)contains 80 multi\-turn questions across 8 categories evaluated via LLM\-as\-judge, andWildBenchLinet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib22)\)consists of 1,024 challenging tasks carefully curated from over one million real WildChat user\-chatbot conversation logs; collectively these form the instruction\-following suite used by SkillOrchestraWanget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib50)\)\.

### Memory and Conversational Benchmarks\.

Memory\-centric benchmarks evaluate whether agents can extract, consolidate, and recall information across long interaction histories or multi\-turn dialogues\. They are adapted to validate the performance of using skills for summarization, indexing, and retrieval over past experience, which enables agents to operate beyond a single context window\.LoCoMoMaharanaet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib85)\)provides very long dialogues with∼\\sim300 turns spanning up to 35 sessions, accompanied by question\-answering, summarization, and multimodal probes;LongMemEvalWuet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib86)\)contributes 500 QA items over long chat histories under both synthetic and realistic settings; together withHotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.11435#bib.bib87)\)\(multi\-hop QA\) andALFWorld, these are used by MemSkillZhanget al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib58)\)to evaluate learnable memory skills for extracting and consolidating information across long interaction histories\.StuLifeCaiet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib11)\)simulates a student’s holistic college journey across three core phases and ten sub\-scenarios in a persistent, stateful campus environment \(1,284 interdependent tasks spanning a full academic year\), and is referenced by AutoSkillYanget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib54)\)as a representative experience\-driven lifelong learning benchmark for evaluating self\-evolving agents\.

Evolution ParadigmPrimary Signal SourceStrengthsCritical Trade\-offsBenchmark Coverage GapExecution FeedbackRuntime errors, verifier signalsHigh fidelity to real failures; easily auditableReactive; struggles with sparse or ambiguous signalsLack of longitudinal tracking across feedback roundsTrajectory DistillationMulti\-run success/failure tracesCaptures reusable reasoning patterns & recovery pathsNoise accumulation; trajectory bloat inflates context windowsFew benchmarks measure distillation efficiency vs\. token costCompression & AugmentationInter\-skill similarity, knowledge graphsReduces redundancy; improves routing & generalizationRisk of stripping safety constraints or domain nuanceLimited evaluation of post\-composition fidelity & conflict resolutionReinforcement LearningMulti\-task reward gaps, rollout comparisonsOptimizes reusability & long\-horizon orchestrationReward hacking; high compute; unstable without curated baselinesBinary pass/fail metrics ignore composite utility/safety trade\-offsTable 4:Cross\-paradigm trade\-offs and benchmark alignment for skill evolution strategies\.
### Multimodal and Tool\-Use Benchmarks\.

Multimodal and tool\-use benchmarks evaluate visual grounding, tool selection, and orchestration of external resources alongside language reasoning\. They are promising to skill evaluation as many real\-world skills are inherently multimodal or tool\-mediated \(e\.g\., reading a chart and querying an API\), and their value cannot be captured by purely textual benchmarks\. For multimodal continual learning, XSkillJianget al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib66)\)evaluates on five benchmarks spanning three domains\. Visual agentic tool use is covered byVisualToolBenchGuoet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib14)\), comprising 1,204 open\-ended vision tasks \(603 single\-turn and 601 multi\-turn\) across five domains paired with detailed rubrics, and byTIR\-BenchLiet al\.\([2025b](https://arxiv.org/html/2606.11435#bib.bib21)\), which evaluates agentic thinking\-with\-images reasoning across 13 diverse tasks requiring novel tool use for image processing and manipulation in chain\-of\-thought\. Multimodal search and web browsing are covered byMMSearch\-PlusTaoet al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib32)\), a 311\-task provenance\-aware benchmark that requires extracting weak, localized visual cues and propagating them through iterative image\-text retrieval, andMMBrowseCompLiet al\.\([2025c](https://arxiv.org/html/2606.11435#bib.bib20)\), a hand\-crafted set of 224 questions specifically designed to assess multimodal retrieval and reasoning over image\- and video\-rich web content\. A comprehensive multimodal\-agent setting is provided byAgentVistaSuet al\.\([2026b](https://arxiv.org/html/2606.11435#bib.bib31)\), which contains 209 tasks across 25 sub\-domains in 7 categories requiring long\-horizon hybrid tool use \(web search, image search, page navigation, and code\-based image processing\)\.SpreadsheetBenchMaet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib25)\)provides 912 real\-world spreadsheet manipulation tasks from online Excel forums with 2,729 test cases, used by Trace2SkillNiet al\.\([2026](https://arxiv.org/html/2606.11435#bib.bib56)\)for skill\-deepening evaluation\. For tool\-calling and conversational control,BFCL\-v3Patilet al\.\([2024](https://arxiv.org/html/2606.11435#bib.bib88)\)extends the Berkeley Function Calling Leaderboard toward 1,000 multi\-turn function calling data, andτ2\\tau^\{2\}\-BenchBarreset al\.\([2025](https://arxiv.org/html/2606.11435#bib.bib89)\)studies dual\-control telecom\-style dialogues with compositional simulated users and verifiable outcomes\.M3\-BenchZhouet al\.\([2025a](https://arxiv.org/html/2606.11435#bib.bib117)\)further covers multi\-modal, multi\-hop, and multi\-threaded tool\-use agents over 28 MCP servers exposing 231 tools\.

### Embodied / Open\-Ended Environments\.

Embodied and open\-ended environments evaluate exploration, lifelong learning, and the construction of compositional skill libraries in worlds without a fixed task distribution\. Leveraging such benchmarks for skill evaluation could assess the capability of skills to support cumulative, transferable competence acquired through interaction with the environment\.MineDojoFanet al\.\([2022](https://arxiv.org/html/2606.11435#bib.bib13)\)is a Minecraft\-based framework for open\-ended embodied lifelong learning, providing a simulation suite with thousands of programmatic tasks and an internet\-scale knowledge base of 730K\+ YouTube videos with time\-aligned transcripts, 6K\+ free\-form Wiki pages, and 340K\+ Reddit posts with multimedia contents\. It is the evaluation environment for VoyagerWanget al\.\([2023](https://arxiv.org/html/2606.11435#bib.bib110)\), which measures unique items obtained, distance traveled, and tech\-tree milestone progression to evaluate compositional skill libraries built through automatic curricula\.

## Appendix CPractical Guidelines for Skill Evolution System Design

[Table 4](https://arxiv.org/html/2606.11435#A2.T4)maps each evolution paradigm to its primary signal source, empirical strengths, critical trade\-offs, and benchmark coverage gaps\. Evolution paradigm performance varies due to distinct signal sources that we offer practical guidelines to advance this research frontier\.

Execution Feedback: The feedback loop must better distinguish between identifying failures and generating the rewrite\. Since these execution\-feedback methods often excel at high\-fidelity failure correction but their signals are sparse when execution environments are narrow or deterministic, and existing evaluations measure mostly single\-round skill quality rather than tracking improvement across repeated feedback cycles\.

Trajectory distillation: The distillation operation is recommended to compare patterns across multiple runs to discover reusable knowledge\. High\-quality trajectories are required in advance and should be explicitly curated along quality and diversity before distillation rather than using all available traces indiscriminately\.

Compression and augmentation: Compression operations targeting token efficiency can degrade skill utility by improperly removing task\-critical procedural knowledge\. So before compression, core executable steps should be annotated as a protected reference\. After compression, the evolved skill should be executed against a held\-out task set to confirm performance is preserved\.

Reinforcement learning: To verify that an RL approach genuinely improves skill quality rather than training the agent to bypass skills, we recommend a dual\-rollout evaluation protocol: at regular training intervals, evaluating task performance both with and without skills, and treating the performance gap as the skill contribution signal\. A shrinking training gap is an early sign that the agent is learning to solve tasks bypassing the skill library and should trigger a review of the reward design\.
Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

Similar Articles

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Submit Feedback

Similar Articles

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills