SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
Summary
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
View Cached Full Text
Cached at: 05/08/26, 08:40 AM
# A Large-Scale Benchmark for Skill Retrieval in LLM Agents
Source: [https://arxiv.org/html/2605.05726](https://arxiv.org/html/2605.05726)
Hongcheol Cho Ryangkyung Kang11footnotemark:1Youngeun Kim
ThakiCloud
Equal contribution\. Equal contributors are listed in alphabetical order\.Corresponding author:youngeun\.kim@thakicloud\.com###### Abstract
As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge\. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets\. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill libraries\. To address this gap, we introduceSkillRet, a large\-scale benchmark for skill retrieval in LLM agents\.SkillRetcontains 17,810 public agent skills, organized with structured semantic tags and a two\-level taxonomy spanning 6 major categories and 18 sub\-categories\. It provides 63,259 training samples and 4,997 evaluation queries with disjoint skill pools, enabling both benchmarking and retrieval\-oriented training\. Across a diverse set of retrievers, we find that skill retrieval remains far from solved: off\-the\-shelf models struggle on realistic large\-scale skill libraries, and prior skill\-retrieval models still leave substantial headroom\. Task\-specific fine\-tuning onSkillRetsubstantially improves performance, improving NDCG@10 by \+13\.1 points over the strongest prior retriever and by \+16\.9 points over the strongest off\-the\-shelf retriever\. Our analysis further suggests that these gains arise because fine\-tuned models better focus on the small skill\-relevant signals within long and noisy queries\. These results establishSkillRetas a strong benchmark and foundation for future research on retrieval in large\-scale agent systems\. We publicly release the[benchmark](https://huggingface.co/datasets/ThakiCloud/SKILLRET),[code](https://github.com/ThakiCloud/SKILLRET), and model checkpoints \([0\.6B](https://huggingface.co/ThakiCloud/SKILLRET-Embedding-0.6B),[8B](https://huggingface.co/ThakiCloud/SKILLRET-Embedding-8B)\)\.
## 1Introduction
As LLM agents become more capable, they increasingly rely on reusable skills \(i\.e\.,long\-form procedural modules such as prompts, scripts, workflows, and execution policies\) to solve complex tasksXu and Yan \([2026](https://arxiv.org/html/2605.05726#bib.bib30)\); Jianget al\.\([2026b](https://arxiv.org/html/2605.05726#bib.bib31)\); Zhouet al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib32)\); Wanget al\.\([2023a](https://arxiv.org/html/2605.05726#bib.bib34)\)\. In small\-scale settings, users can often invoke such skills explicitly by name\. However, this assumption becomes brittle as agent ecosystems grow\. When a system maintains a large default pool of reusable skills, it is no longer practical to expose the entire library in context or expect users to know which skill should be activated for a given request\. Instead, future agent systems will increasingly require an explicit retrieval layer that selects a small, relevant subset of skills for the current task, both to reduce context cost and to enable robust automated skill use at scaleLiet al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib33)\)\. This shift is already visible in recent agent systems such as MetaClawXiaet al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib1)\), XSkillJianget al\.\([2026a](https://arxiv.org/html/2605.05726#bib.bib6)\), and WebXSkillWanget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib7)\), which rely on inference\-time retrieval of task\-relevant skills or knowledge to guide downstream execution\.
This trend makes skill retrieval and selection a central systems problem\. The key challenge is whether agents can identify the right skills from a large library under realistic inference constraints\. However, despite the growing need for reliable skill selection, its evaluation remains underdeveloped\. As shown in Table[1](https://arxiv.org/html/2605.05726#S1.T1), prior skill benchmarksLiet al\.\([2026b](https://arxiv.org/html/2605.05726#bib.bib10)\); Hanet al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib11)\); Liet al\.\([2026a](https://arxiv.org/html/2605.05726#bib.bib9)\)mainly focus on end\-to\-end execution rather than retrieval itself, while existing retrieval benchmarks either target tools or provide only limited evaluation scale\. ToolRet studies tool retrieval and shows that even strong IR models struggle in that settingShiet al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib12)\)\. SkillRouter is the closest prior work on skill retrieval, but provides only 75 evaluation queries and does not publicly release its training dataZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\)\. These limitations point to the need for a larger, publicly available benchmark with substantial training and evaluation splits that isolates skill retrieval as a standalone problem\.
To address this gap, we introduceSkillRet, a large\-scale benchmark for skill retrieval in LLM agents\.SkillRetis built from 17,810 public agent skills, curated from a raw crawl of 22,795 listings through a filtering pipeline\. It provides 63,259 public training samples and 4,997 evaluation samples, enabling both controlled benchmarking and retrieval\-oriented model development\. We further annotate the corpus with semantic tags and a two\-level taxonomy spanning 6 major categories and 18 sub\-categories, supporting fine\-grained analysis across domains and difficulty factors\. Altogether,SkillRetcaptures a realistic retrieval environment characterized by long\-context skill documents and imbalanced skill distributions\.
We benchmark a broad range of retrieval and reranking models onSkillRet\. Our experiments reveal several key findings\. First, skill retrieval remains challenging: even the strongest off\-the\-shelf retriever achieves limited performance, indicating that existing models are not well suited for retrieving relevant skills from queries\. Second, task\-specific fine\-tuning on our training data yields substantial gains, allowing smaller fine\-tuned models to match or even surpass much larger off\-the\-shelf models\. Third, reranking is most effective when the first\-stage retriever has remaining headroom, but its marginal benefit diminishes once the base retriever becomes strong\. Finally, our analysis shows that fine\-tuned models improve retrieval by better focusing on the small skill\-relevant sentences embedded within long, noisy, and compositional queries\. These results establish skill retrieval as a distinct retrieval problem and positionSkillRetas a strong foundation for future research in large\-scale agent systems\.
Table 1:Comparison ofSkillRetwith related benchmarks and work\. Unlike prior skill benchmarks that mainly evaluate end\-to\-end performance,SkillRetisolates skill retrieval as a standalone problem and provides large\-scale train/evaluation splits for retrieval\-model development\. Compared with the closest skill\-retrieval work, SkillRouter,SkillRetoffers a substantially larger evaluation set and a larger public training set\.†SkillRouter reports 37,979 training data, but the training data are not publicly released\.BenchmarkTaskTarget\# Eval SamplesTrain\# Train SamplesToolRetShiet al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib12)\)RetrievalTool7,615✓\>\>200KSkillsBenchLiet al\.\([2026b](https://arxiv.org/html/2605.05726#bib.bib10)\)End\-to\-End PerformanceSkill86×\\times–SWE\-Skills\-BenchHanet al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib11)\)End\-to\-End PerformanceSkill565×\\times–AgentSkillOSLiet al\.\([2026a](https://arxiv.org/html/2605.05726#bib.bib9)\)End\-to\-End PerformanceSkill30×\\times–SkillRouterZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\)RetrievalSkill75✓37,979†SkillRet\(ours\)RetrievalSkill4,997✓63,259
## 2Related Work
### 2\.1Agent Skills
Recent work increasingly treats skills as a reusable abstraction layer for agent systems\. Recent work increasingly treats such skills as a core component of agent design\. MetaClaw proposes a continual meta\-learning framework that jointly evolves a base LLM policy and a reusable skill library, using failure trajectories to synthesize new skills and improve agents without downtimeXiaet al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib1)\)\. XSkill studies continual learning in multimodal agents through two forms of reusable knowledge retrieved and adapted to the current visual context at inference timeJianget al\.\([2026a](https://arxiv.org/html/2605.05726#bib.bib6)\)\. WebXSkill focuses on autonomous web agents and introduces executable skills that combine parameterized action programs with step\-level natural language guidance, organized in a URL\-based graph for context\-aware retrievalWanget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib7)\)\. These systems show that reusable skills are becoming a practical design pattern and that inference\-time access to skill libraries is increasingly important\. Another research driection is about the studies broader skill ecosystems and usefulness\. AgentSkillOS studies ecosystem\-scale organization, selection, and orchestration through capability trees and DAG\-based multi\-skill pipelines, evaluating 30 artifact\-rich tasks across five categoriesLiet al\.\([2026a](https://arxiv.org/html/2605.05726#bib.bib9)\)\. SkillsBench measures whether skills improve performance across 86 tasks in 11 domains, showing gains from curated skills but no average benefit from self\-generated skillsLiet al\.\([2026b](https://arxiv.org/html/2605.05726#bib.bib10)\)\. SWE\-Skills\-Bench similarly evaluates public SWE skills on requirement\-driven software engineering tasks and finds that most skills provide little or no pass\-rate improvementHanet al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib11)\)\. These works are complementary to ours: they show that skill ecosystems are already emerging and that downstream skill usefulness is highly variable\. However, these benchmarks do not isolate skill retrieval quality as a standalone problem\. In end\-to\-end skill\-use settings, failures can arise from the intrinsic usefulness of the selected skill, orchestration errors, execution failures, or contextual mismatch, making it difficult to attribute performance specifically to retrieval\.
### 2\.2Skill Retrieval Benchmarks and Skill Routing
A smaller but growing line of work studies retrieval more directly\. In the tool setting, ToolRet introduces a benchmark with 7\.6K retrieval tasks and 43K tools, showing that models strong on conventional IR benchmarks still struggle on tool retrievalShiet al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib12)\)\. This is an important precedent for our setting: retrieval should be treated as a first\-class agent bottleneck rather than a solved preprocessing step\. However, ToolRet focuses on tools rather than skills, and therefore does not capture the long\-form procedural content, reusable prompting logic, and compositional structure of real skill libraries\. SkillFlowLiet al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib33)\)is complementary to our work because it proposes an agent\-facing multi\-stage pipeline for retrieving and selecting skills from a large community skill library, whereasSkillRetisolates skill retrieval as a standalone benchmark with public train/evaluation splits and controlled ranking\-based evaluation\. The closest prior work is SkillRouter, which studies skill selection over roughly 80K candidate skills using a two\-stage retrieve\-and\-rerank pipeline and a benchmark of 75 expert\-verified queriesZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\)\. A key finding is that the full skill body carries decisive routing signal, and removing it causes large performance drops across retrieval methodsZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\)\. At the same time, SkillRouter is primarily a routing\-model paper rather than a benchmark paper\. Its core contribution is how to design and train a scalable router, whereas our goal is to provide a broader benchmark for comparing retrieval quality across models and settings\.
## 3SkillRet Benchmark
SkillRetis a large\-scale benchmark for retrieving relevant agent skills from a curated library of publicly available skills\. Starting from 22,795 community\-contributed skills, we apply quality filtering and deduplication to obtain 17,810 skills \(Section[3\.1](https://arxiv.org/html/2605.05726#S3.SS1)\)\. We then generate natural\-language queries that mirror realistic agent invocation patterns, where each query requires one or more skills from the library \(Section[3\.2](https://arxiv.org/html/2605.05726#S3.SS2)\)\. Finally, we filter the generated query–skill pairs through automatic checks, LLM\-based review, and human expert validation, yielding disjoint training and evaluation splits with no skill overlap\. Fig\.[1](https://arxiv.org/html/2605.05726#S3.F1)illustrates the full data construction pipeline\.
### 3\.1Data Collection and Quality Filtering
##### Raw corpus\.
We start from a snapshot of 22,795 agent skills crawled fromclaude\-plugins\.dev111[https://claude\-plugins\.dev](https://claude-plugins.dev/), a community\-maintained, open\-source marketplace that auto\-indexes all public agent skills on GitHub\. Each record contains a skill identifier, name, natural\-language description, the full skill body \(SKILL\.md\), and marketplace metadata including GitHub stars, platform\-specific install counts, author, namespace, and license\.
##### Five\-stage filtering\.
We apply a pipeline to remove noise and redundancy, organized into two phases:*content eligibility*\(Steps 1–3\) ensures each skill meets basic quality and legal requirements, and*deduplication*\(Steps 4–5\) removes redundant entries\.\(1\) Description recovery and pruning: listings with missing or stub descriptions \(<<10 characters\) are recovered via YAML frontmatter parsing or first\-paragraph extraction; unrecoverable entries are removed \(3 skills\)\.\(2\) Language filtering: skills whose body contains more than 3% non\-Latin characters are removed, retaining only English\-language skills \(1,319 skills\)\.\(3\) License filtering: skills declaring a license other than MIT or Apache\-2\.0 are excluded \(255 skills\); license\-undeclared near\-duplicates of these entries are identified by normalized content hashes and also removed \(1,249 total\)\.\(4\) Content deduplication: each skill body is normalized \(strip YAML, lowercase, remove non\-alphanumeric\) and hashed with SHA\-256; among duplicates we retain the entry with the highest star and install counts \(1,547 skills removed\)\.\(5\) Search\-target deduplication: skills sharing an identical normalized name–description pair are deduplicated on the concatenated hash, again keeping the most popular entry \(867 skills removed\)\.
After filtering, 17,810 skills remain \(78\.1% of the raw corpus\), forming the document corpus for the benchmark\. The per\-step attrition is tabulated in Appendix[B\.1](https://arxiv.org/html/2605.05726#A2.SS1)\(Table[7](https://arxiv.org/html/2605.05726#A2.T7)\)\. These 17,810 skills are split into a training pool of 10,123 skills and a held\-out evaluation pool of 6,660 skills, with no overlap between the two splits\.
### 3\.2Skill–Query Pair Generation
To construct a realistic evaluation set, we generate natural\-language user queries via a self\-instruct\-styleWanget al\.\([2023b](https://arxiv.org/html/2605.05726#bib.bib2)\)pipeline in which a large language model is prompted to produce queries that include one or more skills from the library\.
##### Seed examples\.
To encourage lexical and structural diversity, we supply each generation call with a random subset of the GAIA benchmark validation setMialonet al\.\([2024](https://arxiv.org/html/2605.05726#bib.bib5)\)\(165 questions\) as style seeds\. These seeds illustrate the range of tones, lengths, and request types found in realistic user messages, and the model is instructed to match this diversity rather than converge on a fixed template\.
##### Skill sampling\.
For each generation call we samplek∈\{1,2,3\}k\\in\\\{1,2,3\\\}skills uniformly at random, wherekkis drawn with equal probability across the three values\. Skills are selected via inverse\-frequency weighted samplingKanget al\.\([2019](https://arxiv.org/html/2605.05726#bib.bib37)\); Cuiet al\.\([2019](https://arxiv.org/html/2605.05726#bib.bib38)\): each skill’s probability is proportional to1/\(freq\+1\)1/\(\\text\{freq\}\+1\), wherefreqis the number of queries already generated for that skill\. A first pass exhausts all skills with zero coverage before any skill is repeated, ensuring that every skill in the library is represented at least once in the final set\.
##### Query generation\.
Each generation call receives the name and description of the sampled skills \(without the full skill body\) and is instructed to produce a single user message that naturally requires all selected skills\. The prompt explicitly forbids mentioning any skill name in the generated query, forcing the task need to emerge from the scenario description rather than from lexical overlap with the skill identifier\. Previously generated queries for the same skill are shown to the model to suppress near\-duplicate outputs\. Evaluation queries are generated with Claude Opus 4\.6Anthropic \([2026](https://arxiv.org/html/2605.05726#bib.bib3)\), while training queries are generated with Qwen3\.5\-122B\-A10BQwen Team \([2026](https://arxiv.org/html/2605.05726#bib.bib4)\)\. If the model judges the skill combination to be unrealistic, it may output a designated null token, and that combination is discarded\. Full prompt details are provided in Appendix[C\.1](https://arxiv.org/html/2605.05726#A3.SS1)\.
Figure 1:Overview of theSkillRetdata generation pipeline\. Starting from 165 seed queries and 17,810 curated agent skills, we sample skills using inverse\-frequency weighting and prompt an LLM to synthesize realistic user messages that naturally require the selected capabilities\. Training queries are generated with Qwen3\.5\-122B\-A10B, while evaluation queries are generated with Claude Opus 4\.6\. Generated queries are then passed through automated filtering, LLM\-based review, and human expert validation, yielding a training pool of 63,259 queries and 4,997 evaluation queries\.
##### Quality filtering and human validation\.
Generated queries pass through a two\-stage automatic filter followed by human expert review\.\(1\) Leakage detection\.We compute the 3\-gram overlap between each query and its associated skill documentation\. Queries whose overlap ratio exceeds a threshold of 10% are flagged as leaking skill content and discarded\.\(2\) Multi\-perspective LLM review\.A second LLM call evaluates each query from three independent reviewer perspectives: skill coherence \(does the query genuinely require the skill?\), query quality \(is the request specific and realistic?\), and benchmark discriminability \(would a model without the skill fail to answer it?\)\. A query is rejected if two or more of the three perspectives return an invalid verdict; a single invalid verdict routes the query to human review rather than discarding it outright\. Full prompts are provided in Appendix[C\.2](https://arxiv.org/html/2605.05726#A3.SS2)\.\(3\) Human expert validation\.Queries that pass automatic filtering are reviewed by three expert annotators using a custom web\-based review tool\. The tool presents each query alongside the associated skill name, description, and the LLM pre\-judgment rationale, allowing annotators to assess skill–query alignment, realism, and discriminability\. Annotators cast a binary valid/invalid mark\. This stage serves as the final quality gate, catching subtle failures that automated filters miss, such as queries that are plausible in isolation but do not genuinely depend on the paired skill\.
##### Training and evaluation splits\.
To construct query sets for both splits, we generate training queries using Qwen3\.5\-122B\-A10BQwen Team \([2026](https://arxiv.org/html/2605.05726#bib.bib4)\)and evaluation queries using Claude Opus 4\.6Anthropic \([2026](https://arxiv.org/html/2605.05726#bib.bib3)\)\. We deliberately use different model families for the two splits so that retrieval models trained on the training set cannot exploit stylistic artifacts of a single generator to inflate evaluation scores; the larger scale of training generation \(63,259 queries\) also makes the open\-weight model the practical choice, allocating the higher\-capacity model to the evaluation set, where query quality directly affects benchmark reliability\. The resulting split comprises a training pool of 10,123 skills and 63,259 queries, and an evaluation pool of 6,660 skills and 4,997 queries, with zero skill overlap between the two sets\. As Fig\.[5](https://arxiv.org/html/2605.05726#A2.F5)shows \(in Appendix\), the major\-category distribution of each split deviates by less than 1 pp from the full library, confirming that the split preserves the natural category distribution without explicit stratification\.
## 4Benchmark Analysis
### 4\.1Taxonomy Overview
Table 2:Taxonomy overview: 6 Major and 18 Sub\-categories covering 17,810 skills\.Major CategorySub\-CategorySkills% TotalSoftware Eng\.Development4,42324\.8Analysis & Testing2,32013\.0Infra\. & DevOps1,97011\.1Documentation8895\.0Version Control7564\.2Security7274\.1AI AgentsAgent Development1,1946\.7Agent Orchestration6073\.4Agent Evaluation2731\.5BusinessBusiness Analysis8214\.6& PlanningProject Mgmt\.7884\.4Data & MLML Development4772\.7Data Engineering4182\.3Data Analysis4162\.3ContentWriting & Text6873\.9CreationVisual & Media4892\.7Info\.General Search3572\.0RetrievalTechnical Search1981\.1The taxonomy is constructed through a five\-stage pipeline\. \(1\) Tag Discovery: an LLM annotates each skill with three structured tags \(*primary\_action*,*primary\_object*,*domain*\) similar toGilardiet al\.\([2023](https://arxiv.org/html/2605.05726#bib.bib39)\); Ziemset al\.\([2024](https://arxiv.org/html/2605.05726#bib.bib40)\)\(2\) Clustering:kk\-means over tag vectors at multiple resolutions reveals*stable clusters*,i\.e\.,groups that persist across different values ofkk\. \(3\) Taxonomy Construction: stable clusters seed an initial draft, which experts iteratively refine into 6 Major categories and 18 Sub\-categories\. \(4\) LLM\-based Assignment: because three\-axis tags capture only surface\-level attributes, we employ Claude Sonnet 4\.6 to classify all 17,810 skills using their full name and description \(Appendix[B\.5](https://arxiv.org/html/2605.05726#A2.SS5)reports representative tag\-rule failures that motivate this design choice\)\. \(5\) Human Validation: a stratified sample of 200 skills is independently verified by experts, yielding an average accuracy of 95\.5% for major categories and 92\.2% for sub\-categories, with full three\-way agreement on 91\.0% and 84\.5% of items respectively\. Appendix[B\.4](https://arxiv.org/html/2605.05726#A2.SS4)provides full details of each stage\. Software Engineering accounts for 62\.2% of the corpus while Information Retrieval comprises only 3\.1%, mirroring the natural composition of public agent skill ecosystems \(Fig\.[4](https://arxiv.org/html/2605.05726#A2.F4)in the Appendix\)\.
### 4\.2Skill & Taxonomy Statistics
Each skill is represented as the composite textname \| description \| skill\_md, which is the actual retrieval target used by all models in our evaluation\. Theskill\_mdcomponent contains the full Markdown body including instructions, decision logic, usage constraints, and implementation details\. Measured incl100k\_basetokens, this composite text has a median length of 1,583 tokens \(mean 2,083; 95th percentile 5,531; max 47,412\), resulting in approximately 37\.1 M tokens across the corpus \(Fig\.[2](https://arxiv.org/html/2605.05726#S4.F2)\(a\)\)\. This is an order of magnitude longer than typical tool descriptions in existing benchmarksShiet al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib12)\), making skill retrieval a fundamentally long\-document matching problem\. Fig\.[2](https://arxiv.org/html/2605.05726#S4.F2)\(b\) shows the per\-Major length distributions\. Data & ML skills are the longest \(median 1,795 tokens\)\. Information Retrieval skills are the shortest, reflecting their comparatively concise search\-oriented instructions\.
Figure 2:Skill and query length statistics\. \(a\) Distribution of document length across all 17,810 skills\. \(b\) Box plots of document length by major category\. \(c\) Query length distributions for the evaluation set and training set\. \(d\) Distribution ofkk\(number of skills per query\) in each split; training queries are sampled uniformly acrosskk, whereas evaluation queries are concentrated onk=1k\{=\}1andk=2k\{=\}2\.
### 4\.3Query Statistics
We further summarize the key distributional properties of the generated queries across the training and evaluation splits\. In Fig\.[2](https://arxiv.org/html/2605.05726#S4.F2)\(c\), evaluation queries, generated by Claude Opus 4\.6, are substantially longer than training queries generated by Qwen3\.5\-122B\-A10BQwen Team \([2026](https://arxiv.org/html/2605.05726#bib.bib4)\), with a median length of 170 words versus 72 words and a 95th percentile of 270 versus 108 words\. This difference likely reflects generation style across the two model families, where Opus 4\.6 tends to produce more detailed, scenario\-rich requests, making evaluation queries inherently more challenging for lexical matching methods\. In terms of the number of required skills per query, training queries are distributed uniformly acrossk∈\{1,2,3\}k\\in\\\{1,2,3\\\}, whereas evaluation queries are concentrated on lower values ofkk, with 46% single\-skill queries, 40% two\-skill queries, and 13% three\-skill queries \(Fig\.[2](https://arxiv.org/html/2605.05726#S4.F2)\(d\)\)\. Notably, multi\-skill queries \(k≥2k\\geq 2\) still account for the majority of the evaluation set \(54%\), requiring retrievers to jointly identify multiple relevant skills rather than simply retrieving a single best match\.
Table 3:Embedding retrieval results onSkillRet\. Models are grouped by architecture type\. BM25 is included as a sparse baseline\. Best result per metric isbolded\.
## 5Evaluation
### 5\.1Experimental Setup
##### Setup\.
We adopt a two\-stage retrieve\-then\-rerank pipeline where an embedding model retrieves the top\-kkcandidates via cosine similarity and a reranker re\-scores each query–candidate pair\. Largerkkyields better coverage but increases reranking cost\. We evaluatek∈\{10,20,50\}k\\in\\\{10,20,50\\\}and setk=20k\{=\}20considering the trade\-off between retrieval quality and computational cost\. Ablation results are in Appendix[D](https://arxiv.org/html/2605.05726#A4)\. Encoding the full document text, including the name, description, and Markdown body, consistently outperforms encoding name and description only, as shown in Appendix[E](https://arxiv.org/html/2605.05726#A5)\. We therefore encode each document up to the model’s maximum sequence length for all experiments, with per\-model limits listed in Appendix[F](https://arxiv.org/html/2605.05726#A6)\. We use each model’s officially recommended prompts\. Only the Harrier, Qwen3\-Embedding, and Qwen3\-Reranker families have their default web\-search instruction replaced with a skill\-retrieval instruction we authored\. Full specifications are in Appendix[G](https://arxiv.org/html/2605.05726#A7)\.
##### Models\.
For embedding, we evaluate 18 models across three categories, including a sparse baseline BM25, encoder\-only models, and decoder\-only models, covering sub\-100M to 12B parameters with 16 off\-the\-shelf and 2 fine\-tuned models\. The full list is in Table[3](https://arxiv.org/html/2605.05726#S4.T3)\. For reranking, we evaluate jina\-reranker\-v2\-base\-multilingualJina AI \([2024](https://arxiv.org/html/2605.05726#bib.bib29)\)and the Qwen3\-RerankerZhanget al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib23)\)family at scales 0\.6B, 4B, and 8B\. We also include SkillRouter\-Embedding\-0\.6BZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\)and SkillRouter\-Reranker\-0\.6BZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\)as external fine\-tuned baselines, evaluated using the publicly released checkpoints on HuggingFace\. We refer to all models fine\-tuned onSkillRettraining data collectively as theSkillRetmodel family, comprising SkillRet\-Embedding\-0\.6B, SkillRet\-Embedding\-8B, and SkillRet\-Reranker\-0\.6B\. Although Harrier\-OSS outperforms Qwen3\-Embedding off\-the\-shelf, we use Qwen3\-Embedding as our fine\-tuning base because Harrier is itself a fine\-tuned derivative of Qwen3\-Embedding\. We verify that fine\-tuning from either base yields comparable results in Appendix[H](https://arxiv.org/html/2605.05726#A8)\.
##### Training details\.
We fine\-tune Qwen3\-Embedding\-0\.6B and Qwen3\-Embedding\-8B using MultipleNegativesRankingLoss with in\-batch negatives on 127,190 positive query–skill pairs derived from the training split\. SkillRet\-Reranker\-0\.6B is fine\-tuned from Qwen3\-Reranker\-0\.6B using binary cross\-entropy on theyes/notoken probability at the final decoding position\. Hard negatives are mined with the fine\-tuned SkillRet\-Embedding\-0\.6B retriever\. Full hyperparameters are in Appendix[I](https://arxiv.org/html/2605.05726#A9)\.
##### Evaluation metrics\.
We report three metrics atk∈\{5,10,15\}k\\in\\\{5,10,15\\\}:NDCG@kkJärvelin and Kekäläinen \([2002](https://arxiv.org/html/2605.05726#bib.bib35)\)measures ranking quality,Recall@kkmeasures the fraction of ground\-truth skills retrieved, andCompleteness@kkQuet al\.\([2024](https://arxiv.org/html/2605.05726#bib.bib36)\)measures the fraction of queries where all ground\-truth skills are retrieved, i\.e\., Recall@k=1k=1\. All evaluations are run on a single NVIDIA B200 GPU \(180 GB VRAM\) per model to ensure reproducibility\.
### 5\.2Experimental Results
##### Embedding Retrieval\.
Table[3](https://arxiv.org/html/2605.05726#S4.T3)reports the retrieval performance of all evaluated models on theSkillRetbenchmark\. The best encoder\-only model, bge\-large\-en\-v1\.5Xiaoet al\.\([2024](https://arxiv.org/html/2605.05726#bib.bib18)\), reaches 55\.82 NDCG@10, setting a ceiling that decoder\-only models consistently surpass\. Decoder\-only models support maximum sequence lengths of 8K–32K tokens, far exceeding the 512\-token limit of encoder\-only models, and can thus encode full skill documents without truncation\. harrier\-oss\-v1\-0\.6bHuanget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib28)\)reaches 66\.55 NDCG@10, a gap of 10\.7 points over the encoder\-only ceiling\. Within decoder\-only models, however, larger parameter counts do not guarantee better performance: NV\-Embed\-v1Leeet al\.\([2024](https://arxiv.org/html/2605.05726#bib.bib25)\)at 7B scores only 53\.12, well below harrier\-oss\-v1\-270mHuanget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib28)\)at 61\.17, and KaLM\-Gemma3\-12BZhaoet al\.\([2025](https://arxiv.org/html/2605.05726#bib.bib27)\)at 12B achieves only 55\.38, lower than several 0\.6B and 8B models\. These inversions suggest that model scale alone is insufficient\. What matters more is whether a model has been trained on domain\-relevant data\. Fine\-tuning directly validates this\. SkillRouter\-Embedding\-0\.6BZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\), a publicly released model fine\-tuned, already surpasses all off\-the\-shelf models at 70\.38 NDCG@10\. Our SkillRet models push further still\. SkillRet\-Embedding\-0\.6B reaches 78\.03, outperforming SkillRouter\-Embedding\-0\.6B by 7\.7 points, and SkillRet\-Embedding\-8B reaches 83\.45, a gain of 16\.9 points over the strongest off\-the\-shelf model, confirming that domain\-specific fine\-tuning is the dominant factor for skill retrieval performance\.
##### Reranking\.
Table[4](https://arxiv.org/html/2605.05726#S5.T4)reports results before and after reranking for the top\-20 candidates returned by each first\-stage retriever\. Off\-the\-shelf rerankers consistently*decrease*NDCG@10 for the SkillRet embedding models, suggesting*domain mismatch*where a general\-purpose reranker may override correct results from an already task\-specialized retriever\. Qwen3\-Reranker variants at 0\.6B, 4B, and 8B converge to a similar performance level, suggesting they are bounded by domain coverage rather than scale\. SkillRet\-Reranker\-0\.6B breaks through this via domain\-specific fine\-tuning, with gains proportional to first\-stage headroom\. It improves SkillRet\-Embedding\-0\.6B by 4\.15 NDCG@10 points, from 78\.03 to 82\.18, where headroom remains, but yields a smaller gain for SkillRet\-Embedding\-8B near the performance ceiling, from 83\.45 to 84\.22\. SkillRet\-Reranker\-0\.6B performs on par with SkillRouter\-Reranker\-0\.6BZhenget al\.\([2026](https://arxiv.org/html/2605.05726#bib.bib13)\)across all first\-stage models, despite being independently fine\-tuned\. This convergence suggests both models have reached a performance ceiling imposed by the current benchmark and training data\.
Table 4:Reranking results onSkillRetwith top\-20 candidates\. Best result per first\-stage model isbolded\.
### 5\.3Analysis
##### Training effect: skill\-relevant sentence focus\.
Fine\-tuned models substantially outperform their base counterparts, as shown in Table[3](https://arxiv.org/html/2605.05726#S4.T3), but why does training help? We hypothesize that fine\-tuning does not simply improve overall query encoding, but instead sharpens the model’s focus on the small subset of sentences within a query that directly signals skill intent\. Skill queries are typically long, scenario\-rich requests in which only a few sentences carry the actionable capability signal\. The remainder consists of background context, output requirements, and constraints largely orthogonal to skill selection\. A base model may distribute attention broadly across all sentences, while a fine\-tuned model learns via retrieval supervision to prioritize the sentences that most directly determine which skill is needed\.
Table 5:Effect of masking important query snippets on retrieval performance \(NDCG@10\)\.To test this, we conduct a sentence erasure analysisBarkanet al\.\([2024](https://arxiv.org/html/2605.05726#bib.bib15)\); Liet al\.\([2016](https://arxiv.org/html/2605.05726#bib.bib16)\)on the 2,319 single\-skill evaluation queries\. For each sentencesis\_\{i\}in queryqq, we replace it with\[MASK\], re\-encode the masked queryq∖siq\_\{\\setminus s\_\{i\}\}, and computeimportance\(si\)=sim\(q,d\+\)−sim\(q∖si,d\+\)\\mathrm\{importance\}\(s\_\{i\}\)=\\mathrm\{sim\}\(q,d^\{\+\}\)\-\\mathrm\{sim\}\(q\_\{\\setminus s\_\{i\}\},d^\{\+\}\)\. We then mask the top\-kkmost important sentences and re\-run retrieval, with results shown in Table[5](https://arxiv.org/html/2605.05726#S5.T5)\. On the full query, the trained model outperforms the base model by 7\.5 NDCG@10 points, yet removing the single most important sentence causes a larger performance drop\. This suggests that the trained model concentrates its retrieval signal on a small set of skill\-relevant sentences, whereas the base model relies more diffusely on information spread across the entire query\. A qualitative visualization of this pattern is shown in Fig\.[6](https://arxiv.org/html/2605.05726#A12.F6)in Appendix[L](https://arxiv.org/html/2605.05726#A12)\.

Figure 3:MTEB Retrieval score vs\.SkillRet\. Circle size is proportional to parameter count\.Table 6:Per\-Major category NDCG@10 for Qwen3\-Embedding \(Base\) and SkillRet\-Embedding \(Ours\)\. Categories ordered by difficulty \(hardest first\)\.
##### MTEB Retrieval ranking does not predict skill retrieval performance\.
Figure[3](https://arxiv.org/html/2605.05726#S5.F3)plots MTEB Retrieval scoreHugging Face \([2026](https://arxiv.org/html/2605.05726#bib.bib14)\)againstSkillRetNDCG@10\. MTEB Retrieval score shows a moderate positive correlation at Spearmanρ=0\.71\\rho=0\.71, yet ranking inversions are common, with models that score highly on MTEB often underperforming onSkillRet, and vice versa\. These inversions suggest that skill retrieval demands a form of query understanding distinct from general semantic matching, requiring models to identify specific capability signals within long, multi\-sentence queries\. Task\-specific fine\-tuning, as demonstrated by the SkillRet model family, is the most effective way to bridge this gap\. Full scores and detailed examples are in Appendix[J](https://arxiv.org/html/2605.05726#A10)\.
##### Per\-category performance\.
Table[6](https://arxiv.org/html/2605.05726#S5.T6)breaks down NDCG@10 by the six Major categories\. The SkillRet models improve substantially over the Qwen3\-Embedding baselines across every category, with gains ranging from \+10\.4 pp to \+40\.0 pp for the 8B variant\. Despite these gains, the difficulty ordering is stable across all four configurations: Information Retrieval and AI Agents consistently score lowest, and a 16 pp gap between the easiest and hardest categories persists even for SkillRet\-Embedding\-8B\. This category\-level disparity is invisible to the aggregate NDCG@10 of 83\.5 and can only be surfaced through the taxonomy\-based stratification\. Finer\-grained Sub\-category results \(Appendix[K](https://arxiv.org/html/2605.05726#A11)\) expose within\-Major variance of up to 17\.9 pp, further confirming the taxonomy’s value as a diagnostic tool for pinpointing retrieval bottlenecks\.
## 6Limitations
SkillRethas two main limitations\. First,SkillRetqueries are designed to resemble realistic user requests but are synthetically generated rather than collected from live agent interactions\. Thus, the evaluation set may under\-represent terse, underspecified, conversational, or user\-context\-dependent requests common in real deployments\. We mitigate this with GAIA\-style seed examples, skill\-name leakage filtering, and query–skill validation, but bridging synthetic benchmarks with real agent traffic remains important future work\. Second,SkillRetevaluates retrieval quality in isolation and does not measure downstream task success or end\-to\-end agent performance\. Higher NDCG@10 does not necessarily imply better skill use, since retrieved skills must still be selected, composed, interpreted, and executed under practical context and latency constraints\. We leave the joint study of skill retrieval and downstream execution to future work\.
## 7Conclusion
We introducedSkillRet, a large\-scale benchmark for skill retrieval in LLM agents, built from 17,810 curated public skills with a two\-level taxonomy of 6 Major and 18 Sub\-categories, 4,997 evaluation queries, and a matched training pool of 63,259 queries\. Unlike prior tool retrieval benchmarks, SkillRet targets long\-form, compositional skill documents, where the relevant signal must be matched against a small actionable portion of the user query\. Across various embedding models, The strongest off\-the\-shelf retriever reaches 0\.665 NDCG@10, while the strongest prior skill\-retrieval model reaches 0\.704\. Domain\-specific fine\-tuning onSkillRetlifts NDCG@10 to 0\.835, corresponding to a \+13\.1\-point gain over the strongest prior retriever and a \+16\.9\-point gain over the strongest off\-the\-shelf retriever\. These results position skill retrieval as a distinct long\-document matching problem and establish SkillRet as a foundation for retrieval\-oriented training and benchmarking in future agent systems\.
## Acklowdege
We sincerely thank Hyojung Han and Seunghun Jeon for their helpful discussions during the early stages of this project\.
## References
- \[1\]\(2026\)Jina\-embeddings\-v5\-text: task\-targeted embedding distillation\.arXiv preprint arXiv:2602\.15547\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.17.17.2)\.
- \[2\]Anthropic\(2026\)Claude opus 4\.6\.Note:[https://www\.anthropic\.com/claude](https://www.anthropic.com/claude)Cited by:[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.SSS0.Px5.p1.1)\.
- \[3\]O\. Barkan, Y\. Toib, Y\. Elisha, J\. Weill, and N\. Koenigstein\(2024\)LLM explainability via attributive masking learning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 9522–9537\.Cited by:[§5\.3](https://arxiv.org/html/2605.05726#S5.SS3.SSS0.Px1.p2.5)\.
- \[4\]Y\. Cui, M\. Jia, T\. Lin, Y\. Song, and S\. Belongie\(2019\)Class\-balanced loss based on effective number of samples\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 9268–9277\.Cited by:[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.SSS0.Px2.p1.4)\.
- \[5\]S\. Eslami, M\. Gaiduk, M\. Krimmel, L\. Milliken, B\. Wang, and D\. Bykov\(2026\)Diffusion\-pretrained dense and contextual embeddings\.arXiv preprint arXiv:2602\.11151\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.14.14.2)\.
- \[6\]F\. Gilardi, M\. Alizadeh, and M\. Kubli\(2023\)ChatGPT outperforms crowd workers for text\-annotation tasks\.Proceedings of the National Academy of Sciences120\(30\),pp\. e2305016120\.Cited by:[§4\.1](https://arxiv.org/html/2605.05726#S4.SS1.p1.2)\.
- \[7\]T\. Han, Y\. Zhang, W\. Song, C\. Fang, Z\. Chen, Y\. Sun, and L\. Hu\(2026\)SWE\-skills\-bench: do agent skills actually help in real\-world software engineering?\.arXiv preprint arXiv:2603\.15401\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2603.15401)Cited by:[Table 1](https://arxiv.org/html/2605.05726#S1.T1.5.3.2),[§1](https://arxiv.org/html/2605.05726#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.05726#S2.SS1.p1.1)\.
- \[8\]A\. Huang, L\. Wang, F\. Wei,et al\.\(2026\)Harrier\-oss\-v1\.Note:[https://huggingface\.co/microsoft/harrier\-oss\-v1\-0\.6b](https://huggingface.co/microsoft/harrier-oss-v1-0.6b)Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.13.13.2),[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.16.16.2),[§5\.2](https://arxiv.org/html/2605.05726#S5.SS2.SSS0.Px1.p1.1)\.
- \[9\]Hugging Face\(2026\)MTEB leaderboard\.Note:[https://huggingface\.co/spaces/mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard)Cited by:[§5\.3](https://arxiv.org/html/2605.05726#S5.SS3.SSS0.Px2.p1.1)\.
- \[10\]K\. Järvelin and J\. Kekäläinen\(2002\)Cumulated gain\-based evaluation of ir techniques\.ACM Transactions on Information Systems \(TOIS\)20\(4\),pp\. 422–446\.Cited by:[§5\.1](https://arxiv.org/html/2605.05726#S5.SS1.SSS0.Px4.p1.5)\.
- \[11\]G\. Jiang, Z\. Su, X\. Qu, and Y\. R\. Fung\(2026\)XSkill: continual learning from experience and skills in multimodal agents\.arXiv preprint arXiv:2603\.12056\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2603.12056)Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05726#S2.SS1.p1.1)\.
- \[12\]Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu\(2026\)SoK: agentic skills–beyond tool use in llm agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1)\.
- \[13\]Jina AI\(2024\)Jina\-reranker\-v2\-base\-multilingual\.Note:[https://huggingface\.co/jinaai/jina\-reranker\-v2\-base\-multilingual](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual)Cited by:[§5\.1](https://arxiv.org/html/2605.05726#S5.SS1.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.11.9.2),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.4.2.2)\.
- \[14\]B\. Kang, S\. Xie, M\. Rohrbach, Z\. Yan, A\. Gordo, J\. Feng, and Y\. Kalantidis\(2019\)Decoupling representation and classifier for long\-tailed recognition\.arXiv preprint arXiv:1910\.09217\.Cited by:[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.SSS0.Px2.p1.4)\.
- \[15\]C\. Lee, R\. Roy, M\. Xu, J\. Raiman, M\. Shoeybi, B\. Catanzaro, and W\. Ping\(2024\)Nv\-embed: improved techniques for training llms as generalist embedding models\.arXiv preprint arXiv:2405\.17428\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.18.18.2),[§5\.2](https://arxiv.org/html/2605.05726#S5.SS2.SSS0.Px1.p1.1)\.
- \[16\]F\. Li, P\. Tagkopoulos, and I\. Tagkopoulos\(2025\)SkillFlow: scalable and efficient agent skill retrieval system\.arXiv e\-prints,pp\. arXiv–2504\.Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.05726#S2.SS2.p1.1)\.
- \[17\]H\. Li, C\. Mu, J\. Chen, S\. Ren, Z\. Cui, Y\. Zhang, L\. Bai, and S\. Hu\(2026\)Organizing, orchestrating, and benchmarking agent skills at ecosystem scale\.arXiv preprint arXiv:2603\.02176\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2603.02176)Cited by:[Table 1](https://arxiv.org/html/2605.05726#S1.T1.6.4.2),[§1](https://arxiv.org/html/2605.05726#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.05726#S2.SS1.p1.1)\.
- \[18\]J\. Li, W\. Monroe, and D\. Jurafsky\(2016\)Understanding neural networks through representation erasure\.arXiv preprint arXiv:1612\.08220\.Cited by:[§5\.3](https://arxiv.org/html/2605.05726#S5.SS3.SSS0.Px1.p2.5)\.
- \[19\]X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun, S\. Wang, B\. Li, Q\. Zeng, D\. Wang, X\. Zhao, Y\. Wang, R\. Ben Chaim, Z\. Di, Y\. Gao, J\. He, Y\. He, L\. Jing, L\. Kong, X\. Lan, J\. Li, S\. Li, Y\. Li, Y\. Lin, X\. Liu, X\. Liu, H\. Lyu, Z\. Ma, B\. Wang, R\. Wang, T\. Wang, W\. Ye, Y\. Zhang, H\. Xing, Y\. Xue, S\. Dillmann, and H\. Lee\(2026\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.arXiv preprint arXiv:2602\.12670\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.12670)Cited by:[Table 1](https://arxiv.org/html/2605.05726#S1.T1.4.2.2),[§1](https://arxiv.org/html/2605.05726#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.05726#S2.SS1.p1.1)\.
- \[20\]L\. Merrick, D\. Xu, G\. Nuti, and D\. Campos\(2024\)Arctic\-embed: scalable, efficient, and accurate text embedding models\.arXiv preprint arXiv:2405\.05374\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.7.7.2)\.
- \[21\]G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom\(2024\)GAIA: a benchmark for general AI assistants\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fibxvahvs3)Cited by:[§C\.1](https://arxiv.org/html/2605.05726#A3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.SSS0.Px1.p1.1)\.
- \[22\]C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen\(2024\)Towards completeness\-oriented tool retrieval for large language models\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,pp\. 1930–1940\.Cited by:[§5\.1](https://arxiv.org/html/2605.05726#S5.SS1.SSS0.Px4.p1.5)\.
- \[23\]Qwen Team\(2026\)Qwen3\.5\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Accessed: 2026\-04\-22Cited by:[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.SSS0.Px5.p1.1),[§4\.3](https://arxiv.org/html/2605.05726#S4.SS3.p1.3)\.
- \[24\]S\. E\. Robertson and S\. Walker\(1994\)Some simple effective approximations to the 2\-poisson model for probabilistic weighted retrieval\.InSIGIR’94: Proceedings of the Seventeenth Annual International ACM\-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University,pp\. 232–241\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.4.4.2)\.
- \[25\]Z\. Shi, Y\. Wang, L\. Yan, P\. Ren, S\. Wang, D\. Yin, and Z\. Ren\(2025\)Retrieval models aren’t tool\-savvy: benchmarking tool retrieval for large language models\.arXiv preprint arXiv:2503\.01763\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2503.01763)Cited by:[Table 1](https://arxiv.org/html/2605.05726#S1.T1.3.1.2),[§1](https://arxiv.org/html/2605.05726#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.05726#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.05726#S4.SS2.p1.1)\.
- \[26\]O\. Team\(2025\)Octen series: optimizing embedding models to \#1 on rteb leaderboard\.External Links:[Link](https://octen-team.github.io/octen_blog/posts/octen-rteb-first-place/)Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.20.20.2)\.
- \[27\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1)\.
- \[28\]L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei\(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.arXiv preprint arXiv:2212\.03533\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.8.8.2),[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.9.9.2)\.
- \[29\]Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi\(2023\)Self\-instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 13484–13508\.Cited by:[§3\.2](https://arxiv.org/html/2605.05726#S3.SS2.p1.1)\.
- \[30\]Z\. Wang, Q\. Wu, X\. Zhang, C\. Zhang, W\. Yao, F\. E\. Faisal, B\. Peng, S\. Qin, S\. Nath, Q\. Lin, C\. Bansal, D\. Zhang, S\. Rajmohan, J\. Gao, and H\. Yao\(2026\)WebXSkill: skill learning for autonomous web agents\.arXiv preprint arXiv:2604\.13318\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2604.13318)Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05726#S2.SS1.p1.1)\.
- \[31\]P\. Xia, J\. Chen, X\. Yang, H\. Tu, J\. Liu, K\. Xiong, S\. Han, S\. Qiu, H\. Ji, Y\. Zhou, Z\. Zheng, C\. Xie, and H\. Yao\(2026\)MetaClaw: just talk – an agent that meta\-learns and evolves in the wild\.arXiv preprint arXiv:2603\.17187\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2603.17187)Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05726#S2.SS1.p1.1)\.
- \[32\]S\. Xiao, Z\. Liu, P\. Zhang, N\. Muennighoff, D\. Lian, and J\. Nie\(2024\)C\-pack: packed resources for general chinese embeddings\.InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval,pp\. 641–649\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.10.10.2),[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.6.6.2),[§5\.2](https://arxiv.org/html/2605.05726#S5.SS2.SSS0.Px1.p1.1)\.
- \[33\]R\. Xu and Y\. Yan\(2026\)Agent skills for large language models: architecture, acquisition, security, and the path forward\.arXiv preprint arXiv:2602\.12430\.Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1)\.
- \[34\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin,et al\.\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.15.15.2),[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.19.19.2),[§5\.1](https://arxiv.org/html/2605.05726#S5.SS1.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.12.10.2),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.13.11.2),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.14.12.2),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.5.3.2),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.6.4.2),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.7.5.2)\.
- \[35\]Z\. Zhang, Z\. Liao, H\. Yu, P\. Di, and R\. Wang\(2026\)F2LLM\-v2: inclusive, performant, and efficient embeddings for a multilingual world\.arXiv preprint arXiv:2603\.19223\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.12.12.2)\.
- \[36\]X\. Zhao, X\. Hu, Z\. Shan, S\. Huang, Y\. Zhou, X\. Zhang, Z\. Sun, Z\. Liu, D\. Li, X\. Wei,et al\.\(2025\)Kalm\-embedding\-v2: superior training techniques and data inspire a versatile embedding model\.arXiv preprint arXiv:2506\.20923\.Cited by:[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.21.21.2),[§5\.2](https://arxiv.org/html/2605.05726#S5.SS2.SSS0.Px1.p1.1)\.
- \[37\]Y\. Zheng, Z\. Zhang, C\. Ma, Y\. Yu, J\. Zhu, B\. Dong, and H\. Zhu\(2026\)SkillRouter: skill routing for llm agents at scale\.arXiv preprint arXiv:2603\.22455\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2603.22455)Cited by:[Table 1](https://arxiv.org/html/2605.05726#S1.T1.7.5.2),[§1](https://arxiv.org/html/2605.05726#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.05726#S2.SS2.p1.1),[Table 3](https://arxiv.org/html/2605.05726#S4.T3.5.1.22.22.2),[§5\.1](https://arxiv.org/html/2605.05726#S5.SS1.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.05726#S5.SS2.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.05726#S5.SS2.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.15.13.2),[Table 4](https://arxiv.org/html/2605.05726#S5.T4.5.1.8.6.2)\.
- \[38\]H\. Zhou, S\. Guo, A\. Liu, Z\. Yu, Z\. Gong, B\. Zhao, Z\. Chen, M\. Zhang, Y\. Chen, J\. Li,et al\.\(2026\)Memento\-skills: let agents design agents\.arXiv preprint arXiv:2603\.18743\.Cited by:[§1](https://arxiv.org/html/2605.05726#S1.p1.1)\.
- \[39\]C\. Ziems, W\. Held, O\. Shaikh, J\. Chen, Z\. Zhang, and D\. Yang\(2024\)Can large language models transform computational social science?\.Computational Linguistics50\(1\),pp\. 237–291\.Cited by:[§4\.1](https://arxiv.org/html/2605.05726#S4.SS1.p1.2)\.
## Appendix
## Appendix AData, code, and model\.
## Appendix BDataset Construction Details
This appendix provides supporting details for the SkillRet skill library summarized in Section[3](https://arxiv.org/html/2605.05726#S3)and the taxonomy presented in Section[4](https://arxiv.org/html/2605.05726#S4): the per\-step filtering attrition \(§[B\.1](https://arxiv.org/html/2605.05726#A2.SS1)\), the two\-pass LLM tagging procedure \(§[B\.2](https://arxiv.org/html/2605.05726#A2.SS2)\), consensus clustering over action–object combinations \(§[B\.3](https://arxiv.org/html/2605.05726#A2.SS3)\), the iterative taxonomy design process \(§[B\.4](https://arxiv.org/html/2605.05726#A2.SS4)\), LLM\-based skill assignment \(§[B\.5](https://arxiv.org/html/2605.05726#A2.SS5)\), and human validation \(§[B\.6](https://arxiv.org/html/2605.05726#A2.SS6)\)\.
### B\.1Per\-Step Filtering Attrition
Table[7](https://arxiv.org/html/2605.05726#A2.T7)reports the per\-step attrition of the five\-stage filtering pipeline described in §[3\.1](https://arxiv.org/html/2605.05726#S3.SS1)\. The two largest reductions come from content deduplication \(Step 4\) and language filtering \(Step 2\)\.
Table 7:Quality filtering pipeline\. The largest reductions come from content deduplication \(Step 4\) and language filtering \(Step 2\)\.
### B\.2Structured Tagging
To characterize each skill along interpretable dimensions, we assign three structured tags per skill:primary\_action\(what the skill*does*\),primary\_object\(what the skill*acts on*\), anddomain\(the technical field it belongs to\)\. We use a two\-pass procedure with Claude Sonnet 4\.6\.
#### B\.2\.1Pass 1: Category Discovery
All 17,810 skill names with truncated descriptions \(100 characters\) are submitted in a single prompt\. The model is instructed to discover natural, non\-overlapping categories for each dimension at an appropriate granularity \(roughly 8–15 categories\)\. This yields 13 actions, 14 objects, and 13 domains\. The categories are discovered by the LLM from the corpus rather than being predefined by the authors, though the target granularity \(8–15 per dimension\) is specified in the prompt\. The resulting label sets were manually reviewed by the authors to verify semantic coherence and adjust ambiguous or overlapping categories\.
##### System prompt\.
> ``` You are a skill taxonomy analyst. You will receive a list of ~17,000 AI coding skill names with short descriptions. Your task: analyze ALL skills and discover the natural categories that exist across three dimensions. For each dimension, identify **distinct, non-overlapping categories** at an appropriate granularity level (roughly 8-15 categories per dimension). Each category should have a short lowercase label (1-2 words, snake_case) and a brief description. Dimensions: 1. **primary_action**: What the skill DOES (the core verb/activity) 2. **primary_object**: What the skill acts ON (the target/subject) 3. **domain**: What technical field the skill belongs to Output strict JSON with this structure: { "primary_action": [ {"label": "...", "description": "..."} ], "primary_object": [ {"label": "...", "description": "..."} ], "domain": [ {"label": "...", "description": "..."} ] } No markdown fences, no explanations outside the JSON. ```
##### User message format\.
> ``` Here are all the skills: {skill_name_1}: {description_first_100_chars} {skill_name_2}: {description_first_100_chars} ... {skill_name_17810}: {description_first_100_chars} ```
#### B\.2\.2Pass 2: Batch Classification
The discovered categories are injected into the system prompt as a closed label set\. Skills are then classified in batches of 100, with the model selecting exactly one label per dimension for each skill\. The output is a structured JSON record per skill\. After deduplication of any double\-tagged entries, we obtain a clean set of 17,810 \(id, action, object, domain\) tuples\.
##### System prompt\.
> ``` You are a skill taxonomy classifier. For each AI coding skill, assign exactly 3 labels. **primary_action** -- choose ONE from: - implement: Writing, building, or creating new code, features, components, or systems - debug: Finding, diagnosing, and fixing bugs, errors, or unexpected behavior - review: Evaluating, auditing, or assessing code, documentation, or designs for quality - test: Writing, running, or managing automated tests and test strategies - design: Architecting systems, designing APIs, planning schemas, or defining specifications - document: Creating, updating, or generating documentation, comments, or explanations - refactor: Restructuring or improving existing code without changing behavior - configure: Setting up, installing, or configuring tools, environments, or services - deploy: Building, packaging, releasing, or deploying software to environments - analyze: Investigating, researching, profiling, or extracting insights from code or data - generate: Producing artifacts like images, content, reports, or boilerplate automatically - orchestrate: Coordinating, routing, or managing multiple agents, tasks, or workflows - search: Finding, discovering, or retrieving information from code, docs, or the web **primary_object** -- choose ONE from: - code: Source code files, functions, classes, modules - api: REST, GraphQL, gRPC interfaces and endpoints - database: Database schemas, queries, migrations - ui_component: Frontend components, pages, layouts - test_suite: Unit tests, integration tests, E2E tests - documentation: READMEs, API docs, guides, changelogs - pipeline: CI/CD pipelines, data pipelines, build workflows - infrastructure: Cloud resources, containers, K8s, and infrastructure-as-code - agent_skill: AI agent skills, prompts, system prompts, and LLM configurations - data: Datasets, data files, spreadsheets, reports - project: Entire projects, repositories, codebases - dependency: Packages, libraries, version management - security: Vulnerabilities, authentication, secrets - content: Text content, blog posts, marketing copy **domain** -- choose ONE from: - web_frontend: Browser-based UI development (React, Vue, Angular, HTML/CSS) - backend_api: Server-side development, REST/GraphQL APIs, microservices - devops_infra: CI/CD, cloud infrastructure, containers, Kubernetes - data_ml: Data engineering, machine learning, AI model training, analytics - mobile: iOS, Android, cross-platform mobile apps - security: Application security, penetration testing, vulnerability management - database: Relational and NoSQL databases, query optimization - ai_agents: LLM applications, agent frameworks, RAG systems, prompt engineering - developer_tools: CLI tools, IDE extensions, code generation, developer productivity - testing_qa: Test automation, quality assurance - product_design: UI/UX design, product management, user research - systems: Operating systems, embedded systems, compilers - business_ops: Project management, marketing, sales, finance, legal Respond ONLY with a JSON array. Each element: {"id": "...", "primary_action": "...", "primary_object": "...", "domain": "..."}. No explanations, no markdown fences. ```
##### User message format \(per batch of 100 skills\)\.
> ``` Tag these skills: {id_1}|{name_1}: {description_first_200_chars} {id_2}|{name_2}: {description_first_200_chars} ... {id_100}|{name_100}: {description_first_200_chars} ```
### B\.3Consensus Clustering over Action–Object Combinations
The action×\\timesobject product space contains 182 possible combinations, but the distribution is highly concentrated: the top 44 combinations account for 80% of all skills \(14,324 of 17,810\)\. We focus on these 44 combinations to discover stable groupings that seed the initial taxonomy draft\.
##### Embedding\.
Each combination is represented as the text"\{action\} \{object\}"and encoded with Qwen3\-Embedding\-8B, yielding a 4,096\-dimensional vector\.
##### Multi\-resolution clustering\.
We runkk\-means at five resolutions \(k∈\{5,7,10,15,20\}k\\in\\\{5,7,10,15,20\\\}\) with 20 random initializations each, and build a co\-association matrix: entry\(i,j\)\(i,j\)records the fraction of runs in which combinationsiiandjjare assigned to the same cluster\.
##### Strict consensus groups\.
Two combinations are linked if and only if they co\-occur in*all five*resolutions \(threshold = 5/5\); that is, regardless of whetherkkis 5 or 20, the pair is always assigned to the same cluster\. Connected components of this graph yield10 stable groups\(25 combinations\) and19 singletons\(Table[8](https://arxiv.org/html/2605.05726#A2.T8)\)\.
The groups fall into two types:*object\-bound*groups, in which diverse actions share a common object \(e\.g\., G2: document×\\timesdoc, generate×\\timesdoc, review×\\timesdoc\); and*action\-bound*groups, in which a single action spans multiple objects \(e\.g\., G1: implement×\\timescode, implement×\\timesapi, implement×\\timesdata\)\. Object\-bound groups outnumber action\-bound groups 6 to 4; these stable groups seed the initial taxonomy draft \(§[B\.4](https://arxiv.org/html/2605.05726#A2.SS4)\)\.
Table 8:Consensus clustering: 10 stable groups \(threshold = 5/5\)\.*Binding*indicates whether members share a common object or action\. 19 singletons \(5,776 skills\) are omitted for brevity\.
### B\.4Iterative Taxonomy Design
The final two\-level taxonomy \(Table[2](https://arxiv.org/html/2605.05726#S4.T2)\) is the product of an iterative, human\-in\-the\-loop process\. The 10 stable groups identified by consensus clustering \(§[B\.3](https://arxiv.org/html/2605.05726#A2.SS3)\) were used to seed an initial draft taxonomy with 7 Major categories and 21 Sub\-categories\. Experts then iteratively reviewed stratified samples of 200 skills, identifying structural ambiguities such as an over\-broad*Documentation & Knowledge*category, mixed classification axes within Software Engineering, and scattered ML\-related skills across Data and SE\. Through successive rounds of review and revision, the taxonomy was refined into the final6 Major / 18 Substructure\.
Figure 4:Major\-category distribution of the 17,810 skills\. Software Engineering dominates \(62\.2%\), creating a20×20\\timesimbalance with the smallest category \(Information Retrieval, 3\.1%\)\.Figure 5:Major\-category distribution across data splits\. The three bars per category \(full library, train, eval\) show near\-identical proportions \(<<1 pp deviation\), confirming that the disjoint split preserves the natural category distribution\.
### B\.5LLM\-based Skill Assignment
While tag\-based heuristic rules were effective for*discovering*the taxonomy structure, we found them insufficient for precise*assignment*of individual skills\. Three\-axis tags \(action, object, domain\) capture only surface\-level attributes and cannot distinguish skills whose true purpose is apparent only from the name and description\. For example, a skill taggedimplement / content / business\_opsis routed to Software Engineering by tag\-based rules, although its description reveals a marketing\-campaign planner that belongs in Business & Planning\.
To address these limitations, we classify all 17,810 skills using Claude Sonnet 4\.6 via the Anthropic API\.
##### System prompt\.
> ``` You are a taxonomy classifier for AI agent skills. Each skill is a reusable instruction file that extends an LLM’s capabilities. Given a skill’s name and description, assign it to exactly one (Major, Sub-category) pair from the taxonomy below. TAXONOMY: ## Software Engineering - Development / Analysis & Testing - Infrastructure & DevOps / Security - Version Control / Documentation ## AI Agents - Agent Development / Orchestration / Evaluation ## Data & ML - Data Engineering / Data Analysis / ML Development ## Content Creation - Writing & Text / Visual & Media ## Business & Planning - Business Analysis / Project Management ## Information Retrieval - Technical Search / General Search CLASSIFICATION PRINCIPLE: - Classify by the DOMAIN in which the skill’s capability is used. - Every skill extends an agent’s capabilities, but classify by WHAT the extended capability is about, not the fact that an agent uses it. - Technical docs (README, API docs) -> SE / Docs. - Product planning (PRD, sprints, Jira) -> Business & Planning / Project Management. - Pure business analysis (market research) -> Business & Planning / Business Analysis. - Text/media as final product -> Content Creation. - AI Agents is ONLY for the agent system itself (prompts, routing, MCP servers, evaluation). - Information Retrieval is ONLY when the PRIMARY output is found/retrieved content. OUTPUT: a JSON array, one object per skill. {"id": "...", "major": "...", "sub": "..."} No markdown fences. No explanations. ```
##### User message format \(per batch of 50 skills\)\.
> ``` Classify these skills: {id_1}|{name_1}: {description_first_300_chars} {id_2}|{name_2}: {description_first_300_chars} ... {id_50}|{name_50}: {description_first_300_chars} ```
### B\.6Human Validation of Assignment
To verify the quality of LLM\-based assignment, a stratified random sample of 200 skills is drawn from the classified corpus, preserving the corpus\-level distribution across all six Major categories\. Experts independently judge whether each skill’s assigned Major and Sub\-category are appropriate\.
The average accuracy across the reviewers is95\.5%for major categories and92\.2%for sub\-categories\. Full three\-way agreement is reached on 91\.0% \(major\) and 84\.5% \(sub\) of the 200 items\. Table[9](https://arxiv.org/html/2605.05726#A2.T9)reports the per\-category breakdown\.
Table 9:Per\-category accuracy of LLM\-based taxonomy assignment, averaged over independent reviewers on a stratified sample of 200 skills\.
## Appendix CQuery Generation
### C\.1Query Generation Prompt
Each generation call receives the name and full body of the sampled skill\(s\) as\{skills\_text\}, a random subset of 165 GAIA\[[21](https://arxiv.org/html/2605.05726#bib.bib5)\]validation questions as\{seeds\_text\}, and up to 30 previously generated queries for the same skill as\{prev\_section\}to suppress near\-duplicate outputs\. There is no system prompt; the entire instruction is issued as a single user turn\. If the model judges the skill combination to be unrealistic, it outputsNoneand the combination is discarded\.
##### User prompt\.
> ``` Write one realistic message that a user might send to an AI coding assistant. The message must naturally require ALL of the following skills to fulfill: {skills_text} Here are {N} examples of how real users talk to AI assistants. Notice the variety -- questions, commands, multi-step requests, short and long. Match this diversity of tone and structure: {seeds_text} ## Previously generated queries (DO NOT repeat or ## paraphrase these) {prev_queries} RULES: - Do NOT always start with "I’m" or "I need". Vary the opening: use questions ("How do I..."), commands ("Set up..."), descriptions ("Our team has..."), etc. - Do NOT mention skill names. The need must arise from the task description itself. - Do NOT explain, evaluate, or comment on the skills. Just write the user message. - The message must be standalone (no prior conversation context needed). - Your query must be DIFFERENT from any previously generated query listed above. Use a different scenario, domain, or framing. - If this skill combination makes no sense together in any realistic scenario, output exactly: None YOUR OUTPUT (one line only -- either a user message or None): ```
### C\.2Multi\-Perspective LLM Review
Each query–skill pair that passes the leakage filter is evaluated by Claude Sonnet 4\.6 using three independent reviewer prompts issued in separate API calls\. Each prompt adopts a distinct evaluation persona—Skill Coherence, Query Quality, and Benchmark Discriminability—so that each dimension is assessed without anchoring bias from the others\. A query is markedinvalidif two or more reviewers return an invalid verdict; a single invalid verdict routes the query to human expert review rather than discarding it outright\.
##### Reviewer 1 — Skill Coherence\.
> ``` You are a benchmark quality reviewer evaluating skill-query alignment. SKILL(S): {skills_block} USER QUERY: {query} Does this query genuinely require the skill(s) listed above? - Is there a meaningful semantic connection between the skill description and the query? - If multiple skills are provided, does the query naturally require all of them? - Mark INVALID if the skill and query are unrelated or the combination is forced. Be conservative: only mark INVALID when clearly problematic. When in doubt, mark VALID. Respond in JSON only (no markdown): {"verdict": "valid" or "invalid", "reasoning": "1-2 sentences"} ```
##### Reviewer 2 — Query Quality\.
> ``` You are a benchmark quality reviewer evaluating query realism and specificity. SKILL(S): {skills_block} USER QUERY: {query} Is this a well-formed, realistic user query? - Is the request specific and answerable? - Could this plausibly come from a real user in a professional setting? - Is the content technically coherent? - Mark INVALID if the query is too vague, unrealistic, or technically incoherent. Be conservative: only mark INVALID when clearly problematic. When in doubt, mark VALID. Respond in JSON only (no markdown): {"verdict": "valid" or "invalid", "reasoning": "1-2 sentences"} ```
##### Reviewer 3 — Benchmark Discriminability\.
> ``` You are a benchmark quality reviewer evaluating whether a query can distinguish models that have access to the skill from those that do not. SKILL(S): {skills_block} USER QUERY: {query} Can this query discriminate between models with and without the skill? - Would a model lacking this specific skill fail to answer it well? - Is the query too generic -- answerable by any capable model without specialized skill knowledge? - Mark INVALID if the query can be answered adequately without the specific skill. Be conservative: only mark INVALID when clearly problematic. When in doubt, mark VALID. Respond in JSON only (no markdown): {"verdict": "valid" or "invalid", "reasoning": "1-2 sentences"} ```
## Appendix DTop\-kkReranking Depth Ablation
Table[10](https://arxiv.org/html/2605.05726#A4.T10)reports NDCG@10 for three first\-stage retrievers across reranking depthsk∈\{10,20,50\}k\\in\\\{10,20,50\\\}using Qwen3\-Reranker\-0\.6B and Qwen3\-Reranker\-8B\. Largerkkconsistently improves NDCG@10 across all models and both rerankers\. We adoptk=20k\{=\}20in the main experiments as a practical trade\-off between performance and computational cost\.
Table 10:NDCG@10 at varying reranking depthsk∈\{10,20,50\}k\\in\\\{10,20,50\\\}for two rerankers across three first\-stage retrievers\.Emb\. Onlydenotes the embedding\-only baseline without reranking\.
## Appendix EDocument Representation Ablation
Table[11](https://arxiv.org/html/2605.05726#A5.T11)compares two document representation strategies across three embedding models\.Name\+Descencodes only the skill name and description, whileFullencodes the complete document text including the name, description, and Markdown body up to the model’s maximum sequence length\. Full\-text encoding consistently outperforms name\-and\-description only across all models, with gains of 1\.5–11\.4 NDCG@10 points\.
Table 11:Effect of document representation on NDCG@10\.
## Appendix FModel Maximum Sequence Lengths
Table[12](https://arxiv.org/html/2605.05726#A6.T12)reports the maximum input sequence length for each model evaluated in this work\. For each model, we use the maximum sequence length specified in the official model card or documentation\. When no explicit limit is stated, we use the model’s default context window\. Encoder\-only embedding models are limited to 512 tokens, which truncates the majority of skill documents in the corpus\. Decoder\-only models support substantially longer contexts of 8K–32K tokens, covering nearly all documents\. Detailed document length statistics are in Section[4\.2](https://arxiv.org/html/2605.05726#S4.SS2)\.
Table 12:Maximum supported sequence length per embedding and reranking model\.MethodModelMax Tokens*Embedding*bge\-small\-en\-v1\.5512e5\-small\-v2512snowflake\-arctic\-embed\-s512bge\-large\-en\-v1\.5512e5\-large\-v2512F2LLM\-v2\-80M40,960harrier\-oss\-v1\-270m32,768pplx\-embed\-v1\-0\.6b32,768Qwen3\-Embedding\-0\.6B32,768jina\-embeddings\-v5\-text\-small8,192harrier\-oss\-v1\-0\.6b32,768NV\-Embed\-v132,768Octen\-Embedding\-8B32,768Qwen3\-Embedding\-8B32,768KaLM\-Gemma3\-12B8,192*Reranking*jina\-reranker\-v2\-base\-multilingual1,024Qwen3\-Reranker\-0\.6B32,768Qwen3\-Reranker\-4B32,768Qwen3\-Reranker\-8B32,768
## Appendix GRetrieval Prompts for Each Evaluated Model
For each model, we follow the query/document prompts recommended in the official model documentation, including model cards, READMEs, and reference implementations\. Three models deviate from their default prompts\.
- •Harrier\-OSS and Qwen3\-Embedding\.Most models use task\-neutral prompts such asquery:orpassage:, but both of these families default to a web search specific instruction\. We replace it with a skill\-retrieval instruction authored for this work, shown in Table[13](https://arxiv.org/html/2605.05726#A7.T13)\.
- •
- •Qwen3\-Reranker\.The default web search instruction is replaced with a skill search instruction we authored, shown in Table[13](https://arxiv.org/html/2605.05726#A7.T13)\.
Table[13](https://arxiv.org/html/2605.05726#A7.T13)lists the final query and document prompts used for each model\.
Table 13:Query and document prompts for each model\.nonedenotes no prompt applied\.†\\daggerPrompt authored for this work\.ModelQuery PromptDoc Prompt*Embedding*bge\-small/large\-en\-v1\.5Represent this sentence for searching relevant passages:nonesnowflake\-arctic\-embed\-sRepresent this sentence for searching relevant passages:nonee5\-small/large\-v2query:passage:pplx\-embed\-v1\-0\.6bQuery:Document:jina\-embeddings\-v5\-text\-smallQuery:Document:harrier\-oss\-v1\-270m/0\.6bInstruct: Given a skill search query, retrieve relevant skills that match the query\\nQuery:†noneQwen3\-Embedding\-0\.6B/8BInstruct: Given a skill search query, retrieve relevant skills that match the query\\nQuery:†noneF2LLM\-v2\-80MInstruct: Given a question, retrieve passages that can help answer the question\.\\nQuery:noneKaLM\-Gemma3\-12BInstruct: Given a query, retrieve documents that answer the query\\nQuery:noneNV\-Embed\-v1nonenoneOcten\-Embedding\-8Bnone\-*Reranking*jina\-reranker\-v2\-base\-multilingualnonenoneQwen3\-Reranker\-0\.6B/4B/8BGiven a skill search query, judge whether the skill document is relevant and useful for the query†none
## Appendix HFine\-tuning Base Model Selection
To select the base model for fine\-tuning, we compared fine\-tuning Qwen3\-Embedding\-0\.6B against fine\-tuning harrier\-oss\-v1\-0\.6b, which is itself a derivative of Qwen3\-Embedding\. Table[14](https://arxiv.org/html/2605.05726#A8.T14)shows that both fine\-tuned variants achieve nearly identical performance across all metrics, with differences well within noise\. We therefore choose Qwen3\-Embedding as the fine\-tuning base to avoid double fine\-tuning and to maintain a cleaner experimental provenance\. The same rationale applies at the 8B scale, where we fine\-tune Qwen3\-Embedding\-8B in preference to Octen\-Embedding\-8B, which is also Qwen3\-Embedding\-based\.
Table 14:Fine\-tuning base model comparison at 0\.6B scale\. \(ft\) denotes fine\-tuned on SkillRet training data\.
## Appendix ISkillRet Fine\-tuning Details
We fine\-tune all SkillRet models on the released training split, comprising 10,123 skills and 63,259 synthetic queries yielding 127,190 positive query–skill pairs\. Training and evaluation skills are disjoint\.
##### Embedding models\.
We fine\-tune Qwen3\-Embedding\-0\.6B and Qwen3\-Embedding\-8B using MultipleNegativesRankingLoss\. Each query is paired with one positive skill document per training instance, so a query with multiple ground\-truth skills contributes multiple pairs, with remaining in\-batch examples serving as negatives\. Skill documents are encoded asname \| description \| skill\_md, matching the evaluation document representation\. We apply the same skill\-retrieval query instruction used in evaluation \(Table[13](https://arxiv.org/html/2605.05726#A7.T13)\) to anchor queries during training\. Both models are trained for one epoch with maximum sequence length 8192, learning rate2×10−52\\times 10^\{\-5\}, warmup ratio 0\.1, bf16 precision, and gradient checkpointing on 4 GPUs\. The 0\.6B model uses per\-device batch size 96, effective batch 384, while the 8B model uses per\-device batch size 20, effective batch 80\.
##### Reranker model\.
We fine\-tune Qwen3\-Reranker\-0\.6B using the same yes/no token scoring interface used at inference time\. For each query–document pair, the model receives a chat\-formatted prompt containing the skill\-search instruction, query, and candidate skill document, and is trained with binary cross\-entropy on the probability of the “yes” token versus the “no” token\. Positive pairs come from the ground\-truth query–skill labels\. For negatives, we mine hard negatives using the fine\-tuned SkillRet\-Embedding\-0\.6B retriever\. For each query, we retrieve the top 60 candidates, skip the top 20 near\-neighbor candidates, and use up to 7 remaining non\-relevant candidates, filling any missing slots with random negatives\. The reranker is trained for one epoch with maximum sequence length 8192, learning rate2×10−52\\times 10^\{\-5\}, warmup ratio 0\.1, bf16 precision, and gradient checkpointing on 8 GPUs\. Per\-device batch size is 96, effective batch 768\.
## Appendix JMTEB Retrieval vs\.SkillRetPerformance
Table[15](https://arxiv.org/html/2605.05726#A10.T15)lists MTEB Retrieval scores alongsideSkillRetNDCG@10 for all evaluated models, sorted by MTEB Retrieval score in descending order\. A moderate positive correlation is visible in the overall trend, yet notable exceptions appear in both directions\. KaLM\-Gemma3\-12B, for instance, leads on MTEB at 75\.66 but achieves only 55\.38 onSkillRet, the largest drop among all models\. Conversely, some models with low MTEB scores remain highly competitive onSkillRet\. harrier\-oss\-v1\-0\.6b ranks 4th on MTEB at 70\.75 yet achieves the best off\-the\-shelf score onSkillRetat 66\.55, and encoder\-only models with as few as 33M parameters reachSkillRetscores in the range of 51–53, comparable to NV\-Embed\-v1 at 7B which scores 53\.12 despite a substantially higher MTEB score of 53\.98\. Together, these patterns suggest that skill retrieval is a distinct task from general information retrieval, requiring models to identify specific capability signals within long, multi\-sentence queries\.
Table 15:MTEB Retrieval score vs\.SkillRetNDCG@10\. Models sorted by MTEB Retrieval score in descending order\.
## Appendix KPer\-Sub\-category Retrieval Performance
Table[16](https://arxiv.org/html/2605.05726#A11.T16)provides a complete breakdown of NDCG@10 and Recall@10 across all 18 Sub\-categories for both base and fine\-tuned Qwen3\-Embedding models\. Sub\-categories are grouped by Major category and sorted by fine\-tuned 8B NDCG@10 within each group\. This table supports the intra\-Major hard\-negative analysis in §[5\.3](https://arxiv.org/html/2605.05726#S5.SS3.SSS0.Px3)\.
Table 16:Per\-Sub\-category NDCG@10 and Recall@10 for base and fine\-tuned Qwen3\-Embedding models\. Sub\-categories grouped by Major category\.nn= number of evaluation queries per Sub\-category\.
## Appendix LQualitative Visualization of Sentence Erasure Importance
To complement the aggregate masking results in Table[5](https://arxiv.org/html/2605.05726#S5.T5), we provide a qualitative example of the sentence\-level erasure analysis in Fig\.[6](https://arxiv.org/html/2605.05726#A12.F6)\. For each sentence in the query, we measure the similarity drop after replacing that sentence with\[MASK\]\. A larger drop indicates that the sentence contributes more strongly to retrieving the gold skill\.
Figure 6:Sentence\-level erasure importance for an example query\. Each bar shows the similarity drop after replacing a sentence with\[MASK\]\. The trained model concentrates more importance on the skill\-relevant sentence, whereas the base model assigns importance more diffusely across the query\.
## Appendix MBroader Impacts
SkillRetis intended to support research on reliable skill retrieval for LLM agents\. By isolating retrieval quality from downstream execution, it provides a controlled benchmark for studying how well models select relevant procedural knowledge from large skill libraries\. This may help reduce context cost, improve reproducibility, and diagnose retrieval failures across domains\. At the same time, strong retrieval performance does not imply safe or correct end\-to\-end agent behavior\. Retrieved skills may be outdated, unsafe, misapplied, or incorrectly composed with other skills\. Therefore,SkillRetshould not be used as evidence that a deployed agent system is safe or reliable\. The dataset is derived from public GitHub\-hosted skills and synthetic queries\. It is intended for retrieval evaluation and model development, not for profiling individual authors, inferring personal attributes, or certifying downstream agent safety\. Practical deployments should include additional safeguards such as provenance checks, permission controls, sandboxing, and human oversight for high\-impact actions\.Similar Articles
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
This paper introduces SkillLens, a hierarchical framework for adaptive multi-granularity skill reuse in LLM agents, demonstrating improved accuracy and cost-efficiency on benchmark tasks.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
SkillAdaptor is a training-free step-level skill adaptation framework with explicit failure attribution for LLM agents, improving performance on WebShop, PinchBench, and Claw-Eval.
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
Introduces SkillDAG, a self-evolving typed directed graph for LLM skill selection at scale that models inter-skill relationships and allows agents to query and evolve the graph during execution, outperforming baselines on ALFWorld and SkillsBench.