SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

arXiv cs.CL Papers

Summary

SkillChain automates the lifecycle of per-intent skill specifications for image-based e-commerce AI assistants, improving response quality and user engagement through iterative refinement and routing alignment.

arXiv:2606.12984v1 Announce Type: new Abstract: Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:51 AM

# SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants
Source: [https://arxiv.org/html/2606.12984](https://arxiv.org/html/2606.12984)
Yimin Hu1Mengtao Xu1Hao Guo1Yuheng Song1Xiaoyong Zhu1,†Bo Zheng1,†1Alibaba Group \{hym408321, mengtao\.xmt, gh225907, songyuheng\.syh\}@taobao\.com, \{xiaoyong\.z, bozheng\}@alibaba\-inc\.com

###### Abstract

Image\-based AI assistants are now deployed at production scale on e\-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge\. Without per\-intent behavioral constraints, LLM\-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible\. To address this, we presentSkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of*Skills*through three stages:*Skill Creator*for bootstrapping from task specs and trajectories,*Route Optimizer*for routing alignment, and*Body Refiner*for iterative Skill Body refinement via dual\-path LLM\-Judge evaluation\. Deployed on a production\-scale e\-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one\-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long\-term retention\.

SkillChain: Closing the Loop on Skill Evolution for Image\-Based E\-Commerce AI Assistants

Yimin Hu1Mengtao Xu1Hao Guo1Yuheng Song1Xiaoyong Zhu1,†Bo Zheng1,†1Alibaba Group\{hym408321, mengtao\.xmt, gh225907, songyuheng\.syh\}@taobao\.com, \{xiaoyong\.z, bozheng\}@alibaba\-inc\.com

22footnotetext:Corresponding author## 1Introduction

E\-commerce platforms increasingly deploy image\-based AI assistants powered by large language models\(Brownet al\.,[2020](https://arxiv.org/html/2606.12984#bib.bib43); OpenAIet al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib40); Touvronet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib42)\)that allow users to upload a photo and receive personalized responses\. Visual inputs carry inherently ambiguous intent: the same image may prompt a product search, a style comparison, an encyclopedia lookup, or a utility tool call, each demanding a distinct response format, tool set, and domain vocabulary\.

This diversity poses three interconnected production challenges\.\(C1\) No per\-intent behavioral specifications\.Without explicit constraints per intent, LLMs generate free\-form responses that mix incompatible formats such as product cards in encyclopedia replies, invoke tools incorrectly, and fail domain quality standards\.\(C2\) Routing drift under distribution shift\.Visual intent patterns continuously evolve; well\-designed intent\-to\-specification mappings degrade over time, yet continuous re\-engineering by hand is infeasible at scale\.\(C3\) Specification decay without production feedback\.Specifications fixed at creation time silently accumulate deficiencies with no mechanism for automated diagnosis and repair\.

We proposeSkillChain, which addresses C1–C3 via*Skills*, declarative per\-intent specifications covering tool calls, rich\-media composition, writing constraints, and domain knowledge, across three coupled stages:Stage 1 \(Skill Creator\)bootstraps Skills from task specifications and user trajectories, gating quality through human reflection \(C1\);Stage 2 \(Route Optimizer\)mines routing failures and applies update/merge/discard operations to realign Descriptions with evolving traffic \(C2\);Stage 3 \(Body Refiner\)runs dual\-path evaluation and cross\-sample attribution to identify and repair Body deficiencies \(C3\)\.

Deployed on a production\-scale e\-commerce image assistant, SkillChain achieves the highest aggregate LLM Judge score across all evaluated configurations, with substantial gains in structural compliance and content quality; online A/B results confirm significant engagement and retention improvements over the production Stage 2 baseline\. Critically, the pipeline is*unidirectional*: each stage targets a disjoint component, so corrections never propagate backward, a property no prior skill system achieves\.

Our main contributions are:

1. 1\.We identify routing and behavioral drift as stage\-specific production Skill failure modes, each addressed by a dedicated chain link\.
2. 2\.We presentSkillChain, the first deployed image\-based framework that closes all three Skill feedback loops in a single self\-evolving E\-commerce lifecycle with a stage\-wise monotone quality guarantee\.
3. 3\.Production validation across five visual intent categories confirms strictly additive stage gains, with online A/B evidence over the deployed baseline\.

## 2Related Work

LLM\-based autonomous agents have seen rapid progress across diverse domains\(Wanget al\.,[2024b](https://arxiv.org/html/2606.12984#bib.bib20); Xiet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib21)\); we organize related work into four threads most pertinent to SkillChain\.

#### Skill lifecycle and self\-evolving agent systems\.

Voyager\(Wanget al\.,[2024a](https://arxiv.org/html/2606.12984#bib.bib35)\)and Ghost in the Minecraft\(Zhuet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib46)\)pioneered reusable code\-block and sub\-goal skill libraries for open\-world exploration\. Building on this, AutoSkill\(Yanget al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib8)\), SkillForge\(Liuet al\.,[2026b](https://arxiv.org/html/2606.12984#bib.bib11)\), SkillClaw\(Maet al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib6)\), CoEvoSkills\(Zhanget al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib4)\), and EvoSkill\(Alzubiet al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib15)\)evolve skills from interaction traces via failure\-driven refinement, trace aggregation, or co\-evolved surrogate verifiers; Trace2Skill\(Niet al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib10)\), XSkill\(Jianget al\.,[2026a](https://arxiv.org/html/2606.12984#bib.bib13)\), and WebXSkill\(Wanget al\.,[2026b](https://arxiv.org/html/2606.12984#bib.bib19)\)distill trajectory pools; SkillRL\(Xiaet al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib9)\), ARISE\(Liet al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib16)\), and SkillOS\(Ouyanget al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib5)\)apply RL\-based curation; SkillX\(Wanget al\.,[2026a](https://arxiv.org/html/2606.12984#bib.bib7)\), Graph of Skills\(Liuet al\.,[2026a](https://arxiv.org/html/2606.12984#bib.bib17)\), and SkillNet\(Lianget al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib14)\)construct structured skill repositories; the Dual\-Granularity Skill Bank\(Tuet al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib18)\)maintains coarse\-to\-fine skill abstractions for agentic RL\.Jianget al\.\([2026b](https://arxiv.org/html/2606.12984#bib.bib12)\)andSumerset al\.\([2024](https://arxiv.org/html/2606.12984#bib.bib22)\)argue that agent skills constitute a capability class distinct from tool use, generalizing across tasks and modalities\. SkillChain differs in three key respects: \(1\) Skills are*declarative*behavioral specifications rather than executable code or memory; \(2\) it adds an explicit*routing optimization*stage absent from all prior work; \(3\) it is validated at industrial e\-commerce scale\.

#### Tool\-augmented LLMs\.

ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib23)\)and Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib24)\)established reasoning\-action interleaving and verbal self\-improvement for tool use; ToolFormer\(Schicket al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib33)\)and Gorilla\(Patilet al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib32)\)further extended LLM tool use to large API libraries\. SkillChain instead manages*specifications*governing tool invocation and continuously refines them from production feedback\.

#### Automatic prompt optimization\.

OPRO\(Yanget al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib56)\), APE\(Zhouet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib57)\), DSPy\(Khattabet al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib58)\), and TextGrad\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib59)\)optimize prompts via LLM feedback or text\-based differentiation; ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib26)\)distills execution traces into reusable templates\. SkillChain shares the data\-driven improvement motivation but operates over structured, intent\-specific Skill Bodies with production routing signals\.

#### LLM\-as\-Judge evaluation\.

G\-Eval\(Liuet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib60)\), MT\-Bench\(Zhenget al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib61)\), Constitutional AI\(Baiet al\.,[2022](https://arxiv.org/html/2606.12984#bib.bib63)\), and HELM\(Lianget al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib62)\)establish LLM\-based quality assessment, AI\-driven constraint enforcement, and holistic multi\-metric evaluation, all informing our four\-dimensional scoring design\. We adopt the LLM\-as\-Judge paradigm but ground evaluation in Skill Body constraints, enabling feedback directly actionable for Body refinement\.

![Refer to caption](https://arxiv.org/html/2606.12984v1/x1.png)Figure 1:An overview of the SkillChain three\-stage framework\.\(a\) Skill Creatorbootstraps Skills from task specifications and user trajectories via an engineer loop, then gates quality through human reflection before deployment\.\(b\) Route Optimizercontinuously mines routing failures in production and applies update/merge/discard operations to keep Skill Description boundaries aligned with real traffic\.\(c\) Body Refinerevaluates responses through a dual\-path pipeline and aggregates cross\-sample signals to drive iterative Skill Body refinement\.

## 3Methodology

Production Skills degrade along three independent dimensions: initialization quality, routing accuracy, and behavioral compliance\. Because the Description \(dd\) governing routing and the Body \(bb\) governing response quality are architecturally decoupled, their feedback loops close sequentially without interference, embodying the*unidirectional chain*design of SkillChain\.

FollowingJianget al\.\([2026b](https://arxiv.org/html/2606.12984#bib.bib12)\), aSkillis a tuples=\(d,b,Cs,Od\)s=\(d,\\ b,\\ C\_\{s\},\\ O\_\{d\}\):ddgoverns routing,bbspecifies format, tool, and constraint rules,CsC\_\{s\}provides static knowledge and examples, andOdO\_\{d\}lists dynamic operators invoked at inference time\. All Skills are versioned in a*Skill Bank*ℬk\\mathcal\{B\}\_\{k\}; SkillChain automates the full lifecycle across three coupled stages, as illustrated in Figure[1](https://arxiv.org/html/2606.12984#S2.F1)\.

### 3\.1Stage 1: Skill Creation

#### LLM\-Driven Bootstrap\.

Stage 1 bootstraps a Skill from a Task Specification, User Trajectories, i\.e\., real interaction sequences characterizing the target intent, and the current Skill Bankℬk\\mathcal\{B\}\_\{k\}for knowledge reuse\. An LLM acting as Skill Creator generates an initial draft conditioned on these inputs, using reference Skills retrieved fromℬk\\mathcal\{B\}\_\{k\}\(Lewiset al\.,[2020](https://arxiv.org/html/2606.12984#bib.bib47)\)as in\-context exemplars\. The draft is then refined through an*Engineer Loop*\(Wuet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib27)\)that validates Static Components and Dynamic Operators against sampled queries, terminating when all declarations pass correctness checks\.

#### Human Reflection Gate\.

Before deployment, the Optimized Skill is tested against a curated query sample and reviewed by domain experts for Body content quality\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.12984#bib.bib41)\); only approved Skills are versioned intoℬk\+1\\mathcal\{B\}\_\{k\+1\}\. This gate catches Body\-level deficiencies that programmatic checks cannot capture, but it cannot anticipate routing drift under the full production traffic distribution, the coverage problem that Stage 2 specifically targets\. Different intent types impose structurally distinct Body constraints; Table[1](https://arxiv.org/html/2606.12984#S3.T1)summarizes these per\-intent requirements\.

Table 1:Scene\-specific Skill design principles applied during Stage 1\.

### 3\.2Stage 2: Route Optimization

As traffic evolves, Skill Descriptions that were precise at creation time drift out of alignment with user intents matched by the backbone MLLM\(Qwen Team,[2025](https://arxiv.org/html/2606.12984#bib.bib1)\)\.

#### Routing Failure Analysis\.

To detect and repair this drift, a Judge LLM compares sampled routing decisions against human\-annotated ground truth, collecting failures into a routing failure pool\. An LLM classifier then assigns each failure one of three root\-cause labels \(Appendix[E\.1](https://arxiv.org/html/2606.12984#A5.SS1)\):*Boundary ambiguity*, where the Description does not clearly include or exclude the query;*Missing Skill*, where no existing Skill covers the intent; or*Visual parsing error*, which is escalated to the upstream parser\. In our deployment, boundary ambiguity accounts for the large majority of failures; once the initial Skill Bank covers core intent categories, missing Skill cases are rare, confirming that Description boundary maintenance is the dominant routing cost\.

#### Iterative Description Update\.

Stage 2 runs iteratively over a held\-out validation set, accepting an update only whenF1​\(ℬk\+1\)≥F1​\(ℬk\)\\mathrm\{F1\}\(\\mathcal\{B\}\_\{k\+1\}\)\\geq\\mathrm\{F1\}\(\\mathcal\{B\}\_\{k\}\), which provides a monotone quality guarantee\. Based on mined failure patterns, each round applies one of three operations toℬk\\mathcal\{B\}\_\{k\}\(Yanget al\.,[2026](https://arxiv.org/html/2606.12984#bib.bib8)\):

Update:\\displaystyle\\text\{\{Update\}\}:\\quadℬk\+1=ℬk​with​ds←ds′\\displaystyle\\mathcal\{B\}\_\{k\+1\}=\\mathcal\{B\}\_\{k\}\\text\{ with \}d\_\{s\}\\leftarrow d\_\{s\}^\{\\prime\}Merge:\\displaystyle\\text\{\{Merge\}\}:\\quadℬk\+1=\(ℬk∖\{si,sj\}\)∪\{s′\}\\displaystyle\\mathcal\{B\}\_\{k\+1\}=\(\\mathcal\{B\}\_\{k\}\\setminus\\\{s\_\{i\},s\_\{j\}\\\}\)\\cup\\\{s^\{\\prime\}\\\}Discard:\\displaystyle\\text\{\{Discard\}\}:\\quadℬk\+1=ℬk∖\{s\}\\displaystyle\\mathcal\{B\}\_\{k\+1\}=\\mathcal\{B\}\_\{k\}\\setminus\\\{s\\\}\(1\)

### 3\.3Stage 3: Skill Body Refinement

Even after routing is repaired, Skill Bodies accumulate deficiencies that only surface at production scale: format violations, content gaps, tool misuse\. Individual\-sample feedback is too noisy to act on: per\-query variation in content and phrasing causes high variance in any single Judge evaluation, and acting on individual scores risks overcorrecting to idiosyncratic cases rather than systematic weaknesses\.

#### Dual\-Path Evaluation\.

A*dual\-path evaluation*pipeline covers complementary failure modes: a rule\-based path for deterministic structural checks and an LLM Judge path\(Liuet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib60); Zhenget al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib61)\)that scores each response on four dimensions: Tool Call Rationality \(TCR\), Card Composition Compliance \(CCC\), Content Quality \(CQ\), and Constraint Adherence \(CA\), with natural\-language rationales grounding feedback in Skill Body constraints \(Appendix[E\.2](https://arxiv.org/html/2606.12984#A5.SS2)\)\. Neither path alone captures the full failure space\.

#### Cross\-Sample Attribution\.

Rather than acting on individual scores, Stage 3 discretizes each Judge score into three tiers \(Good / Average / Poor\) and computes tier distributions per Skill across all attributed responses; a Skill is flagged for dimensionddwhen

Pri⁡\[τd​\(Jis\)=Poor\]\>θd\\Pr\_\{i\}\\\!\\left\[\\tau\_\{d\}\(J\_\{i\}^\{s\}\)=\\text\{Poor\}\\right\]\>\\theta\_\{d\}\(2\)whereτd​\(⋅\)\\tau\_\{d\}\(\\cdot\)maps a score to its tier andθd\\theta\_\{d\}is a dimension\-specific threshold calibrated empirically to reflect each dimension’s natural score variance and the minimum signal\-to\-noise ratio needed to distinguish genuine Skill\-level weaknesses from per\-query fluctuations \(see Appendix[D](https://arxiv.org/html/2606.12984#A4)\)\. An LLM then synthesizes these distributions and per\-sample rationales into structured directives: Skill Suggestions, Rule Violations, and Ideal Gaps\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib59); Zhaoet al\.,[2024](https://arxiv.org/html/2606.12984#bib.bib26)\), targeting recurring deficiencies rather than one\-off failures\.

#### Update Gate\.

Refined Skills re\-enter the Human Reflection gate before deployment, and Body edits are accepted only whenJ¯​\(ℬk\+1\)≥J¯​\(ℬk\)\\bar\{J\}\(\\mathcal\{B\}\_\{k\+1\}\)\\geq\\bar\{J\}\(\\mathcal\{B\}\_\{k\}\), completing the monotone quality guarantee\. Because each chain link modifies a disjoint component, where Stage 2 only updatesddand Stage 3 only updatesbb, the two stages can run over the same live Skill Bank in either sequential or alternating fashion without invalidating each other’s updates\.

## 4Experiments

### 4\.1Experimental Setup

#### Systems\.

We compare five cumulative configurations:NoSkill\(production baseline without Skill constraints\),ManualSkill\(human\-crafted Skills, no SkillChain pipeline\), and three SkillChain variants each adding one stage to the previous:Stage 1\(Skill creation\),Stage 2\(\+\+routing optimization\), andStage 3\(\+\+Body refinement, full SkillChain\)\. Full system descriptions are in Appendix[D](https://arxiv.org/html/2606.12984#A4)\.

#### Evaluation set\.

The offline set comprises 1,000 production queries sampled with intent\-stratified sampling across all five intent categories listed in Table[1](https://arxiv.org/html/2606.12984#S3.T1)\. The online A/B experiment runs for one week comparing Stage 3 against the deployed Stage 2 baseline; the deployment rationale is discussed in Appendix[D](https://arxiv.org/html/2606.12984#A4)\.

#### Metrics\.

Offline: four LLM\-Judge dimensions\(Liuet al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib60); Zhenget al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib61)\)\(Tool Call Rationality \(TCR\), Card Composition Compliance \(CCC\), Content Quality \(CQ\), and Constraint Adherence \(CA\), each∈\[0,100\]\\in\[0,100\]after score normalization\) and Routing Accuracy F1 against human\-annotated ground truth\(Lianget al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib62)\)\.Online: Interactive UV, Full\-read Rate, Avg\. Dwell Time, and 7\-day Return Rate\. Formal definitions and scoring rubrics are in Appendix[C](https://arxiv.org/html/2606.12984#A3)\.

### 4\.2Main Results

Table[2](https://arxiv.org/html/2606.12984#S4.T2)reports offline LLM Judge scores and Routing F1 for all five system configurations, with each SkillChain row adding one stage cumulatively\. Each stage addresses one challenge and contributes in the expected direction\.

MethodAdded ComponentLLM Judge Dimensions\(↑\\uparrow\)RoutingTCRCCC†CQCAAvgF1\(↑\\uparrow\)NoSkillProduction baseline \(no Skill constraint\)54\.449\.169\.652\.859\.1—ManualSkillHuman\-designed skill baseline61\.456\.170\.965\.364\.961\.5SkillChainS1Auto Skill creation \+ human reflection46\.758\.976\.653\.762\.565\.5SkillChainS1\+S2\+\+Routing failure mining & Description update55\.363\.476\.554\.167\.278\.0SkillChain\(Full\)\+\+Dual\-path evaluation & Body refinement61\.372\.282\.362\.772\.273\.5Table 2:Main offline evaluation results\. Each SkillChain row adds one pipeline stage cumulatively over the previous\. LLM Judge dimensions are each scored in\[0,100\]\[0,100\]:TCRTool Call Rationality,CCCCard Composition Compliance,CQContent Quality,CAConstraint Adherence;Avgnormalized aggregate score across all four dimensions\.†CCC computed on the subset of queries with product card output\.Bold: best per column;underline: second best\.#### Stage 1\.

Stage 1 establishes a strong baseline by introducing per\-intent behavioral constraints \(C1\), improving three of four Judge dimensions over NoSkill \(CCC, CQ, and CA\)\. The TCR decline is mechanistically expected: without Skill constraints, NoSkill’s tool calls are governed by the model’s general priors; Stage 1 imposes per\-intent tool\-call specifications that improve TCR only when routing is correct, but misrouted queries receive tool mandates misaligned with their actual intent\. Unlike unconstrained generation, which at worst omits the optimal tool, an incorrect Skill Body*actively forces*erroneous calls, pushing aggregate TCR below the NoSkill baseline\.

#### Stage 2\.

Stage 2 directly targets routing drift \(C2\), with gains concentrated in Routing F1 and TCR\. Routing repair reduces wrong\-constraint exposure, restoring TCR once the correct skill is consistently selected; accurate intent\-to\-Skill mapping is a prerequisite for correct tool invocation\.

#### Stage 3\.

Stage 3 closes the quality feedback loop \(C3\), with the largest incremental gains in CCC and CA\. The rule\-based path directly targets structural violations captured by CCC, while the LLM Judge’s CA dimension checks Skill Body compliance, making both the primary beneficiaries of the Stage 3 loop\.

#### vs\. ManualSkill\.

Stage 3 achieves the highest aggregate score and outperforms ManualSkill on CCC, CQ, and Routing F1; the CA gap reflects the denser constraint specification of self\-evolved Skills, which raises the bar for full adherence \(see Appendix[B](https://arxiv.org/html/2606.12984#A2)\)\.

### 4\.3Analysis

We design three research questions to probe SkillChain’s core design claims: \(Q1\) whether routing and response quality gains follow the unidirectional chain order without reverse propagation; \(Q2\) whether the self\-evolution converges and at what rate; \(Q3\) what drives Stage 3 quality gains, with full ablations in Appendix[A](https://arxiv.org/html/2606.12984#A1)due to space\.

#### RQ1: Chain Unidirectionality\.

Table[2](https://arxiv.org/html/2606.12984#S4.T2)confirms the unidirectionality hypothesis: Stage 2 gains concentrate on Routing F1 and TCR with negligible changes to CQ and CA; Stage 3 reverses the pattern, with the largest gains in CCC and CA and a slight Routing F1 decline\. This dissociation reflects the unidirectional chain design: gains accumulate strictly downstream without reversing upstream metrics\.

Table[3](https://arxiv.org/html/2606.12984#S4.T3)breaks down per\-intent gains within each stage and reinforces this downstream dependency: Stage 3’s largest CQ/CA gains occur in Exact Match, where routing was already stable and Body quality headroom remained; intents with the most routing drift \(Encyclopedia, Divergent Rec\.\) see the strongest Stage 2 gains but more constrained Stage 3 improvements, consistent with the chain’s sequential dependency; Utility Assistance shows near\-zero Stage 3 gains, reflecting the difficulty of exhaustively constraining open\-ended intents\.

Table 3:Per\-intent incremental gains from Stage 2 \(routing,Δ\\Deltaover Stage 1\) and Stage 3 \(body quality,Δ\\Deltaover Stage 2\) across five intent categories\.Bold: best per column\.
#### RQ2: Evolution Convergence\.

Figure[2](https://arxiv.org/html/2606.12984#S4.F2)tracks how routing quality and Body quality evolve across optimization rounds\. Stage 2 Routing F1 rises steadily over four rounds before plateauing, with boundary Update operations contributing the most\. Stage 3 exhibits dimension\-level heterogeneity: structural dimensions \(CCC, TCR\) plateau earlier while content\-oriented dimensions \(CQ, CA\) continue improving across all three rounds\. Structural violations are finite and discrete: a small set of rule\-based signals can enumerate and repair them within a few rounds, whereas content\-quality failures span a long tail of query phrasings and domain sub\-topics, requiring more production exposure to converge\. The two trajectories evolve independently, empirically confirming that the unidirectional chain design holds at runtime\.

![Refer to caption](https://arxiv.org/html/2606.12984v1/x2.png)Figure 2:Evolution of routing quality \(Stage 2 Routing P, R, F1\) and body quality metrics \(CCC, TCR, CQ, CA\) across optimization rounds\. Stage 2 converges within four rounds; structural dimensions \(CCC, TCR\) plateau earlier than content\-oriented ones \(CQ, CA\)\. Two stage trajectories evolve independently, confirming that downstream refinement does not reverse upstream routing gains\.
#### RQ3: Stage 3 Design Choices\.

Full ablation results are in Table[5](https://arxiv.org/html/2606.12984#A1.T5)in Appendix[A](https://arxiv.org/html/2606.12984#A1)\. In brief: the two evaluation paths address complementary failure modes and are both necessary; Statistical Aggregation is the more critical attribution component, as removing it causes the largest structural degradation across all ablations\.

### 4\.4Online A/B Experiment

Table[4](https://arxiv.org/html/2606.12984#S4.T4)reports online metric deltas for Stage 3 vs\. the already\-deployed Stage 2 baseline over a one\-week production A/B experiment\. Full\-read Rate shows the largest gain \(\+\+4\.98 pp\) and 7\-day Return Rate improves by\+\+1\.15 pp, signalling sustained attention and long\-term retention beyond single\-session effects\. Interactive UV \(\+\+1\.92 pp\) and Dwell Time \(\+\+2\.85 s\) confirm gains across both commercial and knowledge\-seeking queries, validating that offline LLM\-as\-Judge scores are a reliable proxy for real user experience\.

Table 4:Online A/B results \(Stage 3 vs\. Stage 2, one\-week production experiment\)\.Δ\\Deltavalues are absolute improvements \(pp = percentage points\)\. Interactive UV counts unique users who either ask follow\-up questions \(knowledge queries\) or click a product card \(commercial queries\)\.
### 4\.5Human Evaluation \(SBS\)

We conduct a side\-by\-side \(SBS\) blind evaluation with human annotators comparing Stage 3 vs\. ManualSkill responses on a balanced set of 300 queries \(60 per intent category\)\. Figure[3](https://arxiv.org/html/2606.12984#S4.F3)reports win/tie/lose rates overall and per scene\. Stage 3 achieves a consistent net win across all five intent categories\. Encyclopedia and Utility Assistance show the largest human preference margins, while Divergent Rec\. is the weakest, reflecting the subjective nature of style and diversity preferences in recommendation responses\. Notably, Utility Assistance yields a strong human preference despite its near\-zero offline Stage 3 gains, suggesting that the LLM Judge underestimates quality improvements for open\-ended intents, consistent with the attribution noise identified in §[4\.3](https://arxiv.org/html/2606.12984#S4.SS3)\.

![Refer to caption](https://arxiv.org/html/2606.12984v1/x3.png)Figure 3:SBS human evaluation win/tie/lose rates for Stage 3 vs\. ManualSkill across five intent categories \(300 queries, 60 per intent\)\. SkillChain achieves a net win across all intents, with the largest margin on content\-demanding scenes\.

## 5Conclusion

We presented SkillChain, a framework that closes the loop on Skill evolution for e\-commerce AI assistants\. By decomposing the problem into Skill creation \(Stage 1\), routing optimization \(Stage 2\), and Body refinement \(Stage 3\), SkillChain continuously adapts to real traffic distribution and quality signals without manual re\-engineering\. Experiments on the production system demonstrate significant offline and online improvements across diverse visual intent types\. The Skill abstraction is modality\-agnostic; future work may extend it to text\-based search, live\-stream commentary, and product Q&A settings with similar intent diversity\.

## 6Limitations

SkillChain has several limitations worth noting\. First,cold\-start and Skill Bank scalability: intents absent from the initial Skill Bank fall back to NoSkill, making coverage dependent on upfront seeding breadth; routing disambiguation complexity grows with Skill count, and routing boundary conditions between overlapping intents remain inherently difficult to delineate\. Second,evaluation reliability: LLM Judge quality degrades for open\-ended intents where no single ground\-truth response exists to anchor scoring, which explains the near\-zero Stage 3 gains for Utility Assistance, and Stage 2 routing failure diagnosis relies on human\-annotated ground\-truth routing labels, limiting applicability where such supervision is scarce\. Third,declarative representation ceiling and generalizability: Skills are flat text specifications and cannot encode executable procedures or hierarchical sub\-skill compositions, capping expressiveness for complex multi\-step tasks; all experiments are conducted on a single e\-commerce platform with five intent categories, and generalizability to other domains and languages remains to be validated\. We plan to address these through automatic Skill discovery, multi\-judge consensus scoring, executable Skill representations, and cross\-platform evaluation in future work\.

## References

- EvoSkill: automated skill discovery for multi\-agent systems\.External Links:2603\.02766,[Link](https://arxiv.org/abs/2603.02766)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- Anthropic \(2026\)Introducing Claude Sonnet 4\.6\.External Links:[Link](https://www.anthropic.com/news/claude-sonnet-4-6)Cited by:[Appendix D](https://arxiv.org/html/2606.12984#A4.SS0.SSS0.Px1.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon, C\. Chen, C\. Olsson, C\. Olah, D\. Hernandez, D\. Drain, D\. Ganguli, D\. Li, E\. Tran\-Johnson, E\. Perez, J\. Kerr, J\. Mueller, J\. Ladish, J\. Landau, K\. Ndousse, K\. Lukosuite, L\. Lovitt, M\. Sellitto, N\. Elhage, N\. Schiefer, N\. Mercado, N\. DasSarma, R\. Lasenby, R\. Larson, S\. Ringer, S\. Johnston, S\. Kravec, S\. E\. Showk, S\. Fort, T\. Lanham, T\. Telleen\-Lawton, T\. Conerly, T\. Henighan, T\. Hume, S\. R\. Bowman, Z\. Hatfield\-Dodds, B\. Mann, D\. Amodei, N\. Joseph, S\. McCandlish, T\. Brown, and J\. Kaplan \(2022\)Constitutional ai: harmlessness from ai feedback\.External Links:2212\.08073,[Link](https://arxiv.org/abs/2212.08073)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px4.p1.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)Cited by:[§1](https://arxiv.org/html/2606.12984#S1.p1.1)\.
- G\. Jiang, Z\. Su, X\. Qu, and Y\. R\. Fung \(2026a\)XSkill: continual learning from experience and skills in multimodal agents\.External Links:2603\.12056,[Link](https://arxiv.org/abs/2603.12056)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026b\)SoK: agentic skills – beyond tool use in llm agents\.External Links:2602\.20867,[Link](https://arxiv.org/abs/2602.20867)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.12984#S3.p2.6)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html)Cited by:[§3\.1](https://arxiv.org/html/2606.12984#S3.SS1.SSS0.Px1.p1.2)\.
- Y\. Li, R\. Miao, Z\. Qi, and T\. Lan \(2026\)ARISE: agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning\.External Links:2603\.16060,[Link](https://arxiv.org/abs/2603.16060)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Ré, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. Wang, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yuksekgonul, M\. Suzgun, N\. Kim, N\. Guha, N\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2023\)Holistic evaluation of language models\.External Links:2211\.09110,[Link](https://arxiv.org/abs/2211.09110)Cited by:[§C\.1](https://arxiv.org/html/2606.12984#A3.SS1.p4.1),[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px4.p1.1),[§4\.1](https://arxiv.org/html/2606.12984#S4.SS1.SSS0.Px3.p1.1)\.
- Y\. Liang, R\. Zhong, H\. Xu, C\. Jiang, Y\. Zhong, R\. Fang, J\. Gu, S\. Deng, Y\. Yao, M\. Wang, S\. Qiao, X\. Xu, T\. Wu, K\. Wang, Y\. Liu, Z\. Bi, J\. Lou, Y\. E\. Jiang, H\. Zhu, G\. Yu, H\. Hong, L\. Huang, H\. Xue, C\. Wang, Y\. Wang, Z\. Shan, X\. Chen, Z\. Tu, F\. Xiong, X\. Xie, P\. Zhang, Z\. Gui, L\. Liang, J\. Zhou, C\. Wu, J\. Shang, Y\. Gong, J\. Lin, C\. Xu, H\. Deng, W\. Zhang, K\. Ding, Q\. Zhang, F\. Huang, N\. Zhang, J\. Z\. Pan, G\. Qi, H\. Wang, and H\. Chen \(2026\)SkillNet: create, evaluate, and connect ai skills\.External Links:2603\.04448,[Link](https://arxiv.org/abs/2603.04448)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Liu, Z\. Li, H\. Du, X\. Wu, S\. Gui, Y\. Kuang, and L\. Sun \(2026a\)Graph\-of\-skills: dependency\-aware structural retrieval for massive agent skills\.External Links:2604\.05333,[Link](https://arxiv.org/abs/2604.05333)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, X\. Luo, L\. Li, G\. Huang, J\. Liu, and H\. Qiao \(2026b\)SkillForge: forging domain\-specific, self\-evolving agent skills in cloud technical support\.External Links:[Link](https://api.semanticscholar.org/CorpusID:287351631)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 2511–2522\.External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.153),[Document](https://dx.doi.org/10.18653/V1/2023.EMNLP-MAIN.153)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px4.p1.1),[§3\.3](https://arxiv.org/html/2606.12984#S3.SS3.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.12984#S4.SS1.SSS0.Px3.p1.1)\.
- Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu \(2026\)SkillClaw: let skills evolve collectively with agentic evolver\.External Links:2604\.08377,[Link](https://arxiv.org/abs/2604.08377)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Ni, Y\. Liu, X\. Liu, Y\. Sun, M\. Zhou, P\. Cheng, D\. Wang, E\. Zhao, X\. Jiang, and G\. Jiang \(2026\)Trace2Skill: distill trajectory\-local lessons into transferable agent skills\.External Links:2603\.25158,[Link](https://arxiv.org/abs/2603.25158)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat, R\. Avila, I\. Babuschkin, S\. Balaji, V\. Balcom, P\. Baltescu, H\. Bao, M\. Bavarian, J\. Belgum, I\. Bello, J\. Berdine, G\. Bernadett\-Shapiro, C\. Berner, L\. Bogdonoff, O\. Boiko, M\. Boyd, A\. Brakman, G\. Brockman, T\. Brooks, M\. Brundage, K\. Button, T\. Cai, R\. Campbell, A\. Cann, B\. Carey, C\. Carlson, R\. Carmichael, B\. Chan, C\. Chang, F\. Chantzis, D\. Chen, S\. Chen, R\. Chen, J\. Chen, M\. Chen, B\. Chess, C\. Cho, C\. Chu, H\. W\. Chung, D\. Cummings, J\. Currier, Y\. Dai, C\. Decareaux, T\. Degry, N\. Deutsch, D\. Deville, A\. Dhar, D\. Dohan, S\. Dowling, S\. Dunning, A\. Ecoffet, A\. Eleti, T\. Eloundou, D\. Farhi, L\. Fedus, N\. Felix, S\. P\. Fishman, J\. Forte, I\. Fulford, L\. Gao, E\. Georges, C\. Gibson, V\. Goel, T\. Gogineni, G\. Goh, R\. Gontijo\-Lopes, J\. Gordon, M\. Grafstein, S\. Gray, R\. Greene, J\. Gross, S\. S\. Gu, Y\. Guo, C\. Hallacy, J\. Han, J\. Harris, Y\. He, M\. Heaton, J\. Heidecke, C\. Hesse, A\. Hickey, W\. Hickey, P\. Hoeschele, B\. Houghton, K\. Hsu, S\. Hu, X\. Hu, J\. Huizinga, S\. Jain, S\. Jain, J\. Jang, A\. Jiang, R\. Jiang, H\. Jin, D\. Jin, S\. Jomoto, B\. Jonn, H\. Jun, T\. Kaftan, Ł\. Kaiser, A\. Kamali, I\. Kanitscheider, N\. S\. Keskar, T\. Khan, L\. Kilpatrick, J\. W\. Kim, C\. Kim, Y\. Kim, J\. H\. Kirchner, J\. Kiros, M\. Knight, D\. Kokotajlo, Ł\. Kondraciuk, A\. Kondrich, A\. Konstantinidis, K\. Kosic, G\. Krueger, V\. Kuo, M\. Lampe, I\. Lan, T\. Lee, J\. Leike, J\. Leung, D\. Levy, C\. M\. Li, R\. Lim, M\. Lin, S\. Lin, M\. Litwin, T\. Lopez, R\. Lowe, P\. Lue, A\. Makanju, K\. Malfacini, S\. Manning, T\. Markov, Y\. Markovski, B\. Martin, K\. Mayer, A\. Mayne, B\. McGrew, S\. M\. McKinney, C\. McLeavey, P\. McMillan, J\. McNeil, D\. Medina, A\. Mehta, J\. Menick, L\. Metz, A\. Mishchenko, P\. Mishkin, V\. Monaco, E\. Morikawa, D\. Mossing, T\. Mu, M\. Murati, O\. Murk, D\. Mély, A\. Nair, R\. Nakano, R\. Nayak, A\. Neelakantan, R\. Ngo, H\. Noh, L\. Ouyang, C\. O’Keefe, J\. Pachocki, A\. Paino, J\. Palermo, A\. Pantuliano, G\. Parascandolo, J\. Parish, E\. Parparita, A\. Passos, M\. Pavlov, A\. Peng, A\. Perelman, F\. de Avila Belbute Peres, M\. Petrov, H\. P\. de Oliveira Pinto, Michael, Pokorny, M\. Pokrass, V\. H\. Pong, T\. Powell, A\. Power, B\. Power, E\. Proehl, R\. Puri, A\. Radford, J\. Rae, A\. Ramesh, C\. Raymond, F\. Real, K\. Rimbach, C\. Ross, B\. Rotsted, H\. Roussez, N\. Ryder, M\. Saltarelli, T\. Sanders, S\. Santurkar, G\. Sastry, H\. Schmidt, D\. Schnurr, J\. Schulman, D\. Selsam, K\. Sheppard, T\. Sherbakov, J\. Shieh, S\. Shoker, P\. Shyam, S\. Sidor, E\. Sigler, M\. Simens, J\. Sitkin, K\. Slama, I\. Sohl, B\. Sokolowsky, Y\. Song, N\. Staudacher, F\. P\. Such, N\. Summers, I\. Sutskever, J\. Tang, N\. Tezak, M\. B\. Thompson, P\. Tillet, A\. Tootoonchian, E\. Tseng, P\. Tuggle, N\. Turley, J\. Tworek, J\. F\. C\. Uribe, A\. Vallone, A\. Vijayvergiya, C\. Voss, C\. Wainwright, J\. J\. Wang, A\. Wang, B\. Wang, J\. Ward, J\. Wei, C\. Weinmann, A\. Welihinda, P\. Welinder, J\. Weng, L\. Weng, M\. Wiethoff, D\. Willner, C\. Winter, S\. Wolrich, H\. Wong, L\. Workman, S\. Wu, J\. Wu, M\. Wu, K\. Xiao, T\. Xu, S\. Yoo, K\. Yu, Q\. Yuan, W\. Zaremba, R\. Zellers, C\. Zhang, M\. Zhang, S\. Zhao, T\. Zheng, J\. Zhuang, W\. Zhuk, and B\. Zoph \(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§1](https://arxiv.org/html/2606.12984#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Gray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=TG8KACxEON)Cited by:[§3\.1](https://arxiv.org/html/2606.12984#S3.SS1.SSS0.Px2.p1.1)\.
- S\. Ouyang, J\. Yan, Y\. Chen, R\. Han, Z\. Wang, B\. D\. Mishra, R\. Meng, C\. Li, Y\. Jiao, K\. Zha, M\. Shen, V\. Tirumalashetty, G\. Lee, J\. Han, T\. Pfister, and C\. Lee \(2026\)SkillOS: learning skill curation for self\-evolving agents\.External Links:2605\.06614,[Link](https://arxiv.org/abs/2605.06614)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive APIs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=tBRNC6YemY)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px2.p1.1)\.
- Qwen Team \(2025\)Qwen3\-vl technical report\.CoRRabs/2511\.21631\.External Links:[Link](https://doi.org/10.48550/arXiv.2511.21631),[Document](https://dx.doi.org/10.48550/ARXIV.2511.21631),2511\.21631Cited by:[Appendix D](https://arxiv.org/html/2606.12984#A4.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.12984#S3.SS2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessi, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Yacmpz84TH)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. R\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=vAElhFcKW6)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Sumers, S\. Yao, K\. R\. Narasimhan, and T\. L\. Griffiths \(2024\)Cognitive architectures for language agents\.Transactions on Machine Learning Research\.Note:Survey Certification, Featured CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=1i6ZCvflQJ)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- The Gemini Team \(2026\)Gemini 3\.1 Pro: a smarter model for your most complex tasks\.External Links:[Link](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/)Cited by:[Appendix D](https://arxiv.org/html/2606.12984#A4.SS0.SSS0.Px1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale, D\. Bikel, L\. Blecher, C\. C\. Ferrer, M\. Chen, G\. Cucurull, D\. Esiobu, J\. Fernandes, J\. Fu, W\. Fu, B\. Fuller, C\. Gao, V\. Goswami, N\. Goyal, A\. Hartshorn, S\. Hosseini, R\. Hou, H\. Inan, M\. Kardas, V\. Kerkez, M\. Khabsa, I\. Kloumann, A\. Korenev, P\. S\. Koura, M\. Lachaux, T\. Lavril, J\. Lee, D\. Liskovich, Y\. Lu, Y\. Mao, X\. Martinet, T\. Mihaylov, P\. Mishra, I\. Molybog, Y\. Nie, A\. Poulton, J\. Reizenstein, R\. Rungta, K\. Saladi, A\. Schelten, R\. Silva, E\. M\. Smith, R\. Subramanian, X\. E\. Tan, B\. Tang, R\. Taylor, A\. Williams, J\. X\. Kuan, P\. Xu, Z\. Yan, I\. Zarov, Y\. Zhang, A\. Fan, M\. Kambadur, S\. Narang, A\. Rodriguez, R\. Stojnic, S\. Edunov, and T\. Scialom \(2023\)Llama 2: open foundation and fine\-tuned chat models\.External Links:2307\.09288,[Link](https://arxiv.org/abs/2307.09288)Cited by:[§1](https://arxiv.org/html/2606.12984#S1.p1.1)\.
- S\. Tu, C\. Xu, Q\. Zhang, Y\. Zhang, X\. Lan, L\. Li, D\. Li, and D\. Zhao \(2026\)Dynamic dual\-granularity skill bank for agentic rl\.External Links:2603\.28716,[Link](https://arxiv.org/abs/2603.28716)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang, and S\. Deng \(2026a\)SkillX: automatically constructing skill knowledge bases for agents\.External Links:2604\.04804,[Link](https://arxiv.org/abs/2604.04804)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024a\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=ehfRiF0R3a)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin, W\. X\. Zhao, Z\. Wei, and J\. Wen \(2024b\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\)\.External Links:ISSN 2095\-2236,[Link](http://dx.doi.org/10.1007/s11704-024-40231-1),[Document](https://dx.doi.org/10.1007/s11704-024-40231-1)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.p1.1)\.
- Z\. Wang, Q\. Wu, X\. Zhang, C\. Zhang, W\. Yao, F\. E\. Faisal, B\. Peng, S\. Qin, S\. Nath, Q\. Lin, C\. Bansal, D\. Zhang, S\. Rajmohan, J\. Gao, and H\. Yao \(2026b\)WebXSkill: skill learning for autonomous web agents\.External Links:2604\.13318,[Link](https://arxiv.org/abs/2604.13318)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.External Links:2308\.08155,[Link](https://arxiv.org/abs/2308.08155)Cited by:[§3\.1](https://arxiv.org/html/2606.12984#S3.SS1.SSS0.Px1.p1.2)\.
- Z\. Xi, W\. Chen, X\. Guo, W\. He, Y\. Ding, B\. Hong, M\. Zhang, J\. Wang, S\. Jin, E\. Zhou, R\. Zheng, X\. Fan, X\. Wang, L\. Xiong, Y\. Zhou, W\. Wang, C\. Jiang, Y\. Zou, X\. Liu, Z\. Yin, S\. Dou, R\. Weng, W\. Cheng, Q\. Zhang, W\. Qin, Y\. Zheng, X\. Qiu, X\. Huang, and T\. Gui \(2023\)The rise and potential of large language model based agents: a survey\.External Links:2309\.07864,[Link](https://arxiv.org/abs/2309.07864)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen, Z\. Zheng, C\. Xie, and H\. Yao \(2026\)SkillRL: evolving agents via recursive skill\-augmented reinforcement learning\.External Links:2602\.08234,[Link](https://arxiv.org/abs/2602.08234)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=Bb4VGOWELI)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Yang, J\. Li, Q\. Pan, B\. Zhan, Y\. Cai, L\. Du, J\. Zhou, K\. Chen, Q\. Chen, X\. Li, B\. Zhang, and L\. He \(2026\)AutoSkill: experience\-driven lifelong learning via skill self\-evolution\.External Links:2603\.01145,[Link](https://arxiv.org/abs/2603.01145)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.12984#S3.SS2.SSS0.Px2.p1.2)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)TextGrad: automatic "differentiation" via text\.External Links:2406\.07496,[Link](https://arxiv.org/abs/2406.07496)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.12984#S3.SS3.SSS0.Px2.p1.3)\.
- H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng, X\. Liu, X\. Li, and P\. S\. Yu \(2026\)CoEvoSkills: self\-evolving agent skills via co\-evolutionary verification\.External Links:2604\.01687,[Link](https://arxiv.org/abs/2604.01687)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: llm agents are experiential learners\.InThirty\-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty\-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2024, February 20\-27, 2024, Vancouver, Canada,M\. J\. Wooldridge, J\. G\. Dy, and S\. Natarajan \(Eds\.\),pp\. 19632–19642\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29936),[Document](https://dx.doi.org/10.1609/aaai.v38i17.29936)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.12984#S3.SS3.SSS0.Px2.p1.3)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets%5C_and%5C_Benchmarks.html)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px4.p1.1),[§3\.3](https://arxiv.org/html/2606.12984#S3.SS3.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.12984#S4.SS1.SSS0.Px3.p1.1)\.
- Y\. Zhou, A\. I\. Muresanu, Z\. Han, K\. Paster, S\. Pitis, H\. Chan, and J\. Ba \(2023\)Large language models are human\-level prompt engineers\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=92gvk82DE-)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Zhu, Y\. Chen, H\. Tian, C\. Tao, W\. Su, C\. Yang, G\. Huang, B\. Li, L\. Lu, X\. Wang, Y\. Qiao, Z\. Zhang, and J\. Dai \(2023\)Ghost in the minecraft: generally capable agents for open\-world environments via large language models with text\-based knowledge and memory\.External Links:2305\.17144,[Link](https://arxiv.org/abs/2305.17144)Cited by:[§2](https://arxiv.org/html/2606.12984#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix ARQ3: Stage 3 Design Choices

Table[5](https://arxiv.org/html/2606.12984#A1.T5)ablates the two key design choices in Stage 3\.

#### \(a\) Evaluation signal source\.

Both paths are necessary, but they address complementary failure modes\. Removing the LLM Judge path incurs the larger overall penalty, with TCR and CA dropping the most, as both metrics require holistic understanding of task intent and are difficult to evaluate with fixed rules\. Removing the rule\-based path causes a distinct pattern: CCC suffers the most while CQ is almost unaffected, confirming that rule\-based signals capture structural violations an LLM Judge tends to miss\.

#### \(b\) Cross\-sample attribution level\.

Statistical Aggregation is the more critical component\. Removing it collapses CCC by−12\.2\-12\.2, the largest single\-dimension drop across all ablations in this study, and lowers Avg by−3\.8\-3\.8, indicating that without cross\-sample aggregation the system cannot distinguish per\-query noise from genuine Skill\-level weaknesses\. Removing Qualitative Induction has a much milder effect on Avg, and CA even improves marginally, suggesting that qualitative inductive summaries occasionally introduce spurious rewrites for content\-oriented metrics\.

Table 5:Ablation on Stage 3 design choices\.Δ\\Deltareports the change relative to the full variant within each group \(– denotes the reference row\)\. Metrics defined in §[4\.1](https://arxiv.org/html/2606.12984#S4.SS1.SSS0.Px3)\.

## Appendix BHuman\-Crafted vs\. Self\-Evolved Skill Comparison

Table[6](https://arxiv.org/html/2606.12984#A2.T6)illustrates how SkillChain’s self\-evolution produces qualitatively different Skills compared to manually authored ones, using the Utility Assistance intent as a representative case\. This intent exemplifies a broader pattern: each intent type may encompass multiple specialized Skills targeting distinct sub\-scenes, and Utility Assistance alone spans diverse sub\-scenes, e\.g\., health consultation, recipe guidance, and document assistance, each with its own Body constraints\. Human\-crafted Skills capture high\-level intent but leave routing boundaries vague, tool invocation underspecified, and domain constraints generic\. SkillChain iteratively tightens each dimension through production signals: routing failures refine Description boundaries \(Stage 2\), while LLM Judge attribution identifies and repairs specific Body weaknesses \(Stage 3\)\. Beyond these three dimensions, self\-evolved Skills also enumerate edge cases and follow\-up patterns that manual authoring leaves unaddressed, resulting in roughly twice the constraint density \(see Table[6](https://arxiv.org/html/2606.12984#A2.T6)\)\.

The increased constraint density also provides a lens for interpreting the metric trade\-offs observed in Table[2](https://arxiv.org/html/2606.12984#S4.T2)\. Self\-evolved Skills carry roughly twice the specification volume of ManualSkills, including mandatory field\-level rules, banned\-word blacklists, strict invocation ordering, and enumerated edge\-case handlers, making full Constraint Adherence \(CA\) harder to achieve by construction: even a well\-formed response can now fail on a clause that ManualSkill never expressed\. This explains why CA in SkillChain Full \(62\.762\.7\) falls below the ManualSkill baseline \(65\.365\.3\) despite a higher overall Avg\. From a user\-facing perspective, however, CCC and CQ are the dimensions that matter most: CCC reflects whether product cards are correctly composed \(directly visible to the user\), and CQ captures overall content coherence\. Both improve substantially: CCC from56\.156\.1to72\.272\.2and CQ from70\.970\.9to82\.382\.3, confirming that the richer constraint set drives meaningful quality gains on the dimensions users actually experience\.

Table 6:Structural comparison of a ManualSkill vs\. a SkillChain self\-evolved Skill for the Utility Assistance intent across six dimensions\.
## Appendix CMetric Definitions

### C\.1Offline LLM\-Judge Dimensions

The four dimensions form two complementary layers of response evaluation\.TCRandCCCassess*structural execution fidelity*: whether the response correctly follows the Dynamic Operators and rich\-media formatting prescribed by the Skill Body\.CQandCAassess*content compliance*: whether the actual text meets domain quality standards and honors behavioral constraints\.

Raw scores are summed \(max 50\) and linearly normalized to\[0,100\]\[0,100\]:

J¯​\(r,s\)=2​\(JTCR\+JCCC\+JCQ\+JCA\)\\bar\{J\}\(r,s\)=2\\bigl\(J\_\{\\mathrm\{TCR\}\}\+J\_\{\\mathrm\{CCC\}\}\+J\_\{\\mathrm\{CQ\}\}\+J\_\{\\mathrm\{CA\}\}\\bigr\)\(3\)
- •Tool Call Rationality \(TCR\)\[0–10\] \(↑\\uparrow\): targets the Dynamic Operators layer of the Skill Body\. Measures whether tool selection and invocation order are appropriate for the query; penalizes redundant, missing, or sequentially incorrect calls\.
- •Card Composition Compliance \(CCC\)\[0–10\] \(↑\\uparrow\): targets the rich\-media composition rules and product quality in the Skill Body\. Measures structural conformance of card output, and search term quality to the Skill specification\. Evaluated only on the subset of queries that produce product card output\.
- •Content Quality \(CQ\)\[0–20\] \(↑\\uparrow\): targets the writing quality and domain knowledge layer\. Composite of factual accuracy \[8 pts\], mobile\-friendly readability \[6 pts\], and intent understanding \[6 pts\]\. In routing\-deviation scenarios, intent understanding carries the most weight: rigid rule\-following that misses the user’s true need scores 0–2 on this sub\-dimension\.
- •Constraint Adherence \(CA\)\[0–10\] \(↑\\uparrow\): targets the behavioral constraint layer, covering all mandatory rules and prohibitions explicitly enumerated in the Skill Body \(e\.g\., no fabrication, format length limits, domain\-specific disclaimers\)\.

We additionally reportRouting Accuracy F1\(↑\\uparrow\), the harmonic mean of routing precision and recall against human\-annotated ground\-truth intent labels, following HELM\(Lianget al\.,[2023](https://arxiv.org/html/2606.12984#bib.bib62)\):

F​1routing=2​Prouting⋅RroutingProuting\+RroutingF1\_\{\\text\{routing\}\}=\\frac\{2\\,P\_\{\\text\{routing\}\}\\cdot R\_\{\\text\{routing\}\}\}\{P\_\{\\text\{routing\}\}\+R\_\{\\text\{routing\}\}\}\(4\)This metric is independent of the four Judge dimensions: it evaluates the Description \(dd\) layer independently of Body \(bb\) compliance\.

### C\.2Online A/B Metrics

The four metrics cover three complementary levels of user behavior:*immediate engagement*\(Interactive UV, Full\-read Rate\),*depth of attention*\(Avg\. Dwell Time\), and*long\-term retention*\(7\-day Return Rate\)\.

- •Interactive UV\(↑\\uparrow\): unique users who perform at least one active interaction within the session, such as clicking a product card or asking a follow\-up question\.
- •Full\-read Rate\(↑\\uparrow\): proportion of sessions in which the user scrolls to the end of the response\. Measures content consumption completeness\.
- •Avg\. Dwell Time\(↑\\uparrow\): mean time \(seconds\) spent on the response page per session\. Complements Full\-read Rate by distinguishing careful reading from fast scrolling\.
- •7\-day Return Rate\(↑\\uparrow\): proportion of users who return to the assistant within 7 days\. As a cross\-session metric it is unaffected by single\-session prompt or presentation effects, making it the most reliable signal of sustained user value from Body refinement\.

## Appendix DImplementation Details

#### System configurations\.

Table[7](https://arxiv.org/html/2606.12984#A4.T7)summarizes key implementation settings\. All system variants share the same backbone multimodal LLM,Qwen3\-VL\-235B\-A22B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2606.12984#bib.bib1)\), which is also the production\-deployed model\. The offline LLM Judge isGemini\-3\.1\-Pro\-Preview\(The Gemini Team,[2026](https://arxiv.org/html/2606.12984#bib.bib2)\), prompted with the four\-dimensional scoring rubric in Appendix[E\.2](https://arxiv.org/html/2606.12984#A5.SS2); all Judge calls use temperature 0 for reproducibility\. The Skill Creator in Stage 1 and the Skill Refiners in Stage 2 and Stage 3 are all powered byClaude\-Sonnet\-4\.6\(Anthropic,[2026](https://arxiv.org/html/2606.12984#bib.bib3)\)\. The five experimental configurations are:

- •NoSkill: current production system; LLM generates responses without Skill constraints\.
- •ManualSkill: Skills hand\-crafted by domain engineers without the SkillChain pipeline; serves as a human\-design baseline\.
- •Stage 1\(Skillv​0\\text\{Skill\}\_\{v0\}\): automatically generated Skills after human reflection; routing fixed, Body not refined\.
- •Stage 2\(\+\+Routing\): Stage 1 Skills with Stage 2 routing optimization applied\.
- •Stage 3\(\+\+Refine\): full SkillChain; Stage 2 with Stage 3 Body refinement\.

#### Online A/B deployment rationale\.

The online A/B experiment compares Stage 3 against the already\-deployed Stage 2 baseline rather than NoSkill, because Stage 2 had been launched into large\-scale production once its routing accuracy and Skill coverage met deployment thresholds: correct intent routing is a prerequisite for acceptable user experience at scale, so serving users without a stable routing layer was not an option\. The online experiment therefore measures the incremental contribution of Stage 3 Body refinement on top of a live, stable routing system\.

ComponentSettingValueBackbone MLLMModelQwen3\-VL\-235B\-A22B\-InstructUsageInference only \(production\-deployed; no fine\-tuning\)LLM JudgeModelGemini\-3\.1\-Pro\-PreviewTemperature0 \(deterministic\)UsageOffline evaluation onlyCreator & RefinerModelClaude\-Sonnet\-4\.6UsageStage 1 Skill creation; Stage 2 & 3 Skill refinementRouting JudgeSourceHuman\-annotated ground\-truth labelsUsageStage 2 failure classificationStage 2Max iterations4Convergence criterionRouting F1 on held\-out validation setFailure pool per round≥\\geq30 samples per SkillStage 3Max rounds3Poor\-tier thresholdTCR 20 %, CCC 10 %, CQ 5 %, CA 10 %Min samples per Skill50Offline eval setMin per intent150SamplingIntent\-stratified, proportional to trafficOnline A/BDuration1 weekTable 7:Implementation settings for all SkillChain experiments\.

## Appendix EPrompt Templates

### E\.1Routing Boundary Analysis Prompt

The routing boundary analysis prompt drives Stage 2’s Description refinement loop by processing both misrouted cases \(Case A\) and correctly routed cases \(Case B\)\.

In Case A \(mismatch\), the model receives the image together with the predicted routing, ground\-truth routing, and Skill trigger boundary descriptions\. It identifies the specific visual features that caused the routing ambiguity, explains the basis for the correct routing, and proposes a concrete one\-sentence update to the relevant Skill Description\. These suggestions are aggregated across failure cases to generate Description refinement patches\.

In Case B \(correct match\), the model analyzes a correctly routed case to reinforce existing boundary rules\. It identifies the visual features that positively support the routing, notes any ambiguous elements and explains why the routing is still correct, and summarizes what boundary rule the case reinforces\. This positive\-evidence pass prevents over\-tightening of Description boundaries during refinement\. The full prompts for both cases are shown in Table[8](https://arxiv.org/html/2606.12984#A5.T8)\.

Routing Boundary Analysis Prompt Template\#\#\# RoleYou are a Skill routing boundary analysis expert\. Examine the input image carefully and analyze the routing case below\.Case A — Mismatch \(Routing Error\)Predicted:\{Predicted Skill\}Ground Truth:\{Label Skill\}Skill Trigger Boundaries:\{Skill Descriptions\}Analysis Tasks:1\.Describe the image: subject, background, quantity, text presence, product image vs\. lifestyle shot, etc\.2\.Identify which visual features caused the routing ambiguity\.3\.Explain the basis for the correct routing\.4\.Propose a one\-sentence Description update: name the Skill and which rule to add or tighten\.Output fields:•image\_content•error\_reason•correct\_routing\_basis•suggestionCase B — Match \(Correct Routing, Boundary Reinforcement\)Routing Result \(Correct\):\{Label Skill\}Skill Trigger Boundaries:\{Skill Descriptions\}Analysis Tasks:1\.Describe the image \(same criteria as Case A\)\.2\.Identify visual features that positively support this routing\.3\.Note any ambiguous elements and explain why routing here is still correct \(noneif not applicable\)\.4\.Summarize what boundary rule this case reinforces\.Output fields:•image\_content•routing\_evidence•ambiguity\_risk•boundary\_insightTable 8:Routing boundary analysis prompt templates used in Stage 2\. Case A \(mismatch\) generates Description refinement suggestions; Case B \(match\) reinforces correct boundary rules\.
### E\.2LLM Judge Prompt Design

The LLM Judge evaluates each response against the Skill Body constraints across the four dimensions defined in Appendix[C](https://arxiv.org/html/2606.12984#A3)\. A central design choice is theRouting\-Deviation Fallback Principle, which overrides all other criteria: when the input does not perfectly match the Skill’s trigger scope, rigid rule\-following, even if technically compliant, is penalized to returning an error or silence\. The ideal response instead detects the deviation, infers the user’s true intent, and adaptively delivers a satisfying answer\. This prevents the Judge from rewarding safe\-but\-useless responses at Skill boundaries\.

The Judge produces structured JSON containing per\-dimension scores and tiers \(Good / Average / Poor\),rule\_violationsandideal\_gaps\(each requiring citation of concrete response excerpts\), andskill\_md\_suggestions\. The violations and gaps feed the cross\-sample attribution step; the suggestions are aggregated into Skill Body refinement patches\. The full prompt is shown in Table[9](https://arxiv.org/html/2606.12984#A5.T9)\.

LLM Judge Prompt Template\#\#\# RoleYou are a senior AI product evaluator specializing in assessing assistant response quality under specific Skill scenarios\. Given the target Skill definition, evaluate the response against the ideal behavior a fully competent reply should exhibit—not merely what the Skill Body prescribes, but what would genuinely satisfy the user\.\#\#\# InputSkill Body:\{Skill Body\}Response:\{Response\}\#\#\# Routing\-Deviation Fallback Principle\(highest priority\)When the input does not perfectly match the Skill’s trigger scope \(*routing deviation*\), anideal responsemust: \(1\) detect the deviation, \(2\) infer the user’s true intent, and \(3\) adaptively deliver a satisfying answer\.Two behaviors are penalized equally: \(a\) mechanically applying the Skill template while ignoring the user’s real need; and \(b\) returning an error or silence without any useful fallback\. A reasonable rule deviation made to better serve user intent is*not*penalized—it should be credited\.\#\#\# Evaluation DimensionsTCR — Tool Call Compliance\[0–10\]: Are all necessary tools invoked in the correct order with accurate parameters?Good\(8–10\): complete, ordered, results used in answer;Avg\(4–7\): minor param gaps or underutilized results;Poor\(0–3\): missing first\-turn tools or calls disconnected from output\.CCC — Card Orchestration Compliance\[0–10\]: Is the card type correct? Are cards placed inline at point\-of\-mention? Are titles \(≤\{\\leq\}14 chars\) and search terms precise? Are follow\-up prompts context\-relevant and diverse?Good\(8–10\): correct type, timely placement, precise title and search terms, valuable follow\-ups;Avg\(4–7\): type correct but placement late, or generic titles/search terms, or templated follow\-ups;Poor\(0–3\): wrong card type, missing required cards, banned title words, or no follow\-up\.CQ — Content Quality\[0–20\]:Factual accuracy\[8\]: conclusions correct, no hallucination;Readability\[6\]: mobile\-friendly, concise, well\-structured;Intent understanding\[6\]: hits the user’s core need—in routing\-deviation scenarios this sub\-dimension carries the most weight; rigid rule\-following that misses user need scores 0–2\.Good\(16–20\): accurate, clean layout, deeply addresses user intent;Avg\(8–15\): mostly correct but lacks depth or has formatting issues;Poor\(0–7\): core error, cluttered layout, or intent completely missed\.CA — Constraint Adherence\[0–10\]: Are mandatory rules followed and prohibited actions avoided? Reasonable deviations to better serve user intent in routing\-deviation scenarios are*not*treated as violations\.Good\(8–10\): all mandatory rules and prohibitions respected; flexible fallback in routing\-deviation cases;Avg\(4–7\): main rules followed but 1–2 gaps, or execution slightly mechanical;Poor\(0–3\): core prohibition violated, or rigid template application in routing\-deviation scenario leaves user unserved\.\#\#\# Evidence RequirementEvery entry inrule\_violationsandideal\_gapsmust cite a concrete excerpt from the response—no imagined details\.\#\#\# Output Format\(strict JSON, no other text\)•scores— TCR: \[0–10\], CCC: \[0–10\], CQ: \[0–20\], CA: \[0–10\]•tiers— Good / Average / Poor per dimension•total— sum of four scores \(max 50\)•rule\_violations— violations with evidence citations•ideal\_gaps— ideal behaviors absent from the response•skill\_md\_suggestions— Skill Body refinement patchesTable 9:LLM Judge prompt template used in Stage 3\. Raw scores \(max 50\) are linearly normalized to\[0,1\]\[0,1\]for reporting\.

Similar Articles

SkillNet: Create, Evaluate, and Connect AI Skills

Papers with Code Trending

SkillNet presents an open infrastructure for systematically accumulating and transferring AI skills using a unified ontology, showing significant improvements in agent performance across multiple domains.

@op7418: https://x.com/op7418/status/2065232309310427565

X AI KOLs Timeline

This article discusses the concept of Skills in the AI agent ecosystem, arguing that Skills are more than prompts—they are packaged capabilities that externalize human expertise into reusable workflow units. The author shares design principles and case studies from building popular Skills.