@dair_ai: If you build web agents, this one is worth your time. It's on how to make agent skills reusable. (bookmark it) LLM web …
Summary
This paper introduces SkillMigrator, an LLM web agent that learns reusable skills and transfers them across websites by matching layout structure rather than domain-specific metadata, reducing LLM action count by 8-10% on WebArena and Mind2Web benchmarks.
View Cached Full Text
Cached at: 06/18/26, 06:10 PM
If you build web agents, this one is worth your time.
It’s on how to make agent skills reusable.
(bookmark it)
LLM web agents usually run as tool callers. Each turn, the model reads a fresh page and emits one low-level action, so horizons and policy-facing LLM completions both blow up on benchmarks like Mind2Web and WebArena.
Skill libraries are meant to fix this by wrapping repeated fragments as callable tools, but they trigger reuse on instruction similarity or site metadata, which barely fires on held-out sites.
This work routes skill reuse by transferable interaction patterns instead, so a skill learned on one site fires on new sites that share the same interaction shape. That lifts reuse where domain-keyed retrieval falls flat.
Why does it matter?
The same search, filter, and paginate dance shows up across sites. Abstracting it into a pattern-keyed skill makes web-agent skills generalize beyond the site on which they were learned.
Paper: https://arxiv.org/abs/2606.17645
Learn to build effective AI agents in our academy: https://academy.dair.ai
Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
Source: https://arxiv.org/html/2606.17645 Shiqi He1, Yue Cui2, Feijie Wu3, Xinyu Ma4, Jiaheng Lu5, Yaliang Li2, Bolin Ding2, Mosharaf Chowdhury1 1University of Michigan2Alibaba Group3Purdue University 4McMaster University5University of Pennsylvania
Abstract
Large language model (LLM) web agents are usually deployed astool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments asweb skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields lowskill reuseon held-out sites and leaves much of the potential step and token reduction on the table.
We presentSkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as atransferable interaction pattern(TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time,SkillMigratorretrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches,SkillMigratorreduces the average LLM-action count on successful trajectories by 8–10% across both WebArena and Mind2Web at matched success rate.
1Introduction
Web agents translate a user’s natural-language goal into a sequence of browser actions such as searching, clicking, typing, and submitting forms, providing a general interface for automating tasks that are difficult to script manually. Recent benchmarks such as WebShop[27], Mind2Web[2], and WebArena[31]cover e-commerce interaction, open-domain websites, and realistic self-hosted environments, highlighting both the practical value and the difficulty of this setting. However, most existing web agents rely on an LLM-centered decision loop that repeatedly queries the LLM to predict the next action from the current webpage state, often following reasoning-and-acting paradigms such as ReAct[28]. This design is flexible but expensive at deployment, since each task may require many sequential LLM calls, with cost and latency growing in the length and number of interaction trajectories[26]. There is therefore a need for acost-effective web agentthat reduces reliance on the LLM.


Shopify∙\bulletE-commerceGitLab∙\bulletDeveloper ToolsPostmill∙\bulletOnline ForumTask:“Add a new product”Task:“Open a new issue”Task:“Create a new forum”Primitive actions
fill(‘Title’, …)
fill(‘Description’, …)
fill(‘Price’, …)
click(‘Save’)Primitive actions
fill(‘Title’, …)
fill(‘Description’, …)
select_option(‘Type’, …)
click(‘Create issue’)Primitive actions
fill(‘Name’, …)
fill(‘Title’, …)
fill(‘Description’, …)
click(‘Create forum’)One TIP reusable across all three domains:ι\iota:“fill a labelled form and click submit”σ\sigma:fill-and-submittemplateΦ\Phi:{\bigl\{\,title-like,body-like,…}\dots\,\bigr\}plan⇒\;\Rightarrow\;fill∗thenclick(submit)(replacesnnpolicy LLM calls with 1 skill call)Figure 1:Cross-domain skill reuse motivatesSkillMigrator.Three websites drawn from very different domains—Shopify(e-commerce),GitLab(developer tools), andPostmill(online forum)—use different page layouts, field vocabularies, and submit-button labels. Yet the three subtasks reduce to the same programmatic pattern: fill a few labelled inputs, then click a single submit button. Same-colour fields (title-like,body-like,submit) are paraphrases of the same abstract slot across all three sites.SkillMigratorstoresoneTIP—intentι\iota, operation templateσ\sigma, slot schemaΦ\Phi, and the induction-time tree skeletonτ\tau—and reuses it on all three pages, replacing many policy LLM steps with a single skill call.Recent work pursues this goal through reusable web skills, which store procedural knowledge from prior web interactions and reapply it in future tasks[24,23,30,16,22]. A successful interaction trajectory is abstracted into a skill that the agent can retrieve and execute when it encounters a similar goal or webpage state. This replaces many primitive LLM decisions with a single higher-level operation, reducing LLM calls and shortening the interaction trajectory. Because the skill encodes a validated action pattern, it also mitigates compounding errors in long-horizon navigation[23,30,16,13].
Existing skill-reuse methods fall into two types: reusing web skills underthe same website, where the skill is specialized to a specific site and its interface[30,16,22], and reusing web skillsunder the same domain, where the skill is transferred across websites that share similar task structures, such as shopping, maps, forums, or code repositories[24,23,29]. Both directions have important limitations. Same-website methods suffer from low reuse rates, because each skill is tied to the specific interface, DOM structure, and interaction pattern of the site where it was learned[30,16,22]. The agent benefits only when future tasks revisit that site, which is restrictive in open-web settings where user requests span diverse websites. Same-domain methods such as PolySkill[29]address this through polymorphic abstractions that separate shared skill interfaces from site-specific implementations, enabling reuse across sites within a domain. However, they still confine reuse to a single domain and miss the fact that sites across different domains often share strong interaction patterns, as illustrated in Figure1. As a result, current skill-reuse methods remain narrower than necessary, motivating a more general form of reusable web skills that can transfer beyond boththe same websiteandthe same domain.
Obtaining reusable web skills beyond both the same website and the same domain is challenging because of unreliable skill retrieval. Existing methods retrieve candidates using instruction similarity, intent labels, or website metadata[24,30,29]. However, these signals are insufficient for cross-domain transfer: two tasks with different wording may require the same interaction program, while two textually similar tasks may require different DOM-level control flow. As a result, the agent may fail to retrieve useful skills or mistakenly execute unsuitable ones, forcing it to fall back to primitive action generation with frequent LLM calls.
We proposeSkillMigrator, a cost-effective web agent that enables skill reuse beyond boththe same websiteandthe same domain.SkillMigratorfollows the standard programmatic-skill setting, where observations are accessibility snapshots with stable element references and actions are issued through a fixed tool-calling API over primitive actions and skill invocations[30,16,29]. Its memory unit is a transferable interaction pattern (TIP) , which pairs each induced skill with a structural sketch of the webpage snapshot where the skill was validated. At inference time,SkillMigratorretrieves skills from a single global library by combining layout similarity with text signals, then grounds the matched abstract constraints to live element references before replaying the skill. This design allows the agent to identify reusable interaction patterns across websites with different wording, interfaces, and domains, while avoiding the execution of weakly matched skills.
Contributions.
- •To our knowledge, this is the first work to study reusable web skills across websites beyond domains. This setting is non-trivial because similar interaction patterns may appear with different layouts, labels, and DOM structures.
- •We proposeSkillMigratorfor cross-domain skill matching. It stores induced skills as TIPs, each pairing a validated skill with the structural sketch of its source webpage. For a new task,SkillMigratorretrieves relevant TIPs using layout and text signals, grounds them to live webpage elements, and falls back to primitive control when no reliable match is found.
- •We empirically compareSkillMigratoragainst existing web-agent baselines on Mind2Web and WebArena, reducing average LLM-action count on successful trajectories by 8–10% relative to the state-of-the-art baselines at matched task success rate.
2Background and Motivation
2.1Preliminaries and Problem Formulation
Table 1:Representative primitive tools.eeis an element ref in the text snapshot. Skill calls expand into grounded primitive calls.#### Web Agent Environment.
We follow the BrowserGym and WebArena convention[3,31]: at each timesteptt, the agent receives a Playwright-style accessibility snapshototo_{t}with stable refs, roles, names, and state attributes, and emits one tool callat∼πθ(⋅∣q,o0:t,a0:t−1)a_{t}\sim\pi_{\theta}(\cdot\mid q,o_{0:t},a_{0:t-1})from the action space𝒜\mathcal{A}in Table1. For benchmarks, we consider Mind2Web[2](cross-task, cross-website, cross-domain splits) and WebArena[31](812 executable tasks across shopping, admin, reddit, gitlab, map, multisite). The default policy input is text-only and screenshot-grounded or pixel-space agents are outside our scope[6,10]. A full formulation is in AppendixA.
Skill library.
Beyond primitive browser actions, recent web agents[23,30,29]are equipped with a skill library𝒦\mathcal{K}. Each skillk∈𝒦k\in\mathcal{K}is a temporally extended routine[19]that maps a subtaskssand observationooto a short action sequencek(s,o)=⟨a~1,…,a~n⟩k(s,o)=\langle\tilde{a}_{1},\dots,\tilde{a}_{n}\rangleover Table1, exposed to the policy as a callable high-level macro, which reuses recurring interaction patterns such as opening a menu, filling a form, searching, or filtering.
Problem Formulation.
Given an instructionqq, a planner decomposes it into a sequence of subtasks𝐬(q)={s1,…,sTq}\mathbf{s}(q)=\{s_{1},\ldots,s_{T_{q}}\}. Leto~0\tilde{o}_{0}be the initial observation ando~i\tilde{o}_{i}the observation after finishingsis_{i}. We define𝒩(s,o∣𝒦,πθ)\mathcal{N}(s,o\mid\mathcal{K},\pi_{\theta})as the number ofnewprimitive actions emitted byπθ\pi_{\theta}to completessfromoo: if the subtask is fully covered by a retrieved skill,𝒩=0\mathcal{N}=0; otherwise the agent falls back toπθ\pi_{\theta}and𝒩=n\mathcal{N}=nwherennis the length of the fallback trajectory. Our goal is to construct a compact skill library that minimises the expected number of LLM-generated primitive actions across tasks:
min𝒦𝔼q∼Q[∑i=1Tq𝒩(si,o~i−1∣𝒦,πθ)]+λC(𝒦),\min_{\mathcal{K}}\mathbb{E}_{q\sim Q}\left[\sum_{i=1}^{T_{q}}\mathcal{N}(s_{i},\tilde{o}_{i-1}\mid\mathcal{K},\pi_{\theta})\right]+\lambda\,C(\mathcal{K}),(1)whereQQis the task distribution,C(𝒦)C(\mathcal{K})is the library cost, andλ\lambdatrades off action savings against library size.
2.2Skill Induction Methods
(a)Skill reuse rate.
(b)Mean effective steps.
Figure 2:Motivating comparison of ASI, SkillWeaver, and PolySkill on a cumulative Mind2Web-style setup: cross-task (same website), cross-website (same domain, new sites), and cross-domain (new domains). Subfigure2(a)reports per-phase skill reuse and2(b)show successful-step cost.#### Existing Work.
Recent work on reusable procedural knowledge for web and tool-use agents includes textual workflow memory (AWM[24]), verified programmatic skills callable as high-level actions (ASI[23]), self-induced skill APIs from exploration (SkillWeaver[30]), polymorphic cross-site abstractions (PolySkill[29]), and discovered website tools (WALT[16]). Together, these methods compress repeated primitive sequences into reusable abstractions, shortening interaction horizons and reducing policy LLM calls.
Limitations.
Despite these advances, skill retrieval remains a bottleneck on unseen websites because existing methods rely on semantic keys like task descriptions, workflow summaries, skill names and API descriptions. Web tasks often preserve interaction structure while changing surface wording, so purely semantic retrieval under-retrieves reusable skills—raising𝒩(s,o∣𝒦,πθ)\mathcal{N}(s,o\mid\mathcal{K},\pi_{\theta})—and over-retrieves skills whose execution context is incompatible with the current page. Figure2illustrates this on a 60-task Mind2Web subset evaluated with GPT-4.1: skill reuse drops sharply for ASI, SkillWeaver, and PolySkill as the test trace moves from cross-task to cross-website to cross-domain (Figure2(a)), and average successful-step cost rises in lock-step (Figure2(b)). This motivates augmenting retrieval with a layout-conditioned signal that uses the current observation as additional evidence for reuse beyond semantic similarity.
3Design
SkillMigratorreduces𝒩(si,o~i−1∣𝒦,πθ)\mathcal{N}(s_{i},\tilde{o}_{i-1}\mid\mathcal{K},\pi_{\theta})in the objective of §2by reusing skills in𝒦\mathcal{K}across websites and domains whose surface wording and labels differ. At each pair(si,o~i−1)(s_{i},\tilde{o}_{i-1}), the agent (i) summarises the live snapshot and retrieves the most similar skill from𝒦\mathcal{K}(§3.1); (ii) parses onevaluefor each slot of that skill from the user’s instruction; and (iii) binds each slot to a concrete control on the live page so that the skill expands into grounded primitive actions without an LLM call (§3.2). Figure4summarises this runtime control flow.
3.1Skill Record and Retrieval
Snapshot.
At timett, the observationoto_{t}is a Playwright accessibility snapshot, a YAML-like serialization of the agent-visible accessibility tree[3]. Each node exposes a semantic role (textbox,button,link,heading, …), an accessible name, and a stable agent-addressable referenceref=eNthat primitive actions use as their target. A deterministic rule scansoto_{t}once and produces (i) a one-linepage summaryρ(ot)\rho(o_{t})that lists the page heading, the labelled controls inside the principal form, and the primary button labels, and (ii) the listV(ot)V(o_{t})of all interactive nodes with their roles and names.
Skill record.
Rather than storing a literal action recipe, which would not survive relabelling on a new site, each induced skill is stored as a transferable interaction pattern (TIP)—a tuple
k=(ιk,σk,Φk,τk),k\;=\;\bigl(\,\iota_{k},\;\sigma_{k},\;\Phi_{k},\;\tau_{k}\,\bigr),(2)whereιk\iota_{k}is a one-sentence natural-languageintent,σk\sigma_{k}is the skill’soperation template, drawn from a setΣ\Sigmaof templates mined offline by clustering training trajectories on action shape (e.g. a singlefillon asearchboxfollowed by a submit, or severalfills followed by a labelledbutton),Φk={ξ1,…,ξm}\Phi_{k}=\{\xi_{1},\dots,\xi_{m}\}is theslot schema, andτk\tau_{k}is the cleaned accessibility-tree skeleton of the induction-time snapshot, kept as a small labelled tree carrying role and name per node and used by the layout signal in Eq. (3). Each slotξ\xistores a key (e.g.post_title), a one-line descriptordξd_{\xi}such as*“the headline of the new entry”, and a small synonym setTξT_{\xi}mined from co-clustered training trajectories (e.g.title→\to{title, headline, subject, name, summary, topic}). The record contains noref=eNidentifiers, which are properties of the live page at test time. Each operation templateσ\sigmacomes with a fixed deterministicplan*that callsfill/click/pressprimitives onceΦk\Phi_{k}is bound (Algorithm1). Reusing the same template across sites is what makes a skill cross-domain rather than a per-trajectory replay.
Source trajectory site: postmill task: create submission
Recorded primitives: click textbox‘Title’ press textbox‘Title’ click textbox‘Body’ press textbox‘Body’ click button‘Create sub.’Stored skillk=(ιk,σk,Φk,τk)k=(\iota_{k},\sigma_{k},\Phi_{k},\tau_{k}) ιk:\iota_{k}\!:create a new entry by filling a labelled form and clicking submit σk:\sigma_{k}\!:fill-and-submittemplate Φk:\Phi_{k}\!:{ξ1,ξ2}\bigl\{\,\xi_{1},\,\xi_{2}\,\bigr\}slot schema ξ1=(\xi_{1}\!=\!\bigl(\,post_title, “the headline of the entry”, Tξ1={T_{\xi_{1}}\!=\!\{title, headline, subject, summary, …})\}\,\bigr), ξ2=(\xi_{2}\!=\!\bigl(\,post_body, “the main text content”, Tξ2={T_{\xi_{2}}\!=\!\{body, description, reply, comment, …})\}\,\bigr) τk:\tau_{k}\!:form→\totextbox++buttonTarget page site: shopping admin task: write review
Live controls: textbox‘Summary’[e16] textbox‘Review’[e21] button‘Save Review’[e25] button ‘Reset’[e26] … Status, Nickname, …*induce§3.1retrieveEq. (3)Subtasksis_{i}:“Write a customer review with summary‘Great fit!’and detailed comment‘Comfortable for long training runs; true to size and stays cool.’”* Instantiation dictDD:{\{summary: ‘Great fit!’,comment: ‘Comfortable for long …’}\} Stage A(value parse):ξ1\xi_{1}post_title↦\;\mapsto\;‘Great fit!’ ξ2\xi_{2}post_body↦\;\mapsto\;‘Comfortable for …’Stage B(slot→\toref):ξ1\xi_{1}→\toref=e16*(Summary)ξ2\xi_{2}→\toref=e21(Review)submit→\toref=e25(Save Review)*value⋈\bowtieref(§3.2)
Figure 3:End-to-end cross-domain example. A skill induced from a Postmillcreate-submissiontrajectory (left) is retrieved on a shopping-adminwrite-reviewpage (right):post_titleandpost_bodyrebind toSummaryandReviewthrough the synonym pools inΦk\Phi_{k}, and the bottom panel walks through Stage A and Stage B for one subtask. Same-colour fields (title-like,body-like,submit) denote the same abstract slot across both pages.
Retrieval.
Given the current pair(si,o~i−1)(s_{i},\tilde{o}_{i-1}), we score each skillk∈𝒦k\in\mathcal{K}by combining atextsignal that captureswhat kind of subtaskis being attempted with alayoutsignal that captureswhat kind of pagethe agent is on:
score(k,si,o~i−1)=α𝐞(si∥ρ(o~i−1))⊤𝐞(δk)⏟text signal+(1−α)ℒ(k,o~i−1)⏟layout signal,\mathrm{score}(k,\,s_{i},\,\tilde{o}_{i-1})\;=\;\alpha\,\underbrace{\mathbf{e}\bigl(s_{i}\,\|\,\rho(\tilde{o}_{i-1})\bigr)^{\!\top}\mathbf{e}(\delta_{k})}_{\text{text signal}}\;+\;(1-\alpha)\,\underbrace{\mathcal{L}\bigl(k,\,\tilde{o}_{i-1}\bigr)}_{\text{layout signal}},(3)withα∈[0,1]\alpha\!\in\![0,1]. Here𝐞(⋅)\mathbf{e}(\cdot)is a frozen sentence encoder[17], andδk=ιk∥⋃ξ∈Φk(dξ∥Tξ)\delta_{k}=\iota_{k}\,\|\,\bigcup_{\xi\in\Phi_{k}}(d_{\xi}\,\|\,T_{\xi})is arich descriptorconcatenating the skill’s intent with every slot descriptor and synonym, so paraphrased subtasks across sites stay close in embedding space[12,21]. The layout signalℒ(k,o~i−1)\mathcal{L}(k,\tilde{o}_{i-1})uses the live accessibility-tree structure ofo~i−1\tilde{o}_{i-1}to ground retrieval in what the page actually exposes. The*tree edit distance (TED)*term1−TED(τk,τ(o~i−1))/max(|τk|,|τ(o~i−1)|)1\!-\!\mathrm{TED}(\tau_{k},\tau(\tilde{o}_{i-1}))/\!\max(|\tau_{k}|,|\tau(\tilde{o}_{i-1})|)is computed by APTED on small trees[20,15]. This is the dominant signal when source and target sites use different labels for structurally analogous forms.
We openskill modeonly ifscore(k⋆,si,o~i−1)≥β\mathrm{score}(k^{\star},s_{i},\tilde{o}_{i-1})\!\geq\!\betafor the top-1 skillk⋆k^{\star}; otherwise the agent stays in react mode[28]. This gate matters because not every test subtask has a transferable analog in𝒦\mathcal{K}, and forcing a weakly matched skill on an unrelated page would silently corrupt the trajectory.α\alphaandβ\betaare tuned on held-out training trajectories.
3.2Slot Binding and Execution
Oncek⋆k^{\star}is chosen, the agent must associate each slotξ∈Φk⋆\xi\in\Phi_{k^{\star}}with a concretevaluestring before binding it to a control on the page. We follow the cross-domain slot-filling view ofLiuet al.[12], Wanget al.[21]: a slot is identified by its NL descriptordξd_{\xi}and synonym poolTξT_{\xi}, not by its training-time key, so unseen vocabulary on the test side does not break the match.
Algorithm 1SkillMigratorinference for one subtask(si,o~i−1)(s_{i},\tilde{o}_{i-1}).1:subtask
sis_{i}, snapshot
o~i−1\tilde{o}_{i-1}, library
𝒦\mathcal{K}, encoder
𝐞\mathbf{e}, instantiation dict
DD, gate threshold
β\beta.
2:compute summary
ρ(o~i−1)\rho(\tilde{o}_{i-1})and node list
V(o~i−1)V(\tilde{o}_{i-1}) 3:
k⋆←argmaxk∈𝒦score(k,si,o~i−1)k^{\star}\leftarrow\arg\max_{k\in\mathcal{K}}\mathrm{score}(k,s_{i},\tilde{o}_{i-1})via Eq. (3)
4:if
score(k⋆,si,o~i−1)<β\mathrm{score}(k^{\star},s_{i},\tilde{o}_{i-1})<\betathen
5:return
πθ(si,o~i−1)\pi_{\theta}(s_{i},\tilde{o}_{i-1})⊳\trianglerightfall back to react mode
6:endif
7:Stage A: Hungarian-solve
Φk⋆×D\Phi_{k^{\star}}\!\times\!Dfor slot values; for any unbound slot, extract a span from
sis_{i} 8:Stage B: Hungarian-solve
Φk⋆×V(o~i−1)\Phi_{k^{\star}}\!\times\!V(\tilde{o}_{i-1})for slot
→\toref bindings
9:emit primitives per the plan of
σk⋆\sigma_{k^{\star}}; escalate any unboundrequiredslot to
πθ\pi_{\theta}
Instr.qqPlannerSummaryρ,V\rho,\,VRetrievalEq. (3)Gateβ\betaStage Avalue parseStage Bslot→\torefNextsnapshotLibrary𝒦\mathcal{K}Primitiveπθ\pi_{\theta}passfailnext subtask / stopFigure 4:Runtime control flow. Passing the gateβ\betaroutes the subtask through skill mode (Stage A then Stage B). Failing falls back to a single primitive step fromπθ\pi_{\theta}.#### Stage A: parsing slot values from the instruction.
The planner forces each subtasksis_{i}to expose aninstantiation dictD={(ηj,vj)}jD=\{(\eta_{j},v_{j})\}_{j}in the task spec[31], whereηj\eta_{j}is a key (subject,message,from_location, …) andvjv_{j}is the corresponding value. We score every candidate pair(ξ,ηj)(\xi,\eta_{j})by
𝐞(dξ∥Tξ)⊤𝐞(ηj)+λlit⋅𝟙[ηj∈Tξ],\mathbf{e}(d_{\xi}\,\|\,T_{\xi})^{\!\top}\mathbf{e}(\eta_{j})\;+\;\lambda_{\mathrm{lit}}\!\cdot\!\mathbb{1}[\eta_{j}\in T_{\xi}],and run a Hungarian solve[11,14]on the resulting|Φk⋆|×|D||\Phi_{k^{\star}}|\!\times\!|D|matrix to obtain a one-to-one assignment. The literal bonusλlit\lambda_{\mathrm{lit}}is the same disambiguation mechanism long used in slot filling[1]: whenηj\eta_{j}already appears verbatim insideTξT_{\xi}the match is essentially certain. In the example of Figure3,summaryappears inT𝑝𝑜𝑠𝑡_𝑡𝑖𝑡𝑙𝑒T_{\mathit{post\_title}}andcommentinT𝑝𝑜𝑠𝑡_𝑏𝑜𝑑𝑦T_{\mathit{post\_body}}, so the source-trajectory keys (post_title,post_body) are correctly bound to the target keys (summary,comment).
WhenDDis empty or covers fewer slots thanΦk⋆\Phi_{k^{\star}}needs, we extract candidate spans𝒞(si)\mathcal{C}(s_{i})directly from the instruction text using a fixed cue grammar (quoted strings, list literals, dates, URLs,from X to Y,titled X,named X, capitalised proper-noun phrases, …). Each spancccomes with a small prior reflecting how distinctive its cue is (e.g. a quoted string is a stronger signal than a bare capitalisation). We then score𝐞(dξ∥Tξ)⊤𝐞(ctx(c))\mathbf{e}(d_{\xi}\,\|\,T_{\xi})^{\!\top}\mathbf{e}(\mathrm{ctx}(c))on the surrounding context window with the candidate replaced by a placeholder, and run a Hungarian assignment. The two paths are run in order: dict-match first, then instruction-extract for any still-unbound slot.
Stage B: binding each slot to a control on the page.
The bound values must finally be typed into actual page controls. For every interactive nodev∈V(o~i−1)v\in V(\tilde{o}_{i-1})we build a deterministiccontrol descriptord(v)=role(v)∥name(v)d(v)=\mathrm{role}(v)\,\|\,\mathrm{name}(v), score every (slot, control) pair by𝐞(dξ∥Tξ)⊤𝐞(d(v))\mathbf{e}(d_{\xi}\,\|\,T_{\xi})^{\!\top}\mathbf{e}(d(v)), restrict to controls of compatible role (textbox-like for fillable slots), and run a second Hungarian solve. An assignment is accepted only above a similarity threshold. Below it we declare the slotunboundand fall back toπθ\pi_{\theta}for that slot.
The critical effect of this stage is that thenumber of fillsemitted on the live page is determined by the live page and the matched slots, not by the action count of the source trajectory. On a five-field form likeCreate new forum(Name, Title, Description, Sidebar, Tags), an instantiation dict that supplies only three values fills only the three matching textboxes and leaves the other two empty; on a two-field form likeNew issue(Title, Description) the same skill emits two fills. This handles the variable-arity issue without touching the stored skill record.
Plan execution.
Each operation templateσ∈Σ\sigma\in\Sigmahas a fixed plan keyed on the bound slots. Asearchtemplate emits afillon the search input and aclickon the submit button, or apress(Enter)when no submit button binds. Afill-and-submittemplate emits onefillper bound slot in declared order, then aclickon the button whose label matches the global submit-keyword pool (create, submit, post, send, save, update, publish) mined from training. Aclick-by-texttemplate emits a singleclickon the link or cell whose name best matches the target value. Any unboundrequiredslot escalates toπθ\pi_{\theta}.
4Experiments
We evaluateSkillMigratorwith a focus on the LLM-action count𝒩(si,o~i−1∣𝒦,πθ)\mathcal{N}(s_{i},\tilde{o}_{i-1}\mid\mathcal{K},\pi_{\theta})from Eq. (1) of §2, corresponding to theaverage successful tool stepsmetric used in prior work[23,29]. Three research questions organise the evaluation:
- •RQ1.DoesSkillMigratorreduce𝒩\mathcal{N}while preserving task success rate on different splits where previous baselines are restricted to within-domain reuse?
- •RQ2.DoesSkillMigratorcomposewith existing skill libraries: when its gate falls back, does plugging an ASI or PolySkill library underneath add to the gain?
- •RQ3.Which component is responsible for the gain, and how sensitive are the results to the weightα\alphaand gate thresholdβ\beta?
4.1Setup
Benchmarks and baselines.
We evaluate onMind2Web[2](137 websites, 31 domains, cross-task / cross-website / cross-domain splits) andWebArena[31](812 executable tasks across 6 columns). Baselines areReAct[28](no skill library), SkillWeaver[30], ASI[23], and PolySkill[29]. ASI, PolySkill, andSkillMigratorare each reported in both astaticregime (library fixed before evaluation) and an*+Update*regime (skills induced online during evaluation), following PolySkill.
Metrics.
The main metric is theaverage LLM-action counton successful trajectories,𝒩¯=𝔼q∈𝒟succ∑i𝒩(si,o~i−1∣𝒦,πθ)\bar{\mathcal{N}}=\mathbb{E}_{q\in\mathcal{D}_{\mathrm{succ}}}\sum_{i}\mathcal{N}(s_{i},\tilde{o}_{i-1}\mid\mathcal{K},\pi_{\theta}), counting one LLM call per primitive tool action and zero per retrieved skill. We additionally report task success rate (SR), skill reuse rate, and library size|𝒦||\mathcal{K}|.
Configuration.
Unless stated otherwise we useα=0.6\alpha=0.6for the text/layout mixing weight in Eq. (3) andβ=0.20\beta=0.20for the gate threshold of §3.1. Both were selected on a 10% subset of training trajectories without overlapping with any test set. More details are in AppendixB.
4.2Main Results
Table 2:WebArena per-domain task success rate (SR, %) and average LLM-action count𝒩¯\bar{\mathcal{N}}on successful trajectories.𝒩¯\bar{\mathcal{N}}rows are extracted from trajectory logs under one unified counter.Table 3:Mind2Web success rate, average LLM-action count𝒩¯\bar{\mathcal{N}}, and library size|𝒦||\mathcal{K}|on the three generalisation splits with GPT-4.1. Methods are reported in two regimes: astaticlibrary (rows without the*(+Update)suffix) fixed before evaluation, and an+Updateregime (rows tagged(+Update)*) in which new skills are induced online during evaluation.Tables2and3, together with the scatter in Figure5, answer RQ1. The pattern is consistent across both benchmarks:SkillMigratorsits at or just below the strongest baseline on SR, but spends noticeably fewer LLM calls per successful trajectory and reuses more skills across the test pool. On WebArena, aggregate SR is within−-3.6 points of PolySkill but𝒩¯\bar{\mathcal{N}}drops from 5.9 to 5.4—an 8.5% reduction in policy LLM calls, and 16.9% against the ReAct baseline of 6.5. On Mind2Web the same trade-off shows up:SkillMigrator(+Update) achieves a 63.0% SR on cross-domain at𝒩¯=6.2\bar{\mathcal{N}}\!=\!6.2, against PolySkill (+Update)’s 63.4% SR and𝒩¯=6.9\bar{\mathcal{N}}\!=\!6.9. The reuse rate further supports this result: cross-domain reuse rises to 35.4% versus PolySkill’s 31%, while the library size is consistently smaller. Overall,SkillMigratorreduces LLM calls per successful trajectory and improves skill reuse on previously unseen domains without a significant decrease in accuracy.
(a)WebArena average.
(b)Mind2Web cross-domain.
Figure 5:Success rate against average LLM-action count𝒩¯\bar{\mathcal{N}}(lower-left is better on both axes).SkillMigratorsits left of every baseline at comparable SR on both benchmarks, using fewer LLM calls per successful trajectory.
4.3Orthogonality with Existing Skill Libraries
Table 4:Orthogonality study (a) and component ablations (b) on the WebArena.(a)WebArena average when the gate ofSkillMigratorfalls back to a secondary library.ConfigurationSR %𝒩¯\bar{\mathcal{N}}ASI alone[23]46.56.2PolySkill alone[29]49.35.9SkillMigratoralone (gate→\toReAct)45.75.4SkillMigrator+ ASI47.25.3SkillMigrator+ PolySkill47.75.3 (b)Each row removes one element ofSkillMigrator. Lower𝒩¯\bar{\mathcal{N}}is better.
RQ2 asks whetherSkillMigratorcompeteswith baselines orcomposeswith them. When the gate falls back, the agent does not need to revert to raw ReAct: control can pass to a secondary library trained by another method. Table4(a)shows the result. The hybridSkillMigrator+PolySkill row reaches the lowest𝒩¯\bar{\mathcal{N}}in the table (5.3, vs. 5.4 alone and 5.9 for PolySkill alone), and its 47.7% SR sits between the two single-library numbers (45.7 and 49.3). The two libraries cover different slices of the test set: when a layout match exists, the layout signal triggers aSkillMigratorskill. When no such match exists, control routes to PolySkill underneath. In practice,SkillMigratorcan behave as a retrieval layer that sits on top of existing skill-induction systems rather than competing with them.
4.4Ablation Study
For RQ3, Table4(b)isolates three ingredients: the layout signalℒ\mathcal{L}(text onlypushesα\alphato 1), the slot-synonym poolTξT_{\xi}(no synonymsremoves paraphrase coverage), and the gate (no gatesetsβ=0\beta\!=\!0). The biggest move is on the WebArena average SR: removingℒ\mathcal{L}drops 6.5 points (45.7→\to39.2) and lands almost on top of the no-skill baseline (38.5%), because instruction text alone cannot tell apart same-wording, different-structure pages. On Mind2Web cross-domain the same ablation costs 3.6 points (59.4→\to55.8), andno synonymscosts another 2.4 (59.4→\to57.0): paraphrase coverage matters most when the target site renames its labels. Disabling the gate is the only ablation that lowers𝒩¯\bar{\mathcal{N}}(5.4→\to4.7 on WebArena; 6.6→\to6.3 on Mind2Web) at almost no SR cost (−-0.4 on WebArena), so the defaultβ=0.20\beta\!=\!0.20errs on the conservative side—a deployment that tolerates more skill triggers can push for further LLM-call savings.
4.5Sensitivity Analysis
(a)Sweepα\alpha.
(b)Sweepβ\beta.
(c)Backbone.
Figure 6:Sensitivity to the mixing weightα\alpha, the gate thresholdβ\beta, and the LLM backbone, on the WebArena. Blue (left axis) is success rate. Orange (right axis) is the average LLM-action count𝒩¯\bar{\mathcal{N}}.We sweepα∈{0.0,0.2,…,1.0}\alpha\in\{0.0,0.2,\dots,1.0\}andβ∈{0.0,0.1,…,0.5}\beta\in\{0.0,0.1,\dots,0.5\}on WebArena with GPT-4.1, and re-run the headline configuration on Claude-4. Hyperparameters were chosen on a held-out 10% of training trajectories, never on the test set. Figure6(a)shows that SR is essentially flat forα∈[0.0,0.6]\alpha\!\in\ and collapses to 39.3% atα=1.0\alpha\!=\!1.0, the same drop seen in thetext onlyablation—layout signal is doing the heavy lifting. Figure6(b)shows theβ\betasweep behaving as a precision/recall trade-off:𝒩¯\bar{\mathcal{N}}slides from 5.6 down to 4.7 asβ→0\beta\!\to\!0while SR moves only between 45.3% and 45.7%, soβ\betais a deployment knob rather than a critical hyperparameter. Figure6(c)swaps the LLM backbone: Claude-4 gives 54.7% SR and𝒩¯=5.1\bar{\mathcal{N}}=5.1versus 45.7% SR and𝒩¯=5.4\bar{\mathcal{N}}=5.4on GPT-4.1, a++9-point SR shift. The gains come from the retrieval design and ride on top of whatever absolute performance the backbone supplies.
5Conclusion
We presentedSkillMigrator, a web agent that keeps the standardtext snapshot + tool callinginterface used by recent skill-induction systems[23,29,30]but indexes induced skills astransferable interaction patterns (TIP)—pairing each skill with a structural sketch of the snapshot at induction time, and retrieving globally by layout similarity before grounding to live element references. On Mind2Web and WebArena, this reduces the average LLM-action count on successful trajectories by 8–10% over the state-of-the-art methods, at matched task success rate.
References
- [1]A. Bapna, G. Tür, D. Hakkani-Tür, and L. Heck(2017)Towards zero-shot frame semantic parsing for domain scaling.InInterspeech 2017,pp. 2476–2480.External Links:DocumentCited by:§3.2.
- [2]X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su(2023)Mind2web: towards a generalist agent for the web.Advances in Neural Information Processing Systems36,pp. 28091–28114.Cited by:Appendix A,Appendix B,§C.1,§1,§2.1,§4.1.
- [3]A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. Del Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez,et al.(2024)Workarena: how capable are web agents at solving common knowledge work tasks?.arXiv preprint arXiv:2403.07718.Cited by:§C.1,§2.1,§3.1.
- [4]D. Gao, Z. Li, Y. Xie, W. Kuang, L. Yao, B. Qian, Z. Ma, Y. Cui, H. Luo, S. Li, L. Yi, Y. Yu, S. He, Z. Luo, W. Zhou, Z. Zhang, X. He, Z. Chen, W. Liao, F. I. Kushnazarov, Y. Li, B. Ding, and J. Zhou(2025)AgentScope 1.0: a developer-centric framework for building agentic applications.CoRRabs/2508.16279.Cited by:Appendix B.
- [5]T. Gupta, P. Wolters, Z. Ma, P. Sushko, R. Y. Pang, D. Llanes, Y. Yang, T. Anderson, B. Zheng, Z. Ren,et al.(2026)MolmoWeb: open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516.Cited by:§C.1.
- [6]H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, and D. Yu(2024)Webvoyager: building an end-to-end web agent with large multimodal models.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),pp. 6864–6890.Cited by:§C.1,Appendix D,§2.1.
- [7]S. He, Y. Cui, X. Ma, Y. Li, B. Ding, and M. Chowdhury(2025)Branch-and-browse: efficient and controllable web exploration with tree-structured reasoning and action memory.arXiv preprint arXiv:2510.19838.Cited by:§C.1.
- [8]L. K. Jang, J. Y. Koh, D. Fried, and R. Salakhutdinov(2026)Odysseys: benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964.Cited by:§C.1.
- [9]L. P. Kaelbling, M. L. Littman, and A. R. Cassandra(1998)Planning and acting in partially observable stochastic domains.Artificial Intelligence101(1–2),pp. 99–134.Cited by:Appendix A.
- [10]J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried(2024)Visualwebarena: evaluating multimodal agents on realistic visual web tasks.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),Cited by:Appendix A,§C.1,Appendix D,§2.1.
- [11]H. W. Kuhn(1955)The Hungarian method for the assignment problem.Naval Research Logistics Quarterly2(1–2),pp. 83–97.Cited by:§3.2.
- [12]Z. Liu, G. I. Winata, P. Xu, and P. Fung(2020)Coach: a coarse-to-fine approach for cross-domain slot filling.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL),pp. 19–25.External Links:LinkCited by:§3.1,§3.2.
- [13]Z. Lu, Y. Zuo, Y. Nie, X. He, W. Fan, L. Qi, and S. Jin(2026)ContractSkill: repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340.Cited by:§C.2,§1.
- [14]J. Munkres(1957)Algorithms for the assignment and transportation problems.Journal of the Society for Industrial and Applied Mathematics5(1),pp. 32–38.Cited by:§3.2.
- [15]M. Pawlik and N. Augsten(2016)Tree edit distance: robust and memory-efficient.Information Systems56,pp. 157–173.Cited by:Appendix B,§3.1.
- [16]V. Prabhu, Y. Dai, M. Fernandez, J. Gu, K. Ramakrishnan, Y. Luo, S. Savarese, C. Xiong, J. Li, Z. Chen, and R. Xu(2025)WALT: web agents that learn tools.arXiv preprint arXiv:2510.01524.Cited by:§C.2,§C.2,§1,§1,§1,§2.2.
- [17]N. Reimers and I. Gurevych(2019)Sentence-BERT: sentence embeddings using siamese BERT-networks.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,pp. 3982–3992.Cited by:Appendix B,§3.1.
- [18]T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang(2017)World of bits: an open-domain platform for web-based agents.InInternational Conference on Machine Learning,pp. 3135–3144.Cited by:Appendix A.
- [19]R. S. Sutton, D. Precup, and S. Singh(1999)Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning.Artificial Intelligence112(1–2),pp. 181–211.Cited by:§2.1.
- [20]K. Tai(1979)The tree-to-tree correction problem.Journal of the ACM26(3),pp. 422–433.Cited by:§3.1.
- [21]L. Wang, X. Li, J. Liu, K. He, Y. Yan, and W. Xu(2021)Bridge to target domain by prototypical contrastive learning and label confusion: re-explore zero-shot learning for slot filling.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP),pp. 9474–9480.External Links:LinkCited by:§3.1,§3.2.
- [22]Z. Wang, Q. Wu, X. Zhang, C. Zhang, W. Yao, F. E. Faisal, B. Peng, S. Qin, S. Nath, Q. Lin,et al.(2026)WebXSkill: skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318.Cited by:§C.2,§1,§1.
- [23]Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried(2025)Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821.Cited by:Appendix B,§C.2,§C.2,§1,§1,§2.1,§2.2,§4.1,Table 2,Table 3,Table 3,4(a),§4,§5.
- [24]Z. Z. Wang, J. Mao, D. Fried, and G. Neubig(2024)Agent workflow memory.arXiv preprint arXiv:2409.07429.Cited by:§C.2,§1,§1,§1,§2.2.
- [25]T. Xue, W. Qi, T. Shi, C. H. Song, B. Gou, D. Song, H. Sun, and Y. Su(2025)An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382.Cited by:Appendix B,§C.1.
- [26]K. Yang, Y. Liu, S. Chaudhary, R. Fakoor, P. Chaudhari, G. Karypis, and H. Rangwala(2024)Agentoccam: a simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825.Cited by:§1.
- [27]S. Yao, H. Chen, J. Yang, and K. Narasimhan(2022)Webshop: towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems35,pp. 20744–20757.Cited by:§C.1,§1.
- [28]S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao(2023)React: synergizing reasoning and acting in language models.InInternational Conference on Learning Representations,Cited by:§C.1,§1,§3.1,§4.1,Table 2,Table 3.
- [29]S. Yu, G. Li, W. Shi, and P. Qi(2026)PolySkill: learning generalizable skills through polymorphic abstraction for continual learning.InInternational Conference on Learning Representations,Cited by:Appendix B,Appendix B,§C.2,§C.2,§1,§1,§1,§2.1,§2.2,§4.1,Table 2,Table 3,Table 3,4(a),§4,§5.
- [30]B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su(2025)SkillWeaver: web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079.Cited by:§C.2,§C.2,§1,§1,§1,§1,§2.1,§2.2,§4.1,Table 2,§5.
- [31]S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig(2024)WebArena: a realistic web environment for building autonomous agents.InInternational Conference on Learning Representations,Cited by:Appendix A,Appendix B,§C.1,§1,§2.1,§3.2,§4.1.
Appendix
Appendix ADetailed Problem Formulation
This appendix expands the description of the web-agent setting that we summarised in §2.
Web agent environment.
We model web navigation as a partially observable sequential decision process, following the standard view of interactive agents in uncertain environments[9]. A task is specified by a natural-language instructionqq. Since a real web application may contain hidden browser state, server-side state, and asynchronous page updates, the agent only observes an agent-visible representation of the environment. We use an observation-level abstraction and denote the observation space by𝒮\mathcal{S}. At time steptt, the agent receives an observationot∈𝒮o_{t}\in\mathcal{S}, which may include the rendered viewport, the DOM, the accessibility tree, and other browser metadata. This representation is consistent with prior web-agent benchmarks, where agents interact with web pages through visual and structural observations[18,2,31,10].
Given the current observation, the agent selects an executable action using an LLMπθ\pi_{\theta}viaat∼πθ(⋅∣q,o0:t,a0:t−1)a_{t}\sim\pi_{\theta}(\cdot\mid q,o_{0:t},a_{0:t-1}), whereo0:to_{0:t}anda0:t−1a_{0:t-1}are observation and action histories, respectively. All actions are chosen from the valid action space𝒜\mathcal{A}defined in Table1, such asclick(ee)andfill(ee,text). After execution, the web environment returns a new observation,ot+1=𝒯(ot,at)o_{t+1}=\mathcal{T}(o_{t},a_{t}), where𝒯:𝒮×𝒜→𝒮\mathcal{T}:\mathcal{S}\times\mathcal{A}\to\mathcal{S}is the transition function induced by the browser and the web application. For simplicity, we write𝒯\mathcal{T}as deterministic, although real web transitions may be stochastic due to network latency, dynamic content, or nondeterministic server responses.
Subtasks and the LLM-action count.
Given an instructionqq, the planner decomposes it into subtasks𝐬(q)={s1,…,sTq}\mathbf{s}(q)=\{s_{1},\ldots,s_{T_{q}}\}. Leto~i\tilde{o}_{i}denote the observation after finishing subtasksis_{i}, soo~i−1\tilde{o}_{i-1}is the observation before executingsis_{i}. For each pair(si,o~i−1)(s_{i},\tilde{o}_{i-1}), the agent retrieves a relevant skill from𝒦\mathcal{K}when available. If the subtask is fully covered by a retrieved skill, then𝒩(si,o~i−1∣𝒦,πθ)=0\mathcal{N}(s_{i},\tilde{o}_{i-1}\mid\mathcal{K},\pi_{\theta})=0; otherwise the agent falls back toπθ\pi_{\theta}and generates a trajectoryτ=(o(0),a1,o(1),…,an,o(n))\tau=(o^{(0)},a_{1},o^{(1)},\ldots,a_{n},o^{(n)}), whereo(0)=o~i−1o^{(0)}=\tilde{o}_{i-1},o(j)=𝒯(o(j−1),aj)o^{(j)}=\mathcal{T}(o^{(j-1)},a_{j}), and𝒩(si,o~i−1∣𝒦,πθ)=n\mathcal{N}(s_{i},\tilde{o}_{i-1}\mid\mathcal{K},\pi_{\theta})=n. Equation (1) of the main paper directly minimises the expectation of∑i𝒩(si,o~i−1∣𝒦,πθ)\sum_{i}\mathcal{N}(s_{i},\tilde{o}_{i-1}\mid\mathcal{K},\pi_{\theta})plus a library-cost regulariserλC(𝒦)\lambda\,C(\mathcal{K}).
Appendix BExperimental Setup Details
This appendix expands the condensed Setup of §4.1with full benchmark protocol, baseline regimes, and reproducibility details.
Benchmarks.
Mind2Web[2]spans 137 websites across 31 domains and is partitioned into three generalisation splits: cross-task (same website, new task), cross-website (same domain, new website), and cross-domain (new domain). Task success is adjudicated by the previous GPT-4.1 judge protocol[29,25], which has∼\sim85% agreement with human judgments on this benchmark.WebArena[31]provides 812 executable tasks across shopping, shopping-admin, reddit, gitlab, map, and multi-site columns, scored with the official programmatic validators. We follow the same accessibility-snapshot and tool-calling protocol used by PolySkill, which keeps the action schema fixed across compared systems and isolates the contribution of skill retrieval and grounding.
Implementation.
Our system is implemented on top of the Playwright MCP framework111https://github.com/microsoft/playwright-mcpand AgentScope[4]for unified agent orchestration and reasoning. In addition, we usesentence-transformers/all-MiniLM-L6-v2[17]as the frozen sentence encoder𝐞(⋅)\mathbf{e}(\cdot), with no fine-tuning onSkillMigratordata. All embedding similarities reported in Eq. (3) are cosine similarities of L2-normalised vectors. Tree-edit distance for the layout signal is computed by APTED[15]on the cleaned accessibility-tree skeletons.
Skill induction parity.
Skills are induced from the same training trajectories used by ASI[23]and PolySkill[29], with their respective induction pipelines, so that the only variable across systems is the retrieval and grounding stack ofSkillMigrator. All reported numbers are means over three random seeds.
Appendix CRelated Work
C.1LLM-based Web Agents and Evaluation Environments
Web agents aim to translate natural-language instructions into executable browser actions over interactive webpages. This problem has been studied through increasingly realistic benchmarks and environments, including WebShop[27], Mind2Web[2], WebArena[31], and Branch-and-Browse[7]. In these settings, agents typically operate in a step-wise decision loop, observing the current webpage state and predicting one low-level action at a time, often following reasoning-and-acting paradigms such as ReAct[28]. This design provides broad flexibility, but it also makes deployment costly on long-horizon tasks, since each primitive action may require a separate policy-facing LLM call. Subsequent work further expanded the evaluation landscape to multimodal and realistic settings, such as WebVoyager[6], VisualWebArena[10], WorkArena[3], Odysseys[8], and MolmoWeb[5]. Recent analysis has also highlighted that benchmark improvements do not always imply robust general web competence, motivating more careful measurement of efficiency and generalization in addition to task success[25].
C.2Reusable Web Skills
To reduce repeated low-level reasoning, a growing line of work studies reusable web skills, which compress frequently occurring primitive action sequences into higher-level abstractions. One line of work stores reusable procedural knowledge in textual form. Agent Workflow Memory (AWM)[24]induces workflow memories from successful episodes and reuses them to guide future problem solving, improving long-horizon behavior through experience reuse. Another line of work emphasizes executable skill representations. ASI[23]induces programmatic skills from successful trajectories and exposes them directly as callable actions. SkillWeaver[30]studies self-improvement through exploration, discovery, and iterative honing of reusable web skills. WALT[16]further argues that agents should exploit higher-level website tools rather than rely entirely on brittle primitive UI interactions. PolySkill[29]improves transfer by introducing polymorphic abstractions across websites, while more recent work explores richer skill formulations, such as combining executability with interpretability or improving repairability and verification[22,13]. Across these methods, the common goal is to shorten interaction horizons, reduce policy LLM calls, and improve robustness by reusing validated behavioral structure rather than regenerating every action from scratch.
Our work is most closely related to recent programmatic-skill web agents[23,30,16,29], but differs in what it treats as the main bottleneck. Rather than focusing primarily on how to induce or refine the internal implementation of a skill, we focus on how to retrieve and ground previously induced skills under broader transfer conditions. In particular, we target reuse beyond both the same website and the same domain. To do so, we index each skill not only by text but also by a structural sketch of the webpage snapshot at induction time, and retrieve from a global library using layout-aware matching before grounding abstract constraints to live element references. In this sense, our method complements prior work on workflow memory, executable skill induction, and polymorphic abstraction by emphasizing transferable interaction patterns as the retrieval key for cross-website and cross-domain reuse.
Appendix DLimitations
Layout-conditioned retrieval assumes that pages sharing accessibility-tree structure also share interaction semantics. This holds for some workflows we evaluate but weakens on visually isomorphic pages whose control flow diverges, where lightweight multimodal cues[6,10]would be a natural complement to our text-only setting. In addition, the operation templates are mined from observed action shapes, so modalities outside this distribution (e.g. drag-and-drop, modal dialogs) fall back to react mode until the inventory is extended.SkillMigratorreduces the LLM-call cost of web automation, which can improve efficiency and latency for repeated tasks. At the same time, lower-cost automation may increase the scale of automated web actions. Deployments should respect host-site usage policies, rate limits, authentication boundaries, and user consent requirements, and should not use skill reuse to bypass access controls or generate abusive traffic.
Similar Articles
Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
This paper introduces SkillMigrator, an agent that learns reusable web skills as transferable interaction patterns (TIPs) and transfers them across websites by matching layout structure, reducing LLM action counts by 8-10% on benchmarks.
@dair_ai: // Evolving Meta-Skill for Multi-Agent Systems // Can a multi-agent system get better at orchestration without touching…
Skill-MAS introduces a method for evolving meta-skills in multi-agent systems to improve orchestration without modifying model weights, achieving transferable performance gains across tasks and LLMs.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents
This paper proposes MASA, a framework that adapts skills to each LLM backbone without modifying weights, using hierarchical evolution and a model-conditioned rewriter, achieving gains of up to 25.8 points over baselines.