Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
Summary
This paper proposes SGDR (State-Grounded Dynamic Retrieval), an online skill learning method for web agents that enables stepwise, state-aware skill reuse rather than static task-level retrieval. Experiments on WebArena show SGDR achieves 37.5% success rate with GPT-4.1, a ~10.6% relative gain over strong baselines.
View Cached Full Text
Cached at: 06/05/26, 02:07 AM
# Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
Source: [https://arxiv.org/html/2606.04391](https://arxiv.org/html/2606.04391)
Jiaxi Li1, Ke Deng1, Yun Wang1, Jingyuan Huang1, Yucheng Shi2, Qiaoyu Tan3, Jin Lu1†, Ninghao Liu4† 1University of Georgia2Tencent America 3New York University4The Hong Kong Polytechnic University
###### Abstract
Language agents increasingly rely on reusable skills to improve multi\-step web automation across related tasks\. A growing line of work studiesonline skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly\. However, existing methods mainly reuse skills at the task\-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution\. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover\. To address this gap, we proposeState\-GroundedDynamicRetrieval\(SGDR\), an online skill learning method that enables stepwise skill reuse for web agents\.SGDRconsists of three components: a sliding\-window extraction process that turns completed trajectories into reusable sub\-procedures invokable at intermediate execution states, a dual text–code representation that connects skill retrieval with executable action, and a state\-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state\. Experiments on WebArena across five domains show thatSGDRconsistently outperforms strong baselines, achieving average success rates of 37\.5% withGPT\-4\.1and 24\.3% withQwen3\-4B, corresponding to relative gains of 10\.6% and 10\.0% over the strongest baseline, respectively\. The code is available at[https://github\.com/plusnli/skill\-dynamic\-retrieval](https://github.com/plusnli/skill-dynamic-retrieval)\.
Online Skill Learning for Web Agents via State\-Grounded Dynamic Retrieval
††footnotetext:Co\-corresponding authors\.## 1Introduction
Figure 1:Comparison between traditional skill methods and our method \(SGDR\) within the setting of online skill learning\.Language agentsYaoet al\.\([2023](https://arxiv.org/html/2606.04391#bib.bib6)\); Sumerset al\.\([2024](https://arxiv.org/html/2606.04391#bib.bib2)\); Zhouet al\.\([2025b](https://arxiv.org/html/2606.04391#bib.bib31)\)are increasingly used to solve multi\-step web tasks such as information seeking, form filling, and forum interaction on realistic websites\(Chaeet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib9); Guet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib11); Ninget al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib7)\)\. Although these tasks vary in goals, they often share recurring procedural patterns, such as navigating menus, filling forms, applying filters, and submitting changes\. This observation has motivated a growing line of work onskill learningfor language agents, where reusable procedural knowledge is summarized and reused in related tasks\(Liuet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib25); Zhenget al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib18)\)\. By accumulating such skills, agents can amortize repeated procedural discovery and improve across related tasks without relying solely on zero\-shot planningTacket al\.\([2024](https://arxiv.org/html/2606.04391#bib.bib56)\)\.
Within this direction,online skill learningprovides a particularly realistic setting for web agents\. Instead of assuming a fixed skill library constructed offline, online methods allow agents to continually induce skills from completed executions and update their skill library as tasks arrive sequentially\(Wanget al\.,[2025b](https://arxiv.org/html/2606.04391#bib.bib16),[a](https://arxiv.org/html/2606.04391#bib.bib17); Liuet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib25)\)\. Compared to relying on a pre\-built library constructed offline, this online paradigm more closely matches real\-world deployment, where tasks arrive sequentially and agents must improve as they go\.
Despite this progress, existing online skill learning methods largely treat skill reuse as a task\-level one\-shot operation\(Wanget al\.,[2025b](https://arxiv.org/html/2606.04391#bib.bib16),[a](https://arxiv.org/html/2606.04391#bib.bib17); Liuet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib25)\)\. Skills are retrieved or injected once based on the initial task instruction and then kept fixed throughout execution\. This design is natural if a web task is viewed as a static instruction, but it is insufficient for interactive web automation\. In web execution, the usefulness of a skill depends not only on the task goal but also on the current webpage state\. Consequently, a skill that is useful at the beginning of the task may become irrelevant later, while another skill that was not initially selected may become useful after the agent reaches a new page, form, or interaction context\. The core limitation is therefore that skill retrieval operates at the task level rather than at the level of intermediate execution states, where skills actually need to be invoked\. This raises a central question:how can an online agent retrieve the right reusable skill dynamically according to both the task goal and the current execution state?
However, dynamically retrieving skills at intermediate states is non\-trivial, because retrieval quality depends not only on the matching mechanism but also on the granularity of the skill library\. If the library contains only full\-trajectory skills, retrieved procedures may preserve the complete context of their original tasks but fail to apply to arbitrary intermediate webpage states\. If the library contains only single\-action skills, retrieved procedures may be broadly applicable but too primitive to provide meaningful procedural abstraction\. This creates a granularity challenge: state\-grounded reuse requires skills that are compact enough to match diverse webpage states, yet structured enough to execute useful browser operations\. Without skills at this granularity, dynamic retrieval would either return overly broad workflows that mismatch the current state or low\-level actions that offer little benefit over primitive browser actions\.
To address these limitations, we proposeState\-GroundedDynamicRetrieval\(SGDR\), an online skill learning method for web agents, as illustrated in Figure[1](https://arxiv.org/html/2606.04391#S1.F1)\.SGDRreplaces task\-level one\-shot skill reuse with step\-level, state\-conditioned skill retrieval\. After completing a task,SGDRextracts reusable sub\-procedures from the trajectory through sliding\-window extraction, producing skills at an intermediate granularity\. Each skill is represented as a text–code pair: a natural\-language description supports retrieval, while executable code provides support for action\. When solving a new task,SGDRretrieves step\-specific skills conditioned on both the task instruction and the current webpage state, enabling skill support to adapt as execution unfolds\. Together, these designs turn online skill learning from static task\-level reuse into adaptive state\-grounded reuse\. Ourmajor contributionsare summarized as follows\.
- •We study online skill learning for language agents under a sequential task\-stream setting, where agents can only reuse skills induced from past task trajectories and update the skill library on the fly\.
- •We identify the limitations of task\-level one\-shot skill reuse and propose state\-grounded dynamic retrieval, which retrieves skills at each decision step according to both the task instruction and the evolving webpage state\.
- •We enable intermediate\-state skills through sliding\-window extraction and dual text–code representation, producing reusable sub\-procedures that are both retrievable in natural language and executable as browser actions\.
- •We evaluate SGDR on WebArena across five website domains with two backbone models, showing consistent overall improvements over strong online skill learning baselines in both success rates and step efficiency\.
## 2Related Work
### 2\.1Web Agents and Benchmarks
Early web agent research\(Liuet al\.,[2018](https://arxiv.org/html/2606.04391#bib.bib36); Nakanoet al\.,[2021](https://arxiv.org/html/2606.04391#bib.bib34); Yaoet al\.,[2022](https://arxiv.org/html/2606.04391#bib.bib35)\)studied how language models interact with browsers to retrieve information and complete tasks in simulated environments\. Recent work has scaled web agents toward more realistic settings along several axes: generalist navigation on real\-world websitesDenget al\.\([2023](https://arxiv.org/html/2606.04391#bib.bib37)\); Heet al\.\([2024](https://arxiv.org/html/2606.04391#bib.bib38)\); Zhenget al\.\([2024a](https://arxiv.org/html/2606.04391#bib.bib39)\); Laiet al\.\([2024](https://arxiv.org/html/2606.04391#bib.bib41)\); Huet al\.\([2025](https://arxiv.org/html/2606.04391#bib.bib40)\); Yuet al\.\([2026](https://arxiv.org/html/2606.04391#bib.bib5)\), robustness through memory, workflow induction, and reusable skills\(Zhenget al\.,[2024b](https://arxiv.org/html/2606.04391#bib.bib42); Wanget al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib59),[2025b](https://arxiv.org/html/2606.04391#bib.bib16),[2025a](https://arxiv.org/html/2606.04391#bib.bib17); Zhenget al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib18); Zhuet al\.,[2026](https://arxiv.org/html/2606.04391#bib.bib30); Sunet al\.,[2026](https://arxiv.org/html/2606.04391#bib.bib33)\), and benchmarks that evaluate agents under increasingly realistic conditions including visually grounded and conversational navigation\(Zhouet al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib13); Kohet al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib43); Luet al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib44); Drouinet al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib45); Yanget al\.,[2025b](https://arxiv.org/html/2606.04391#bib.bib8); Xueet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib46); Liuet al\.,[2026](https://arxiv.org/html/2606.04391#bib.bib29); Tianet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib47); Yanget al\.,[2026](https://arxiv.org/html/2606.04391#bib.bib53); Sunet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib12); Gouet al\.,[2026](https://arxiv.org/html/2606.04391#bib.bib48)\)\. Together, these efforts move web agent research from controlled browser interaction toward dynamic, long\-horizon web automation\.
### 2\.2Skill Discovery and Learning
Recent work explores how language agents can self\-improve by discovering and accumulating reusable skills from past executions\(Qianet al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib55); Yuet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib26); Ouyanget al\.,[2026a](https://arxiv.org/html/2606.04391#bib.bib4),[b](https://arxiv.org/html/2606.04391#bib.bib3); Wanget al\.,[2026b](https://arxiv.org/html/2606.04391#bib.bib54); Tanet al\.,[2026b](https://arxiv.org/html/2606.04391#bib.bib50); Yanget al\.,[2025c](https://arxiv.org/html/2606.04391#bib.bib20); Luet al\.,[2026](https://arxiv.org/html/2606.04391#bib.bib23); Fanget al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib49)\)\. Early approaches store procedural knowledge in natural language and adapt it non\-parametrically, such as verbal reflections\(Shinnet al\.,[2023](https://arxiv.org/html/2606.04391#bib.bib14)\)or distilled experiential insights\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib15)\)\. More recent work formulates reusable skills as workflows\(Wanget al\.,[2025b](https://arxiv.org/html/2606.04391#bib.bib16)\), executable programs\(Wanget al\.,[2025a](https://arxiv.org/html/2606.04391#bib.bib17)\), or retrievable past experiences\(Liuet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib25)\), with further studies exploring diverse forms of skill organization\(Zhouet al\.,[2025a](https://arxiv.org/html/2606.04391#bib.bib24); Zhenget al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib18); Liet al\.,[2025](https://arxiv.org/html/2606.04391#bib.bib28); Tanet al\.,[2026a](https://arxiv.org/html/2606.04391#bib.bib60)\)and reuse\(Wanget al\.,[2026c](https://arxiv.org/html/2606.04391#bib.bib19); Jianget al\.,[2026](https://arxiv.org/html/2606.04391#bib.bib21); Wanget al\.,[2026a](https://arxiv.org/html/2606.04391#bib.bib22)\)\. Our work is complementary: rather than treating learned skills as pre\-fixed memories or tools, we focus onwhenandhowaccumulated skills are retrieved and invoked, so that agents can better exploit them at the right intermediate states\.
## 3Preliminaries
### 3\.1Task and Skill Formalization
We consider a sequence of web agent tasks𝒢=\{gi\}i=1N\\mathcal\{G\}=\\\{g\_\{i\}\\\}\_\{i=1\}^\{N\}, where eachgig\_\{i\}denotes the natural language instruction specifying the task goal, with a total ofNNtasks\. When solving theii\-th taskgig\_\{i\}, the agent interacts with a web environment over multiple steps, receiving the current webpage observation and executing an action, producing a trajectory𝒯i\\mathcal\{T\}\_\{i\}, which is an observation\-action interleaving sequence of lengthHiH\_\{i\}\.
The agent maintains a skill library throughout the task sequence\. We denote the skill library by𝒮i\\mathcal\{S\}\_\{i\}after processing the firstiitasks, with𝒮0\\mathcal\{S\}\_\{0\}being the initial empty library\. Each skills∈𝒮is\\in\\mathcal\{S\}\_\{i\}represents reusable procedural memory induced from previous task executions\. After executing taskgig\_\{i\}, the agent may induce a set of new skillsΔ𝒮i\\Delta\\mathcal\{S\}\_\{i\}from its trajectory and update the library as
𝒮i=𝒮i−1∪Δ𝒮i\.\\mathcal\{S\}\_\{i\}=\\mathcal\{S\}\_\{i\-1\}\\cup\\Delta\\mathcal\{S\}\_\{i\}\.
For evaluation, we useyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}to denote the ground\-truth task success signal used for external benchmarking, whereyi=1y\_\{i\}=1indicates that the task is correctly solved andyi=0y\_\{i\}=0indicates that the task is not correctly solved\.
Figure 2:The online skill learning setting\. The agent solves tasks sequentially, updates the skill library from evaluator\-assessed trajectories, and reuses accumulated skills for future tasks\.
### 3\.2Online Skill Learning
Online learning is a sequential learning paradigm in which a learner makes decisions over a stream of examples and uses information revealed from previous rounds to improve future decisions\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2606.04391#bib.bib52); Shalev\-Shwartz,[2025](https://arxiv.org/html/2606.04391#bib.bib51)\)\. In this work, we formulateonline skill learningfor language agents as a task\-stream setting in which an agent solves tasks sequentially, updates its skill library from completed trajectories, and reuses only skills induced from past tasks when solving future tasks\. This contrasts with offline skill learning, where a fixed skill library is pre\-constructed from a separate set of tasks before being used to assist the agent on held\-out evaluation tasks\.
Figure 3:Overview of our methodSGDR\. Completed trajectories are segmented with sliding windows to induce reusable text\-code skills\. During future task execution,SGDRretrieves state\-grounded skills, reranks them with Maximal Marginal Relevance \(MMR\), and injects the selected skills for the action next step\.Figure[2](https://arxiv.org/html/2606.04391#S3.F2)depicts the overall setup\. In line with prior workWanget al\.\([2025a](https://arxiv.org/html/2606.04391#bib.bib17),[b](https://arxiv.org/html/2606.04391#bib.bib16)\); Liuet al\.\([2025](https://arxiv.org/html/2606.04391#bib.bib25)\), tasks arrive sequentially: when solving taskgig\_\{i\}, the agent can only access the skill library accumulated from previous tasks, namely𝒮i−1\\mathcal\{S\}\_\{i\-1\}\. The ground\-truth environment signalyiy\_\{i\}for the current task is unavailable during both execution and library update, and therefore cannot be used for skill induction or action selection\. The agent must complete the task using only the current instruction, the evolving webpage observations, and skills induced from past tasks\. To support skill induction without access toyiy\_\{i\}, an evaluator modelEEis used to assess the completed trajectory after execution:
y^i=E\(gi,𝒯i\),\\hat\{y\}\_\{i\}=E\(g\_\{i\},\\mathcal\{T\}\_\{i\}\),wherey^i∈\{0,1\}\\hat\{y\}\_\{i\}\\in\\\{0,1\\\}denotes the evaluator’s binary correctness judgment for taskgig\_\{i\}, withy^i=1\\hat\{y\}\_\{i\}=1indicating thatEEjudges the completed trajectory to have correctly solved the task, andy^i=0\\hat\{y\}\_\{i\}=0indicating thatEEjudges it to have failed\.
After executinggig\_\{i\}, the agent updates the skill library without observing the ground\-truth signalyiy\_\{i\}\. The update can only rely on the task instructiongig\_\{i\}, the collected trajectory𝒯i\\mathcal\{T\}\_\{i\}, and the evaluator\-produced proxy judgmenty^i\\hat\{y\}\_\{i\}\. We formalize skill induction as an update functionUU:
Δ𝒮i=U\(gi,𝒯i,y^i\),𝒮i=𝒮i−1∪Δ𝒮i\.\\Delta\\mathcal\{S\}\_\{i\}=U\(g\_\{i\},\\mathcal\{T\}\_\{i\},\\hat\{y\}\_\{i\}\),\\quad\\mathcal\{S\}\_\{i\}=\\mathcal\{S\}\_\{i\-1\}\\cup\\Delta\\mathcal\{S\}\_\{i\}\.The newly induced skills become available only for subsequent tasksgi\+1,…,gNg\_\{i\+1\},\\ldots,g\_\{N\}\.
The goal of online skill learning is to design an online agent that maximizes cumulative task success ratesyiy\_\{i\}over the task stream:
maxπ∑i=1Nyi,\\max\_\{\\pi\}\\sum\_\{i=1\}^\{N\}y\_\{i\},whereπ\\pidenotes the overall online skill learning agent, including its action policy, skill induction, and skill reuse rules\.
## 4Proposed Method
Building on the online setting in[Section˜3\.2](https://arxiv.org/html/2606.04391#S3.SS2),SGDRis motivated by two challenges in deploying a reusable skill library for web automation: how to extract skills at a suitable granularity and adaptively retrieve them conditioned on the evolving webpage states\. To address these,SGDRcombines sliding\-window skill extraction with a text\-code skill representation, and state\-grounded dynamic retrieval with reranking\.[Figure˜3](https://arxiv.org/html/2606.04391#S3.F3)illustrates the overall pipeline\.
Unless otherwise specified, we describeSGDRfor the current taskgig\_\{i\}in the online task stream, and omit the task indexiifor readability\. Thus, we write the current task asgg, its trajectory as𝒯\\mathcal\{T\}, and the currently available skill library as𝒮=𝒮i−1\\mathcal\{S\}=\\mathcal\{S\}\_\{i\-1\}\.
### 4\.1Skill Extraction and Representation
We first describe the unit of reuse maintained bySGDR\. Before solving the current taskgg, the agent has access only to the skill library accumulated from previous tasks, denoted as𝒮=\{sk\}k=1n\\mathcal\{S\}=\\\{s\_\{k\}\\\}\_\{k=1\}^\{n\}\. Each skillsks\_\{k\}stores a reusable web procedure and is represented as a text–code pairsk=\(dk,ck\)s\_\{k\}=\\big\(d\_\{k\},c\_\{k\}\\big\), wheredkd\_\{k\}is a natural\-language description used for retrieval andckc\_\{k\}is an executable code function used for action execution\. This text–code representation ties retrieval and execution together: the description abstracts the skill’s intent and applicable state, while the code implements the corresponding web operations once the skill is selected\. For example, a description such as “navigate to the account address settings page” can be paired with code that opens the account menu, clicks the address settings entry, and waits for the target form to load\.
After taskggis finished, the evaluator produces a binary judgmenty^\\hat\{y\}for its completed trajectory\. We perform skill extraction only wheny^=1\\hat\{y\}=1, i\.e\., when the evaluatorEEjudges the trajectory to have successfully solved the task\. For such successful trajectories, we revisit the full trajectory𝒯\\mathcal\{T\}:
𝒯=\(o1,a1,o2,a2,…,aH,oH\+1\),\\mathcal\{T\}=\(o\_\{1\},a\_\{1\},o\_\{2\},a\_\{2\},\\ldots,a\_\{H\},o\_\{H\+1\}\),whereHHis the interaction horizon\. At any stept∈\{1,…,H\}t\\in\\\{1,\\ldots,H\\\},oto\_\{t\}represents the current webpage observation that the agent receives, andata\_\{t\}denotes the executed action, forming an observation\-action interleaving trajectory\. In web environments,oto\_\{t\}can be represented by the textual form of the webpage accessibility tree, which contains structured information about visible elements, their attributes, and possible interaction targets\. The set of primitive actions is provided in Appendix[A\.1](https://arxiv.org/html/2606.04391#A1.SS1)\.
Rather than storing the entire trajectory as a single task\-level skill, we decompose it into local segments that can be reused from intermediate states in future tasks\. We then apply a set of sliding windows over the trajectory to obtain candidate segments\. For each window lengthl∈ℒl\\in\\mathcal\{L\}, we enumerate candidate segments
wt,l=\(ot,at,…,at\+l−1,ot\+l\),w\_\{t,l\}=\(o\_\{t\},a\_\{t\},\\ldots,a\_\{t\+l\-1\},o\_\{t\+l\}\),wheret∈\{1,…,H−l\+1\}t\\in\\\{1,\\ldots,H\-l\+1\\\}denotes the window’s starting timestep\.
The use of sliding windows is to extract reusable skills at an intermediate granularity\. Full trajectories often encode an entire task and are too specific to be reused at a later intermediate state, while individual actions are too fine\-grained to capture meaningful procedures\. Windowed segments instead correspond to local but reusable subroutines, such as opening a settings page, filling a short form, or applying a filter\.
Each candidate segmentwt,lw\_\{t,l\}is passed to an LLM, which judges whether it captures a reusable state\-contingent procedure and, if so, converts it into a skillsk=\(dk,ck\)s\_\{k\}=\\big\(d\_\{k\},c\_\{k\}\\big\)\. Following ASIWanget al\.\([2025a](https://arxiv.org/html/2606.04391#bib.bib17)\), we verify each induced skill by replacing its corresponding primitive action segment in the original trajectory with a skill call and executing the rewritten trajectory in the environment\. Only skills whose substituted trajectories are still judged successful by the evaluator are added to the library\. Together, this sliding\-window extraction and verification process yields skills that are compact enough to be invoked from intermediate execution states, while remaining executable and semantically meaningful\. Once added to the library, these verified text–code skills become candidates for step\-level retrieval in subsequent tasks\.
### 4\.2State\-Grounded Dynamic Retrieval
Given the verified skill library,SGDRretrieves skills dynamically as the agent moves through a task, rather than selecting a fixed set of skills only once at the beginning\. At execution stepttof taskgg, the agent observes the current web stateoto\_\{t\}\. As raw web states such as accessibility trees can be verbose, we first obtain a compact state summaryrt=Summarize\(ot\)r\_\{t\}=\\mathrm\{Summarize\}\(o\_\{t\}\)using an LLM\. The resulting summary serves as the state\-side retrieval query, while the original task instructionggprovides the goal\-side query\.
##### Relevance Retrieval\.
To retrieve appropriate skills at steptt, we do relevance retrieval over the skill library𝒮\\mathcal\{S\}\. For each skillsk=\(dk,ck\)s\_\{k\}=\(d\_\{k\},c\_\{k\}\), we compute a combined task\-state relevance score:
scoret\(sk\)=\\displaystyle\\operatorname\{score\}\_\{t\}\(s\_\{k\}\)=\{\}αcos\(ϕ\(g\),ϕ\(dk\)\)\\displaystyle\\alpha\\,\\cos\\big\(\\phi\(g\),\\phi\(d\_\{k\}\)\\big\.\)\(1\)\+\(1−α\)cos\(ϕ\(rt\),ϕ\(dk\)\)\.\\displaystyle\+\(1\-\\alpha\)\\,\\cos\\big\(\\phi\(r\_\{t\}\),\\phi\(d\_\{k\}\)\\big\.\)\.Hereϕ\(⋅\)\\phi\(\\cdot\)maps text into the embedding space, andcos\(𝐮,𝐯\)=𝐮⊤𝐯/\(‖𝐮‖‖𝐯‖\)\\cos\(\\mathbf\{u\},\\mathbf\{v\}\)=\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}/\(\\\|\\mathbf\{u\}\\\|\\\|\\mathbf\{v\}\\\|\)denotes cosine similarity between two embeddings𝐮\\mathbf\{u\}and𝐯\\mathbf\{v\}\. The coefficientα\\alphais a hyper\-parameter that balances the overall task instruction and the current state\. The first term measures alignment with the task goal, while the second term measures applicability to the current page state\. We first keep the top\-MMskills according to their relevance scorescoret\(sk\)\\operatorname\{score\}\_\{t\}\(s\_\{k\}\), whereMMis the coarse candidate budget, and then pass them to the reranking stage described below\. This stage filters the library to skills that are broadly relevant to the current task and state\.
Table 1:Main success rates \(%\) on WebArena\. We useSGDR\(State\-GroundedDynamicRetrieval\) to denote our method\. SR denotes the average success rate overall, and we also list average success rates for five separate domains\. \# Steps denotes the average number of steps to complete each task\. The best result is shown inbold, and the second\-best result isunderlined\.
##### MMR Reranking\.
The relevance retrieval stage produces a top\-MMcandidate set whose members are individually relevant to the current task and state\. However, because skills are extracted from overlapping sliding windows, many high\-scoring candidates may correspond to near\-duplicate local procedures with slightly different boundaries or contexts\. Directly passing the top\-ranked skills to the agent can therefore allocate multiple skill slots to the same procedural pattern, leaving fewer distinct options for the next decision\. To avoid this redundancy while preserving relevance, we apply Maximal Marginal Relevance \(MMR\)Carbonell and Goldstein \([1998](https://arxiv.org/html/2606.04391#bib.bib10)\)within the relevance\-filtered candidate set\. This reranking step is not a replacement for relevance retrieval: the relevance score keeps each selected skill grounded in the current task and state, while the diversity penalty discourages selecting skills that overlap with those already chosen\. Starting from an empty selected set𝒜t\\mathcal\{A\}\_\{t\}, we greedily add skills until\|𝒜t\|=5\|\\mathcal\{A\}\_\{t\}\|=5, where each next skill is selected according to
MMRt\(sk\)=\\displaystyle\\operatorname\{MMR\}\_\{t\}\(s\_\{k\}\)=\{\}λscoret\(sk\)\\displaystyle\\lambda\\,\\operatorname\{score\}\_\{t\}\(s\_\{k\}\)\(2\)−\(1−λ\)maxsk′∈𝒜tsim\(dk,dk′\)\.\\displaystyle\-\(1\-\\lambda\)\\,\\max\_\{s\_\{k^\{\\prime\}\}\\in\\mathcal\{A\}\_\{t\}\}\\operatorname\{sim\}\(d\_\{k\},d\_\{k^\{\\prime\}\}\)\.Heresim\(dk,dk′\)=cos\(ϕ\(dk\),ϕ\(dk′\)\)\\operatorname\{sim\}\(d\_\{k\},d\_\{k^\{\\prime\}\}\)=\\cos\(\\phi\(d\_\{k\}\),\\phi\(d\_\{k^\{\\prime\}\}\)\)denotes the cosine similarity between the two skill descriptions in embedding space and serves as a proxy for procedural overlap\. The second term is taken as0when𝒜t\\mathcal\{A\}\_\{t\}is empty\.λ\\lambdais a hyperparameter that balances relevance and coverage among selected skills\. The resulting set𝒜t\\mathcal\{A\}\_\{t\}is the step\-specific skill set activated for the agent’s next decision\.
### 4\.3Skill Injection and Execution
After retrieval and reranking, the selected set𝒜t\\mathcal\{A\}\_\{t\}is exposed to the agent only for the current decision steptt\. For each retrieved skill, we provide its descriptiondkd\_\{k\}and callable codeckc\_\{k\}as additional action support\. This step\-level injection lets the available skill support adapt to the evolving webpage without exposing the full skill library at every decision step\. After the task is completed, the collected trajectory is evaluated and processed by the extraction procedure illustrated in[Section˜4\.1](https://arxiv.org/html/2606.04391#S4.SS1)\. The resulting verified skills are added to the corresponding domain\-specific library and become available for subsequent tasks, starting fromgi\+1g\_\{i\+1\}\.
## 5Experiments
### 5\.1Experiment Setup
##### Benchmark\.
We evaluate on WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.04391#bib.bib13)\), a representative and realistic web agent benchmark whose structure is well suited to our online skill learning setting\. WebArena spans five website domains, Shopping, Admin, Reddit, Gitlab, and Map, where tasks within each domain typically share similar website interface and interaction conventions\. This domain structure naturally supports our domain\-wise continual skill acquisition: for a given website domain, after completing a task, the agent extracts skills from the resulting trajectory and reuses them for subsequent tasks in the same domain\. Since a small number of WebArena tasks require interactions across multiple websites, we exclude such tasks and focus on single\-domain tasks\. Accordingly, we maintain a separate skill library for each domain to avoid cross\-domain interference\. We list the detailed task indices within each website domain in Appendix[A\.2](https://arxiv.org/html/2606.04391#A1.SS2)\. The evaluation by WebArena environment is based on a binary success reward: the reward is11if the task is correctly solved, and0otherwise\.
Figure 4:Cumulative success rates over the online task stream with backbone modelGPT\-4\.1on four WebArena domains\. The x\-axis denotes the remapped within\-domain task index and sorting by the original WebArena task IDs\.SGDRgenerally maintains stronger cumulative performance as more tasks are processed, showing the benefit of dynamically retrieving state\-grounded skills during execution\.
##### Baseline Methods\.
We compareSGDRwith four baselines\.Vanillais a skill\-free baseline that solves each task independently without maintaining or reusing skills across the task stream\. We further compare with three baseline methods within the paradigm of online skill learning: Agent Workflow Memory \(AWM\)Wanget al\.\([2025b](https://arxiv.org/html/2606.04391#bib.bib16)\), Agent Skill Induction \(ASI\)Wanget al\.\([2025a](https://arxiv.org/html/2606.04391#bib.bib17)\), and Contextual Experience Replay \(CER\)Liuet al\.\([2025](https://arxiv.org/html/2606.04391#bib.bib25)\)\. These methods can accumulate reusable memory from past trajectories and apply it to future tasks\. In our comparison, they primarily instantiate task\-level static reuse: relevant workflows, programmatic skills, or past experiences are selected based on the task context and then used as fixed support during execution\. Specifically, AWM stores natural\-language workflows, ASI induces executable programmatic skills, and CER retrieves relevant past experiences for decision support\. For AWM and CER, we adopt their online variants, ensuring that all skill\-based methods accumulate experience without access to ground\-truth signals over the same task stream\.
##### Implementation details\.
We report results usingGPT\-4\.1Achiamet al\.\([2023](https://arxiv.org/html/2606.04391#bib.bib57)\)andQwen3\-4BYanget al\.\([2025a](https://arxiv.org/html/2606.04391#bib.bib58)\)as the backbone models\. For both our methodSGDRand the baselines, when using eitherGPT\-4\.1orQwen3\-4Bas the backbone LLM, we use the same model for all LLM\-based components within that method, including skill induction, trajectory summarization, action planning, and evaluation\. For CER, we implement the experience buffer, experience synthesis, and retrieval modules following the original paperLiuet al\.\([2025](https://arxiv.org/html/2606.04391#bib.bib25)\)\. We segment the resulting trajectory with sliding windows of lengthsℒ=\{2,3,4,5\}\\mathcal\{L\}=\\\{2,3,4,5\\\}for skill extraction\. During task execution, skill retrieval is performed using the state\-grounded retrieval score defined in[Equation˜1](https://arxiv.org/html/2606.04391#S4.E1), withα\\alphaset to0\.50\.5, followed by reranking with the MMR objective in[Equation˜2](https://arxiv.org/html/2606.04391#S4.E2), whereλ=0\.7\\lambda=0\.7\. Detailed prompts and parameter configuration are given in Appendix[A\.3](https://arxiv.org/html/2606.04391#A1.SS3)and[A\.4](https://arxiv.org/html/2606.04391#A1.SS4), respectively\.
### 5\.2Main Results
[Table˜1](https://arxiv.org/html/2606.04391#S4.T1)reports the success rates on WebArena, with step\-count efficiency discussed in[Section˜5\.3](https://arxiv.org/html/2606.04391#S5.SS3)\. Overall,SGDRachieves the best average success rate under both backbones, reaching 37\.5% withGPT\-4\.1and 24\.3% withQwen3\-4B\. Compared with the strongest baseline CER,SGDRimproves the overall SR by 3\.6 points withGPT\-4\.1and 2\.2 points withQwen3\-4B, showing that state\-grounded dynamic retrieval provides benefits beyond static task\-level skill reuse\.
The gains are broadly distributed across domains\. WithGPT\-4\.1,SGDRachieves the best performance on four of the five domains, including a notable improvement on Admin from 41\.4% to 47\.7%\. A similar trend holds forQwen3\-4B, while Gitlab remains the main exception\. We hypothesize that Gitlab tasks often involve version\-control operations with persistent repository preconditions, such as forking and merge\-request operations\. SinceSGDRlearns local rather than whole\-task skills, it may be less effective for such tasks than methods that preserve complete task\-level procedures\.
### 5\.3Execution Efficiency Analysis
We further examine execution efficiency through average step count\. Across both backbone models,SGDRcompletes tasks with fewer steps than the baselines\. WithGPT\-4\.1, it uses 4\.8 steps on average, compared with 6\.0 for Vanilla, 5\.2 for ASI, and 6\.4 for CER\. WithQwen3\-4B, it reduces the average step count by 11\.1% relative to Vanilla and 13\.8% relative to CER\. This efficiency gain arises because one skill can execute a short procedure composed of multiple primitive browser actions, such as a sequence of clicks and fills, thereby replacing repeated low\-level interactions with a higher\-level reusable action\.
### 5\.4Online Performance Analysis
A central motivation ofSGDRis to improve skill reuse throughout the online task stream\.[Figure˜4](https://arxiv.org/html/2606.04391#S5.F4)shows the cumulative success rate withGPT\-4\.1on four WebArena domains, where tasks are ordered by their original WebArena IDs and reindexed within each domain\. Overall,SGDRgenerally stays above the baselines, with especially clear advantages on Admin and Reddit\. Although the curves are not monotonic because later tasks may be harder or less aligned with previously accumulated skills,SGDRoften remains on the upper envelope, suggesting that state\-grounded dynamic retrieval helps the agent better exploit the growing skill library during execution\. The smaller margin on Gitlab is consistent with its reliance on persistent repository\-specific preconditions, which can limit the transferability of local procedural skills\.
### 5\.5Ablation Study
Table 2:Ablation study on retrieval signals with modelGPT\-4\.1on Shopping, Reddit, and Map websites\.We conduct ablation studies withGPT\-4\.1on three representative WebArena website domains: Shopping, Reddit, and Map\. These studies examine three components ofSGDR: relevance retrieval, MMR reranking, and skill extraction\.
##### Ablation Study on Retrieval\.
We first ablate the retrieval signal to study whether the task goal, the current webpage state, or their combination is most useful for selecting skills\. As shown in[Table˜2](https://arxiv.org/html/2606.04391#S5.T2), task\-only retrieval consistently outperforms state\-only retrieval, suggesting that the initial task instruction remains an important anchor for skill selection\. However, combining task and state information yields the best results across all three domains, withα=0\.5\\alpha=0\.5achieving 34\.6%, 35\.9%, and 32\.3% on Shopping, Reddit, and Map, respectively\. The lower performance atα=0\.3\\alpha=0\.3andα=0\.7\\alpha=0\.7further indicates that overemphasizing either the current state or the task goal is suboptimal\.
##### Ablation Study on MMR Reranking\.
We next ablate the MMR reranking module to examine whether relevance alone is sufficient for selecting useful skills\.[Table˜3](https://arxiv.org/html/2606.04391#S5.T3)shows that retrieving skills only by the top\-MMrelevance score performs worse than all MMR variants, indicating that relevance\-only retrieval can introduce redundant or overly similar skills\. Adding MMR consistently improves performance by encouraging a more diverse set of retrieved procedures\. Among the MMR settings,λ=0\.7\\lambda=0\.7performs best on all three domains\. While other results are slightly weaker, suggesting thatSGDRbenefits most from a relevance\-focused ranking that still preserves procedural diversity\.
Table 3:Ablation study on MMR reranking with modelGPT\-4\.1on websites Shopping, Reddit, and Map\.
##### Ablation Study on Sliding\-Window Extraction\.
We compare different granularities for skill extraction\. As shown in[Table˜4](https://arxiv.org/html/2606.04391#S5.T4), sliding\-window skills outperform both full\-trajectory and single\-action alternatives on all domains\. Full\-trajectory skills preserve more task\-level context but are less reusable for intermediate webpage states, leading to lower performance\. Single\-action skills perform worst because they provide little abstraction over primitive browser actions to capture meaningful procedures\. In contrast, sliding\-window extraction offers a better balance\. It captures reusable multi\-action sub\-procedures while remaining flexible enough to be invoked at different execution states\.
Table 4:Ablation study on skill extraction granularity with backbone modelGPT\-4\.1on websites Shopping, Reddit, and Map\.
## 6Case Study
We present some representative case studies in Appendix[B](https://arxiv.org/html/2606.04391#A2)\.SGDRinduces reusable skills from judged\-as\-successful trajectories in several different domains\. For example, one skill listed in Appendix[B\.1](https://arxiv.org/html/2606.04391#A2.SS1)fills the start and destination fields to submit a driving\-directions query in the Map domain, while another skill listed in Appendix[B\.2](https://arxiv.org/html/2606.04391#A2.SS2)fills and submits a merge\-request comment in the GitLab domain\. Although the two skills come from distinct websites, both separate webpage\-specific element identifiers from task\-specific content values, suggesting thatSGDRlearns practical sub\-procedural patterns\.
## 7Conclusion
We presentSGDR, a method for language agents that addresses core limitations of task\-level skill reuse in the setting of online skill learning\. By extracting skills from sliding windows of evaluator\-assessed trajectories and retrieving them dynamically with both task and state information, the agent receives adaptive support throughout execution rather than only at the beginning of each task\. Results on WebArena show strong performances ofSGDRacross five domains with two backbone models, suggesting that state\-grounded retrieval is a practical approach to improve web agents based on both proprietary and open\-source models\.
## Limitations
This work still has some limitations\. First, our experiments are conducted on WebArena, which provides realistic multi\-step web tasks but still covers a limited set of website domains, interaction patterns, and agent action set\. EvaluatingSGDRon broader web environments would further validate its generality\. Second, our study focuses on non\-parametric skill accumulation and reuse, without exploring how the learned skills could be integrated with model fine\-tuning or long\-term agent personalization\. We leave these directions for future work\.
## Ethical Considerations
This work studies online skill learning for language agents in web environments\. Our experiments are conducted on WebArena and do not involve human subjects, private user data, or interactions with live third\-party websites\. Nevertheless, more capable web agents may raise potential concerns if deployed without appropriate safeguards, since automated agents could perform unintended actions, access sensitive information, or violate website usage policies\. We therefore viewSGDRas a research framework for controlled environments, and practical deployment should include permission checks, action constraints, and monitoring\. The learned skills should also be validated before reuse in safety\-critical settings\.
## References
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§5\.1](https://arxiv.org/html/2606.04391#S5.SS1.SSS0.Px3.p1.4)\.
- The use of mmr, diversity\-based reranking for reordering documents and producing summaries\.InProceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,pp\. 335–336\.Cited by:[§4\.2](https://arxiv.org/html/2606.04391#S4.SS2.SSS0.Px2.p1.3)\.
- N\. Cesa\-Bianchi and G\. Lugosi \(2006\)Prediction, learning, and games\.Cambridge university press\.Cited by:[§3\.2](https://arxiv.org/html/2606.04391#S3.SS2.p1.1)\.
- H\. Chae, N\. Kim, K\. Ong, M\. Gwak, G\. Song, J\. Kim, S\. Kim, D\. Lee, and J\. Yeo \(2025\)Web agents with world models: learning and leveraging environment dynamics in web navigation\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 63707–63738\.Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, D\. Vazquez, N\. Chapados, and A\. Lacoste \(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.pp\. 11642–11662\.External Links:[Link](https://proceedings.mlr.press/v235/drouin24a.html)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang \(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- B\. Gou, Z\. Huang, Y\. Ning, Y\. Gu, M\. Lin, W\. Qi, A\. Kopanev, B\. Yu, B\. J\. Gutierrez, Y\. Shu, C\. H\. Song, J\. Wu, S\. Chen, H\. N\. Moussa, T\. ZHANG, J\. Xie, Y\. Li, T\. Xue, Z\. Liao, K\. Zhang, B\. Zheng, Z\. Cai, V\. Rozgic, M\. Ziyadi, H\. Sun, and Y\. Su \(2026\)Mind2Web 2: evaluating agentic search with agent\-as\-a\-judge\.External Links:[Link](https://openreview.net/forum?id=AUaW6DS9si)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- Y\. Gu, K\. Zhang, Y\. Ning, B\. Zheng, B\. Gou, T\. Xue, C\. Chang, S\. Srivastava, Y\. Xie, P\. Qi, H\. Sun, and Y\. Su \(2025\)Is your LLM secretly a world model of the internet? model\-based planning for web agents\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=c6l7yA0HSq)Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024\)Webvoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6864–6890\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2025\)Automated design of agentic systems\.External Links:[Link](https://openreview.net/forum?id=t9U3LW7JVX)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- G\. Jiang, Z\. Su, X\. Qu, and Y\. R\. Fung \(2026\)Xskill: continual learning from experience and skills in multimodal agents\.arXiv preprint arXiv:2603\.12056\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)Visualwebarena: evaluating multimodal agents on realistic visual web tasks\.pp\. 881–905\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- H\. Lai, X\. Liu, I\. L\. Iong, S\. Yao, Y\. Chen, P\. Shen, H\. Yu, H\. Zhang, X\. Zhang, Y\. Dong,et al\.\(2024\)Autowebglm: a large language model\-based web navigating agent\.pp\. 5295–5306\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- J\. Li, Y\. Shi, X\. Huang, J\. Lu, and N\. Liu \(2025\)MITS: enhanced tree search reasoning for llms via pointwise mutual information\.arXiv preprint arXiv:2510\.03632\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- E\. Z\. Liu, K\. Guu, P\. Pasupat, and P\. Liang \(2018\)Reinforcement learning on web interfaces using workflow\-guided exploration\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ryTp3f-0-)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- W\. Liu, X\. Song, J\. Li, Y\. Wei, N\. Zheng, J\. Yin, and L\. Nie \(2026\)Mitigating hallucination through theory\-consistent symmetric multimodal preference optimization\.Advances in Neural Information Processing Systems38,pp\. 111259–111284\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- Y\. Liu, C\. Si, K\. R\. Narasimhan, and S\. Yao \(2025\)Contextual experience replay for self\-improvement of language agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14179–14198\.Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1),[§1](https://arxiv.org/html/2606.04391#S1.p2.1),[§1](https://arxiv.org/html/2606.04391#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.04391#S3.SS2.p2.5),[§5\.1](https://arxiv.org/html/2606.04391#S5.SS1.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.04391#S5.SS1.SSS0.Px3.p1.4)\.
- X\. H\. Lu, Z\. Kasner, and S\. Reddy \(2024\)WebLINX: real\-world website navigation with multi\-turn dialogue\.pp\. 33007–33056\.External Links:[Link](https://proceedings.mlr.press/v235/lu24e.html)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- Z\. Lu, Y\. Zuo, Y\. Nie, X\. He, W\. Fan, and C\. Dai \(2026\)ContractSkill: repairable contract\-based skills for multimodal web agents\.arXiv preprint arXiv:2603\.20340\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders,et al\.\(2021\)Webgpt: browser\-assisted question\-answering with human feedback\.arXiv preprint arXiv:2112\.09332\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- L\. Ning, Z\. Liang, Z\. Jiang, H\. Qu, Y\. Ding, W\. Fan, X\. Wei, S\. Lin, H\. Liu, P\. S\. Yu,et al\.\(2025\)A survey of webagents: towards next\-generation ai agents for web automation with large foundation models\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 6140–6150\.Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1)\.
- S\. Ouyang, J\. Yan, Y\. Chen, R\. Han, Z\. Wang, B\. D\. Mishra, R\. Meng, C\. Li, Y\. Jiao, K\. Zha,et al\.\(2026a\)SkillOS: learning skill curation for self\-evolving agents\.arXiv preprint arXiv:2605\.06614\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. Le, S\. Daruki, X\. Tang, V\. Tirumalashetty, G\. Lee, M\. Rofouei, H\. Lin, J\. Han, C\. Lee, and T\. Pfister \(2026b\)ReasoningBank: scaling agent self\-evolving with reasoning memory\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=jL7fwchScm)Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- C\. Qian, S\. Liang, Y\. Qin, Y\. Ye, X\. Cong, Y\. Lin, Y\. Wu, Z\. Liu, and M\. Sun \(2024\)Investigate\-consolidate\-exploit: a general strategy for inter\-task agent self\-evolution\.arXiv preprint arXiv:2401\.13996\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- S\. Shalev\-Shwartz \(2025\)Online learning and online convex optimization\.Foundations and Trends® in Machine Learning4\(2\),pp\. 107–194\.Cited by:[§3\.2](https://arxiv.org/html/2606.04391#S3.SS2.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- T\. Sumers, S\. Yao, K\. R\. Narasimhan, and T\. L\. Griffiths \(2024\)Cognitive architectures for language agents\.Transactions on Machine Learning Research\.Note:Survey Certification, Featured CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=1i6ZCvflQJ)Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1)\.
- J\. Sun, J\. Zhu, Y\. Li, T\. Liu, X\. Hu, and B\. Han \(2026\)AgentHijack: benchmarking computer use agent robustness to common environment corruptions\.arXiv preprint arXiv:2605\.25707\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- R\. Sun, T\. Yang, W\. Niu, and J\. Sun \(2025\)OUSAC: optimized guidance scheduling with adaptive caching for dit acceleration\.arXiv preprint arXiv:2512\.14096\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- J\. Tack, J\. Kim, E\. Mitchell, J\. Shin, Y\. W\. Teh, and J\. R\. Schwarz \(2024\)Online adaptation of language models with a memory of amortized contexts\.Advances in Neural Information Processing Systems37,pp\. 130109–130135\.Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1)\.
- Q\. Tan, X\. Song, A\. Akbari, A\. Akbari, Y\. Wang, X\. Zhai, L\. Hong, Z\. Xiang, J\. Lu, and G\. Yuan \(2026a\)Palette: a modular, controllable, and efficient framework for on\-demand authorized safety alignment relaxation in llms\.arXiv preprint arXiv:2605\.24154\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- Q\. Tan, X\. Song, N\. Cheng, N\. Liu, X\. Zhai, L\. Hong, Y\. Wang, Z\. Xiang, and G\. Yuan \(2026b\)Q\-realign: piggybacking realignment on quantization for safe and efficient llm deployment\.arXiv preprint arXiv:2601\.08089\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- S\. Tian, Z\. Zhang, L\. Chen, and Z\. Liu \(2025\)Mmina: benchmarking multihop multimodal internet agents\.pp\. 13682–13697\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang,et al\.\(2026a\)SkillX: automatically constructing skill knowledge bases for agents\.arXiv preprint arXiv:2604\.04804\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- J\. Wang, Y\. Ming, Z\. Ke, S\. Joty, A\. Albarghouthi, and F\. Sala \(2026b\)Skillorchestra: learning to route agents via skill transfer\.arXiv preprint arXiv:2602\.19672\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- Z\. Wang, Q\. Wu, X\. Zhang, C\. Zhang, W\. Yao, F\. E\. Faisal, B\. Peng, S\. Qin, S\. Nath, Q\. Lin,et al\.\(2026c\)WebXSkill: skill learning for autonomous web agents\.arXiv preprint arXiv:2604\.13318\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- Z\. Wang, G\. Neubig, and D\. Fried \(2024\)TroVE: inducing verifiable and efficient toolboxes for solving programmatic tasks\.pp\. 51177–51191\.External Links:[Link](https://proceedings.mlr.press/v235/wang24az.html)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- Z\. Z\. Wang, A\. Gandhi, G\. Neubig, and D\. Fried \(2025a\)Inducing programmatic skills for agentic tasks\.arXiv preprint arXiv:2504\.06821\.Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p2.1),[§1](https://arxiv.org/html/2606.04391#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.04391#S3.SS2.p2.5),[§4\.1](https://arxiv.org/html/2606.04391#S4.SS1.p5.2),[§5\.1](https://arxiv.org/html/2606.04391#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2025b\)Agent workflow memory\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=NTAhi2JEEE)Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p2.1),[§1](https://arxiv.org/html/2606.04391#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.04391#S3.SS2.p2.5),[§5\.1](https://arxiv.org/html/2606.04391#S5.SS1.SSS0.Px2.p1.1)\.
- T\. Xue, W\. Qi, T\. Shi, C\. H\. Song, B\. Gou, D\. Song, H\. Sun, and Y\. Su \(2025\)An illusion of progress? assessing the current state of web agents\.External Links:[Link](https://openreview.net/forum?id=6jZi4HSs6o)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.04391#S5.SS1.SSS0.Px3.p1.4)\.
- T\. Yang, T\. Jordan, R\. Sun, N\. Liu, and J\. Sun \(2026\)Common inpainted objects in\-n\-out of context\.pp\. 13069–13079\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- T\. Yang, Y\. Shi, M\. Du, X\. Wu, Q\. Tan, J\. Sun, and N\. Liu \(2025b\)Concept\-centric token interpretation for vector\-quantized generative models\.arXiv preprint arXiv:2506\.00698\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- Y\. Yang, S\. Kang, J\. Lee, D\. Lee, S\. Yun, and K\. Lee \(2025c\)Automated skill discovery for language agents through exploration and iterative feedback\.arXiv preprint arXiv:2506\.04287\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1)\.
- S\. Yu, G\. Li, W\. Shi, and P\. Qi \(2025\)Polyskill: learning generalizable skills through polymorphic abstraction\.arXiv preprint arXiv:2510\.15863\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- T\. Yu, Z\. Zhang, Z\. Lyu, J\. Gong, H\. Yi, X\. Wang, Y\. Zhou, J\. Yang, P\. Nie, Y\. Huang, and W\. Chen \(2026\)BrowserAgent: building web agents with human\-inspired web browsing actions\.Transactions on Machine Learning Research\.Note:Reproducibility Certification, J2C CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=X4CfZPSEHE)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig,et al\.\(2025\)Skillweaver: web agents can self\-improve by discovering and honing skills\.arXiv preprint arXiv:2504\.07079\.Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- B\. Zheng, B\. Gou, J\. Kil, H\. Sun, and Y\. Su \(2024a\)GPT\-4V\(ision\) is a generalist web agent, if grounded\.InProceedings of the 41st International Conference on Machine LearningThe Thirteenth International Conference on Learning RepresentationsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data MiningThe Twelfth International Conference on Learning RepresentationsProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)Proceedings of the 41st International Conference on Machine LearningProceedings of the 41st International Conference on Machine LearningSecond Conference on Language ModelingFindings of the Association for Computational Linguistics: ACL 2025The Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks TrackProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)Proceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, F\. Berkenkamp, R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, F\. Berkenkamp, R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, F\. Berkenkamp, R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning Research, Vol\.235235235235,pp\. 61349–61385\.External Links:[Link](https://proceedings.mlr.press/v235/zheng24e.html)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- L\. Zheng, R\. Wang, X\. Wang, and B\. An \(2024b\)Synapse: trajectory\-as\-exemplar prompting with memory for computer control\.External Links:[Link](https://openreview.net/forum?id=Pc8AU1aF5e)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: a realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=oKn9c6ytLx)Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.04391#S5.SS1.SSS0.Px1.p1.2)\.
- Y\. Zhou, Q\. Yang, K\. Lin, M\. Bai, X\. Zhou, Y\. Wang, S\. Levine, and L\. E\. Li \(2025a\)Proposer\-agent\-evaluator \(PAE\): autonomous skill discovery for foundation model internet agents\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=uCkwqAG0uS)Cited by:[§2\.2](https://arxiv.org/html/2606.04391#S2.SS2.p1.1)\.
- Z\. Zhou, C\. Cao, X\. Feng, X\. Li, Z\. Li, X\. Lu, J\. Yao, W\. Huang, L\. Xu, T\. Cheng,et al\.\(2025b\)AlphaApollo: orchestrating foundation models and professional tools into a self\-evolving system for deep agentic reasoning\.arXiv preprint arXiv:2510\.06261\.Cited by:[§1](https://arxiv.org/html/2606.04391#S1.p1.1)\.
- J\. Zhu, Y\. Ro, J\. Robertson, K\. Wang, J\. Li, H\. Vikalo, A\. Akella, and Z\. Wang \(2026\)Your agents are aging too: agent lifespan engineering for deployed systems\.arXiv preprint arXiv:2605\.26302\.Cited by:[§2\.1](https://arxiv.org/html/2606.04391#S2.SS1.p1.1)\.
## Appendix AExperiment Details
### A\.1Agent Action Space
[Table˜5](https://arxiv.org/html/2606.04391#A1.T5)shows the default base action space the web navigation agents we employed in all the experiments, within the WebArena environment\. This action space remains the same for our method and all baseline methods, including vanilla, AWM, ASI, CER, and our methodSGDR\.
Table 5:Base primitive action space for web agents throughout our experiments in WebArena\.
### A\.2Task Indices for Website Domains
For reproducibility, we provide the task indices used for each WebArena website domain\. We remove all cross\-site tasks to ensure that skills are extracted and reused within the same website domain, thereby preventing cross\-domain skill transfer from confounding the evaluation\. After this filtering, we use 764 single\-domain tasks in total: 187 Shopping, 182 Admin, 106 Reddit, 180 GitLab, and 109 Map tasks\. The detailed task indices for each domain are listed below\.
- •Shopping: 21–26, 47–51, 96, 117–118, 124–126, 141–150, 158–167, 188–192, 225–235, 238–242, 260–264, 269–286, 298–302, 313, 319–338, 351–355, 358–362, 368, 376, 384–388, 431–440, 465–469, 506–521, 528–532, 571–575, 585–589, 653–657, 689–693, 792–798\.
- •Admin: 0–6, 11–15, 41–43, 62–65, 77–79, 94–95, 107–116, 119–123, 127–131, 157, 183–187, 193–204, 208–217, 243–247, 288–292, 344–348, 374–375, 423, 453–464, 470–474, 486–505, 538–551, 676–680, 694–713, 768–782, 790\.
- •Reddit: 27–31, 66–69, 399–410, 580–584, 595–652, 714–735\.
- •GitLab: 44–46, 102–106, 132–136, 156, 168–182, 205–207, 258–259, 293–297, 303–312, 314–318, 339–343, 349–350, 357, 389–398, 411–422, 441–452, 475–485, 522–527, 533–537, 567–570, 576–579, 590–594, 658–670, 736, 742–756, 783–789, 799–811\.
- •Map: 7–10, 16–20, 32–40, 52–61, 70–76, 80–93, 98–101, 137–140, 151–155, 218–224, 236–237, 248–257, 287, 356, 363–367, 369–373, 377–383, 757–758, 761–767\.
### A\.3Prompts for LLM\-Based Components
In this subsection, we list the prompts we give to LLM\-based components involved in[Section˜4](https://arxiv.org/html/2606.04391#S4)\.
#### A\.3\.1Prompts for Trajectory Assessment\.
Here are the prompts we give to the trajectory evaluator modelEEto assess whether the current trajectory successfully complete the task, as demonstrated in both[Section˜3\.2](https://arxiv.org/html/2606.04391#S3.SS2)and[Section˜4\.1](https://arxiv.org/html/2606.04391#S4.SS1)\. They are used not only for our method, but also for other baseline methods AWM, ASI, and CER introduced in[Section˜5\.1](https://arxiv.org/html/2606.04391#S5.SS1), as they all require the evaluator modelEEto judge their trajectories\.
System Prompt\.The system prompt requires the evaluator modelEEto give judgement "success" or "failure" based on the user prompt input\.
Youareanexpertinevaluatingtheperformanceofawebnavigationagent\.Theagentisdesignedtohelpahumanusernavigateawebsitetocompleteatask\.Giventheuser’sintent,theagent’sactionhistory,thefinalstateofthewebpage,andtheagent’sresponsetotheuser,yourgoalistodecidewhethertheagent’sexecutionissuccessfulornot\.
Therearethreetypesoftasks:
1\.Informationseeking:Theuserwantstoobtaincertaininformationfromthewebpage,suchastheinformationofaproduct,reviews,mapinfo,comparisonofmaproutes,etc\.Thebot’sresponsemustcontaintheinformationtheuserwants,orexplicitlystatethattheinformationisnotavailable\.Otherwise,e\.g\.thebotencountersanexceptionandrespondwiththeerrorcontent,thetaskisconsideredafailure\.Besides,becarefulaboutthesufficiencyoftheagent’sactions\.Forexample,whenaskedtolistthetop\-searcheditemsinashop,theagentshouldordertheitemsbythenumberofsearches,andthenreturnthetopitems\.Iftheorderingactionismissing,thetaskislikelytofail\.
2\.Sitenavigation:Theuserwantstonavigatetoaspecificpage\.Carefullyexaminethebot’sactionhistoryandthefinalstateofthewebpagetodeterminewhetherthebotsuccessfullycompletesthetask\.Noneedtoconsiderthebot’sresponse\.
3\.Contentmodification:Theuserwantstomodifythecontentofawebpageorconfiguration\.Carefullyexaminethebot’sactionhistoryandthefinalstateofthewebpagetodeterminewhetherthebotsuccessfullycompletesthetask\.Noneedtoconsiderthebot’sresponse\.
\*IMPORTANT\*
Formatyourresponseintotwolinesasshownbelow:
Thoughts:<yourthoughtsandreasoningprocess\>"
Status:"success"or"failure"
User Prompt\.Here is the user prompt given to the evaluator modelEE\. For the placeholders in this prompt,intentis the task goal,last\-actionsis the action history of the agent,capis the final state of the webpage, andresponseis the response extracted from the last action that the agent gives to the user\.
UserIntent:\{intent\}
ActionHistory:
\{last\-actions\}
Thedetailedfinalstateofthewebpage:
\`\`\`md
\{cap\}
\`\`\`
Botresponsetotheuser:\{responseifresponseelse"N/A"\}\.
#### A\.3\.2Prompts for Skill Induction\.
Here we list the prompts use for skill extraction in[Section˜4\.1](https://arxiv.org/html/2606.04391#S4.SS1)\. Given the trajectory windows segmented sliding windows, this skill\-induction prompt extracts reusable, single\-page\-callable sub\-routines from successful trajectories and emits each as an executable Python function with a retrieval\-friendly description\.
System Prompt\.
Youareaproficientweb\-automationengineer\.Youjudgewhethershortslicesofasuccessfulwebtrajectoryarereusablesub\-routines,andwhentheyare,youemitasmallPythonfunctionthatimplementstheroutine\.Followtheuserinstruction’srulesandoutputformatexactly\.
User Prompt\.
Youwillbeshownseveralactionwindowsextractedfromasuccessfulwebtasktrajectorybyaslidingwindowoflength2,3,4,or5steps\.Eachstepisashortthoughtfollowedbyoneormoreactioncalls\(e\.g\.click,fill,select\_option\)\.
Foreachwindowyoumustdecide:
1\.Isthewindowa\*reusable\*sub\-routine?
Areusablewindow:
\-Performsarecognizableweboperationthatcouldoccuronothertasks\(e\.g\.searchingaproduct,applyingapricefilter,postingacomment,openingauserprofile\)\.
\-Isgeneralenoughtoapplywithdifferentinputs:variableparts\(searchqueries,usernames,elementidsthatobviouslyvaryacrosstasks\)becomefunctionargumentswithdescriptivenames\.Windowsthatdependonone\-offelementidsortask\-specifictextthatcannotbeparameterizedareNOTreusable\.
\-Contains2to5actionsteps\.
Single\-page\-statecallability\(IMPORTANT\):theagentthatwillinvokethisskillobservesonlytheCURRENTwebpageatcalltime\.EVERYelementIDtheskilltakesasanargumentmustbereadablefromthesingleaccessibilitytreevisibletotheagentatthemomentofcall\.
\-StronglypreferskillswhoseargumentIDs\(buttonIDs,fieldIDs,optionIDs\)areallsimultaneouslyvisibleononepagestate\.
\-REJECTskillsthatrequireanIDwhichappearsonlyAFTERapagetransitiontheskillitselftriggers\.Theskillmaynavigateinternally,butthecallermuststillsupplythatfutureIDupfront\-andthecallercannotobservepagesithasnotyetreached\.ThereisNOvalidexception\.
\*Callableexample:"filltitle\+fillbody\+clicksubmit"onasinglesubmissionform\-allthreeIDsarevisiblesimultaneouslyonthatonepage\.
\*NOTcallable:"clickcombobox,clickoption,filltitle,fillbody,clicksubmit"\-theoptionIDonlyappearsafterthecomboboxisopened,soitisnotreadableatthemomenttheroutineiscalled\.
2\.Ifreusable,produce:
\-description:asinglesentencethatMUSTcontainboth
\(a\)apreciseactionverb\+object\(e\.g\."submitaforumpost","applyapricefilter","openaforum\-selectioncombobox","fillinthetitleandbody"\);and
\(b\)thetypicalpagecontextwherethisroutineruns\(e\.g\."onaforumsubmissionform","onaproductlistingpage","inanopenedcombobox"\)\.
Thedescriptionembeddingiscosine\-matchedtoapage\-statesummarywritteninthesameoperationalvocabulary,sogenericphrasinglike"Performsseveralclicks"willhurtretrieval\.
\-code:aPythonfunctionthatimplementstheroutine\.
Codeconstraints:
\-UseONLYthefollowingactions:click,fill,hover,keyboard\_press,scroll,tab\_focus,new\_tab,tab\_close,go\_back,go\_forward,goto,send\_msg\_to\_user,report\_infeasible,select\_option\.
\-Functionargumentsmustbeprimitivetypes\(str,int,listofstr\)\.Nocallbacks\.
\-Notry/except\.
\-DoNOThardcodeuser\-facingmessagesinside\`send\_msg\_to\_user\`;iftheroutineendswithamessage,takeitasa\`message\`argument\.
Outputformat\-returnasingleJSONarray,oneobjectperwindowinthesameordertheyweregiven\.Schema:
\[
\{"window\_idx":0,"reusable":true,"func\_name":"search\_product","description":"\.\.\.","code":"defsearch\_product\(query\):\\nclick\(’search’\)\\nfill\(’search’,query\)\\nkeyboard\_press\(’Enter’\)\\n"\},
\{"window\_idx":1,"reusable":false\}
\]
OnlyoutputtheJSONarray,nosurroundingprose,nomarkdownfences\.
#### A\.3\.3Prompt for Web Summarization\.
Here is the prompt used for summarizing the webpage stateri,t=Summarize\(oi,t\)r\_\{i,t\}=\\mathrm\{Summarize\}\(o\_\{i,t\}\)forii\-th task at execution steptt, as indicated in[Section˜4\.2](https://arxiv.org/html/2606.04391#S4.SS2)\. Note that it is a system prompt given to an LLM, and the user prompt is the accessibility trees \(text format\) of the webpage\.
Youareastatesummarizerforawebagentwhoseactionlibraryisindexedbydescriptionslike’submitaforumpostonasubmissionform’or’applyapricefilteronaproductlistingpage’\.Yoursummarywillbecosine\-matchedagainstsuchskilldescriptions,sousetheSAMEoperationalvocabularytheydo\.
Giventhecurrentpage’saccessibilitytree\(axtree\)plustheURLandtitle,produceONEshortparagraph\(1\-2sentences\)that:
1\.Namesthekindofpageinoperationalterms\(e\.g\.’forumsubmissionform’,’productlistingpage’,’openedforum\-selectioncombobox’,’post\-detailpagewithcommentsection’\)\.
2\.ListstheactionverbsthispageENABLESrightnow\-i\.e\.whatsub\-routinescouldplausiblyrunonthisexactstate\.Useverb\+objectphrasing\(e\.g\.’submitapost’,’selectaforum’,’fillinthetitleandbody’,’openthesortmenu’,’applyafilter’\)\.
DoNOTenumerateeveryvisibleelement,doNOTdescribepurevisuals\(colors,layout\),anddoNOTmentiontaskinstructionsorspeculateaboutfuturesteps\.Outputonlythesummarytext\.
#### A\.3\.4Prompt for Skill Activation and Execution\.
Here is the user prompt we use to make the web agent make the next\-step decision as illustrated in[Section˜4\.3](https://arxiv.org/html/2606.04391#S4.SS3)\.
\#\#RetrievedSkills
Thefollowing\{N\}high\-levelskillswereretrievedascandidatesforyournextsub\-routine\.Ifone’sintentmatcheswhatyouneed\(e\.g\.,walkingvs\.driving\)andtherequiredargumentsarevisibleintheaccessibilitytree,prefercallingitinasingleaction\.Otherwiseproceedwithprimitiveactions\-eitherway,keepmakingprogresstowardthegoal\.
\[signatureanddocumentdescriptionofeveryretrievdskills\.\]
### A\.4Parameter Configuration
Table[6](https://arxiv.org/html/2606.04391#A1.T6)summarizes the main parameter configuration used in SGDR and the experimental setup\. Blank entries indicate parameters that are mentioned in the paper but not explicitly specified\.
Table 6:Parameter configuration of SGDR and the experimental setup\. Blank entries indicate parameters that are mentioned in the method but not explicitly specified in the current paper\.
## Appendix BCase Study
We present representative skills induced bySGDRfrom five WebArena domains: Map, GitLab, Shopping, Reddit, and Admin\. These examples illustrate the form and reusability of the learned procedural knowledge across different websites and interaction patterns\. In each case, the skill is extracted from a judged\-as\-successful trajectory and represented as a parameterized code function paired with a natural\-language description\.
### B\.1Driving Directions Form Submission
The first skill is extracted from a Map task whose instruction is "Check if the social security administration in pittsburgh can be reached in one hour by car from CMU"\. After the task is successfully completed,SGDRinduces the following skill from the trajectory:
1defsubmit\_driving\_directions\_form\(start\_field\_id,dest\_field\_id,go\_button\_id,start\_location,destination\):
2fill\(start\_field\_id,start\_location\)
3fill\(dest\_field\_id,destination\)
4click\(go\_button\_id\)
The corresponding description is given as follows\.
FillinthestartingpointanddestinationfieldsandclicktheGobuttontogeneratedrivingdirectionsonadirectionsinputform\.
This skill is reusable because it separates structural webpage arguments, includingstart\_field\_id,dest\_field\_id, andgo\_button\_id, from task\-specific content arguments, namelystart\_locationanddestination\. As a result, the same procedure can be invoked for future related map\-navigation tasks when the current page satisfies the required conditions including input fields and submit button\.
### B\.2Merge Request Comment Submission
The second skill is extracted from a GitLab task whose instruction is to post “lgtm” for a merge request related to a specific project\. From this successful trajectory,SGDRinduces the following skill:
1defsubmit\_merge\_request\_comment\(comment\_box\_id,submit\_button\_id,comment\):
2fill\(comment\_box\_id,comment\)
3click\(submit\_button\_id\)
Its description is:
Submitacommentonamergerequestpagebyfillingthecommenttextboxandclickingthesubmitbuttononamergerequestdetailview\.
Although this skill comes from a different domain, it exhibits the same reusable abstraction pattern as the Map skill: element identifiers specify the current webpage structure, while the text argument specifies the task\-dependent content\.
Together, these examples show thatSGDRcan induce compact, parameterized skills that are grounded in the current webpage state but remain reusable across tasks\. They also illustrate why state\-grounded retrieval is important: such skills are useful only when the agent reaches a page state where the required fields and buttons are visible\.
### B\.3Product Search and Wishlist Addition
The third skill is extracted from a Shopping task whose instruction is "Add Tide PODS Spring Meadow Scent HE Turbo Laundry Detergent Pacs, 81 Count to my wish list"\. After the task is successfully completed,SGDRinduces the following skill from the trajectory:
1defsearch\_and\_add\_first\_product\_to
2\_wishlist\(search\_box\_id,search\_button\_id,add\_to\_wishlist\_button\_id,product\_query\):
3fill\(search\_box\_id,product\_query\)
4click\(search\_button\_id\)
5click\(add\_to\_wishlist\_button\_id\)
The corresponding description is given as follows\.
Searchforaproductandaddthefirstsearchresulttothewishlistonaproductsearchresultspage\.
This skill captures a longer e\-commerce subroutine that combines product search, query submission, and wishlist addition\. It separates the task\-specific content argumentproduct\_queryfrom structural webpage arguments, includingsearch\_box\_id,search\_button\_id, andadd\_to\_wishlist\_button\_id\. Compared with simpler two\-step fill\-and\-submit skills, this example shows thatSGDRcan induce multi\-step reusable procedures that abstract over repeated shopping interactions\.
### B\.4Comment Reply Submission
The fourth skill is extracted from a Reddit task whose instruction is "Reply to the manager of the website in this post with ’thanks\! I am a big fan of your website\.’"\. After the task is successfully completed,SGDRinduces the following skill from the trajectory:
1defsubmit\_comment\_reply\(reply\_box\_id,post\_button\_id,message\):
2fill\(reply\_box\_id,message\)
3click\(post\_button\_id\)
The corresponding description is given as follows\.
Fillinareplymessageandsubmititusingthereplytextboxandpostbuttononacommentthreadpage\.
This skill represents a common social\-forum interaction, where the agent fills a reply textbox and submits the response\. It separates the task\-specific reply contentmessagefrom structural webpage arguments, includingreply\_box\_idandpost\_button\_id\. Together with the GitLab merge\-request comment skill, this example shows that similar fill\-and\-submit procedural patterns can emerge across different domains, such as forum discussion and code collaboration\.
### B\.5Shipping Carrier Selection
The fifth skill is extracted from an Admin task whose instruction is "Update order \#306 with the UPS tracking number 55591023930"\. After the task is successfully completed,SGDRinduces the following skill from the trajectory:
1defadd\_tracking\_carrier\(add\_tracking\_btn\_id,carrier\_dropdown\_id,carrier\_name\):
2click\(add\_tracking\_btn\_id\)
3select\_option\(carrier\_dropdown\_id,carrier\_name\)
The corresponding description is given as follows\.
Selectashippingcarrierfromadropdownafterclickingthe’AddTrackingNumber’buttonintheShippingInformationsectiononanorderdetailspage\.
This skill captures an order\-management operation in the Admin domain\. Unlike the previous examples that mainly rely onfillandclick, this skill usesselect\_optionto choose a shipping carrier from a dropdown menu after expanding the tracking\-number interface\. It separates the task\-specific carrier argumentcarrier\_namefrom structural webpage arguments, includingadd\_tracking\_btn\_idandcarrier\_dropdown\_id, showing thatSGDRcan induce reusable skills over different primitive action types\.
Overall, these case studies show thatSGDRlearns compact procedural skills across all five WebArena domains\. The induced skills consistently separate webpage\-specific structural arguments from task\-specific content arguments, making them both grounded in the current page state and reusable for future tasks\. They also cover diverse interaction patterns, including form submission, comment posting, product search, wishlist addition, and dropdown selection\.Similar Articles
DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning
DRIVE proposes a dual-level skill modeling framework that separates reasoning knowledge from interaction knowledge for web agents under continual learning, achieving a 52.8% task success rate on WebArena, outperforming the skill-free baseline by 7.3 percentage points.
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Skill-RAG is a failure-aware RAG framework that uses hidden-state probing and skill routing to diagnose and correct query-evidence misalignment in retrieval-augmented generation. The approach detects retrieval failures and selectively applies targeted skills (query rewriting, question decomposition, evidence focusing) to improve accuracy on hard cases and out-of-distribution datasets.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
This paper introduces Group of Skills (GoSkills), a retrieval method that organizes atomic skills into role-labeled execution contexts to improve agent performance within limited context budgets.
Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning
Skill0.5 is a novel agentic reinforcement learning framework that combines general skill internalization with task-specific skill utilization via a dynamic difficulty-aware router, improving out-of-distribution generalization in complex task environments as demonstrated on ALFWorld and WebShop.