WebChallenger: A Reliable and Efficient Generalist Web Agent

arXiv cs.CL 06/10/26, 04:00 AM Papers
web-agent autonomous-navigation llm-agent dom-representation memory open-source web-navigation
Summary
WebChallenger is a new web agent framework that achieves strong performance across multiple benchmarks using open-weight models without fine-tuning, by replicating human cognitive advantages through architecture design rather than model scale.
arXiv:2606.10423v1 Announce Type: new Abstract: Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:11 AM
# WebChallenger: A Reliable and Efficient Generalist Web Agent
Source: [https://arxiv.org/html/2606.10423](https://arxiv.org/html/2606.10423)
Jayoo Hwang ML Collective jayoohm350@gmail\.com &Xiaowen Zhang longsurf\.ai sean@longsurf\.ai &Vedant Padwal Independent vedantpadwalinfi@gmail\.com

###### Abstract

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful\. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns\. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries\. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide\-and\-conquer observation pipeline that lets the agent skim section summaries and extract details only from task\-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi\-step interactions into single agent actions, handling partial state changes automatically\. Because all three operate over PageMem, the framework generalizes across websites without site\-specific adapters\. Using off\-the\-shelf open\-weight models without fine\-tuning, our system achieves 56\.3% on WebArena, 48\.7% on VisualWebArena, 51\.0% on Online\-Mind2Web, and 70\.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost\. Our code is released at this[URL](https://github.com/jayoohwang1/webchallenger)\.

## 1Introduction

> “I touch the future\. I teach” — Christa McAuliffe

![Refer to caption](https://arxiv.org/html/2606.10423v1/x1.png)Figure 1:Benchmark results\. WebChallenger sets new state\-of\-the\-art performance among agents using open models across four web navigation benchmarks\. Our results were obtained with far less compute than the baselines which either used finetuning or larger models, demonstrating thatscaffolding alone can drastically improve web agent performance\.Autonomous web navigation has long been a goal of AI research\(Doorenboset al\.,[1997](https://arxiv.org/html/2606.10423#bib.bib75)\): the web is one of the most complex interactive environments available, and navigating it autonomously has broad practical implications, from automating repetitive knowledge work to serving as a testbed for general\-purpose agent capabilities\. Recent advances in large language models and vision\-language models have driven rapid progress on computer\-using agents\(Marino and Marasović,[2025](https://arxiv.org/html/2606.10423#bib.bib78)\), yet even the strongest LLM agents remain below human performance on realistic, long\-horizon web tasks\(Janget al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib79); Miyaiet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib80)\)\. Additionally, the best generalist agents rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive work where agents would be desirable\.

This gap echoes Moravec’s paradox\(Moravec,[1988](https://arxiv.org/html/2606.10423#bib.bib76); Su,[2025](https://arxiv.org/html/2606.10423#bib.bib74)\): browsing the web is effortless for humans yet remarkably difficult for AI models that excel at mathematics and code generation\. We argue that this difficulty stems not from a lack of web knowledge in current models, but from a mismatch between how agent frameworks present the web environment and how it needs to be processed\. Specifically, humans bring three cognitive advantages to web navigation that current agent architectures fail to replicate\. First,selective attention: humans focus on relevant regions of a page while ignoring the rest\(Putkonenet al\.,[2023](https://arxiv.org/html/2606.10423#bib.bib73)\), whereas LLM agents ingest entire pages as flat token sequences, diluting relevant information in irrelevant context\. Second,persistent memory: humans memorize the layout and functionality of websites they have used before, while LLM agents approach each session with no prior environmental knowledge\. Third,procedural fluency: humans internalize reusable routines for common interaction patterns \(e\.g\., searching, selecting from a dropdown, filling a form\) that execute as cohesive sequences without deliberate reasoning at each step, while LLM agents must re\-observe and re\-reason over the full page state for every atomic action\.

In this work, we show that these three human advantages can be realized through agent architecture design rather than model scale or training\. Implementing them in a way that generalizes across websites without site\-specific adapters requires a shared abstraction the agent can reason over uniformly\. We introducePageMem, a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries: a representation the agent can skim like a table of contents, expand selectively for detail, and dispatch to specialized workflows by section type\. On this substrate we build three mechanisms that mirror the three cognitive advantages above\.

Adivide\-and\-conquer observationpipeline lets the agent skim PageMem’s section summaries, select task\-relevant regions, and extract details only from those regions, producing information\-dense observations without processing entire pages\.

A lightweightexploration and memorysystem traverses new websites before task execution, assembling a persistent collection of PageMems that records pages, navigation paths, and interactive element behaviors\.

Compound action workflowsimplement site\-agnostic routines for common interaction patterns such as searching, menu selection, and form submission\. Dispatched by section type, these workflows collapse multi\-step processes into single agent actions and automatically surface partial state changes \(such as a dropdown expanding\) without requiring the agent to reprocess the full page\.

Decomposing observation and decision\-making into focused sub\-prompts in this way allows our framework to extract strong performance from small, locally\-run models that would struggle with the monolithic prompts used by most existing agent frameworks\. Using an off\-the\-shelf 32B LLM and a 7B VLM without any fine\-tuning, our system achieves 56\.3% on WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib2)\), 48\.7% on VisualWebArena\(Kohet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib1)\), 51\.0% on Online\-Mind2Web\(Xueet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib3)\), and 70\.9% on WorkArena\(Drouinet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib4)\)— state\-of\-the\-art results among open\-weight models of comparable scale, and approaching frontier proprietary systems at a fraction of the inference cost\. These results indicate that current LLMs already possess sufficient reasoning ability for many web tasks; what they lack is the right scaffolding around observation, memory, and action to use it effectively\.

## 2Method

### 2\.1Problem Formulation

We frame web navigation as a sequential decision process in which an agent interacts with a web browser to complete a natural\-language task\. A task is a tupleτ=\(I,u0\)\\tau=\(I,u\_\{0\}\)consisting of an instructionIIand a starting URLu0u\_\{0\}, which determines the initial websitew0w\_\{0\}from a set𝒲\\mathcal\{W\}of target websites\. At each timesteptt, the agent receives an observationoto\_\{t\}, maintains a compact historyhth\_\{t\}of prior interactions, and selects an actionata\_\{t\}from a candidate set𝒜t\\mathcal\{A\}\_\{t\}\.

A standard LLM web agent implements this loop asat=π\(ot,ht\)a\_\{t\}=\\pi\(o\_\{t\},h\_\{t\}\): a single model call that maps a raw observation — typically a full accessibility tree or screenshot — and an interaction history to the next atomic browser action\. Our system departs from this template with four novel components\.

##### A structured page representation\.

Rather than exposing the raw DOM or accessibility tree, we introducePageMem, a structured representationppdeterministically constructed from the DOM\. Each PageMem contains an ordered list ofPageSections\{s1,…,sn\}\\\{s\_\{1\},\\ldots,s\_\{n\}\\\}corresponding to semantic regions of the page, and each PageSection contains a set of interactiveElements\. PageSections carry model\-generated summaries alongside DOM\-derived attributes, and serve as the shared substrate on which the observation pipeline, memory, and action workflows all operate\. This abstract substrate is what allows the rest of the system to remain site\-agnostic\. PageMem is defined in detail in §[2\.2](https://arxiv.org/html/2606.10423#S2.SS2)\.

##### Persistent memory from offline exploration\.

Before any task is attempted, an offline exploration phase traverses each websitew∈𝒲w\\in\\mathcal\{W\}and builds aWebsiteMemℳw\\mathcal\{M\}\_\{w\}: a persistent collection of PageMems indexed by URL, together with information about page templates and element behaviors discovered during exploration\. At task start the agent may select a set of bookmarksBτ⊆ℳw0B\_\{\\tau\}\\subseteq\\mathcal\{M\}\_\{w\_\{0\}\}that remain available as navigation targets throughout the task\. WebsiteMem is constructed once per site and reused across all subsequent tasks\. Exploration and memory are detailed in §[2\.3](https://arxiv.org/html/2606.10423#S2.SS3)\.

##### A multi\-stage observation pipeline\.

Rather than producingoto\_\{t\}by serializing the full page, we decompose observation into three stages over the current PageMemptp\_\{t\}: the agent first selects a subset of sections whose summaries appear relevant to the task, then extracts task\-relevant details from the full content of each selected section, and finally synthesizes the extractions into a task\-focused page summaryo^t\\hat\{o\}\_\{t\}\. The pipeline is defined in §[2\.4](https://arxiv.org/html/2606.10423#S2.SS4)\.

##### Compound actions with workflows\.

A timestep in our system corresponds to onehigh\-levelagent action, which may execute multiple browser operations\. Single\-step actions \(clicking a link, navigating to a URL\) cause a page transition and advance the loop directly\. Compound actions \(dropdown selection, form submission, search\) invoke aworkflowω\(at\)\\omega\(a\_\{t\}\)— a fixed sequence of additional LLM sub\-calls and browser operations that handles intermediate partial state changes, such as a dropdown expanding or form fields being filled one at a time, before returning control to the top\-level loop\. The action system is detailed in §[2\.5](https://arxiv.org/html/2606.10423#S2.SS5)\.

#### 2\.1\.1System overview\.

Given a taskτ=\(I,u0\)\\tau=\(I,u\_\{0\}\), the agent retrieves the WebsiteMemℳw0\\mathcal\{M\}\_\{w\_\{0\}\}built during offline exploration and optionally selects bookmarksBτB\_\{\\tau\}\. At each timesteptt, it \(i\) retrieves or constructs the PageMemptp\_\{t\}for the current page; \(ii\) applies the observation pipeline to produceo^t\\hat\{o\}\_\{t\}; and \(iii\) selects an actionat∈𝒜ta\_\{t\}\\in\\mathcal\{A\}\_\{t\}, which executes either as a direct browser operation or through a workflowω\(at\)\\omega\(a\_\{t\}\)\. The loop terminates when the agent selects an end\-task action and verifies completion, or when a step budget is exhausted\. The agent inference algorithm is provided in Appendix[A\.4](https://arxiv.org/html/2606.10423#A1.SS4)\.

![Refer to caption](https://arxiv.org/html/2606.10423v1/figure_overview.png)

Figure 2:Overview of WebChallenger\. \(left\) Each webpage is decomposed along the DOM into sections that correspond to semantic regions of the page\. \(middle\) These sections are indexed by short summaries to form a PageMem, a structured page representation cached in per\-website memory\. The agent skims these summaries and expands only the task\-relevant sections for detailed processing\. \(right\) Specialized multi\-step workflows are executed based on section type\.

### 2\.2PageMem

PageMem is an abstract page representation deterministically constructed from the DOM that serves as the common interface shared by the exploration \(§[2\.3](https://arxiv.org/html/2606.10423#S2.SS3)\), observation \(§[2\.4](https://arxiv.org/html/2606.10423#S2.SS4)\), and action \(§[2\.5](https://arxiv.org/html/2606.10423#S2.SS5)\) components\. It exposes a semantic, chunked view of a page while preserving the selectors needed for direct browser control, allowing higher\-level components to operate on abstract objects without site\-specific adapters\.

##### Hierarchy\.

PageMem is organized in four levels\. A*WebsiteMem*ℳw\\mathcal\{M\}\_\{w\}contains all PageMems and elements encountered on a websiteww\. A*PageMem*ppcorresponds to a single page and holds a title, URL, ordered list of sections\(s1,…,sn\)\(s\_\{1\},\\ldots,s\_\{n\}\), and a page\-level summary\. A*PageSection*sis\_\{i\}represents a subregion of the page \(e\.g\., navigation bar, product listing, review form\) and maps to a sub\-tree of the DOM\. Each section carries DOM\-derived state attributes \(e\.g\., tag, class, bounding box, contained elements\) and variable metadata \(e\.g\., summary, extracted details\)\. An*Element*eerepresents a single interactive widget, and carries DOM attributes to enable selector construction as well as metadata such as the element’s current value, clicked status, and dropdown elements\. The PageMem data structure acts as the central hub where all agent\-related information about a page is stored, flexibly facilitating the implementation of precise context\-engineering for web agents\.

##### Construction\.

PageSections are produced by recursively splitting the DOM tree, terminating at nodes that either fall below a size threshold or match a grouping tag \(form,ul,li,table,section, etc\.\); sibling nodes sharing tag and class are grouped into a single*list section*\. Clickable elements are identified using heuristics adapted from the BrowserUse library\(Müller and Žunic\.,[2024](https://arxiv.org/html/2606.10423#bib.bib53)\)and assigned to their ancestor section\. Finally, we prompt an LLM or VLM to provide a general one\-sentence summary for each section and the overall page\. Normal sections are size\-bounded so their full content fits in a single LLM call; list sections are unbounded and represented at a higher level of abstraction as a sequence of uniform sub\-sections, one per item\. Full details and construction algorithm are given in Appendix[A\.1](https://arxiv.org/html/2606.10423#A1.SS1)\.

### 2\.3Exploration and Memory

Before any task is attempted, an offline exploration phase traverses each target websitew∈𝒲w\\in\\mathcal\{W\}and produces the WebsiteMemℳw\\mathcal\{M\}\_\{w\}used at inference\. Exploration is fully deterministic: it requires no LLM guidance, task demonstrations, or external resources\. Compared to tree\-search methods that expand during execution or skill\-learning approaches that improve only after accumulating task experience, our approach amortizes environmental knowledge upfront and makes it available from the first task at a fixed, one\-time cost\. We describe exploration here and provide details in Appendix[A\.2](https://arxiv.org/html/2606.10423#A1.SS2)\.

##### Traversal\.

Starting from the homepage of a website, we explore all unique clickable elements on the page in order\. If a page contains many repeated elements with the same structure \(such as a list or table of results\), then we only explore elements contained within one item/row of the list/table for efficiency\. We skip exploring elements that have already been explored on the current website\. An element is explored by clicking it and recording the state\-transition it induces\. If clicking results in navigation to anunexplored page, the URL of the new page is added to the exploration frontier\. If clicking an element modifies the state of thecurrent pageby expanding an interface, then we add the newly revealed elements as the clicked element’s dropdown items\. After exploring the homepage, we repeat the above process for the pages that were added to the exploration frontier\. For each page visited, we extract its title and summarize it\. Exploration continues depth\-first until a set maximum search depth is reached\. We also limit the maximum number of elements explored per page, the total number of pages explored, and also set a timeout for each website\.

##### Use at inference\.

ℳw\\mathcal\{M\}\_\{w\}is saved per\-site as JSON and reused across tasks\. This memory is consumed by the agent in a highly token\-efficient manner: rather than loading the full memory into the context window or retrieving large passages of text, only a handful of extra tokens are added per prompt\. At task start the agent may select a small bookmark setBτ⊆ℳwB\_\{\\tau\}\\subseteq\\mathcal\{M\}\_\{w\}that remains available as navigation shortcuts throughout the task, and the agent’s observation space is augmented to provide context about hidden dropdown menu elements\.

We adopt a deliberately minimal memory instantiation in this work in order to efficiently demonstrate how our framework can be used to create structured site\-specific memories for LLM agents\. Our website representation could also be used as a building block to implement more advanced memory approaches such as those explored in Agent Workflow Memory\(Wanget al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib9)\)or SkillWeaver\(Zhenget al\.,[2025a](https://arxiv.org/html/2606.10423#bib.bib20)\)\. We leave this to future work\.

### 2\.4Divide\-and\-Conquer Observation

Large pages can easily exceed the reliable context window of small LLMs, and even within that window, flattening a full accessibility tree into a single prompt dilutes task\-relevant signal among boilerplate\. Our system addresses these issues by decomposing page analysis across multiple focused sub\-prompts, extracting and condensing task\-relevant information into a summarized observationo^t\\hat\{o\}\_\{t\}\.

##### PageMem retrieval and update\.

For the current pageptp\_\{t\}, the agent first checks whether a PageMem already exists inℳw\\mathcal\{M\}\_\{w\}\(either built during exploration or cached from a previous visit in the same task\)\. If so, it is reused; otherwise a fresh PageMem is constructed from the live DOM as described in §[2\.2](https://arxiv.org/html/2606.10423#S2.SS2)\. When a cached PageMem is reused, sections whose elements have changed since the last visit are re\-summarized while unchanged sections retain their cached summaries and extractions, amortizing summarization cost across timesteps and repeat visits\.

##### Section selection\.

The LLM is shown the list of section summaries forptp\_\{t\}along with the task instructionIIand the interaction historyhth\_\{t\}, and returns a subset of sectionsSt⊆\{s1,…,sn\}S\_\{t\}\\subseteq\\\{s\_\{1\},\\ldots,s\_\{n\}\\\}judged relevant to the task\.

##### Detail extraction\.

For each PageSections∈Sts\\in S\_\{t\}, the LLM is prompted to extract task\-relevant information from the section’s full content \(accessibility sub\-tree, page metadata\)\. If a section contains visible images above a minimum size, the URLs and VLM descriptions for the images are included in the extraction prompt\. Whenssis a table or list section, extraction is preceded by an item\-selection step: items are grouped into chunks of maximum sizecc, the LLM selects relevant items from each chunk, and only the selected items are passed to the extraction prompt \(Figure[2](https://arxiv.org/html/2606.10423#S2.F2), middle\-right\)\. This chunked filtering keeps even very long lists or tables within context\. Extractions are cached on the section and reused while the section is unchanged\. Full process is in Appendix[A\.3](https://arxiv.org/html/2606.10423#A1.SS3)\.

##### Summary synthesis\.

Finally, we provide the LLM with the extracted outputs from all selected sections and prompt it to generate a compact page summaryo^t\\hat\{o\}\_\{t\}, which becomes the observation passed to the action module and appended to the historyhth\_\{t\}\. We instruct the LLM to generate a one paragraph long summary, as we find this is sufficient in most cases to capture the task\-relevant page information while also allowing the history representation to remain compact\.

### 2\.5Compound Actions and Workflows

At each timestepttthe action module selects an actionata\_\{t\}from a candidate set𝒜t\\mathcal\{A\}\_\{t\}assembled from the current PageMem, the selected sectionsStS\_\{t\}, and the agent’s memory\. Rather than exposing actions through an LLM tool\-use interface, we present𝒜t\\mathcal\{A\}\_\{t\}as a numbered list and prompt the model to return the index of its chosen action, as we find this to be more reliable than tool use for small open\-weight models\. Our system then automatically executes the appropriate action function based on the action selected\. Many actions are*compound*: their execution invokes a workflowω\(at\)\\omega\(a\_\{t\}\)that combines multiple LLM sub\-calls with browser operations \(implemented using Playwright111[https://playwright\.dev/python/](https://playwright.dev/python/)\) to complete a multi\-step interaction as a single action\.

##### Action selection\.

𝒜t\\mathcal\{A\}\_\{t\}comprises three groups\.*Navigation*actions include previously visited URLs, bookmarks inBτB\_\{\\tau\}, a type\-URL action, and \(when applicable\) switch\-tab and switch\-website\.*Element*actions are gathered from the selected sectionsStS\_\{t\}and a rule\-based pre\-filter removes no\-ops and rarely\-useful actions \(e\.g\., navigating to the current page, clicking a selected radio\)\. The LLM filters the full list of navigation and element actions for those it deems most promising for the task\.*End task*is always available\. The LLM then selects the next actionata\_\{t\}from the filtered candidate set\.

##### Workflows\.

Our design principle for workflows is that action sequences whose intermediate steps produce only*partial*state changes to the current page \(dropdown expansion, search suggestions, field\-by\-field form entry\) are collapsed into a single compound action handled by a workflow, while actions that navigate to a different page are kept as individual agent actions\. This keeps the agent’s decision loop anchored at semantically meaningful transitions rather than at every micro\-interaction\. We describe two representative workflows; the full set is summarized in Appendix[A\.5](https://arxiv.org/html/2606.10423#A1.SS5)\.

*Dropdown selection\.*The workflow clicks the dropdown element, extracts the list of revealed options via a section\-level diff, prompts the LLM to choose one option by index, and clicks the chosen option\.

*Form submission\.*The LLM first selects which fields of the form to fill, is then prompted for each field’s value \(which is entered on the page\), and finally reviews the completed form to either edit further or submit \(Figure[2](https://arxiv.org/html/2606.10423#S2.F2), bottom\-right\)\. The workflow handles field\-specific details internally so that the agent\-level action is a singleSubmitFormstep\.

##### End\-task\.

The end\-task action invokes a one\-time verification workflow: the LLM is prompted with the task and history and asked to either produce a final answer or report that the task is not yet complete, in which case the LLM is re\-prompted to select a different non\-terminating action\.

Table 1:Benchmark success rates \(%\)\. WebChallenger sets new open\-model SOTA on four web navigation benchmarks and performs comparably to agents built on proprietary models, despite using no training\. Best proprietary and open\-model results are bolded\. VWA: VisualWebArena\(Kohet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib1)\), O\-M2W: Online\-Mind2Web\(Xueet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib3)\), WoA: WorkArena\(Drouinet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib4)\)\.- †\\daggerOur VisualWebArena experiments use Qwen3\-VL\-4B\-Instruct in place of Qwen2\.5\-VL\-7B\-Instruct\.

## 3Experiments

We evaluate WebChallenger on four open\-ended web navigation benchmarks to test its performance on a diverse range of capabilities\.WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib2)\)consists of 812 tasks in 6 simulated environments that are designed to mimic common website types \(e\.g\., forum, wiki\) and uses a combination of both programmatic and LLM evaluation\.VisualWebArena\(Kohet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib1)\)builds on the infrastructure of WebArena, but consists of 910 tasks that require visual reasoning\.Online\-Mind2Web\(Xueet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib3)\)consists of 300 tasks across 136 real\-world websites\. We score our agent using human evaluations for Online\-Mind2Web\.WorkArena\(Drouinet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib4)\)contains 330 enterprise\-related tasks that require agents to navigate complex user interfaces\.

### 3\.1Experimental Setup

We useGLM\-4\-32B\-0414\(Z\.ai,[2025](https://arxiv.org/html/2606.10423#bib.bib56)\)as the LLM controller, andQwen2\.5\-VL\-7B\-Instruct\(Baiet al\.,[2025b](https://arxiv.org/html/2606.10423#bib.bib54)\)as our supplementary vision model for image captioning\. For VisualWebArena, we useQwen3\-VL\-4B\-Instruct\(Baiet al\.,[2025a](https://arxiv.org/html/2606.10423#bib.bib55)\)as the vision model\. For all experiments we use the same agent prompts and sample with temperature 0\. For each benchmark, we first explore the full set of benchmark websites before running inference\. During inference, the agent’s memory is reset between tasks to the post\-exploration state to preserve independence between evaluation samples\. Additional experiment details are provided in Appendix[B](https://arxiv.org/html/2606.10423#A2)\.

### 3\.2Main Results

Baselines\. We compare WebChallenger against strong open model and proprietary baselines for each of our selected benchmarks\.

For proprietary model baselines, we use WALT\(Prabhuet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib22)\), IBM CUGA\(Marreedet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib48)\), OpenAI CUA\(OpenAI,[2025](https://arxiv.org/html/2606.10423#bib.bib52)\), ScribeAgent\(Shenet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib47)\), AgentSymbiotic,\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.10423#bib.bib49)\), AgentOccam\-Judge\(Yanget al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib65)\), WebPilot\(Zhanget al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib50)\), SkillWeaver\(Zhenget al\.,[2025a](https://arxiv.org/html/2606.10423#bib.bib20)\), and Agent Workflow Memory\(Wanget al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib9)\)\.

For open\-model baselines, we use Agent\-as\-Annotators\(Lù and Reddy,[2026](https://arxiv.org/html/2606.10423#bib.bib67)\), Mobile\-Agent\-v3\.5\(Xuet al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib68)\), WebDreamer\(Guet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib63)\), Fara\-7B\(Awadallahet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib58)\), Learn\-by\-Interact\(Suet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib59)\), AgentTrek\(Xuet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib60)\), Go\-Browse\(Gandhi and Neubig,[2025](https://arxiv.org/html/2606.10423#bib.bib64)\), AutoWebGLM\(Laiet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib61)\), TTI\(Shenet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib62)\), and Tree Search\(Kohet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib66)\)\.

GenericAgent results are taken from the official BrowserGym leaderboard\(ServiceNow,[2025](https://arxiv.org/html/2606.10423#bib.bib51)\)\. All other baseline results in Table[1](https://arxiv.org/html/2606.10423#S2.T1)are taken from their original reports\.

Results\. As shown in Table[1](https://arxiv.org/html/2606.10423#S2.T1), WebChallenger sets new state\-of\-the\-art results among open\-model agents on all four benchmarks despite using no fine\-tuning\. On WebArena, our 56\.3% exceeds the strongest fine\-tuned open\-model baseline \(Mobile\-Agent\-v3\.5, 48\.4%\) by 7\.9 points and surpasses ScribeAgent \(53\.0%, GPT\-4o planner\)\. On VisualWebArena, 48\.7% outperforms all open\-model baselines and trails only WALT \(52\.9%, GPT\-5\)\. On WorkArena, 70\.9% lands 20 points above the next\-best zero\-shot open model and exceeds both Claude 3\.5 Sonnet \(56\.4%\) and GPT\-4o \(45\.5%\) backbones\. 51\.0% on Online\-Mind2Web shows that our framework generalizes by exploiting structural patterns shared across the web rather than site\-specific adaptations\. These results demonstrate that careful architectural scaffolding can close most of the gap between small open\-weight models and frontier proprietary systems on long\-horizon web tasks, and that a single configuration generalizes consistently across a wide range of tasks and environments\.

Table 2:Component ablations on WebArena\-lite \(165 tasks\)\. Per\-site task counts: Shopping \(n=46n=46\), Reddit \(n=21n=21\), GitLab \(n=32n=32\), Maps \(n=31n=31\), CMS \(n=35n=35\)\.Δ\\Deltais the change in average success rate relative to the full system\.
### 3\.3Analysis

We run additional experiments on the 165\-task WebArena\-lite subset\(Liuet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib5)\)to examine component contributions, compute usage, and backbone sensitivity; our system’s lite score \(58\.858\.8\) tracks the full WebArena score \(56\.356\.3\) closely\.

##### Component ablations\.

We separately remove each of the three architectural components \(Table[2](https://arxiv.org/html/2606.10423#S3.T2)\)\.Remove memorydisables bookmarks, dropdown information, and pre\-cached section summaries; PageMem is still used at inference time but constructed from scratch for each task\.Remove compound actionsrestricts the agent to single basic actions \(ClickElement,EnterInput,SelectOption,UploadFile, plus navigation\), eliminating the search, dropdown, and form\-filling workflows\.Remove observation pipelinereplaces section selection and detail extraction with a single prompt containing the full ax\-tree and all available actions, with history reduced to a list of prior actions\.

Among the three components, removing the observation pipeline causes the largest accuracy drop \(−17\.6\-17\.6points\), followed by compound actions \(−9\.7\-9\.7\) and memory \(−7\.6\-7\.6\)\. Compound action removal has its largest effect on CMS \(−20\.0\-20\.0\), as CMS involves interactions with complex interfaces such as forms and filtering menus\. On Reddit, removing memory has no effect \(71\.471\.4in both conditions\), suggesting GLM\-4\-32B navigates Reddit reliably without pre\-cached information\. Maps performance is largely unaffected by memory and compound actions, as the Maps environment is focused on a single interface that doesn’t benefit from those components\.

Table 3:Token and step usage for the GLM\-4\-32B component ablations\. Tokens/Prompt is the average input token count per LLM call\. We count compound actions as one step\.Table 4:Backbone model comparison on WebArena\-lite\(Liuet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib5)\)\. The bottom row uses the minimal GenericAgent harness from BrowserGym\(Chezelleset al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib46)\)with the same GLM\-4\-32B model used in our system to isolate the contribution of our harness\.
##### Token and step efficiency\.

Removing the observation pipeline reduces total tokens \(47\.047\.0M→\\rightarrow36\.036\.0M\) but raises average prompt size4\.75×4\.75\\times\(1850→87931850\\rightarrow 8793tokens\) and step count from7\.27\.2to11\.2611\.26\(Table[3](https://arxiv.org/html/2606.10423#S3.T3)\)\. Our multi\-stage observation processing decomposes large difficult prompts into several smaller but easier prompts, trading inference compute for performance\. Compound actions significantly improve agent efficiency: removal causes total tokens to rise to64\.964\.9M and steps to9\.859\.85, since interactions that previously executed within a single workflow now require a separate observation and decision cycle per atomic action\.

##### Backbone model comparison\.

We swap the GLM\-4\-32B backbone for GPT\-5 and GPT\-4o\-mini, and additionally evaluate GLM\-4\-32B alone in the minimal GenericAgent harness to isolate the architecture’s contribution \(Table[4](https://arxiv.org/html/2606.10423#S3.T4)\)\. GPT\-5 reaches68\.7%68\.7\\%,9\.99\.9points above GLM\-4\-32B in the same framework\. GPT\-4o\-mini reaches46\.7%46\.7\\%, indicating our framework retains strong performance even with weaker backbones\. GLM\-4\-32B in the GenericAgent harness scores19\.4%19\.4\\%, against58\.8%58\.8\\%in our framework, a39\.439\.4\-point improvement from system architecture alone\.

## 4Related Work

Agent Memory\. A growing body of work equips LLM web agents with external memory by accumulating insights from task trajectories\(Wanget al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib9); Ouyanget al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib10); Sarchet al\.,[2025a](https://arxiv.org/html/2606.10423#bib.bib11); Panget al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib12); Liuet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib13); Nekoeiet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib8); Fuet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib6); Chenet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib7); Chenget al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib14); Suet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib59)\)\. Our memory takes a complementary route: a deterministic exploration procedure efficiently produces a structured site map with no task experience, demonstrations, or documentation required, making it applicable to any website out of the box\.

Web Action Space\. Several works extend web agent action spaces beyond click and type by introducing higher\-level programmatic skills\(Songet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib18); Wanget al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib19); Zhenget al\.,[2025a](https://arxiv.org/html/2606.10423#bib.bib20); Heet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib21); Prabhuet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib22); Yuet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib23); Wanget al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib15); Zhonget al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib16)\)\. These approaches typically learn site\-specific code, whereas our compound workflows operate over PageMem’s abstract elements and sections and generalize across sites with no per\-site adaptation\. We also depart from the standard tool\-calling interface in favor of a numbered\-list action format\.

Observation Refinement\. Web agents observe their environment through text\(Guret al\.,[2018](https://arxiv.org/html/2606.10423#bib.bib24); Liet al\.,[2023](https://arxiv.org/html/2606.10423#bib.bib25); Kimet al\.,[2023](https://arxiv.org/html/2606.10423#bib.bib26)\), screenshots\(Shawet al\.,[2023](https://arxiv.org/html/2606.10423#bib.bib27); Honget al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib28); Gouet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib29); Pahujaet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib30); Heet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib31); Zhenget al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib32); Vermaet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib33)\), or both\(Furutaet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib34)\)\. All such modalities are token\-heavy and information\-sparse, motivating refinement strategies\. Text\-based agents prune irrelevant HTML elements\(Guret al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib35); Denget al\.,[2023](https://arxiv.org/html/2606.10423#bib.bib36); Kilet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib37);lù2024weblinxrealworldwebsitenavigation; Leeet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib39); Abuelsaadet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib40); Kerbouaet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib41)\), while vision\-based agents focus attention on specific screen regions\(Sarchet al\.,[2025b](https://arxiv.org/html/2606.10423#bib.bib42); Singhet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib43); Luoet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib44); Parket al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib45)\)\. We apply region\-based focus to a hybrid text\-vision agent by splitting pages along DOM structure, which preserves the semantic grouping authored into the page better than pixel\-space cropping\.Feuillade–Montixi \([2026](https://arxiv.org/html/2606.10423#bib.bib17)\)explores a similar DOM\-based approach\. More broadly, our pipeline echoes a line of work on decomposing long\-context tasks into focused sub\-prompts\(Zhanget al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib69); Chenet al\.,[2023](https://arxiv.org/html/2606.10423#bib.bib70); Jayalathet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib71); Leeet al\.,[2024](https://arxiv.org/html/2606.10423#bib.bib72)\)\.

## 5Conclusion

WebChallenger closes much of the gap between small open\-weight models and frontier proprietary systems on long\-horizon web navigation\. We argue current LLMs already possess sufficient intelligence for many common web tasks, but standard frameworks fail to scaffold that intelligence with the selective attention, persistent memory, and procedural fluency humans rely on\. We supply each through a divide\-and\-conquer observation pipeline, an offline exploration and memory system, and compound action workflows\. These components are implemented on top of PageMem, a shared page representation that generalizes across websites without site\-specific adapters\. Using small, general\-purpose models without fine\-tuning, our system sets new state\-of\-the\-art results among open\-weight agents on four diverse web agent benchmarks\.

## Acknowledgements

We thank the ML Collective community for their support, discussions, and feedback\.

## References

- T\. Abuelsaad, D\. Akkil, P\. Dey, A\. Jagmohan, A\. Vempaty, and R\. Kokku \(2024\)Agent\-e: from autonomous web navigation to foundational design principles in agentic systems\.External Links:2407\.13032,[Link](https://arxiv.org/abs/2407.13032)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- Anthropic \(2025\)Mitigating the risk of prompt injections in browser use\.Note:[https://www\.anthropic\.com/news/prompt\-injection\-defenses](https://www.anthropic.com/news/prompt-injection-defenses)Accessed: 2026\-05\-11Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- A\. Awadallah, Y\. Lara, R\. Magazine, H\. Mozannar, A\. Nambi, Y\. Pandya, A\. Rajeswaran, C\. Rosset, A\. Taymanov, V\. Vineet, S\. Whitehead, and A\. Zhao \(2025\)Fara\-7b: an efficient agentic model for computer use\.External Links:2511\.19663,[Link](https://arxiv.org/abs/2511.19663)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge, W\. Ge, Z\. Guo, Q\. Huang, J\. Huang, F\. Huang, B\. Hui, S\. Jiang, Z\. Li, M\. Li, M\. Li, K\. Li, Z\. Lin, J\. Lin, X\. Liu, J\. Liu, C\. Liu, Y\. Liu, D\. Liu, S\. Liu, D\. Lu, R\. Luo, C\. Lv, R\. Men, L\. Meng, X\. Ren, X\. Ren, S\. Song, Y\. Sun, J\. Tang, J\. Tu, J\. Wan, P\. Wang, P\. Wang, Q\. Wang, Y\. Wang, T\. Xie, Y\. Xu, H\. Xu, J\. Xu, Z\. Yang, M\. Yang, J\. Yang, A\. Yang, B\. Yu, F\. Zhang, H\. Zhang, X\. Zhang, B\. Zheng, H\. Zhong, J\. Zhou, F\. Zhou, J\. Zhou, Y\. Zhu, and K\. Zhu \(2025a\)Qwen3\-vl technical report\.External Links:2511\.21631,[Link](https://arxiv.org/abs/2511.21631)Cited by:[§3\.1](https://arxiv.org/html/2606.10423#S3.SS1.p1.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025b\)Qwen2\.5\-vl technical report\.External Links:2502\.13923,[Link](https://arxiv.org/abs/2502.13923)Cited by:[§3\.1](https://arxiv.org/html/2606.10423#S3.SS1.p1.1)\.
- H\. Chen, R\. Pasunuru, J\. Weston, and A\. Celikyilmaz \(2023\)Walking down the memory maze: beyond context limit through interactive reading\.External Links:2310\.05029,[Link](https://arxiv.org/abs/2310.05029)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- M\. Chen, Y\. Li, Y\. Yang, S\. Yu, B\. Lin, and X\. He \(2024\)AutoManual: constructing instruction manuals by llm agents via interactive environmental learning\.External Links:2405\.16247,[Link](https://arxiv.org/abs/2405.16247)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- J\. Cheng, A\. Kumar, R\. Lal, R\. Rajasekaran, H\. Ramezani, O\. Z\. Khan, O\. Rokhlenko, S\. Chiu\-Webster, G\. Hua, and H\. Amiri \(2025\)WebATLAS: an llm agent with experience\-driven memory and action simulation\.External Links:2510\.22732,[Link](https://arxiv.org/abs/2510.22732)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- T\. L\. S\. D\. Chezelles, M\. Gasse, A\. Drouin, M\. Caccia, L\. Boisvert, M\. Thakkar, T\. Marty, R\. Assouel, S\. O\. Shayegan, L\. K\. Jang, X\. H\. Lù, O\. Yoran, D\. Kong, F\. F\. Xu, S\. Reddy, Q\. Cappart, G\. Neubig, R\. Salakhutdinov, N\. Chapados, and A\. Lacoste \(2025\)The browsergym ecosystem for web agent research\.External Links:2412\.05467,[Link](https://arxiv.org/abs/2412.05467)Cited by:[Table 4](https://arxiv.org/html/2606.10423#S3.T4)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2Web: towards a generalist agent for the web\.External Links:2306\.06070,[Link](https://arxiv.org/abs/2306.06070)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- R\. B\. Doorenbos, O\. Etzioni, and D\. S\. Weld \(1997\)A scalable comparison\-shopping agent for the world\-wide web\.InProceedings of the First International Conference on Autonomous Agents,AGENTS ’97,New York, NY, USA,pp\. 39–48\.External Links:ISBN 0897918770,[Link](https://doi.org/10.1145/267658.267666),[Document](https://dx.doi.org/10.1145/267658.267666)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p2.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. D\. Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez, N\. Chapados, and A\. Lacoste \(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.External Links:2403\.07718,[Link](https://arxiv.org/abs/2403.07718)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p8.1),[Table 1](https://arxiv.org/html/2606.10423#S2.T1),[§3](https://arxiv.org/html/2606.10423#S3.p1.1)\.
- Q\. Feuillade–Montixi \(2026\)WebFurl: a browser\-use AI agent with compressed unfoldable HTML representation for high token efficiency\.GitHub\.Note:[https://github\.com/WeaveMindAI/Webfurl](https://github.com/WeaveMindAI/Webfurl)GitHub repositoryCited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- Y\. Fu, D\. Kim, J\. Kim, S\. Sohn, L\. Logeswaran, K\. Bae, and H\. Lee \(2024\)AutoGuide: automated generation and selection of context\-aware guidelines for large language model agents\.External Links:2403\.08978,[Link](https://arxiv.org/abs/2403.08978)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- H\. Furuta, K\. Lee, O\. Nachum, Y\. Matsuo, A\. Faust, S\. S\. Gu, and I\. Gur \(2024\)Multimodal web navigation with instruction\-finetuned foundation models\.External Links:2305\.11854,[Link](https://arxiv.org/abs/2305.11854)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- A\. Gandhi and G\. Neubig \(2025\)Go\-browse: training web agents with structured exploration\.External Links:2506\.03533,[Link](https://arxiv.org/abs/2506.03533)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- B\. Gou, R\. Wang, B\. Zheng, Y\. Xie, C\. Chang, Y\. Shu, H\. Sun, and Y\. Su \(2025\)Navigating the digital world as humans do: universal visual grounding for gui agents\.External Links:2410\.05243,[Link](https://arxiv.org/abs/2410.05243)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- Y\. Gu, K\. Zhang, Y\. Ning, B\. Zheng, B\. Gou, T\. Xue, C\. Chang, S\. Srivastava, Y\. Xie, P\. Qi, H\. Sun, and Y\. Su \(2025\)Is your llm secretly a world model of the internet? model\-based planning for web agents\.External Links:2411\.06559,[Link](https://arxiv.org/abs/2411.06559)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- I\. Gur, H\. Furuta, A\. Huang, M\. Safdari, Y\. Matsuo, D\. Eck, and A\. Faust \(2024\)A real\-world webagent with planning, long context understanding, and program synthesis\.External Links:2307\.12856,[Link](https://arxiv.org/abs/2307.12856)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- I\. Gur, U\. Rueckert, A\. Faust, and D\. Hakkani\-Tur \(2018\)Learning to navigate the web\.External Links:1812\.09195,[Link](https://arxiv.org/abs/1812.09195)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024\)WebVoyager: building an end\-to\-end web agent with large multimodal models\.External Links:2401\.13919,[Link](https://arxiv.org/abs/2401.13919)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- K\. He, Z\. Wang, C\. Zhuang, and J\. Gu \(2025\)Recon\-act: a self\-evolving multi\-agent browser\-use system via web reconnaissance, tool generation, and task execution\.External Links:2509\.21072,[Link](https://arxiv.org/abs/2509.21072)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- W\. Hong, W\. Wang, Q\. Lv, J\. Xu, W\. Yu, J\. Ji, Y\. Wang, Z\. Wang, Y\. Zhang, J\. Li, B\. Xu, Y\. Dong, M\. Ding, and J\. Tang \(2024\)CogAgent: a visual language model for gui agents\.External Links:2312\.08914,[Link](https://arxiv.org/abs/2312.08914)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- L\. K\. Jang, J\. Y\. Koh, D\. Fried, and R\. Salakhutdinov \(2026\)Odysseys: benchmarking web agents on realistic long horizon tasks\.External Links:2604\.24964,[Link](https://arxiv.org/abs/2604.24964)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p2.1)\.
- D\. Jayalath, J\. B\. Wendt, N\. Monath, S\. Tata, and B\. Gunel \(2025\)PRISM: efficient long\-range reasoning with short\-context llms\.External Links:2412\.18914,[Link](https://arxiv.org/abs/2412.18914)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- I\. Kerboua, S\. O\. Shayegan, M\. Thakkar, X\. H\. Lù, L\. Boisvert, M\. Caccia, J\. Espinas, A\. Aussem, V\. Eglin, and A\. Lacoste \(2025\)FocusAgent: simple yet effective ways of trimming the large context of web agents\.External Links:2510\.03204,[Link](https://arxiv.org/abs/2510.03204)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- J\. Kil, C\. H\. Song, B\. Zheng, X\. Deng, Y\. Su, and W\. Chao \(2024\)Dual\-view visual contextualization for web navigation\.External Links:2402\.04476,[Link](https://arxiv.org/abs/2402.04476)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- G\. Kim, P\. Baldi, and S\. McAleer \(2023\)Language models can solve computer tasks\.External Links:2303\.17491,[Link](https://arxiv.org/abs/2303.17491)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. C\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.External Links:2401\.13649,[Link](https://arxiv.org/abs/2401.13649)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p8.1),[Table 1](https://arxiv.org/html/2606.10423#S2.T1),[§3](https://arxiv.org/html/2606.10423#S3.p1.1)\.
- J\. Y\. Koh, S\. McAleer, D\. Fried, and R\. Salakhutdinov \(2025\)Tree search for language model agents\.External Links:2407\.01476,[Link](https://arxiv.org/abs/2407.01476)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- T\. Kuntz, A\. Duzan, H\. Zhao, F\. Croce, Z\. Kolter, N\. Flammarion, and M\. Andriushchenko \(2025\)OS\-harm: a benchmark for measuring safety of computer use agents\.External Links:2506\.14866,[Link](https://arxiv.org/abs/2506.14866)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.External Links:2309\.06180,[Link](https://arxiv.org/abs/2309.06180)Cited by:[§B\.1](https://arxiv.org/html/2606.10423#A2.SS1.p1.4)\.
- H\. Lai, X\. Liu, I\. L\. Iong, S\. Yao, Y\. Chen, P\. Shen, H\. Yu, H\. Zhang, X\. Zhang, Y\. Dong, and J\. Tang \(2024\)AutoWebGLM: a large language model\-based web navigating agent\.External Links:2404\.03648,[Link](https://arxiv.org/abs/2404.03648)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- D\. Lee, J\. Lee, K\. Kim, J\. Tack, J\. Shin, Y\. W\. Teh, and K\. Lee \(2025\)Learning to contextualize web pages for enhanced decision making by llm agents\.External Links:2503\.10689,[Link](https://arxiv.org/abs/2503.10689)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- K\. Lee, X\. Chen, H\. Furuta, J\. Canny, and I\. Fischer \(2024\)A human\-inspired reading agent with gist memory of very long contexts\.External Links:2402\.09727,[Link](https://arxiv.org/abs/2402.09727)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- T\. Li, G\. Li, Z\. Deng, B\. Wang, and Y\. Li \(2023\)A zero\-shot language agent for computer control with structured reflection\.External Links:2310\.08740,[Link](https://arxiv.org/abs/2310.08740)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- Z\. Liao, J\. Jones, L\. Jiang, Y\. Ning, E\. Fosler\-Lussier, Y\. Su, Z\. Lin, and H\. Sun \(2026\)RedTeamCUA: realistic adversarial testing of computer\-use agents in hybrid web\-os environments\.External Links:2505\.21936,[Link](https://arxiv.org/abs/2505.21936)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- G\. Liu, S\. Geng, S\. Li, H\. Cui, S\. Zhang, X\. Liu, and T\. Liu \(2025\)WebCoach: self\-evolving web agents with cross\-session memory guidance\.External Links:2511\.12997,[Link](https://arxiv.org/abs/2511.12997)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- X\. Liu, T\. Zhang, Y\. Gu, I\. L\. Iong, Y\. Xu, X\. Song, S\. Zhang, H\. Lai, X\. Liu, H\. Zhao, J\. Sun, X\. Yang, Y\. Yang, Z\. Qi, S\. Yao, X\. Sun, S\. Cheng, Q\. Zheng, H\. Yu, H\. Zhang, W\. Hong, M\. Ding, L\. Pan, X\. Gu, A\. Zeng, Z\. Du, C\. H\. Song, Y\. Su, Y\. Dong, and J\. Tang \(2024\)VisualAgentBench: towards large multimodal models as visual foundation agents\.External Links:2408\.06327,[Link](https://arxiv.org/abs/2408.06327)Cited by:[§3\.3](https://arxiv.org/html/2606.10423#S3.SS3.p1.2),[Table 4](https://arxiv.org/html/2606.10423#S3.T4)\.
- X\. H\. Lù and S\. Reddy \(2026\)Structured distillation of web agent capabilities enables generalization\.External Links:2604\.07776,[Link](https://arxiv.org/abs/2604.07776)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- T\. Luo, L\. Logeswaran, J\. Johnson, and H\. Lee \(2025\)Visual test\-time scaling for gui agent grounding\.External Links:2505\.00684,[Link](https://arxiv.org/abs/2505.00684)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- K\. Marino and A\. Marasović \(2025\)Computer use survey: a visual survey of computer use agents\.External Links:[Link](https://kennethmarino.com/computeruse/computeruse.html)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p2.1)\.
- S\. Marreed, A\. Oved, A\. Yaeli, S\. Shlomov, I\. Levy, O\. Akrabi, A\. Sela, A\. Adi, and N\. Mashkif \(2025\)Towards enterprise\-ready computer using generalist agent\.External Links:2503\.01861,[Link](https://arxiv.org/abs/2503.01861)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1)\.
- A\. Miyai, Z\. Zhao, K\. Egashira, A\. Sato, T\. Sunada, S\. Onohara, H\. Yamanishi, M\. Toyooka, K\. Nishina, R\. Maeda, K\. Aizawa, and T\. Yamasaki \(2025\)WebChoreArena: evaluating web browsing agents on realistic tedious web tasks\.External Links:2506\.01952,[Link](https://arxiv.org/abs/2506.01952)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p2.1)\.
- H\. Moravec \(1988\)Mind children: the future of robot and human intelligence\.Harvard University Press,Cambridge, MA\.Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p3.1)\.
- M\. Müller and G\. Žunic\. \(2024\)Browser use: enable ai to control your browser\.External Links:[Link](https://github.com/browser-use/browser-use)Cited by:[§A\.1\.3](https://arxiv.org/html/2606.10423#A1.SS1.SSS3.Px2.p1.3),[§2\.2](https://arxiv.org/html/2606.10423#S2.SS2.SSS0.Px2.p1.1)\.
- H\. Nekoei, A\. Jaiswal, P\. Bechard, O\. Shliazhko, O\. M\. Ayala, M\. Reymond, M\. Caccia, A\. Drouin, S\. Chandar, and A\. Lacoste \(2025\)Just\-in\-time episodic feedback hinter: leveraging offline knowledge to improve llm agents adaptation\.External Links:2510\.04373,[Link](https://arxiv.org/abs/2510.04373)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- OpenAI \(2025\)Introducing operator\.External Links:[Link](https://openai.com/index/introducing-operator/)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1)\.
- S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang, V\. Tirumalashetty, G\. Lee, M\. Rofouei, H\. Lin, J\. Han, C\. Lee, and T\. Pfister \(2025\)ReasoningBank: scaling agent self\-evolving with reasoning memory\.External Links:2509\.25140,[Link](https://arxiv.org/abs/2509.25140)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- V\. Pahuja, Y\. Lu, C\. Rosset, B\. Gou, A\. Mitra, S\. Whitehead, Y\. Su, and A\. Awadallah \(2025\)Explorer: scaling exploration\-driven web trajectory synthesis for multimodal web agents\.External Links:2502\.11357,[Link](https://arxiv.org/abs/2502.11357)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- X\. Pang, R\. Hong, H\. Zhang, and C\. Zhang \(2025\)Assimilation and accommodation: task\-adaptive hierarchical abstraction for solving web tasks\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 14000–14014\.External Links:[Link](https://aclanthology.org/2025.findings-acl.720/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.720),ISBN 979\-8\-89176\-256\-5Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- J\. Park, P\. Tang, S\. Das, S\. Appalaraju, K\. Y\. Singh, R\. Manmatha, and S\. Ghadar \(2025\)R\-vlm: region\-aware vision language model for precise gui grounding\.External Links:2507\.05673,[Link](https://arxiv.org/abs/2507.05673)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- V\. Prabhu, Y\. Dai, M\. Fernandez, J\. Gu, K\. Ramakrishnan, Y\. Luo, S\. Savarese, C\. Xiong, J\. Li, Z\. Chen, and R\. Xu \(2025\)WALT: web agents that learn tools\.External Links:2510\.01524,[Link](https://arxiv.org/abs/2510.01524)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1),[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- A\. Putkonen, A\. Nioche, M\. Laine, C\. Kuuramo, and A\. Oulasvirta \(2023\)Fragmented visual attention in web browsing: weibull analysis of item visit times\.InAdvances in Information Retrieval,J\. Kamps, L\. Goeuriot, F\. Crestani, M\. Maistro, H\. Joho, B\. Davis, C\. Gurrin, U\. Kruschwitz, and A\. Caputo \(Eds\.\),Cham,pp\. 62–78\.External Links:ISBN 978\-3\-031\-28238\-6Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p3.1)\.
- G\. Sarch, L\. Jang, M\. J\. Tarr, W\. W\. Cohen, K\. Marino, and K\. Fragkiadaki \(2025a\)VLM agents generate their own memories: distilling experience into embodied programs of thought\.External Links:2406\.14596,[Link](https://arxiv.org/abs/2406.14596)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- G\. Sarch, S\. Saha, N\. Khandelwal, A\. Jain, M\. J\. Tarr, A\. Kumar, and K\. Fragkiadaki \(2025b\)Grounded reinforcement learning for visual reasoning\.External Links:2505\.23678,[Link](https://arxiv.org/abs/2505.23678)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- ServiceNow \(2025\)BrowserGym leaderboard\.External Links:[Link](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p4.1)\.
- P\. Shaw, M\. Joshi, J\. Cohan, J\. Berant, P\. Pasupat, H\. Hu, U\. Khandelwal, K\. Lee, and K\. Toutanova \(2023\)From pixels to ui actions: learning to follow instructions via graphical user interfaces\.External Links:2306\.00245,[Link](https://arxiv.org/abs/2306.00245)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- J\. Shen, H\. Bai, L\. Zhang, Y\. Zhou, A\. Setlur, S\. Tong, D\. Caples, N\. Jiang, T\. Zhang, A\. Talwalkar, and A\. Kumar \(2025\)Thinking vs\. doing: agents that reason by scaling test\-time interaction\.External Links:2506\.07976,[Link](https://arxiv.org/abs/2506.07976)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- J\. Shen, A\. Jain, Z\. Xiao, I\. Amlekar, M\. Hadji, A\. Podolny, and A\. Talwalkar \(2024\)ScribeAgent: towards specialized web agents using production\-scale workflow data\.External Links:2411\.15004,[Link](https://arxiv.org/abs/2411.15004)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1)\.
- K\. Singh, S\. Singh, and M\. Khanna \(2025\)TRISHUL: towards region identification and screen hierarchy understanding for large vlm based gui agents\.External Links:2502\.08226,[Link](https://arxiv.org/abs/2502.08226)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- Y\. Song, F\. Xu, S\. Zhou, and G\. Neubig \(2025\)Beyond browsing: api\-based web agents\.External Links:2410\.16464,[Link](https://arxiv.org/abs/2410.16464)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- H\. Su, R\. Sun, J\. Yoon, P\. Yin, T\. Yu, and S\. Ö\. Arık \(2025\)Learn\-by\-interact: a data\-centric framework for self\-adaptive agents in realistic environments\.External Links:2501\.10893,[Link](https://arxiv.org/abs/2501.10893)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1),[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- Y\. Su \(2025\)Computer use: modern moravec’s paradox\.Note:Yu’s SubstackBlog post, accessed May 7, 2026External Links:[Link](https://yusu.substack.com/p/computer-use-modern-moravecs-paradox)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p3.1)\.
- A\. D\. Tur, N\. Meade, X\. H\. Lù, A\. Zambrano, A\. Patel, E\. Durmus, S\. Gella, K\. Stańczak, and S\. Reddy \(2025\)SafeArena: evaluating the safety of autonomous web agents\.External Links:2503\.04957,[Link](https://arxiv.org/abs/2503.04957)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- G\. Verma, R\. Kaur, N\. Srishankar, Z\. Zeng, T\. Balch, and M\. Veloso \(2024\)AdaptAgent: adapting multimodal web agents with few\-shot learning from human demonstrations\.External Links:2411\.13451,[Link](https://arxiv.org/abs/2411.13451)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- Z\. Wang, Q\. Wu, X\. Zhang, C\. Zhang, W\. Yao, F\. E\. Faisal, B\. Peng, S\. Qin, S\. Nath, Q\. Lin, C\. Bansal, D\. Zhang, S\. Rajmohan, J\. Gao, and H\. Yao \(2026\)WebXSkill: skill learning for autonomous web agents\.External Links:2604\.13318,[Link](https://arxiv.org/abs/2604.13318)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- Z\. Z\. Wang, A\. Gandhi, G\. Neubig, and D\. Fried \(2025\)Inducing programmatic skills for agentic tasks\.External Links:2504\.06821,[Link](https://arxiv.org/abs/2504.06821)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024\)Agent workflow memory\.External Links:2409\.07429,[Link](https://arxiv.org/abs/2409.07429)Cited by:[§2\.3](https://arxiv.org/html/2606.10423#S2.SS3.SSS0.Px2.p2.1),[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1),[§4](https://arxiv.org/html/2606.10423#S4.p1.1)\.
- C\. H\. Wu, R\. Shah, J\. Y\. Koh, R\. Salakhutdinov, D\. Fried, and A\. Raghunathan \(2025\)Dissecting adversarial robustness of multimodal lm agents\.External Links:2406\.12814,[Link](https://arxiv.org/abs/2406.12814)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- X\. Wu, G\. Hong, Y\. Chen, M\. Liu, F\. Jin, X\. Pan, J\. Dai, and B\. Liu \(2026\)When bots take the bait: exposing and mitigating the emerging social engineering attack in web automation agent\.External Links:2601\.07263,[Link](https://arxiv.org/abs/2601.07263)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- Z\. Xiang, L\. Zheng, Y\. Li, J\. Hong, Q\. Li, H\. Xie, J\. Zhang, Z\. Xiong, C\. Xie, C\. Yang, D\. Song, and B\. Li \(2025\)GuardAgent: safeguard llm agents by a guard agent via knowledge\-enabled reasoning\.External Links:2406\.09187,[Link](https://arxiv.org/abs/2406.09187)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- H\. Xu, X\. Zhang, H\. Liu, J\. Wang, Z\. Zhu, S\. Zhou, X\. Hu, F\. Gao, J\. Cao, Z\. Wang, Z\. Chen, J\. Liao, Q\. Zheng, J\. Zeng, Z\. Xu, S\. Bai, J\. Lin, J\. Zhou, and M\. Yan \(2026\)Mobile\-agent\-v3\.5: multi\-platform fundamental gui agents\.External Links:2602\.16855,[Link](https://arxiv.org/abs/2602.16855)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- Y\. Xu, D\. Lu, Z\. Shen, J\. Wang, Z\. Wang, Y\. Mao, C\. Xiong, and T\. Yu \(2025\)AgentTrek: agent trajectory synthesis via guiding replay with web tutorials\.External Links:2412\.09605,[Link](https://arxiv.org/abs/2412.09605)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p3.1)\.
- T\. Xue, W\. Qi, T\. Shi, C\. H\. Song, B\. Gou, D\. Song, H\. Sun, and Y\. Su \(2025\)An illusion of progress? assessing the current state of web agents\.External Links:2504\.01382,[Link](https://arxiv.org/abs/2504.01382)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p8.1),[Table 1](https://arxiv.org/html/2606.10423#S2.T1),[§3](https://arxiv.org/html/2606.10423#S3.p1.1)\.
- K\. Yang, Y\. Liu, S\. Chaudhary, R\. Fakoor, P\. Chaudhari, G\. Karypis, and H\. Rangwala \(2025\)AgentOccam: a simple yet strong baseline for llm\-based web agents\.External Links:2410\.13825,[Link](https://arxiv.org/abs/2410.13825)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1)\.
- Z\. Ying, Y\. Shao, J\. Gan, G\. Xu, W\. Zhang, Q\. Zou, J\. Shi, Z\. Yin, M\. Zhang, A\. Liu, and X\. Liu \(2026\)SecureWebArena: a holistic security evaluation benchmark for lvlm\-based web agents\.External Links:2510\.10073,[Link](https://arxiv.org/abs/2510.10073)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- S\. Yu, G\. Li, W\. Shi, and P\. Qi \(2025\)PolySkill: learning generalizable skills through polymorphic abstraction\.External Links:2510\.15863,[Link](https://arxiv.org/abs/2510.15863)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- Z\.ai \(2025\)GLM\-4\-32b\-0414\.External Links:[Link](https://huggingface.co/zai-org/GLM-4-32B-0414)Cited by:[§3\.1](https://arxiv.org/html/2606.10423#S3.SS1.p1.1)\.
- A\. L\. Zhang, T\. Kraska, and O\. Khattab \(2026\)Recursive language models\.External Links:2512\.24601,[Link](https://arxiv.org/abs/2512.24601)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- K\. Zhang, M\. Tenenholtz, K\. Polley, J\. Ma, D\. Yarats, and N\. Li \(2025a\)BrowseSafe: understanding and preventing prompt injection within ai browser agents\.External Links:2511\.20597,[Link](https://arxiv.org/abs/2511.20597)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- R\. Zhang, M\. Qiu, Z\. Tan, M\. Zhang, V\. Lu, J\. Peng, K\. Xu, L\. Z\. Agudelo, P\. Qian, and T\. Chen \(2025b\)Symbiotic cooperation for web agents: harnessing complementary strengths of large and small llms\.External Links:2502\.07942,[Link](https://arxiv.org/abs/2502.07942)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1)\.
- Y\. Zhang, T\. Yu, and D\. Yang \(2025c\)Attacking vision\-language computer agents via pop\-ups\.External Links:2411\.02391,[Link](https://arxiv.org/abs/2411.02391)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- Y\. Zhang, Z\. Ma, Y\. Ma, Z\. Han, Y\. Wu, and V\. Tresp \(2024\)WebPilot: a versatile and autonomous multi\-agent system for web task execution with strategic exploration\.External Links:2408\.15978,[Link](https://arxiv.org/abs/2408.15978)Cited by:[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1)\.
- B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig, and Y\. Su \(2025a\)SkillWeaver: web agents can self\-improve by discovering and honing skills\.External Links:2504\.07079,[Link](https://arxiv.org/abs/2504.07079)Cited by:[§2\.3](https://arxiv.org/html/2606.10423#S2.SS3.SSS0.Px2.p2.1),[§3\.2](https://arxiv.org/html/2606.10423#S3.SS2.p2.1),[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- B\. Zheng, B\. Gou, J\. Kil, H\. Sun, and Y\. Su \(2024\)GPT\-4v\(ision\) is a generalist web agent, if grounded\.External Links:2401\.01614,[Link](https://arxiv.org/abs/2401.01614)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p3.1)\.
- B\. Zheng, Z\. Liao, S\. Salisbury, Z\. Liu, M\. Lin, Q\. Zheng, Z\. Wang, X\. Deng, D\. Song, H\. Sun, and Y\. Su \(2025b\)WebGuard: building a generalizable guardrail for web agents\.External Links:2507\.14293,[Link](https://arxiv.org/abs/2507.14293)Cited by:[Appendix D](https://arxiv.org/html/2606.10423#A4.p1.1)\.
- H\. Zhong, F\. Faisal, L\. França, T\. Leesatapornwongsa, A\. Szekeres, K\. Rong, and S\. Nath \(2026\)ActionEngine: from reactive to programmatic gui agents via state machine memory\.External Links:2602\.20502,[Link](https://arxiv.org/abs/2602.20502)Cited by:[§4](https://arxiv.org/html/2606.10423#S4.p2.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: a realistic web environment for building autonomous agents\.External Links:2307\.13854,[Link](https://arxiv.org/abs/2307.13854)Cited by:[§1](https://arxiv.org/html/2606.10423#S1.p8.1),[§3](https://arxiv.org/html/2606.10423#S3.p1.1)\.

## Appendix AImplementation Details

### A\.1PageMem

We provide the details of our memory structure and describe the PageMem construction process\. Each PageMem is built in two stages:DividePage\(Algorithm[1](https://arxiv.org/html/2606.10423#alg1)\) recursively partitions the live DOM tree into an ordered list of empty PageSections forming the structural skeleton of the page, andUpdatePageMem\(Algorithm[2](https://arxiv.org/html/2606.10423#alg2)\) then populates each section with its interactable elements and an LLM\-generated summary, and finally generates the page\-level summary\.

#### A\.1\.1Memory Structure

We provide a detailed formalization of the memory hierarchy from §[2\.2](https://arxiv.org/html/2606.10423#S2.SS2)\. Throughout, we useσ\\sigmafor model\-generated summaries,α\\alphafor immutable DOM\-derived attributes, andμ\\mufor mutable agent\-side state accumulated during exploration and task execution\. Descriptions ofα\\alphaandμ\\mubelow give representative fields rather than exhaustive listings; full field schemas are documented in our released code\.

##### WebsiteMem\.

A WebsiteMem for websitewwis a tuple

ℳw=\(Pw,Tw,Ew\),\\mathcal\{M\}\_\{w\}=\(P\_\{w\},\\,T\_\{w\},\\,E\_\{w\}\),wherePwP\_\{w\}is a mapping from URL to PageMem, collecting all concrete pages encountered onww;TwT\_\{w\}is a list of list\-page templates, each itself a PageMem, against which newly visited pages are matched by structural comparison \(see[A\.2](https://arxiv.org/html/2606.10423#A1.SS2.SSS0.Px2)\); andEwE\_\{w\}is the set of all elements encountered onww, used to deduplicate elements during exploration\.

##### PageMem\.

A PageMem is a tuple

p=\(up,np,σp,Sp,μp\),p=\(u\_\{p\},\\,n\_\{p\},\\,\\sigma\_\{p\},\\,S\_\{p\},\\,\\mu\_\{p\}\),whereupu\_\{p\}is the page URL;npn\_\{p\}is the page title;σp\\sigma\_\{p\}is a VLM\-generated page\-level summary;Sp=\(s1,…,s\|Sp\|\)S\_\{p\}=\(s\_\{1\},\\ldots,s\_\{\|S\_\{p\}\|\}\)is an ordered list of PageSections; andμp\\mu\_\{p\}holds page\-level agent state \(e\.g\., information extracted by agent, the agent’s past interaction history on the page\)\.

##### PageSection\.

A PageSection is a tuple

s=\(σs,Es,Ss′,αs,μs\),s=\(\\sigma\_\{s\},\\,E\_\{s\},\\,S^\{\\prime\}\_\{s\},\\,\\alpha\_\{s\},\\,\\mu\_\{s\}\),whereσs\\sigma\_\{s\}is a VLM\-generated section summary;Es=\(e1,…,eks\)E\_\{s\}=\(e\_\{1\},\\ldots,e\_\{k\_\{s\}\}\)is an ordered list of Elements contained in the section;Ss′=\(s1′,…,sms′\)S^\{\\prime\}\_\{s\}=\(s^\{\\prime\}\_\{1\},\\ldots,s^\{\\prime\}\_\{m\_\{s\}\}\)is an ordered list of sub\-sections, empty for normal sections and containing one sub\-section per item for list sections;αs\\alpha\_\{s\}holds DOM\-derived attributes used for creating selectors \(e\.g\., id, tag, class, DOM\-subtree handle, bounding box\) andμs\\mu\_\{s\}holds mutable agent state \(e\.g\., task\-relevant extractions, VLM\-generated image descriptions, a staleness flag indicating whether the DOM subtree has changed sinceσs\\sigma\_\{s\}was last computed\)\.

##### Element\.

An Element is a tuple

e=\(αe,Ee′,μe\),e=\(\\alpha\_\{e\},\\,E^\{\\prime\}\_\{e\},\\,\\mu\_\{e\}\),whereαe\\alpha\_\{e\}holds DOM\-derived attributes \(e\.g\., id, tag, class, role, label, type\);Ee′=\(e1′,…,ele′\)E^\{\\prime\}\_\{e\}=\(e^\{\\prime\}\_\{1\},\\ldots,e^\{\\prime\}\_\{l\_\{e\}\}\)is an ordered list of dropdown items \(themselves Elements\), which is empty for non\-dropdown elements; andμe\\mu\_\{e\}holds mutable agent state \(e\.g\., the element’s current input value, a flag for whether the agent has clicked the element during the current task\)\.

#### A\.1\.2Page division\.

We provide the pseudocode for page splitting in Algorithm[1](https://arxiv.org/html/2606.10423#alg1)\.DividePagetakes the root of the DOM tree and returns a PageMem whose ordered section list forms the structural skeleton of the page\. The procedure recursively descends the DOM, terminating at nodes that form a meaningful grouping, either semantically \(by tag\), visually \(by size\), or structurally \(by repetition of siblings\)\. Groups of≥4\\geq 4consecutive siblings sharing tag and class are merged into a single list section node before recursion; list sections are always terminal and are never further subdivided\.

Algorithm 1DividePage1:DOM root node

rr
2:PageMem

pp
3:

L←\[\]L\\leftarrow\[\\,\]
4:Split\(

rr,

LL\)

5:

p←NewPageMem\(\)p\\leftarrow\\textsc\{NewPageMem\}\(\\,\)
6:

p\.u←CurrentURL\(\)p\.u\\leftarrow\\textsc\{CurrentURL\}\(\\,\)
7:

p\.n←ExtractTitle\(\)p\.n\\leftarrow\\textsc\{ExtractTitle\}\(\\,\)
8:

p\.S←Lp\.S\\leftarrow L
9:return

pp
10:

11:procedureSplit\(node

vv, list

LL\)

12:ifIsTerminal\(

vv\)then

13:appendMakeSection\(

vv\) to

LL
14:else

15:

C←GroupSiblings\(v\.children\)C\\leftarrow\\textsc\{GroupSiblings\}\(v\.\\text\{children\}\)
16:for

c∈Cc\\in Cdo

17:Split\(

cc,

LL\)

18:

19:functionIsTerminal\(node

vv\)

20:return

v\.isListSection∨v\.tag∈𝒯group∨¬Oversized\(v\)v\.\\text\{isListSection\}\\ \\lor\\ v\.\\text\{tag\}\\in\\mathcal\{T\}\_\{\\text\{group\}\}\\ \\lor\\ \\lnot\\,\\textsc\{Oversized\}\(v\)
21:

22:functionOversized\(node

vv\)

23:return

\(v\.h\>900∧v\.w\>320\)∨\(v\.h\>500∧v\.w\>800\)\(v\.h\>900\\land v\.w\>320\)\\ \\lor\\ \(v\.h\>500\\land v\.w\>800\)
24:

25:functionGroupSiblings\(children

\(c1,…,ck\)\(c\_\{1\},\\ldots,c\_\{k\}\)\)

26:scan

\(c1,…,ck\)\(c\_\{1\},\\ldots,c\_\{k\}\)for groups of consecutive siblings that share tag and class

27:replace each group of length

≥4\\geq 4with a single list\-section node containing the group

28:returnthe resulting \(shortened\) sequence

##### Parameters\.

The grouping tag set is

𝒯group=\{ol,ul,table,form,fieldset,aside,article,details,p,img,embed,code,group,nav,header,footer\}\.\\begin\{split\}\\mathcal\{T\}\_\{\\text\{group\}\}=\\\{&\\texttt\{ol\},\\texttt\{ul\},\\texttt\{table\},\\texttt\{form\},\\texttt\{fieldset\},\\texttt\{aside\},\\texttt\{article\},\\\\ &\\texttt\{details\},\\texttt\{p\},\\texttt\{img\},\\texttt\{embed\},\\texttt\{code\},\\texttt\{group\},\\texttt\{nav\},\\texttt\{header\},\\texttt\{footer\}\\\}\.\\end\{split\}Dimensionsv\.hv\.handv\.wv\.winOversizedare the node’s rendered bounding\-box height and width in CSS pixels, obtained from the browser’s layout engine\.

#### A\.1\.3PageMem update\.

UpdatePageMemrefreshes a PageMem to reflect the current live page state and is invoked at the start of every observation step\. It is also the routine that populates a freshly divided PageMem with its initial elements and summaries\.

UpdateSection\(i\) queries the browser for the section’s current set of interactable elements, \(ii\) computes the added / removed / modified diffΔ=\(Δ\+,Δ−,Δ∼\)\\Delta=\(\\Delta^\{\+\},\\Delta^\{\-\},\\Delta^\{\\sim\}\)against the section’s previous element list, \(iii\) re\-summarizes if the section has no summary yet or if the structural change is large enough, and \(iv\) returnsΔ\\Deltaso that the caller \(an observation step or a workflow\) can respond to partial state changes\. This is the machinery behind, e\.g\., the dropdown workflow of §[2\.5](https://arxiv.org/html/2606.10423#S2.SS5), which reads the revealed options directly fromΔ\+\\Delta^\{\+\}\. List sections are handled specially: their content is unbounded and repetitive, so instead of enumerating elements or tracking a diff at the list level, we always re\-summarize from a screenshot and return an empty diff\. Element\-level tracking happens only on the per\-item sub\-sections, and only after list\-item selection in the observation pipeline \(§[2\.4](https://arxiv.org/html/2606.10423#S2.SS4)\)\.

Algorithm 2UpdatePageMemandUpdateSection1:procedureUpdatePageMem\(PageMem

pp\)

2:for

s∈p\.Ss\\in p\.Sdo

3:UpdateSection\(

ss\)

4:if

p\.σpp\.\\sigma\_\{p\}is undefinedthen

5:

p\.σp←VLMSummarizePage\(p\)p\.\\sigma\_\{p\}\\leftarrow\\textsc\{VLMSummarizePage\}\(p\)
6:

7:procedureUpdateSection\(PageSection

ss\)

8:if

ssis a list sectionthen⊳\\trianglerightlist sections only get summarized

9:

s\.σs←VLMSummarizeSection\(s\)s\.\\sigma\_\{s\}\\leftarrow\\textsc\{VLMSummarizeSection\}\(s\)
10:return

\(∅,∅,∅\)\(\\emptyset,\\emptyset,\\emptyset\)
11:

Enew←GetElements\(s\)E\_\{\\text\{new\}\}\\leftarrow\\textsc\{GetElements\}\(s\)
12:

Δ←Diff\(s\.Es,Enew\)\\Delta\\leftarrow\\textsc\{Diff\}\(s\.E\_\{s\},\\,E\_\{\\text\{new\}\}\)⊳\\trianglerightΔ\+\\Delta^\{\+\}: added,Δ−\\Delta^\{\-\}: removed,Δ∼\\Delta^\{\\sim\}: input\-value changes

13:

s\.Es←Enews\.E\_\{s\}\\leftarrow E\_\{\\text\{new\}\}
14:if

s\.σss\.\\sigma\_\{s\}is undefinedthen

15:

s\.σs←VLMSummarizeSection\(s\)s\.\\sigma\_\{s\}\\leftarrow\\textsc\{VLMSummarizeSection\}\(s\)
16:elseif

\|Δ\+\|\+\|Δ−\|≥3\|\\Delta^\{\+\}\|\+\|\\Delta^\{\-\}\|\\geq 3then⊳\\trianglerightre\-summarize if≥3\\geq 3elements added/removed

17:

s\.σs←VLMSummarizeSection\(s\)s\.\\sigma\_\{s\}\\leftarrow\\textsc\{VLMSummarizeSection\}\(s\)
18:return

Δ\\Delta

##### Element population\.

GetElementsproduces the element list for a section via a three\-step pipeline\. A helper first resolves the section to a Playwright locator\. The locator’s descendants are then filtered by the clickable predicateIsClickabledefined below\. Finally, each surviving DOM node is passed to an Element constructor that reads its DOM attributes intoαe\\alpha\_\{e\}\.

##### Clickable predicate\.

A DOM nodevvis considered interactable iff it passes a visibility\-and\-accessibility gate*and*satisfies at least one positive signal\. The gate excludes nodes that are not rendered, carry thedisabledattribute, or havearia\-hidden="true"\. The positive signals are any of: a tag in an interactable tag set, a DOM event\-listener attribute in a listener set, an ARIA role in an interactable role set, or a computedcursorstyle ofpointer\. Formally,

IsClickable\(v\)≡\(v\.tag∈𝒯clk∨v\.attrs∩ℒclk≠∅∨v\.role∈ℛclk∨v\.cursor=pointer\)∧Accessible\(v\),\\begin\{split\}\\textsc\{IsClickable\}\(v\)\\ \\equiv\\ \\big\(v\.\\text\{tag\}\\in\\mathcal\{T\}\_\{\\text\{clk\}\}\\ \\lor\\ v\.\\text\{attrs\}\\cap\\mathcal\{L\}\_\{\\text\{clk\}\}\\neq\\emptyset\\ \\lor\\ v\.\\text\{role\}\\in\\mathcal\{R\}\_\{\\text\{clk\}\}\\ \\lor\\ v\.\\text\{cursor\}=\\texttt\{pointer\}\\big\)\\\\ \\land\\ \\textsc\{Accessible\}\(v\),\\end\{split\}with the sets

𝒯clk\\displaystyle\\mathcal\{T\}\_\{\\text\{clk\}\}=\{button,a,input,select,textarea,details,summary,option\},\\displaystyle=\\\{\\texttt\{button\},\\texttt\{a\},\\texttt\{input\},\\texttt\{select\},\\texttt\{textarea\},\\texttt\{details\},\\texttt\{summary\},\\texttt\{option\}\\\},ℒclk\\displaystyle\\mathcal\{L\}\_\{\\text\{clk\}\}=\{onclick,onmousedown,onmouseup,onkeydown,onkeyup\},\\displaystyle=\\\{\\texttt\{onclick\},\\texttt\{onmousedown\},\\texttt\{onmouseup\},\\texttt\{onkeydown\},\\texttt\{onkeyup\}\\\},ℛclk\\displaystyle\\mathcal\{R\}\_\{\\text\{clk\}\}=\{button,link,menuitem,option,radio,checkbox,tab,\\displaystyle=\\\{\\texttt\{button\},\\texttt\{link\},\\texttt\{menuitem\},\\texttt\{option\},\\texttt\{radio\},\\texttt\{checkbox\},\\texttt\{tab\},textbox,combobox,slider,spinbutton,search,searchbox\}\.\\displaystyle\\qquad\\texttt\{textbox\},\\texttt\{combobox\},\\texttt\{slider\},\\texttt\{spinbutton\},\\texttt\{search\},\\texttt\{searchbox\}\\\}\.These heuristics are adapted from BrowserUse\(Müller and Žunic\.,[2024](https://arxiv.org/html/2606.10423#bib.bib53)\)\.

### A\.2Exploration

We provide the details of the offline exploration procedure that builds the WebsiteMemℳw\\mathcal\{M\}\_\{w\}used at inference\. Exploration is a deterministic depth\-first traversal of a website’s pages and clickable elements, deduplicated against the running setEwE\_\{w\}of all elements seen on the site, with state restored between element clicks by reloading the pre\-click URL\.ExplorePage\(Algorithm[3](https://arxiv.org/html/2606.10423#alg3)\) is the recursive driver that visits one page at a time; it delegates toIteratePageandExploreElement\(Algorithm[4](https://arxiv.org/html/2606.10423#alg4)\) for the element\-level work\. Exploration is launched per website by initializing an emptyℳw\\mathcal\{M\}\_\{w\}and a URL\-only PageMem stub at a chosen starting URL — the homepage in our experiments — and invokingExplorePageat the configured maximum depth\. We abstract over the per\-page element budget, total page budget, and per\-website timeout in the pseudocode for clarity; these limits act as additional early\-return checks throughout, and their values are reported per benchmark in Appendix[B](https://arxiv.org/html/2606.10423#A2)\.

Algorithm 3ExplorePage1:procedureExplorePage\(PageMem stub

pp, depth

dd\)

2:if

p\.u∈ℳw\.Pwp\.u\\in\\mathcal\{M\}\_\{w\}\.P\_\{w\}then⊳\\trianglerightURL already explored on this site

3:return

4:Navigate\(

p\.up\.u\)

5:Populate

ppwith sections, elements and summaries

6:add

ppto

ℳw\.Pw\\mathcal\{M\}\_\{w\}\.P\_\{w\}keyed by

p\.up\.u
7:ifMatchesTemplate\(

p,ℳw\.Twp,\\,\\mathcal\{M\}\_\{w\}\.T\_\{w\}\)then⊳\\trianglerightalready explored page with same structure

8:return

9:elseifHasListSection\(

pp\)

∨p\.is\_list\_item\\lor\\ p\.\\text\{is\\\_list\\\_item\}then

10:add

ppto

ℳw\.Tw\\mathcal\{M\}\_\{w\}\.T\_\{w\}
11:

N←N\\leftarrowIteratePage\(

pp\)⊳\\trianglerightexplore elements on page

12:if

d=0d=0orbudget exhaustedthenreturn

13:for

p′∈Np^\{\\prime\}\\in Ndo

14:ExplorePage\(

p′,d−1p^\{\\prime\},\\,d\-1\)

##### Page\-level traversal\.

ExplorePagetakes a stub PageMem carrying a target URL and the remaining recursion depth\. It deduplicates against URLs already inℳw\\mathcal\{M\}\_\{w\}, navigates the browser to the page, runsDividePageon the freshly\-loaded DOM to construct the full PageMem, andUpdatePageMemto populate elements and summaries\. The newly\-built PageMem is then registered inℳw\\mathcal\{M\}\_\{w\}\. If its section structure matches an existing template inTwT\_\{w\}, the page is treated as a known\-shape duplicate and not iterated, since further iteration would re\-cover element behaviors already learned from the matching template; otherwise, if the page contains a list section or itsis\_list\_itemflag is set, the PageMem is added toTwT\_\{w\}as a new template\.IteratePageis then called on the page, returning a list of stub PageMems for newly\-discovered URLs, which the procedure recursively explores at depthd−1d\-1\.

##### Template matching\.

MatchesTemplatecompares the candidate PageMem against each template inTwT\_\{w\}\. Two PageMems match when they have the same number of sections and each pair of corresponding sections is structurally equivalent under the DOM\-derived attributes inαs\\alpha\_\{s\}\(tag, class, and other selector\-defining attributes\)\. BecauseDividePageis deterministic on the DOM, structurally equivalent pages reliably yield identical section sequences in practice, so exact structural equality is sufficient as a match criterion without needing a similarity threshold\. Matching is checked only againstTwT\_\{w\}rather than all ofPwP\_\{w\}, both for efficiency and because non\-template pages are by definition idiosyncratic and not expected to recur\.

##### Element\-level traversal\.

IteratePagewalks the page’s elements in document order\. For elements not contained in any list section, it skips those already in the global element setEwE\_\{w\}and registers each new element inEwE\_\{w\}\. List sections are handled separately: rather than iterating every list item \(which would redundantly re\-cover elements with structurally identical neighbors\), the procedure invokesIterateListItem, which iterates the elements contained in a single list\-item container using the same per\-element logic\. Stubs returned from list\-item exploration are tagged withis\_list\_itemso that the recursiveExplorePagecall can promote the resulting pages to templates\.

ExploreElementreturns the newly\-discovered URL stub\(s\) reached by clicking the element\. It first applies a static skip filter \(described below\) that rules out elements unsafe or unhelpful to click\. It then records the pre\-click URL, clicks the element, and inspects the result\. If the URL has not changed, the post\-click diff in page state is computed\. Any newly\-revealed elements \(Δ\+\\Delta^\{\+\}\) are recorded as the clicked element’sdropdown\_elementsand recursively explored using the sameExploreElementroutine\. If the URL has changed and points to a same\-site page not yet inℳw\\mathcal\{M\}\_\{w\}, a URL\-only stub is created and returned\. Finally, the browser is reloaded to the pre\-click URL to restore page state for the next iteration\.

Algorithm 4IteratePageandExploreElement1:procedureIteratePage\(PageMem

pp\)

2:

N←\[\]N\\leftarrow\[\\,\]
3:forelement

eein

ppnot contained in a list sectiondo

4:ifper\-page element budget exhaustedthenbreak

5:if

e∈ℳw\.Ewe\\in\\mathcal\{M\}\_\{w\}\.E\_\{w\}thencontinue

6:add

eeto

ℳw\.Ew\\mathcal\{M\}\_\{w\}\.E\_\{w\}
7:

N←N\+N\\leftarrow N\\,\+ExploreElement\(

ee\)

8:forlist section

s∈p\.Ss\\in p\.Sdo

9:

Nℓ←N\_\{\\ell\}\\leftarrowIterateListItem\(

ss\)⊳\\trianglerightexplore elements in one list\-item container

10:for

p′∈Nℓp^\{\\prime\}\\in N\_\{\\ell\}do

p′\.is\_list\_item←truep^\{\\prime\}\.\\text\{is\\\_list\\\_item\}\\leftarrow\\text\{true\}
11:

N←N\+NℓN\\leftarrow N\\,\+\\,N\_\{\\ell\}
12:return

NN
13:

14:procedureExploreElement\(Element

ee\)

15:

N←\[\]N\\leftarrow\[\\,\]
16:ifShouldSkip\(

ee\)thenreturn

NN
17:

upre←u\_\{\\text\{pre\}\}\\leftarrowCurrentURL\( \)

18:Click\(

ee\)

19:

upost←u\_\{\\text\{post\}\}\\leftarrowCurrentURL\( \)

20:if

upost=upreu\_\{\\text\{post\}\}=u\_\{\\text\{pre\}\}then

21:Identify newly revealed elements

Δ\+\\Delta^\{\+\}
22:if

Δ\+≠∅\\Delta^\{\+\}\\neq\\emptysetthen⊳\\trianglerightclick revealed new elements \(dropdown opened\)

23:

e\.dropdown\_elements←Δ\+e\.\\text\{dropdown\\\_elements\}\\leftarrow\\Delta^\{\+\}
24:for

e′∈Δ\+e^\{\\prime\}\\in\\Delta^\{\+\}do

25:

N←N\+N\\leftarrow N\\,\+ExploreElement\(

e′e^\{\\prime\}\)⊳\\trianglerightexplore dropdown elements

26:elseif

upostu\_\{\\text\{post\}\}is on the same site

wwand

upost∉ℳw\.Pwu\_\{\\text\{post\}\}\\notin\\mathcal\{M\}\_\{w\}\.P\_\{w\}then

27:create stub

pnewp\_\{\\text\{new\}\}with

pnew\.u←upostp\_\{\\text\{new\}\}\.u\\leftarrow u\_\{\\text\{post\}\}
28:

N←N\+\[pnew\]N\\leftarrow N\\,\+\\,\[p\_\{\\text\{new\}\}\]
29:Navigate\(

upreu\_\{\\text\{pre\}\}\)⊳\\trianglerightrestore pre\-click state for the next iteration

30:return

NN

##### Skip filter\.

ShouldSkipexcludes four categories of elements before any click is issued: \(i\) off\-site links, identified by anhrefpointing to a domain outsideww; \(ii\) authentication links such as login and sign\-up, identified by keyword matching against the link text and URL path; \(iii\)tel:,mailto:, andjavascript:print\(…\)links, identified by thehrefscheme; and \(iv\)*modifier*buttons that could mutate persistent site state, identified either by the form attributetype="submit"or by keyword matching of the element’s accessible text against destructive terms \(delete,remove,submit,save, etc\.\)\.

### A\.3Observation Pipeline

We provide the details of the detail\-extraction and summarization stages of the observation pipeline \(§[2\.4](https://arxiv.org/html/2606.10423#S2.SS4)\)\.AnalyzePage\(Algorithm[5](https://arxiv.org/html/2606.10423#alg5)\) acts as the main driver: given the set of sectionsStS\_\{t\}selected as relevant, it extracts task\-relevant information from each and synthesizes a page summary\. List sections are routed throughSelectListItems\(Algorithm[6](https://arxiv.org/html/2606.10423#alg6)\), which uses chunked LLM selection with explicit early termination to keep arbitrarily long lists within context\.

##### Per\-section detail extraction\.

For each selected section, a helperFormatproduces the*details string*consumed by the extraction LLM\. For a normal section, the details string contains the section’s accessibility subtree together with the URLs and VLM\-generated descriptions of any images in the section above the minimum size threshold \(50 x 50 pixels\)\. For a list section,SelectListItemsis invoked first to choose a subset of items, and the details string is the per\-item content formatted as a numbered list, with each entry containing the same accessibility\-subtree\-plus\-image content as a normal section\. The extraction callLLMExtractDetails\(Prompt[E\.1](https://arxiv.org/html/2606.10423#A5.SS1)\) caches its output on the PageSection together with the details string used to produce it; on a subsequent call with an identical details string, the cached extraction is returned without an LLM call\.

Algorithm 5AnalyzePage1:procedureAnalyzePage\(PageMem

pp, selected sections

StS\_\{t\}\)

2:

X←\[\]X\\leftarrow\[\\,\]⊳\\trianglerightper\-section extraction strings, in selection order

3:for

s∈Sts\\in S\_\{t\}do

4:if

ssis a list sectionthen

5:SelectListItems\(

ss\)⊳\\trianglerightpopulatesss’s sub\-sections ands\.Ess\.E\_\{s\}

6:

D←D\\leftarrowFormat\(

ss\)

7:

x←x\\leftarrowLLMExtractDetails\(

DD\)⊳\\trianglerightreturns cached value ifDDunchanged

8:append"<idx\> <tag\> <class\>: "

\+x\+\\ xto

XX
9:

p\.task\_summary←p\.\\text\{task\\\_summary\}\\leftarrowLLMSummarizePage\(

XX\)⊳\\trianglerightregenerated every call

10:return

p\.task\_summaryp\.\\text\{task\\\_summary\}

##### List item selection\.

A list section can contain hundreds or thousands of items, far exceeding what fits in a single LLM context\.SelectListItemsaddresses this by chunking the items sequentially into fixed\-size groups and prompting the LLM to select relevant items chunk by chunk \(Prompt[E\.1\.1](https://arxiv.org/html/2606.10423#A5.SS1.SSS1)\)\. After each chunk, a separate LLM call is issued \(Prompt[E\.1\.1](https://arxiv.org/html/2606.10423#A5.SS1.SSS1)\) that sees the indices already searched, the items already selected, and the remaining entries, and decides whether to terminate early — this avoids paying the cost of scanning the full list when the relevant items have already been found \(e\.g\., the top few results of a sorted list\)\. After selection, the procedure rebuilds the list section’s sub\-sections from the selected items, populating each viaUpdateSection, and overwrites the list section’s element listEsE\_\{s\}with only the elements from selected items\. The original full element set is not retained: re\-selection on a later observation step rebuilds the sub\-sections from scratch from the live page state\.

Algorithm 6SelectListItems1:procedureSelectListItems\(list section

ss\)

2:

I⋆←\[\]I^\{\\star\}\\leftarrow\[\\,\]⊳\\trianglerightindices of items selected so far

3:partition the items of

ssinto sequential chunks

\(C1,C2,…\)\(C\_\{1\},C\_\{2\},\\ldots\)of fixed size

cc
4:for

k=1,2,…k=1,2,\\ldotsdo

5:

I⋆←I⋆\+I^\{\\star\}\\leftarrow I^\{\\star\}\\,\+LLMSelectItems\(

Ck,I⋆C\_\{k\},\\,I^\{\\star\}\)

6:ifLLMCheckDone\(

I⋆I^\{\\star\}, indices searched so far, remaining items\)then

7:break

8:rebuild

s\.Ss′s\.S^\{\\prime\}\_\{s\}as one PageSection per item index in

I⋆I^\{\\star\}
9:for

s′∈s\.Ss′s^\{\\prime\}\\in s\.S^\{\\prime\}\_\{s\}doUpdateSection\(

s′s^\{\\prime\}\)

10:

s\.Es←s\.E\_\{s\}\\leftarrowconcatenation of

s′\.Es′s^\{\\prime\}\.E\_\{s^\{\\prime\}\}over

s′∈s\.Ss′s^\{\\prime\}\\in s\.S^\{\\prime\}\_\{s\}⊳\\trianglerightonly selected items contribute actions

##### Summary caching\.

Two caches operate at different lifetimes within the observation pipeline\. Section summariesσs\\sigma\_\{s\}are populated byUpdateSection\(Algorithm[2](https://arxiv.org/html/2606.10423#alg2)\) and persist across tasks within a WebsiteMem\. Per\-section task extractionsxxproduced byLLMExtractDetailsare cached on the PageSection alongside the details stringDDthat produced them, and are reused for the lifetime of a task whenever the section’s content is unchanged\. The page\-level task summaryp\.task\_summaryp\.\\text\{task\\\_summary\}is always regenerated on each call toAnalyzePage, since the relevant framing of a page can shift as the task progresses through its historyhth\_\{t\}\.

### A\.4Agent Loop

We provide the details of the top\-level inference loop that integrates the observation pipeline \(§[2\.4](https://arxiv.org/html/2606.10423#S2.SS4), App\.[A\.3](https://arxiv.org/html/2606.10423#A1.SS3)\) with the action system \(§[2\.5](https://arxiv.org/html/2606.10423#S2.SS5)\)\.AgentLoop\(Algorithm[7](https://arxiv.org/html/2606.10423#alg7)\) executes one timestep at a time until either the task is verified complete or the step budget is exhausted\. Each timestep produces*one*agent action, which may itself be a compound workflow that internally issues multiple LLM sub\-calls and browser operations\. After a non\-navigating action, an intra\-step continuation loop allows the agent to chain follow\-up actions on the same page without re\-running the full observation pipeline, up to a small budget\. Bookmark and \(where applicable\) website pre\-selection happen once at task start \(Prompt[E\.2\.1](https://arxiv.org/html/2606.10423#A5.SS2.SSS1)\)\.

Algorithm 7AgentLoop1:procedureAgentLoop\(task

τ=\(I,u0\)\\tau=\(I,u\_\{0\}\), WebsiteMem

ℳw\\mathcal\{M\}\_\{w\}\)

2:

Bτ←B\_\{\\tau\}\\leftarrowLLMSelectBookmarks\(

ℳw,I\\mathcal\{M\}\_\{w\},\\,I\)⊳\\trianglerightoptional, at task start only

3:

h←\[\]h\\leftarrow\[\\,\]
4:for

t=1,…,Tmaxt=1,\\ldots,T\_\{\\max\}do

5:

p←p\\leftarrowGetPageMem\(current URL,

ℳw\\mathcal\{M\}\_\{w\}\)

6:if

m←m\\leftarrowCheckModal\(

pp\)then

7:

S←\[m\]S\\leftarrow\[m\]⊳\\trianglerightmodal: focus on dialog, skip section selection

8:else

9:

S←S\\leftarrowLLMSelectSections\(

pp\)

10:

o^←\\hat\{o\}\\leftarrowAnalyzePage\(

p,Sp,\\,S\)

11:

upre←u\_\{\\text\{pre\}\}\\leftarrowcurrent URL

12:for

j=1,…,Jmaxj=1,\\ldots,J\_\{\\max\}do⊳\\trianglerightintra\-step continuation,Jmax=5J\_\{\\max\}=5

13:

𝒜←\\mathcal\{A\}\\leftarrowGatherCandidates\(

p,S,p\.S∖S,Bτp,\\,S,\\,p\.S\\setminus S,\\,B\_\{\\tau\}\)

14:

\(a,r\)←\(a,r\)\\leftarrowLLMSelectAction\(

𝒜\\mathcal\{A\}\)⊳\\trianglerightup to 3 retries on action error

15:if

aais end\-taskthen

16:ifLLMVerifyEndTask\(

I,hI,\\,h\)then⊳\\triangleright1 verification check per task

17:returnLLMFinalAnswer\(

I,hI,\\,h\)

18:else

19:remove end\-task from

𝒜\\mathcal\{A\}and re\-prompt for

aa
20:ExecuteAction\(

aa\)⊳\\trianglerightsingle op or compound workflow

21:ifcurrent URL

≠upre\\neq u\_\{\\text\{pre\}\}thenbreak

22:ifCheckModal\(

pp\)

≠𝐧𝐢𝐥\\neq\\mathbf\{nil\}thenbreak

23:UpdatePageMem\(

pp\)

24:

o^′←o^\+\\hat\{o\}^\{\\prime\}\\leftarrow\\hat\{o\}\\,\+VLMScreenDiff\(

pp\)⊳\\trianglerightcheap update for follow\-up action

25:append step observation, reason, and action to

hh
26:returnLLMFinalAnswer\(

I,hI,\\,h\)⊳\\trianglerightstep budget exhausted

##### Observation phase\.

At the start of each timestep, the agent retrieves or constructs the PageMemptp\_\{t\}for the current page and refreshes it viaUpdatePageMem\. A modal\-detection helperCheckModalthen tests for the presence of a modal dialog using DOM heuristics includingrole="dialog"andaria\-modal="true"; when a modal is detected, section selection is bypassed entirely and the modal’s PageSection is used as the sole relevant section, focusing the agent’s attention on the dialog and preventing the surrounding \(now\-inert\) page from polluting the candidate space\. Otherwise, the LLM selects relevant sectionsStS\_\{t\}from the section summaries as in §[2\.4](https://arxiv.org/html/2606.10423#S2.SS4)\(Prompt[E\.1](https://arxiv.org/html/2606.10423#A5.SS1)\)\. The remaining sectionsSt∁=pt\.S∖StS\_\{t\}^\{\\complement\}=p\_\{t\}\.S\\setminus S\_\{t\}are kept aside for use during candidate assembly\.AnalyzePagethen produces the task summaryo^t\\hat\{o\}\_\{t\}\.

##### Action phase\.

GatherCandidatesassembles the candidate set𝒜t\\mathcal\{A\}\_\{t\}in two passes\. First, for eachs∈Sts\\in S\_\{t\}, an LLM call selects elements fromsslikely to be useful for the task, conditioned onss’s extracted details \(Prompt[E\.2\.1](https://arxiv.org/html/2606.10423#A5.SS2.SSS1)\)\. Second, a single LLM call covers the elements of the first five entries ofSt∁S\_\{t\}^\{\\complement\}in document order — a heuristic that ensures upper\-page UI \(navigation bars, search boxes, primary buttons\) remains reachable even when the LLM did not flag those sections as relevant during section selection\. Navigation actions \(visited URLs, bookmarksBτB\_\{\\tau\}, type\-URL, switch\-tab, switch\-website\) are filtered by an LLM pass against the task \(Prompt[E\.2\.1](https://arxiv.org/html/2606.10423#A5.SS2.SSS1)\)\. The end\-task action is always appended\. A rule\-based pre\-filter removes irrelevant actions \(e\.g\., switching to an already active tab, clicking an already\-selected radio button, print, tel links, links leading outside allowed domains\) before𝒜t\\mathcal\{A\}\_\{t\}is presented to the LLM\.

The action\-selection LLM call \(Prompt[E\.2\.1](https://arxiv.org/html/2606.10423#A5.SS2.SSS1)\) returns a chosen actionaatogether with a natural\-language*reason*rrexplaining the choice\.ExecuteActiondispatches to the workflow appropriate to the selected action’s type\. Navigation actions invoke a singleNavigatecall; element actions inside a form section invokeSubmitForm, while other element actions invokeElementAction, which routes to the appropriate workflow \(App\.[A\.5](https://arxiv.org/html/2606.10423#A1.SS5)\); the end\-task action invokesLLMVerifyEndTask\. If the chosen action raises a runtime error \(an invalid URL, a stale selector, an interaction failure on a non\-interactable element\), the action is removed from𝒜t\\mathcal\{A\}\_\{t\}and the LLM is re\-prompted; this retry budget resets each timestep and is bounded at three attempts\.

##### End\-task verification\.

When the LLM selects end\-task,LLMVerifyEndTaskissues an LLM call conditioned on the task instructionIIand the interaction historyhth\_\{t\}that judges whether the task has been completed \(Prompt[E\.2\.1](https://arxiv.org/html/2606.10423#A5.SS2.SSS1)\)\. If completion is verified, a separateLLMFinalAnswercall \(Prompt[E\.2\.1](https://arxiv.org/html/2606.10423#A5.SS2.SSS1)\) produces the answer string and the loop terminates\. If not, the end\-task action is removed from𝒜t\\mathcal\{A\}\_\{t\}for the current timestep only — it remains available on subsequent timesteps — and the LLM is re\-prompted to choose a different action\. We allow at most one verification check per task episode; subsequent end\-task actions end the task immediately\.

##### Intra\-step continuation\.

Some actions \(entering input into a field, copying to clipboard\) leave the page on the same URL and complete a partial rather than full intent\. To avoid the overhead of restarting the observation pipeline for follow\-up actions on the same page, after such an action the loop enters a short continuation phase:UpdatePageMemrefreshes the page state, a VLM is prompted to describe the visual difference and the description is concatenated witho^t\\hat\{o\}\_\{t\}to produce the updated observationo^t′\\hat\{o\}^\{\\prime\}\_\{t\}, and the LLM selects another action from a freshly\-gathered candidate set\. The continuation phase ends when either the page URL changes, a modal dialog appears \(handled at the next timestep with focused attention\), the agent selects end\-task, or a budget of five within\-step actions is reached\.

### A\.5Action Workflows

We provide the details of the workflows invoked byExecuteActionwhen the agent selects an element action \(App\.[A\.4](https://arxiv.org/html/2606.10423#A1.SS4)\)\. Element actions are dispatched in two ways: if the selected element lies inside a form section, control passes toSubmitForm; otherwise it passes toElementAction\(Algorithm[8](https://arxiv.org/html/2606.10423#alg8)\), which routes to the appropriate per\-element\-type workflow based on tag, role, and DOM attributes\. Navigation actions and end\-task are handled directly in the agent loop and are not covered here\.

##### Element\-type dispatch\.

ElementActionis structured as a flat decision tree over element properties: it checks first for behaviors known from exploration \(recordeddropdown\_elements\), then for input\-type\-specific handlers \(file upload, select/combobox, the variousinput/textareasubtypes\), then for “probably opens something” signals \(aria\-haspopup, or any element not yet explored\), and finally falls through to a plain click\.

Algorithm 8ElementAction1:procedureElementAction\(Element

ee\)

2:if

e\.dropdown\_elements≠∅e\.\\text\{dropdown\\\_elements\}\\neq\\emptysetthen⊳\\trianglerightrecorded from exploration

3:DropdownAction\(

ee\)

4:elseif

e\.input\_type=filee\.\\text\{input\\\_type\}=\\texttt\{file\}then

5:UploadFile\(

ee\)

6:elseif

e\.tag=select∨e\.role=comboboxe\.\\text\{tag\}=\\texttt\{select\}\\,\\lor\\,e\.\\text\{role\}=\\texttt\{combobox\}then

7:SelectOption\(

ee\)

8:elseif

e\.tag∈\{input,textarea\}∨e\.role=spinbuttone\.\\text\{tag\}\\in\\\{\\texttt\{input\},\\texttt\{textarea\}\\\}\\,\\lor\\,e\.\\text\{role\}=\\texttt\{spinbutton\}then

9:if

e\.input\_type∈\{submit,reset,button\}e\.\\text\{input\\\_type\}\\in\\\{\\texttt\{submit\},\\texttt\{reset\},\\texttt\{button\}\\\}then

10:ClickElement\(

ee\)

11:elseif

e\.input\_type=searche\.\\text\{input\\\_type\}=\\texttt\{search\}then

12:Search\(

ee\)

13:elseif

eeis a radio or checkboxthen

14:ClickElement\(

ee\)

15:else

16:EnterInput\(

ee\)

17:elseif

e\.aria\-haspopup≠false∨¬e\.explorede\.\\text\{aria\-haspopup\}\\neq\\texttt\{false\}\\,\\lor\\,\\lnot\\,e\.\\text\{explored\}then

18:DropdownAction\(

ee\)⊳\\trianglerightprobe for dropdown semantics

19:elseif

eeis a copy buttonthen

20:CopyToClipboard\(

ee\)

21:else

22:ClickElement\(

ee\)

##### Form submission\.

SubmitForm\(Algorithm[9](https://arxiv.org/html/2606.10423#alg9)\) handles forms in three phases\. An LLM call first selects which fields to fill \(Prompt[E\.2\.2](https://arxiv.org/html/2606.10423#A5.SS2.SSS2)\), andElementActionis invoked on each chosen field, dispatching toEnterInput,SelectOption, orUploadFileas appropriate\. A validation pass then re\-fills any field that is empty\-but\-required or carriesaria\-invalid="true"after the initial entry — the LLM is re\-prompted for new values for these fields\. Finally, a review loop allows the LLM to inspect the populated form and either edit additional fields, submit, or exit \(Prompt[E\.2\.2](https://arxiv.org/html/2606.10423#A5.SS2.SSS2)\); exiting leaves the form in its current state and returns control to the main agent loop without submitting\. Within the review loop the candidate set is restricted to elements of the form section\.

Algorithm 9SubmitForm1:procedureSubmitForm\(form section

ss\)

2:

F←F\\leftarrowLLMSelectFields\(

ss\)

3:for

f∈Ff\\in Fdo

4:ElementAction\(

ff\)

5:UpdateSection\(

ss\)

6:for

f∈s\.Esf\\in s\.E\_\{s\}where

ffis empty\-and\-requiredor

f\.aria\-invalid=truef\.\\text\{aria\-invalid\}=\\texttt\{true\}do

7:ElementAction\(

ff\)

8:UpdateSection\(

ss\)

9:for

k=1,…,Kmaxk=1,\\ldots,K\_\{\\max\}do⊳\\trianglerightreview phase,Kmax=15K\_\{\\max\}=15

10:

a←a\\leftarrowLLMSelectFormAction\(

ss\)⊳\\trianglerightelement inss, submit, or exit

11:if

aais exitor

aais a submit buttonthen

12:if

aais a submit buttonthenClickElement\(

aa\)

13:return

14:ElementAction\(

aa\)

15:UpdateSection\(

ss\)

16:ifcurrent URL has changedthenreturn

##### Dropdown action\.

DropdownAction\(Algorithm[10](https://arxiv.org/html/2606.10423#alg10)\) clicks the dropdown trigger and consults the section diff returned byUpdatePageMem\. Three outcomes are possible\. If the URL changed or no new elements were revealed, the click was an ordinary navigation or null action and the workflow returns\. If the revealed elements form a coherent form\-like cluster \(multiple inputs together with a submit\-like button\), control is routed toSubmitFormon the synthesized form section\. Otherwise — the typical case of a menu, autocomplete list, or option dropdown — the LLM selects one of the revealed elements \(Prompt[E\.2\.1](https://arxiv.org/html/2606.10423#A5.SS2.SSS1)\) and that element is clicked\. The form\-detection heuristicIsFormreturns true whenΔ\+\\Delta^\{\+\}contains at least two input\-like elements and at least one element matching submit\-button heuristics \(a button with typesubmit, or accessible text matching submit\-like keywords\)\.

Algorithm 10DropdownAction1:procedureDropdownAction\(element

ee\)

2:

upre←u\_\{\\text\{pre\}\}\\leftarrowcurrent URL

3:ClickElement\(

ee\)

4:

Δ←\\Delta\\leftarrowUpdatePageMem\(current page\)

5:ifcurrent URL

≠upre\\neq u\_\{\\text\{pre\}\}or

Δ\+=∅\\Delta^\{\+\}=\\emptysetthenreturn

6:ifIsForm\(

Δ\+\\Delta^\{\+\}\)then

7:SubmitForm\(synthesize form section from

Δ\+\\Delta^\{\+\}\)

8:else

9:

e′←e^\{\\prime\}\\leftarrowLLMSelectAction\(

Δ\+\\Delta^\{\+\}\)

10:ClickElement\(

e′e^\{\\prime\}\)

##### Other workflows\.

- •Search: invokesEnterInputon the search field, then computes a section diff to detect whether suggestions have appeared\. If they have, the LLM is offered the option to select a suggestion \(which is then clicked\) or to ignore them \(Prompt[E\.2\.2](https://arxiv.org/html/2606.10423#A5.SS2.SSS2)\)\. The workflow concludes by pressing the Enter key to issue the search\.
- •EnterInput: prompts the LLM for the value to enter \(Prompt[E\.2\.2](https://arxiv.org/html/2606.10423#A5.SS2.SSS2)\) and fills it into the field\.
- •UploadFile: presents the LLM with the choice of either an existing file in the agent’s local filesystem \(input files staged for the task — e\.g\., the input images supplied with VisualWebArena tasks — and any text files created earlier in the same task or VLM\-captioned images saved during the task\) or a*create\-new\-file*option \(Prompt[E\.2\.2](https://arxiv.org/html/2606.10423#A5.SS2.SSS2)\)\. In the latter case the agent is prompted for a filename and text content \(Prompt[E\.2\.2](https://arxiv.org/html/2606.10423#A5.SS2.SSS2)\), the file is written to the local filesystem, and the new file is then uploaded\.
- •SelectOption: prompts the LLM to choose one of the available options \(Prompt[E\.2\.2](https://arxiv.org/html/2606.10423#A5.SS2.SSS2)\) and sets the field’s value to that option via Playwright\.
- •CopyToClipboard: reads the copied text from the clipboard and logs it in the action history\.
- •ClickElement: issues a Playwright click on the element\.

#### A\.5\.1Action Logging and History Format

Every successful basic action contributes a string to the agent’s interaction history\. Failed actions \(those that raise the runtime errors handled by the retry mechanism in App\.[A\.4](https://arxiv.org/html/2606.10423#A1.SS4)\) are not logged — only the eventually\-successful action appears\. Action strings are constructed*after*execution, since several formats reference values that are known only post\-execution \(e\.g\., the actual text entered into a field, the option that was selected\)\. Table[5](https://arxiv.org/html/2606.10423#A1.T5)lists the format for each basic action\.

Table 5:Log string format for each basic action\.Compound actions and intra\-step continuation chains produce multiple basic\-action strings within a single timestep\. These are emitted as a list under anActions:heading in the history; a timestep that produced exactly one basic action uses the singularAction:heading instead\. Each timestep contributes a block of the form below to the history, with the page name and URL drawn from the PageMem, the task summary fromAnalyzePage, and the reason produced by the LLM in the same call as the action selection itself\.

```
* Step 1:
  * Observation: {page_name} ({url})
    - Summary: {task_summary}
  * Reason for Action: {reason}
  * Action: {action_str}
* Step 2:
  * Observation: {page_name} ({url})
    - Summary: {task_summary}
  * Reason for Action: {reason}
  * Actions:
    - {action_str}
    - {action_str}
```

## Appendix BAdditional Experiment Details

##### Exploration parameters\.

Offline exploration is bounded by four limits per website: a maximum of 75 clickable elements explored per page, 500 pages per website, and a search depth of 2 \(where the homepage is at depth 0, so a depth\-2 traversal covers three layers\)\. Each website is also subject to a 12\-hour wall\-clock timeout, after which exploration terminates and the partial WebsiteMem is used as\-is\. For Online\-Mind2Web, which spans 136 distinct websites, depth is reduced to 1 and the per\-website timeout to 1 hour\. WorkArena uses the same parameters as WebArena and VisualWebArena\.

##### URL replacement on WebArena and VisualWebArena\.

WebArena and VisualWebArena evaluate against locally\-hosted simulated copies of real websites \(Reddit, GitLab, OpenStreetMap, etc\.\), but task instructions refer to these sites by their real names\. We observed that LLMs frequently misinterpret this as a directive to navigate to the real site — e\.g\., reading the localhost URL, concluding “I am not on Reddit,” and attempting to navigate to[https://www\.reddit\.com](https://www.reddit.com/), which breaks evaluation\. We address this confusion with a bidirectional URL substitution applied at the prompt boundary: simulated\-site URLs are rewritten to their real\-site counterparts in every string passed to the LLM, and the inverse rewrite is applied to URLs in LLM outputs before they reach the browser\. The LLM thus reasons consistently as if it were operating on the real site, while the browser remains pointed at the simulation\. Table[6](https://arxiv.org/html/2606.10423#A2.T6)lists the seven substitutions used\.

Table 6:URL substitutions applied at the prompt boundary on WebArena and VisualWebArena\. Environment variables hold the localhost URLs of the simulated sites\. The substitution is applied bidirectionally: simulated→\\toreal on input to the LLM, real→\\tosimulated on URLs in the LLM’s outputs\.
##### Multi\-website selection \(WebArena\)\.

At the start of each task, the agent is presented with the full list of benchmark websites and prompted to select any sites beyond the starting URL that are relevant \(Prompt[E\.3\.1](https://arxiv.org/html/2606.10423#A5.SS3.SSS1)\)\. The homepages of selected sites are added to the bookmark setBτB\_\{\\tau\}, making them available as one\-click navigation actions throughout the task\.

##### Input image grounding \(VisualWebArena\)\.

A subset of VisualWebArena tasks include input images that the agent must reason over alongside the task instruction\. At task start, the VLM is prompted with the task instruction, the input image\(s\), and the current page screenshot, and asked to produce a textual description of the image\(s\) in relation to the task \(Prompt[E\.3\.2](https://arxiv.org/html/2606.10423#A5.SS3.SSS2)\)\. This description is appended to the task instruction for the duration of the task\.

##### Hyperparameters\.

Table[7](https://arxiv.org/html/2606.10423#A2.T7)consolidates the configuration values used across all components of the system\. These hyperparameters were largely chosen heuristically as reasonable defaults and were not extensively swept, as we did not observe strong sensitivity in pilot runs\.

ComponentParameterValueExplorationmax elements per page75max pages per website500max depth \(homepage at depth 0\)2 \(1 for OM2W\)max time per website12h \(1h for OM2W\)Page divisionoversize thresholds\(h,w\)\(h,w\)\(\>900,\>320\)\(\\\!\>\\\!900,\\\!\>\\\!320\)or\(\>500,\>800\)\(\\\!\>\\\!500,\\\!\>\\\!800\)list\-grouping run length≥4\\geq 4Section updateresummarization threshold\|Δ\+\|\+\|Δ−\|\|\\Delta^\{\+\}\|\+\|\\Delta^\{\-\}\|≥3\\geq 3Observationlist\-item chunk sizecc25minimum image size for VLM description50×5050\\times 50pxAgent loopmax stepsTmaxT\_\{\\max\}30intra\-step continuation budgetJmaxJ\_\{\\max\}5action\-error retries per step3end\-task verification attempts per task1Form submissionreview loop boundKmaxK\_\{\\max\}15Table 7:Consolidated hyperparameter values\. Per\-benchmark exploration overrides are noted parenthetically; all other values are identical across the four benchmarks\.
### B\.1Compute Cost Estimates

Experiments were performed on a desktop machine with Ryzen 5 3600 CPU, NVIDIA RTX 3090 GPU, and 64GB RAM\. Inference was run locally using vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.10423#bib.bib57)\)\. Total execution time for each benchmark was∼7\{\\sim\}7days for WebArena,∼8\{\\sim\}8days for VisualWebArena,∼3\{\\sim\}3days for Online\-Mind2Web, and∼2\{\\sim\}2days for WorkArena\. Based on estimated system power draw and regional electricity prices, we estimate that experiments cost roughly $1\.15 in electricity per day, leading to a total estimate of $23 for the four benchmarks\. On average, each task used 270k tokens total across the LLM and VLM, which would translate to roughly $0\.03 per task if using OpenRouter API endpoints at the time of writing\. Exploration used approximately 50M total tokens for summarization across all benchmark websites\.

## Appendix CBroader Impacts

Our work shows that capable web agents can be built on small, locally\-runnable open\-weight models, which has positive implications for cost, privacy, and research accessibility: automation of tedious web tasks becomes economical at scales where frontier\-model APIs would not, sensitive browsing sessions need not leave the user’s device, and reproducible agent research becomes more tractable for groups without large compute budgets\. However, this also lowers the barrier for misuse such as spam posting, fake account creation, and review manipulation\. Agents acting autonomously over long horizons also raise deployment concerns: even the strongest current agents make mistakes, and the compounding effect of errors across multi\-step tasks means we recommend human oversight for any consequential domain\.

## Appendix DLimitations

Our framework relies on hand\-designed components that encode structural priors about how web pages are typically organized, such as DOM\-based section decomposition, heuristics for identifying clickable elements, deterministic exploration rules, and a fixed set of compound\-action workflows\. While our implementation is generally robust across a wide range of websites, performance may degrade on sites that diverge significantly from common patterns\. Our method also utilizes a larger number of sequential LLM calls, which increases wall\-clock time per task and makes the framework expensive to run with frontier models\. We further investigate only a minimal instantiation of the memory component; richer mechanisms such as online workflow learning or synthetic\-data generation are left to future work\. Finally, all of our evaluation is conducted on benign tasks, and the system’s robustness to adversarial page content is uncharacterized\(Turet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib81); Zhenget al\.,[2025b](https://arxiv.org/html/2606.10423#bib.bib82); Xianget al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib83); Wuet al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib84); Yinget al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib86); Zhanget al\.,[2025a](https://arxiv.org/html/2606.10423#bib.bib85),[c](https://arxiv.org/html/2606.10423#bib.bib87); Wuet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib88); Liaoet al\.,[2026](https://arxiv.org/html/2606.10423#bib.bib89); Kuntzet al\.,[2025](https://arxiv.org/html/2606.10423#bib.bib90); Anthropic,[2025](https://arxiv.org/html/2606.10423#bib.bib91)\)\.

## Appendix EPrompts

### E\.1Observation Prompts

SelectSectionsYou are an intelligent virtual assistant who completes tasks for users on various websites\. You will be given information about your task, the previous steps taken, and a list of summaries for the different sections available on the current page\. You must now identify all page sections that seem relevant to the task based on their descriptions and select those sections for further analysis\.The information will be given as: TASK: task request from the user HISTORY: summary of previous task steps CURRENT PAGE: name and url of current page PAGE SECTIONS: list of sections available on the current page Think step\-by\-step and identify all page sections that might contain relevant details or potentially aid in progressing the task\. Then, select the relevant page sections by providing a list of integers corresponding to the indices of your selected sections\. Give your full thought process behind your reasoning as: \*THOUGHTS\*\* your reasoning steps Give your answer for the relevant sections as a comma\-separated sequence of integers corresponding to your choices: \*RELEVANT SECTIONS\*\*: indices of selected page sections \(e\.g\., "2, 5, 3"\) Additional Guidelines: \- Select all sections that seem like they could potentially be useful at any point in the task\. \- If you think a section might be relevant but are not sure, select it to see its full details\.

ExtractDetailsYou are an intelligent virtual assistant who completes tasks for users on various websites\. To complete your task, you must first analyze each section of the page one\-by\-one and identify any relevant details\. You will be given information about your overall task and the next section of the page to analyze\. You must then summarize any information or elements in the page section that might potentially be relevant for the task\.The following information will be provided: TASK: your task instructions HISTORY: summary of previous task steps CURRENT PAGE: title, url, and summary of current page SECTION: page section to analyze CONTENT: section content First, think step\-by\-step about the current subsection of the page you are analyzing and consider whether more information needs to be gathered\. Then, provide a summary of the information and save the relevant details in this format: ‘\- \*\*\{descriptive\_label\}\*\*: "\{value\}"‘ Give your full thought process followed by your answer in this format: \*THOUGHT\*\*: your reasoning \*SUMMARY\*\*: summary of why the information may or may not be relevant to the task \*RELEVANT DETAILS\*\*: noteworthy information from this section Additional guidelines: \- Do not include instructions for the next steps to take just yet, only highlight information or potentially useful actions from the current page\. \- Never make assertive statements or assumptions about the given page section as it only provides a partial view of the website\. Always mention any potential uncertainties and/or alternative possibilities about the information given\. \- Make sure to mention any page details that could help us find the correct path to take even if the information is not directly relevant itself\. \- Your response should end after you provide the relevant details\.

SummarizePageYou are a virtual AI assistant who performs tasks on behalf of users through a web browser\. You will be provided with your assigned task instructions, the previous interaction history, and the current web page observation\. You must now write a concise summary of the current page that highlights the relevant information for your task\.The following information will be provided: TASK: your task instructions HISTORY: previous task steps CURRENT PAGE: title and url of current page PAGE OBSERVATION: overview of page First, think step\-by\-step about the progress you have made previously in order to understand the context of the current situation\. Then, identify the information on the page that is the most relevant to your task and provide a concise report \(max one paragraph in length\) that summarizes the current observation\. Provide your full thought process followed by your summary of the page observation in this format: \*THOUGHT\*\*: your reasoning steps \*OBSERVATION SUMMARY\*\*: summary of relevant details Additional Guidelines: \- You should not decide what action to take yet, only highlight relevant information\. \- Do not make assumptions about the functionality of the elements on the page\. You may describe what you think will be a likely result of clicking an action, but you should mention any potential uncertainties and not make strong statements\. \- Make sure to include in your summary any details from the current page that would be worth remembering in the future\. \- If the page is not relevant to your task then simply provide a brief description of the page and why it is not relevant\. \- Your answer should end after you provide the observation summary\.

#### E\.1\.1List item selection prompts\.

SelectItemsYou are an intelligent virtual assistant who completes tasks for users on various websites\. You will be requested by the user to complete a task that involves finding one or more items in a list of content \(e\.g\. a list of links, products, articles, comments, etc\.\) on the website\. The list of items on the current page will be shown and you will have the option to select any items that are relevant to the task based on their positional indices\.The following information will be provided: TASK: the user’s task instructions HISTORY: summary of previous task steps PAGE: title and url of the current page LIST INFO: summary describing the content list LIST ITEMS: the list items on the current page of results Think step\-by\-step about the given information and determine if there are any relevant list items on the current page of results\. Then, select all items that are relevant to the task in any way by providing a comma\-separated list of integers corresponding to the positional indices of those items \(e\.g\. "2, 9, 13"\)\. If there are no exact matches in the list, then still select the item that is the closest match\. If all items are completely unrelated to the task, then you can answer ’None’\. Provide the full thought process behind your answer followed by your final answer for the item selection in this format: \*THOUGHTS\*\*: your reasoning steps \*SELECT ITEMS\*\*: list items to select \(list of integers or ’None’\) Additional Guidelines: \- If some of the list items seem like potential matches but there is not enough information to confirm, select them in order to view more details\. \- Select all items that are partially related to the task requirements even if they are not exact matches\. \- Always include all items that are mentioned in the task instructions\. \- End your response after selecting the list items\.

CheckDoneYou are an intelligent virtual assistant who completes tasks for users on various websites\. Your current objective requires you to find one or more items in a list of content \(e\.g\. a list of pages, products, articles, comments, etc\.\)\. In the previous steps you have iterated over the items in order and recorded any matching items found\. You will be given the history of the items you have checked so far and you must determine if the objective is complete or if you still need to find more items in the list\.You will be provided with the following information: PAGE: title, url, and description of the current page OBJECTIVE: the task you must complete ITEMS CHECKED: list of items that have been checked so far, and whether they match Think step\-by\-step about the requirements of the objective and the details about the items that have been checked so far\. Then, determine if the task is complete or if you still need to look for more items in the rest of the list\. Give your full thought process and your answer in this format: \*THOUGHTS\*\*: steps behind your reasoning process \*COMPLETE\*\*: whether the objective has been completed \(Yes/No\) Reminder: \- Pay attention to the sort order of the list\. It may allow you to determine if the remaining list items are worth checking \(e\.g\. if you need to find the cheapest item and the list is sorted by price, then the remaining items will be more expensive\)\. \- If it is still possible that there might be more matching items in the list, then you should keep searching to be sure\.

### E\.2Action Prompts

#### E\.2\.1Agent Loop Prompts

SelectBookmarksYou are a virtual AI assistant who performs tasks on behalf of users through a web browser\. You will be given a user task request and a list of pages on the current website that you can visit in order to carry out the task\. You must now identify any pages that are likely to be relevant for completing the assigned task\.The list of available pages will be given in this format: ALL PAGES: 1\) \{page\_1\} 2\) \{page\_2\} \.\.\. n\) \{page\_n\} You should carefully think step\-by\-step to determine whether any of the pages would likely be useful for completing the task\. Include all pages that have the potential to be helpful for completing the task in any way\. Provide your reasoning as: \*THOUGHTS\*\*: your reasoning steps Finally, select the relevant pages by providing a comma\-separated sequence of integers corresponding to the indices of your choices: \*RELEVANT PAGES\*\*: indices of selected pages\(s\) \(e\.g\. "1, 7"\) Note: \- Select any pages that seem like they might be partially relevant or help find the target page quicker\. \- If there are no pages that seem relevant for the task then provide ’None’ as your answer\.

Element candidatesYou are a virtual AI assistant who performs tasks on behalf of users through a web browser\. You will be provided with a list of clickable elements available on the current page and you must identify all candidate elements that are potentially useful for progressing the task\.Your assigned task and the previous interaction history will be provided as: TASK: task instructions HISTORY: summary of previous task steps CURRENT PAGE: title, url, and summary of current page observation The elements present on the page will be listed in this format: PAGE ELEMENTS: 1\) \{element\_1\} 2\) \{element\_2\} \.\.\. You should carefully think step\-by\-step to determine which elements might help progress the task\. Provide your full reasoning process as: \*THOUGHTS\*\*: your reasoning steps Finally, select one or more elements by providing the indices of your choices separated by commas: \*SELECT ELEMENTS\*\*: indices of selected element\(s\) \(e\.g\. "1, 7"\) Note: \- For search/filtering interfaces, make sure to clear all pre\-existing filters before applying a new one\. \- If there are multiple elements that seem relevant and you are not sure which one is best, then include all relevant elements in your answer\. \- If none of the listed elements are relevant then answer ’None’\.

Navigation candidatesYou are a virtual AI assistant who performs tasks on behalf of users through a web browser\. You will be provided with your assigned task instructions, the previous interaction history, current page observation, and a list of browser navigation actions to choose from\. You must now determine if any of the available navigation actions would be useful for progressing the task\.The following information will be provided: TASK: your task instructions HISTORY: summary of previous task steps CURRENT PAGE: title, url, and summary of current page BROWSER ACTIONS: list of available navigation options and their details Think step\-by\-step about the navigation options and whether they would be helpful or harmful for progressing the task\. Then, select one or more candidates for the next action to take\. If none of the navigation options are useful for the task then select ’None’\. Give your full thought process behind your reasoning as: \*THOUGHTS\*\* your reasoning steps Select one or more of the navigation options by providing a comma\-separated sequence of integers corresponding to the indices of your choices: \*ANSWER\*\*: indices of selected option\(s\) \(e\.g\. "1, 7"\)

SelectActionYou are an autonomous AI assistant who performs tasks on behalf of users through a web browser\. You will be provided with your assigned task instructions, the previous interaction history, current page observation, and a list of valid actions to choose from\. You must now determine the correct next action to take in order to progress with the task, or end the task if it has been completed\.The following information will be provided: TASK: your task instructions HISTORY: summary of previous task steps CURRENT PAGE: title, url, and summary of current page observation ACTIONS: list of available action options You should first carefully think step\-by\-step about the potential benefits and risks of each option in order to determine the logical next action for making progress on your task\. Provide your full thinking process as: \*THOUGHTS\*\*: your reasoning steps Once you have decided on the next action you wish to take, provide the reason behind your decision followed by the index of your selected action in the following format: \*REASON\*\*: reason for decision \*SELECT ACTION\*\*: index of selected action \(e\.g\. "1"\) Note: \- The reason in your final answer should be a one sentence summary of the thinking process behind your decision\. \- You must select exactly ONE action out of the options provided\. \- Do not mark the task as complete until you have fully completed all steps specified by the instructions \(e\.g\. if the task is to buy a product, you need to complete the full checkout and payment process\)\. \- Only provide the index of the selected action with no additional text afterwards\.

VerifyEndTaskYou are an intelligent virtual assistant who completes tasks for users on various websites\. Based on your given task instructions, your previous action history and the current page observation, you must determine whether you have fully completed the task\.The following information will be provided: TASK: task instructions from the user HISTORY: summary of previous task steps CURRENT PAGE: name, url and summary of the current browser page ACTIONS: available actions \(not executed yet\) Think step\-by\-step about user’s task instructions and identify which parts of the task you have accomplished, then determine whether the task is fully complete or additional actions need to be taken\. \*Guidelines for evaluation\*\*: 1\. You are only required to complete the steps explicitly defined by the instructions\. 2\. As long as there are any next steps that need to be taken, the task completion status should be false\. 3\. Tasks that involve buying an item should only be considered complete after adding the item to the cart and completing the full checkout process\. 4\. If the user asks you to show them an item \(e\.g\. "show me the most expensive \.\.\.", "show me the most recent \.\.\.", "find me a \.\.\." etc\.\), then the task is not complete until the full details of the item are displayed by clicking its link\. 5\. If the task instructions asks you to open a page \(e\.g\. "open my latest \.\.\."\), then the task is not fully complete until you have clicked the link for the page\. Provide your thought process followed by your answer in this format: \*THOUGHTS\*\*: your reasoning process \*TASK COMPLETE\*\*: task completion status \(True/False\)

FinalAnswerYou are an intelligent virtual assistant who performs tasks for users on various websites\. Previously, you have completed the user’s task on the website and now you must send a final message to the user\.You will be provided with the following information: USER: the task request from the user HISTORY: summary of previous task steps CURRENT PAGE: title, url, and summary of current page Think step\-by\-step the user’s task instructions and the information you have found on the webstite\. Then, provide a message to the user that summarizes the steps you have taken to complete their task followed by the final answer to the user’s query\. Outline your full thought process as: \*THOUGHTS\*\*: your reasoning Provide your message to the user followed by your answer value in this format: \*MESSAGE\*\*: completion message \*ANSWER\*\*: final answer value Important guidelines for the final answer: \- If you were unable to successfully complete the task or no answer is necessary, provide "N/A" as the answer\. \- When asked to return a count, return the count as a number with units instead of "N/A" if it’s 0\. \- If the user asks you to check if something is true or not, answer "Yes" or "No"\.

#### E\.2\.2Action Workflow Prompts

SelectFieldsYou are an intelligent virtual assistant who completes tasks for users on various websites\. You will be requested by the user to complete a task that involves filling in a web form\. The list of available form fields will be provided and you will be able to select which ones to edit based on the task\.The task instructions from the user and your current subgoal will be provided as: TASK: the task assigned by the user HISTORY: the previous interaction history CURRENT PAGE: name, url and summary the of current page The form will be provided as a numbered list of available input fields: FORM: 1\) \{field\} 2\) \{field\} \.\.\. n\) \{field\} Think carefully about the task and identify the input fields that you need to select, then provide your reasoning as: \*THOUGHT\*\*: reasoning Finally, provide the list the of all form fields that should be updated for the task by providing a comma\-separated sequence of integers corresponding to the indices of your choices as: \*EDIT FIELDS\*\*: list of indices \(e\.g\. "1, 7"\)

SelectFormActionYou are an intelligent virtual assistant who completes online form filling tasks on various websites\. In the previous steps, you filled out the available fields of the form and you will now choose the next action to take in the form\.The task information will be given in the following format: TASK: overall task to complete PROGRESS: the current task progress FORM MENU: numbered list of actions available in the form section Think step\-by\-step about the given information and identitify the appropriate next action to take among those listed in the form menu\. You will then be able to select one of the actions by providing its index\. Provide your full thought process followed by your action choice as: \*THOUGHTS\*\*: your reasoning steps \*CHOICE\*\*: index of selected action \(e\.g\. "\*\*CHOICE\*\*: 14"\) Additional Guidelines: \- Don’t repeat the action performed in the previous step\. \- End your response after providing your action choice\.

Enter Input ValueYou are an intelligent virtual assistant who completes tasks for users on various websites\. You will be provided information about your assigned task, the web page, and the current state of the task progress\. You have just selected an input field on the page and you can now provide a value to enter into the input field\.The task information will be provided in this format: TASK: <user’s task instructions\> PAGE: <summary of page\> PROGRESS: <current task progress\> INPUT FIELD: <selected input field\> Think step\-by\-step about the actions needed to complete the task and then answer with the value to enter into the selected input field\. If the task specifies an exact input value to enter then your answer should match\. Otherwise, come up with a reasonable input value to enter given the task context\. Provide your full thought process followed by your final answer for the input value as: \*THOUGHTS\*\*: <reasoning\> \*INPUT VALUE\*\*: <value\> Additional Guidelines: \- If the task does not directly specify an input value to use then think carefully about what would be the best value to enter in the field\. \- Only provide an input value for the current selected input field with no additional text afterwards\.

Select Search SuggestionsYou are an intelligent virtual assistant who completes tasks on various websites\. You will be provided information about the current page and your assigned task\. A search bar displaying a list of search suggestions is currently selected and you will now be have the option to click one of them\.The information about your task, the current browser page and the selected search bar will be provided as: TASK: <your assigned task\> PAGE: <summary of page\> SEARCH FIELD: <details about the search bar\> SUGGESTIONS: <list of search suggestions\> Think step\-by\-step about whether any of the search suggestions are relevant to the task\. If you want to click one of the suggestions then provide the number and value of the option\. If there are no relevant search suggestions then you can choose the ’None’ option\. If the correct input value is already entered into the field and the same value also appears in the search suggestions, you should still select the same value again\. Provide your full thought process followed by your final answer in this format: \*THOUGHTS\*\*: <reasoning process\> \*SELECT\*\*: <number and value of option\> \(e\.g\. "1\) Meta"\)

Select OptionYou are an intelligent virtual assistant who completes tasks on various websites\. You will be provided information about the current page and your assigned task\. You have selected an input field on the page with several options available and you must select the appropriate numbered option from the list by providing its index\.The information about the page, your task, and the options for the input field will be provided in this format: TASK: task instructions PROGRESS: current task progress PAGE: summary of page INPUT FIELD: selected input element INPUT OPTIONS: 1\) \{option\_1\} 2\) \{option\_2\} \.\.\. n\) \{option\_n\} Think step\-by\-step about the actions needed to complete the task and then answer with the value to enter into the selected input field\. If the task specifies an exact input option to select then your answer should match\. Otherwise, choose the option that most closely matches with the task instructions given the task context\. Provide your full thought process followed by your final answer as: \*THOUGHTS\*\*: your reasoning \*OPTION\*\*: index of selected option \(e\.g\. "1"\) Additional Guidelines: \- Only select ONE of the input options\. \- Don’t provide any additional text after the selected option\.

Upload FileYou are an intelligent virtual assistant who completes tasks for users on various websites\. Your current task involves uploading a file and you have already navigated to the part of the website where the file upload should take place\. The list of files available in your filesystem will be given and you must choose now choose the appropriate file to upload\.The following information will be provided: TASK: your task instructions HISTORY: summary of the previous actions you have taken CURRENT PAGE: title, url, and summary of current page INPUT FIELD: the field that you are uploading the file to FILES: list of files to choose from Based on this information, you should carefully analyze the context of your current task and determine which file from your filesystem should be uploaded\. If the task requires creating a new file to upload, then you can select the ’Create new file’ option to write and upload a file\. Think step\-by\-step and provide your full thought process as: \*THOUGHTS\*\*: your reasoning steps Then, choose one of the options by providing the index of your choice as: \*ANSWER\*\*: chosen index \(e\.g\. "1"\)

Create FileYou are an intelligent virtual assistant who completes tasks for users on various websites\. Your current task involves creating a new file and uploading it to the current website\. Based on the given task context, you must create the file that will be uploaded by providing the name and contents of the file\.The following information will be provided: TASK: your task instructions HISTORY: summary of the previous actions you have taken CURRENT PAGE: title, url, and summary of current page FIELD: the field where the file will be uploaded Reason carefully about your task and plan what you will name the file and what the file contents should be\. Once you are ready, you can provide the file name followed by the full file content and this file will be uploaded to the website\. Provide your full thinking process as: \*THOUGHTS\*\*: your reasoning steps Provide the name of the file followed the file content in this format: \*FILE\*\*: file name \(e\.g\. urls\.txt\) \*CONTENT\*\*: ‘‘‘ \# full file content here \.\.\. ‘‘‘

Enter URLYou are an autonomous virtual assistant who completes tasks for users on various websites by controlling a web browser\. You will be provided information about your assigned task, the previous actions you have taken, and the current page information\. The browser address bar has been selected and you must now provide a URL to enter and navigate to\.You will be given the following information: TASK: your task instructions HISTORY: summary of previous task steps CURRENT PAGE: title, url, and summary of current page observation Think carefully about the given task information and determine the correct URL for the page that you wish to navigate to on the website\. Important guidelines: 1\. The URL must be for the same website as the current page\. 2\. Pay close attention to the previous URLs you have visited in order to help understand the URL structure used by the website\. Provide your full reasoning process followed by your answer for the URL in this format: \*THOUGHTS\*\*: your reasoning steps \*ANSWER\*\*: URL

### E\.3Other Prompts

#### E\.3\.1WebArena

Multi\-siteYou are an autonomous AI assistant who performs tasks on behalf of users through a web browser\. You will be given a task from the user and start on a website where the task needs to be completed on\. First you must determine if the task only involves the current website, or if part of the task needs to be completed on another website\.You will be given the following information: USER: task instructions from the user CURRENT PAGE: description of current browser page WEBSITES: list of available websites Carefully read the user’s instructions and reason about whether the task involves only the current website, or if a task requirements involves another website\. In most cases the task will only focus on the current site but occasionally it will require information or functionality from one of the other websites\. Your final answer should include one or more websites that are necessary for the task and must always include the current website\. Provide your reasoning process followed by your final answer as: \*THOUGHTS\*\*: your reasoning steps \*ANSWER\*\*: website\(s\) needed for the task \(e\.g "OneStopMarket", "Reddit \+ GitLab"\)

#### E\.3\.2VisualWebArena

Input ImageDescribe what the following user is asking for based on their request message that references both the first image and the web page screenshot\.\*User request message\*\*: "\{task\_instruction\}"

#### E\.3\.3VLM Prompts

VLM Summarize SectionDescribe this section of the \{website\_name\} website in one sentence\.

VLM Summarize PageDescribe this page of the \{website\_name\} website in one sentence\.

VLM Screen DiffThese two screenshots show the \{website\_name\} website immediately before and after I \{action\_str\}\. What was the result of the action?

Describe Image \- ShortDescribe the image with a short sentence\.

Describe Image \- LongFully extract all information contained in this image in an organized format\.Make sure to include: \- All textual and/or numerical information\. \- Visual features such as objects, colors, and settings\. \- Names of any recognizable people, celebrities, characters, media, or locations\.
WebChallenger: A Reliable and Efficient Generalist Web Agent

Similar Articles

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Submit Feedback

Similar Articles

VisualClaw: A Real-Time, Personalized Agent for the Physical World
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration