Region4Web: Rethinking Observation Space Granularity for Web Agents

arXiv cs.CL Papers

Summary

This paper introduces Region4Web, a framework that improves web agent performance by organizing observation spaces into functional regions rather than individual elements. It demonstrates that this approach reduces observation length and increases task success rates on the WebArena benchmark.

arXiv:2605.07134v1 Announce Type: new Abstract: Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page's functional organization implicit and forcing the agent to infer it from element-level signals at every step. We argue observation should instead operate at the granularity of functional regions, parts of the page that each serve a distinct purpose. We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page's functional organization as the basis for page state understanding. Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models (LLMs) and established agent methods, regardless of backbone capacity. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element-level processing alone.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 06:49 AM

# Region4Web: Rethinking Observation Space Granularity for Web Agents
Source: [https://arxiv.org/html/2605.07134](https://arxiv.org/html/2605.07134)
Donguk Kwon Yonsei University donguk\.kwon@yonsei\.ac\.kr &Dongha Lee Yonsei University donalee@yonsei\.ac\.kr

###### Abstract

Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice\. Existing work treats observation at the same element\-level granularity as the action space, leaving the page’s functional organization implicit and forcing the agent to infer it from element\-level signals at every step\. We argue observation should instead operate at the granularity offunctional regions, parts of the page that each serve a distinct purpose\. We proposeRegion4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page’s functional organization as the basis for page state understanding\. Moreover, we proposePageDigest, a web\-specific inference pipeline that delivers this region\-level observation to the actor agent as a compact per\-page digest that persists across steps\. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models \(LLMs\) and established agent methods, regardless of backbone capacity\. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element\-level processing alone\. Code is available at[https://github\.com/kwondu/region4web](https://github.com/kwondu/region4web)\.

## 1Introduction

Large language models \(LLMs\) have enabled autonomous agents capable of handling diverse real\-world tasks in web environments\(Heet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib19); Logeswaranet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib34); Wuet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib26)\)\. At each step, a web agent perceives the current page state through an observation space and selects an action from an action space\. Prior work has concentrated on improving action selection, with task planning\(Guoet al\.,[2026](https://arxiv.org/html/2605.07134#bib.bib32); Huanget al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib28); Shinnet al\.,[2023](https://arxiv.org/html/2605.07134#bib.bib13)\), element grounding\(Zhenget al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib18)\), and model capability\(Qiet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib25); Weiet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib31)\)all directed toward this goal\. Page state understanding, in contrast, has been addressed through filtering or truncating elements from the observation\(Kanget al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib46); Leeet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib29); Zhanget al\.,[2026a](https://arxiv.org/html/2605.07134#bib.bib33)\), which all operate at element\-level granularity, leaving this design choice itself underexamined\.

Existing work often represents the observation space at the same element\-level granularity as the action space\(Schiepanski and Piël,[2025](https://arxiv.org/html/2605.07134#bib.bib44); Yanget al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib24)\), yet this granularity is not equally suited to both\. Element\-level granularity is natural for the action space, where each action targets a specific element with a designated operation\. The observation space, however, serves a fundamentally different role of providing context for understanding the current page state, where context extends from individual elements to their relations\. We capture these relations throughfunctional regions, defined as groups of elements whose relations support a shared purpose, such as site traversal or result narrowing\.

Decomposing pages into regions has been studied in human attention to spatially coherent areas\(Buscheret al\.,[2009](https://arxiv.org/html/2605.07134#bib.bib35)\)and recent GUI web agents that segment screenshots into region partitions\(Fanet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib21); Singhet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib27)\)\. These approaches show that visual layout provides useful cues for grouping elements, often through spatial proximity such as bounding box overlap or layout adjacency\. However, spatial proximity does not entail shared functional purpose\. Such proximity cues may induce visual groupings, but do not specify whether they constitute functional observation units or what purpose they serve in the page state\. A similar implicitness appears in element\-level observation\(Schiepanski and Piël,[2025](https://arxiv.org/html/2605.07134#bib.bib44); Yanget al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib24); Zhanget al\.,[2026a](https://arxiv.org/html/2605.07134#bib.bib33)\), where regions and their purposes are present only implicitly through individual elements and must be inferred by the agent\. Screenshot\-based agents\(Heet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib19); Zhenget al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib18)\)provide layout cues that may make functional organization visually inferable, but they still require the agent to infer whether visually suggested groupings correspond to functional regions and what purpose they serve\. These limitations motivate region\-level observation defined by shared functional purpose\. By identifying functional regions and abstracting each by its purpose, Region4Web makes page organization explicit before action selection, as shown in Figure[1](https://arxiv.org/html/2605.07134#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2605.07134v1/x1.png)Figure 1:Element\-level and region\-level observation of structurally similar card grids\. Region\-level observation distinguishes a grid of product preview cards from a single destination showcase\.Constructing region\-level observation is not straightforward\. Boundaries and purposes of functional regions are implicit in tree representations such as AXTree, where the hierarchy reflects markup nesting rather than how elements are organized\. Deriving them through rule\-based decomposition is insufficient, as what each region is for varies with the page even for structurally repeated patterns\. A grid of structurally repeated cards, for example, forms independent regions when the cards are separate product previews, yet a single region when they collectively form a review showcase, as Figure[1](https://arxiv.org/html/2605.07134#S1.F1)demonstrates\. Nor does existing research on web page structure resolve this, as web page segmentation\(Caiet al\.,[2003](https://arxiv.org/html/2605.07134#bib.bib39); Gerberet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib5); Kieselet al\.,[2020](https://arxiv.org/html/2605.07134#bib.bib2)\)and content extraction\(Barbaresi,[2021](https://arxiv.org/html/2605.07134#bib.bib3); Liuet al\.,[2025a](https://arxiv.org/html/2605.07134#bib.bib48)\)methods target information retrieval or content analysis, not the functional organization that agent observation requires\. Its construction therefore demands learning how web pages are functionally organized across diverse page layouts\.

We address this challenge withRegion4Web, a framework that constructs region\-level observation from the AXTree through two stages\. Hierarchical decomposition classifies each parent\-child edge as merge or cut in a single bottom\-up traversal, and the subtrees formed by merged edges constitute the functional regions of the page\. Semantic abstraction then interprets each region along two orthogonal dimensions, a purpose that identifies what the region is for and a state summary that captures its current actionable context\. Since both stages run at every page during agent execution, they are realized as small dedicated models\. The knowledge of how pages are functionally organized is implicit in the AXTree and cannot be derived by rule, so these models are trained on annotations from a proprietary LLM covering diverse real\-world websites\.

Moreover, deploying Region4Web in web environments requires keeping its region\-level observation compact while preserving the page state understanding it supports, which motivatesPageDigest, a web\-specific inference pipeline that maintains a compact digest of the agent’s observation across steps within each page\. Upon entering a new page, PageDigest selects task\-relevant regions and exposes them as AXTree subtrees alongside the non\-selected regions’ abstractions, preserving element\-level granularity for the action space within the page’s structural information\. Within the same page, PageDigest tracks observation transitions across steps, rather than reconstructing the full observation at every step\. PageDigest shares the actor agent’s backbone LLM and operates solely on the observation space, making it directly applicable to diverse web agents\.

On the WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib16)\)benchmark, PageDigest substantially reduces observation length across four backbone LLMs and two established agent methods, with the reduction holding consistently regardless of backbone capacity\. PageDigest improves overall task success rate across backbones, demonstrating that region\-level observation strengthens page state understanding regardless of backbone capacity\. Ablations confirm that Region4Web and PageDigest make distinct contributions, with Region4Web alone supporting page state understanding while PageDigest delivers it compactly across steps\.

Our contributions are summarized as follows\.

- •We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page’s functional organization as the basis for web agents’ page state understanding\.
- •We propose PageDigest, a web\-specific inference pipeline that delivers each page’s region\-level observation to the actor agent as a compact digest that persists across steps, reducing observation length while preserving task success\.
- •We evaluate Region4Web and PageDigest on the WebArena benchmark, where PageDigest substantially reduces observation length while improving overall task success rate, regardless of backbone capacity\.

## 2Preliminary Analysis

![Refer to caption](https://arxiv.org/html/2605.07134v1/x2.png)\(a\)Distribution of LCA depth ratio for consecutive action pairs against the random baseline\.
![Refer to caption](https://arxiv.org/html/2605.07134v1/x3.png)\(b\)Distribution of DOM change ratio across within\-page steps, where 52\.9% exhibit zero change\.

We analyze action traces and observation transitions to inform two design questions about observation in web environments\. Section[2\.1](https://arxiv.org/html/2605.07134#S2.SS1)examines whether the agent’s actions are localized within the page structure during a task, motivating the unit at which observation should be constructed within a single step\. Section[2\.2](https://arxiv.org/html/2605.07134#S2.SS2)examines how much the observation changes as the agent acts within a page, motivating the question of whether observation should be reconstructed at every step\.

To answer these questions, we use the Mind2Web dataset\(Denget al\.,[2023](https://arxiv.org/html/2605.07134#bib.bib15)\), which provides 2,350 tasks with per\-action ground\-truth annotations across 137 real\-world websites, with dataset selection criteria detailed in Appendix[C](https://arxiv.org/html/2605.07134#A3)\. Each page is represented as a DOM tree with an average of 2,473 nodes\. The dataset contains 15,394 consecutive action pairs, of which 12,009 \(78\.0%\) occur within the same page and the remaining 22\.0% involve page navigation that entirely replaces the observation\. Our analysis focuses on same\-page pairs, where observation construction and update are at issue\.

### 2\.1Consecutive Actions Are Localized within Page Structure

#### Only a negligible fraction of elements on a page are targeted during a task\.

While each page contains thousands of DOM nodes, the number of actions performed on it during a task has a median of 6 and a 90th percentile of 13\. Since each action targets exactly one element, the elements ever acted upon constitute a negligible fraction of the page\. The full page is thus dominated by elements irrelevant to the task, motivating selection of task\-relevant content\.

#### Consecutive actions are structurally co\-located within the page\.

We measure the lowest common ancestor \(LCA\) depth ratio for consecutive action pairs, computed as the depth of the LCA of the two target elements divided by the maximum depth of the DOM tree\. A higher value indicates that the two elements are situated within a tighter subtree\. As Figure[2\(a\)](https://arxiv.org/html/2605.07134#S2.F2.sf1)shows, consecutive action pairs yield a median LCA depth ratio of 0\.48, with 81\.7% exceeding the random baseline median of 0\.22\. Consecutive actions thus concentrate within localized subtrees rather than spanning the page, indicating that the region serves as a natural unit for observation construction\.

### 2\.2Within\-Page Observation Undergoes Marginal Change across Consecutive Steps

For each step within a page, we measure the change ratio, the proportion of DOM elements added or removed by the action\. As Figure[2\(b\)](https://arxiv.org/html/2605.07134#S2.F2.sf2)shows, 52\.9% of steps exhibit zero change, and 74\.4% remain below 5%\. Where changes occur, they reflect minor DOM modifications such as dropdown expansion or tooltip appearance\. Steps exceeding 90% change account for only 2\.5%, attributed to client\-side routing within single\-page applications\. Reconstructing the full observation at every step is therefore unnecessary, and tracking only the incremental changes within each page can avoid this redundancy\.

## 3Region4Web

![Refer to caption](https://arxiv.org/html/2605.07134v1/x4.png)Figure 3:Overview of Region4Web inference process\.Section[2\.1](https://arxiv.org/html/2605.07134#S2.SS1)shows that regions are natural units for observation\. We propose Region4Web, a two\-stage framework for constructing region\-level observation from the AXTree of a web page\.

### 3\.1Problem Formulation

At each step, a web agent perceives the current page state through an observation space and selects an action from an action space\. The observation can be represented as a tree𝒯=\(V,E\)\\mathcal\{T\}=\(V,E\), where each nodev∈Vv\\in Vcorresponds to an element on the page with attributes such as role, name, and value\. In the prevailing element\-level approach, the agent operates overVVdirectly, leaving the page’s functional organization implicit in𝒯\\mathcal\{T\}\. Region\-level observation makes this organization explicit through a partitionℛ=\{R1,…,Rm\}\\mathcal\{R\}=\\\{R\_\{1\},\\ldots,R\_\{m\}\\\}ofVVinto functional regions, where eachRiR\_\{i\}forms a subtree of𝒯\\mathcal\{T\}\. Each region is associated with a purposepip\_\{i\}that identifies what the region is for and a state summarysis\_\{i\}that captures its current actionable context\. Region4Web learns to produce bothℛ\\mathcal\{R\}and the associated\{\(pi,si\)\}\\\{\(p\_\{i\},s\_\{i\}\)\\\}from𝒯\\mathcal\{T\}\.

### 3\.2Hierarchical Decomposition

To construct region\-level observation,𝒯\\mathcal\{T\}must be decomposed into a region partitionℛ\\mathcal\{R\}\. We instantiate𝒯\\mathcal\{T\}as the page’s AXTree, a browser\-generated representation that encodes each element’s accessibility semantics in a hierarchical structure\. Since eachRi∈ℛR\_\{i\}\\in\\mathcal\{R\}forms a subtree of𝒯\\mathcal\{T\}, the partition is fully determined by classifying each edge inEEas merge or cut\. Removing the cut edges from𝒯\\mathcal\{T\}splits the tree into subtrees, each of which constitutes a region inℛ\\mathcal\{R\}\. Since the root has no parent edge to classify, its subtree constitutes the final region inℛ\\mathcal\{R\}after the bottom\-up traversal completes\.

Decomposition determines region boundaries from structural cues alone, whereas semantic abstraction interprets each region’s purpose and actionable state\. Each nodevvis represented by a feature vector𝐱v\\mathbf\{x\}\_\{v\}that combines a learned role embedding with numeric features encoding the node’s structural information in𝒯\\mathcal\{T\}\. At each internal nodevvwith childrenc1,…,ckc\_\{1\},\\ldots,c\_\{k\}and their respective representations𝐫c1,…,𝐫ck\\mathbf\{r\}\_\{c\_\{1\}\},\\ldots,\\mathbf\{r\}\_\{c\_\{k\}\}, anEdgeClassifierdetermines whether each child should be separated, using the sibling mean𝐫¯=1k​∑j𝐫cj\\bar\{\\mathbf\{r\}\}=\\frac\{1\}\{k\}\\sum\_\{j\}\\mathbf\{r\}\_\{c\_\{j\}\}as context,

y^v,ci=EdgeClassifier​\(𝐱v,𝐫ci,𝐫¯\)\.\\hat\{y\}\_\{v,c\_\{i\}\}=\\textsc\{EdgeClassifier\}\(\\mathbf\{x\}\_\{v\},\\;\\mathbf\{r\}\_\{c\_\{i\}\},\\;\\bar\{\\mathbf\{r\}\}\)\.\(1\)Edges withy^v,ci≥τ\\hat\{y\}\_\{v,c\_\{i\}\}\\geq\\tauare cut, while the remaining childrenℳv\\mathcal\{M\}\_\{v\}are merged into the parent’s region\.RegionEncoderthen computes the parent’s representation from𝐱v\\mathbf\{x\}\_\{v\}and the merged childrenℳv\\mathcal\{M\}\_\{v\},

𝐫v=RegionEncoder​\(𝐱v,1\|ℳv\|​∑cj∈ℳv𝐫cj\),\\mathbf\{r\}\_\{v\}=\\textsc\{RegionEncoder\}\\\!\\left\(\\mathbf\{x\}\_\{v\},\\;\\frac\{1\}\{\|\\mathcal\{M\}\_\{v\}\|\}\\sum\_\{c\_\{j\}\\in\\mathcal\{M\}\_\{v\}\}\\mathbf\{r\}\_\{c\_\{j\}\}\\right\),\(2\)ensuring that the parent’s representation reflects only the children that belong to its region\. For leaf nodes, since no children exist,ℳv\\mathcal\{M\}\_\{v\}is empty and the aggregation term reduces to𝟎\\mathbf\{0\}\.

The entire procedure is carried out in a single bottom\-up traversal, where each node’s representation is computed only after all its children’s boundary decisions are resolved, so that boundary decisions propagate upward through the hierarchy without requiring an additional pass\. The full procedure is detailed in Algorithm[1](https://arxiv.org/html/2605.07134#alg1)\.

### 3\.3Semantic Abstraction

The region partitionℛ\\mathcal\{R\}determines which elements belong together, but the semantic meaning of each region remains implicit in its subtree\. A fine\-tuned language model receives the preprocessed AXTree subtree of each region and produces a purposepip\_\{i\}and a state summarysis\_\{i\}, which address two orthogonal dimensions\. Purpose captures what the region is for, serving as the basis for identifying each region\. State summary interprets the region’s current actionable context, conveying what information and actionable elements are available within it\.

### 3\.4Training

Since both stages run on every page during agent execution, they are realized as small dedicated models, a decomposition model for structural boundary decisions and a small language model for per\-region abstraction\. The knowledge of how web pages are functionally organized is implicit in𝒯\\mathcal\{T\}and cannot be derived by rule, so these models are trained on annotations from a proprietary LLM\. We employgpt\-5\-mini\-2025\-08\-27\(OpenAI,[2025](https://arxiv.org/html/2605.07134#bib.bib51)\)as the annotator to construct the training dataset\. The raw AXTree is preprocessed into a textual form that retains the elements an agent can perceive and act on along with the structural grouping among them\. Since Region4Web operates sequentially, with decomposition producingℛ\\mathcal\{R\}that abstraction then interprets, the training dataset should be constructed to follow this same dependency\.

#### Dataset Construction\.

Source pages are collected from 500 real\-world websites sampled from the Tranco top\-1M ranking list,111Tranco top\-1M ranking list snapshot from April 1, 2026\.[https://tranco\-list\.eu/list/QWQ94/1000000](https://tranco-list.eu/list/QWQ94/1000000)a research\-oriented ranking of most popular websites\. These websites span 10 domain categories \(e\.g\., Technology & Computing, Shopping\) derived from the IAB Content Taxonomy,222IAB Content Taxonomy 3\.1\.[https://iabtechlab\.com/standards/content\-taxonomy](https://iabtechlab.com/standards/content-taxonomy)a standard classification of web content, for their relevance to web agent tasks\. For each website, up to 100 page URLs are sampled using a score computed from sitemap metadata, yielding 21,974 pages from 253 websites whose AXTrees are successfully extracted\. The annotator then processes each page in three steps\. It first decomposes the AXTree into a region partition, then verifies the partition to identify incorrectly formed regions, and finally produces a purpose and a state summary for each verified region\. Only pages whose partitions contain no invalid region are retained, yielding 2,052 pages and 45,147 regions\. Pages excluded by this filter are dominated by real\-world website noise that prevents coherent region organization rather than by annotator capacity, so the retained pages carry reliable annotations\.

#### Decomposition training\.

The verified region partitions are converted into binary edge labels overEE, where each edge is labeled as cut if its parent and child belong to different regions and as merge otherwise\. The model is trained with teacher forcing, where ground\-truth labels determine the cut and merge decisions during the bottom\-up traversal so that each node’s representation is computed from correctly partitioned children\. Since merge edges vastly outnumber cut edges, focal loss withα=0\.75\\alpha=0\.75andγ=2\.0\\gamma=2\.0is applied to address the class imbalance\(Linet al\.,[2017](https://arxiv.org/html/2605.07134#bib.bib8); Maet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib38)\)\.

#### Abstraction training\.

Qwen3\-0\.6B\(Team,[2025](https://arxiv.org/html/2605.07134#bib.bib45)\)is fine\-tuned on the 45,147 region annotations from the verified pages, with each example pairing a region’s preprocessed subtree as input with the corresponding purpose and state summary as output\. A small model is chosen so that abstraction can be invoked once per region without dominating inference latency\.

Further details on AXTree preprocessing, dataset construction, and Region4Web implementation are provided in Appendices[D](https://arxiv.org/html/2605.07134#A4),[E](https://arxiv.org/html/2605.07134#A5), and[F](https://arxiv.org/html/2605.07134#A6), respectively\.

## 4PageDigest

![Refer to caption](https://arxiv.org/html/2605.07134v1/x5.png)Figure 4:Overview of PageDigest\.Region4Web produces region\-level observation for a given page, but deploying it in web environments requires focusing the observation on what is task\-relevant and tracking how pages change as the agent acts\. We propose PageDigest, a web\-specific inference pipeline that constructs a page digest upon entering a new page through region selection, retains it within the page, and updates it through observation transition tracking across steps\.

### 4\.1Task\-Relevant Region Selection

Upon entering a new page, Region4Web produces the region partitionℛ\\mathcal\{R\}and the associated\{\(pi,si\)\}\\\{\(p\_\{i\},s\_\{i\}\)\\\}for the page\. The actor agent’s backbone LLM takes the abstractions\{\(pi,si\)\}\\\{\(p\_\{i\},s\_\{i\}\)\\\}together with the task instruction and the action history taken so far, and selects the task\-relevant regions\. The abstractions specify each region individually and collectively convey the page’s overall functional structure and current state, from which the model infers where the task currently stands and what is required next\. Selected regions are exposed to the actor agent as their AXTree subtrees with their purposes, preserving element\-level granularity for the action space, while non\-selected regions are represented by their purposes alone, retaining the page’s overall structural information\.

### 4\.2Page\-Aware Observation Transition Management

Section[2\.2](https://arxiv.org/html/2605.07134#S2.SS2)shows that within\-page observation changes only marginally between consecutive steps\. PageDigest therefore tracks observation transitions across steps, rather than reinvoking Region4Web at each step\. During transition management, only the region purposes are referenced, since each purpose describes what its region is for and remains stable across steps within the page\. State summaries, in contrast, describe each region’s current actionable context, making them useful for region selection at page entry but less suitable for page\-aware observation transition management\.

At each step, observation transitions are identified by comparing the current AXTree against its state at page entry, yielding added, removed, and modified nodes\. Removed and modified nodes update the AXTree constructed upon entering the page by deleting nodes or changing their values under the existing region purposes\. Added nodes, in contrast, are listed as a separate group that retains the structural grouping among them, since merging them into existing regions could shift those regions’ purposes\. The actor agent thus receives the current observation across steps within the page, preserving continuity of the page state\. When the agent navigates to a new page, signaled by a URL change, Region4Web is invoked to produce the new page’sℛ\\mathcal\{R\}and\{\(pi,si\)\}\\\{\(p\_\{i\},s\_\{i\}\)\\\}, and region selection proceeds\.

PageDigest shares the actor agent’s backbone LLM and operates solely on the observation space, requiring no additional model and leaving the actor agent’s policy unmodified, making it directly applicable to diverse web agents\. Moreover, since region selection is performed by the actor agent’s backbone and depends on its capability, the actor agent is given additionalview\_allaction that reveals all regions in their full AXTree subtree form for the remainder of the page, providing a fallback when the selected regions are insufficient\.

## 5Experiments

### 5\.1Experimental Setup

#### Evaluation benchmark\.

We evaluate on WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib16)\), a comprehensive web agent benchmark that spans five distinct domains, namely e\-commerce, social forum, collaborative development, content management, and map services\. Its 812 long\-horizon tasks, each allowing up to 30 steps, cover diverse interaction patterns\. Since the original evaluator relies ongpt\-4\-1106\-previewfor fuzzy answer matching, which has since been deprecated, we replace it with GPT\-4o\.

#### Actor agents\.

We evaluate across diverse backbone LLMs and actor agent methods to verify that PageDigest consistently reduces observation length regardless of the model’s capability or the agent’s design while preserving the performance\. The backbone LLMs span two proprietary and two open\-source LLMs, namely GPT\-5\.1\(OpenAI,[2025](https://arxiv.org/html/2605.07134#bib.bib51)\), Gemini 3\.1 Flash\-Lite\(AI,[2026](https://arxiv.org/html/2605.07134#bib.bib54)\), Deepseek\-V3\.2\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.07134#bib.bib49)\), and Qwen3\.5\-27B\(Team,[2025](https://arxiv.org/html/2605.07134#bib.bib45)\)\. Each backbone selects the next action given the interaction history and current observation at each step\(Yaoet al\.,[2023](https://arxiv.org/html/2605.07134#bib.bib12)\)\. We further evaluate on two established agent methods widely adopted in WebArena evaluation\. SteP\(Sodhiet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib17)\)dynamically composes human\-designed LLM policies tailored to WebArena tasks through a stack\-based Markov decision process\. AgentOccam\(Yanget al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib24)\)refines the observation and action spaces to align them with the underlying LLM’s pretrained capabilities\. Since AgentOccam runs with its own space alignment, in the PageDigest configuration, we replace its alignment with PageDigest while retaining the action space alignment, isolating the effect of region\-level observation\. We evaluate SteP and AgentOccam with GPT\-4o as the backbone, matching the GPT\-4 family under which both methods were originally developed\.

#### Implementation details\.

All experiments are conducted in the BrowserGym environment, with Map domain tasks routed to the live OpenStreetMap service333OpenStreetMap service\.[https://www\.openstreetmap\.org](https://www.openstreetmap.org/)following\(Chaeet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib23); Zhanget al\.,[2026b](https://arxiv.org/html/2605.07134#bib.bib52)\)\. For reproducibility, open\-source models are run at temperature 0 with thinking mode disabled where applicable, while proprietary models retain their default configuration\. We define observation length as the token count of the observation provided at the agent at each step\. All token counts reported in the experiments are measured under the OpenAIo200k\_basetokenizer\. All prompts are provided in Appendix[H](https://arxiv.org/html/2605.07134#A8)\.

Table 1:WebArena success rate \(%\) across domains, with the average observation token length per step reported in the Obs\. length column\. Each actor agent is reported with and without PageDigest\.

### 5\.2Main Results

#### PageDigest improves overall task success rate while reducing observation length, regardless of backbone capacity\.

Across the four backbones in Table[1](https://arxiv.org/html/2605.07134#S5.T1), PageDigest reduces observation length by 43% on average, from 6,437 to 3,671 tokens, and improves task success rate by 2\.3%p on average\. The improvement holds across backbones of varying capacity, suggesting that region\-level observation provides a complementary signal for page state understanding that benefits backbones independent of their strength\.

#### PageDigest extends to established agent methods through the observation space\.

Applying PageDigest to SteP and AgentOccam reduces observation length by 50% and 16% with comparable task success rate\. For AgentOccam, replacing its observation space alignment with PageDigest yields comparable performance, showing region\-level observation can replace element\-level alignment for action selection\. Since PageDigest operates solely on the observation space, sharing the actor’s backbone, it applies to diverse web agents, with task success scaling with backbone capacity\.

### 5\.3Further Analysis

We further analyze the contributions of Region4Web and PageDigest using GPT\-5\.1, with case studies of Region4Web’s decomposition and abstraction in Appendix[G](https://arxiv.org/html/2605.07134#A7)\.

Table 2:Ablation on WebArena\-Lite\. The Obs\. length column matches Table[1](https://arxiv.org/html/2605.07134#S5.T1)\.#### Region4Web improves page state understanding, while PageDigest keeps it compact across steps\.

For the ablation study, we use WebArena\-Lite\(Leeet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib29); Liuet al\.,[2025b](https://arxiv.org/html/2605.07134#bib.bib22)\), a 165\-task subset of WebArena\. Table[2](https://arxiv.org/html/2605.07134#S5.T2)compares the backbone, Region4Web alone, an element\-level variant of PageDigest that omits Region4Web and replaces the region selection stage with self\-contextualization as in LCoW\(Leeet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib29)\), and PageDigest\. Region4Web alone improves task success rate from 48\.5% to 50\.3% with comparable observation length, while the element\-level variant lowers it to 46\.1%, showing that region\-level observation supports the actor agent where element\-level processing instead hinders it\. PageDigest reduces observation length by 30%, comparable to the element\-level variant’s 26% reduction, while still achieving the highest task success rate among the configurations, improving over the backbone by 5\.4%p\. Together, these results show that page state understanding and compact persistence are complementary, with Region4Web preserving functional regions for page state understanding and PageDigest keeping them compact across steps\.

![Refer to caption](https://arxiv.org/html/2605.07134v1/x6.png)\(a\)Distribution of step\-scale observation length\.
![Refer to caption](https://arxiv.org/html/2605.07134v1/x7.png)\(b\)Distribution of task\-scale total observation tokens\.

#### PageDigest preserves its step\-scale reduction at the task\-scale despite the auxiliary inference it adds\.

As Figure[5\(a\)](https://arxiv.org/html/2605.07134#S5.F5.sf1)shows, PageDigest reduces the median observation length by 33%, from 3,077 to 2,066 tokens, and as Figure[5\(b\)](https://arxiv.org/html/2605.07134#S5.F5.sf2)shows, the median cumulative observation across a task drops by 25%, from 26,707 to 19,944 tokens\. The task total comprises more than the actor observations alone, since region selection inputs are added at each entry into a new page, andview\_allexpansions are added whenever the fallback is triggered\. Decomposing the task total, the actor observation accounts for 73\.9%, region selection for 19\.5%, andview\_allfor 6\.6%\. Region selection stage stays cheap since it operates over the region\-level abstractions\{\(pi,si\)\}\\\{\(p\_\{i\},s\_\{i\}\)\\\}rather than the element\-level AXTree, even when invoked 4\.8 times on average across a task, andview\_allis invoked sparingly in only 38\.1% of tasks, with an average of 0\.64 calls across a task\. The auxiliary overhead therefore stays bounded by page entries, and the step\-scale compactness carries through to the task\-scale\.

#### PageDigest’s failures lie largely outside its own design\.

We randomly sample 50 failed task trajectories under PageDigest on WebArena, 10 from each domain, and trace the failure to its

![Refer to caption](https://arxiv.org/html/2605.07134v1/x8.png)Figure 6:Failure mode distribution under PageDigest on WebArena\.triggering step and label every PageDigest stage, as shown in Figure[6](https://arxiv.org/html/2605.07134#S5.F6)\. Decomposition and abstraction errors \(each 2\.0%\) together account for only a small fraction, indicating that Region4Web reliably decomposes and abstracts regions on most pages\. Selection errors \(10\.0%\) reflect the backbone LLM missing task\-relevant regions despite Region4Web’s informative abstractions\. Transition management introduces no errors, as it deterministically compares the current AXTree against its state at page entry\. Actor\-side failures in the backbone’s action selection account for 90\.0%, with environment errors outside the pipeline adding 16\.0%\. PageDigest thus operates as designed, with 82\.0% backbone capacity dominating 8\.0% PageDigest regression under multi\-cause attribution\.

## 6Related Work

#### Web Page Structure Understanding

has been studied for information retrieval and content analysis, treating web pages as content to be processed rather than as observation for agents\. Web page segmentation partitions pages into visually or structurally coherent blocks, exemplified by VIPS\(Caiet al\.,[2003](https://arxiv.org/html/2605.07134#bib.bib39)\)\. Subsequent work has focused on evaluation methodology\(Kieselet al\.,[2020](https://arxiv.org/html/2605.07134#bib.bib2),[2021](https://arxiv.org/html/2605.07134#bib.bib4)\)and macro\-structural labels such as header, main content, and footer\(Gerberet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib5)\)\. Content extraction separates main content from surrounding noise through rule\-based heuristics\(Barbaresi,[2021](https://arxiv.org/html/2605.07134#bib.bib3)\)or language models\(Chenet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib50); Liuet al\.,[2025a](https://arxiv.org/html/2605.07134#bib.bib48); Wanget al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib43)\)\. In contrast, our Region4Web constructs region\-level observation for web agents by decomposing the page into functional regions and making each region’s purpose explicit for action selection\.

#### Observation Processing in Web Agents

has explored strategies to reduce observation length while preserving task\-relevant information\. A dominant line focuses on element selection\(Moskalevaet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib47)\), where Prune4Web\(Zhanget al\.,[2026a](https://arxiv.org/html/2605.07134#bib.bib33)\)filters elements via LLM\-generated keyword matching programs, and LCoW\(Leeet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib29)\)trains a contextualization module that extracts task\-relevant elements and annotates them contextually\. Orthogonal to selection, Beyond Pixels\(Schiepanski and Piël,[2025](https://arxiv.org/html/2605.07134#bib.bib44)\)downsamples the DOM tree while preserving its hierarchical structure\. AgentOccam\(Yanget al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib24)\)reformulates elements into markdown and identifies pivotal nodes to retain across steps\. Multimodal web agents use screenshots as additional input\(Guoet al\.,[2026](https://arxiv.org/html/2605.07134#bib.bib32); Heet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib19); Zhenget al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib18)\), while recent GUI agents introduce visually decomposed region structures\(Fanet al\.,[2024](https://arxiv.org/html/2605.07134#bib.bib21); Singhet al\.,[2025](https://arxiv.org/html/2605.07134#bib.bib27)\)\. These approaches provide visual or layout cues for understanding the page state, but they do not define observation units by shared functional purpose, leaving which elements form functional regions and what purposes those regions serve implicit\. Our work treats observation granularity as a design choice, shifting from element\-level to region\-level observation and deploying it through a web\-specific inference pipeline\.

#### Tree\-Structured Representation Learning

has been studied across domains, from syntactic parse trees in natural language processing\(Taiet al\.,[2015](https://arxiv.org/html/2605.07134#bib.bib6)\), to abstract syntax trees in source code analysis\(Mouet al\.,[2016](https://arxiv.org/html/2605.07134#bib.bib7); Wanget al\.,[2021](https://arxiv.org/html/2605.07134#bib.bib37); Zhanget al\.,[2019](https://arxiv.org/html/2605.07134#bib.bib1)\), to DOM trees in web page understanding\(Wanget al\.,[2022](https://arxiv.org/html/2605.07134#bib.bib10); Yeoh and Wang,[2022](https://arxiv.org/html/2605.07134#bib.bib11)\)\. These methods typically compute representations over a fixed tree structure and use the resulting node or tree representations for downstream prediction\. In this design, the tree structure is given in advance, and representation learning does not change which children belong to each parent\. Region partitioning breaks this independence, as boundary decisions directly alter the set of children a parent must represent\. This boundary\-representation dependency motivates the joint computation in a single bottom\-up traversal that Region4Web adopts\.

## 7Conclusion

We presented Region4Web and PageDigest, addressing observation granularity as an underexamined design choice for web agents\. Region4Web reorganizes the AXTree into functional regions to support the actor agent’s page state understanding, and PageDigest delivers this region\-level observation as a compact digest that persists across steps\. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone LLMs and established agent methods, demonstrating that region\-level observation can provide a more compact and informative basis for web agent decision making than element\-level processing\. These results show that observation granularity directly affects web agent efficiency\. By separating observation design from model capability and action policy, our work opens a path toward more efficient web agents by rethinking the granularity at which pages are observed\.

## References

- G\. AI \(2026\)Gemini 3\.1 flash\-lite: built for intelligence at scale\.External Links:[Link](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite)Cited by:[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px2.p1.1)\.
- A\. Barbaresi \(2021\)Trafilatura: a web scraping library and command\-line tool for text discovery and extraction\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,External Links:[Link](https://aclanthology.org/2021.acl-demo.15)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p4.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- G\. Buscher, E\. Cutrell, and M\. R\. Morris \(2009\)What do you see when you’re surfing?: using eye tracking to predict salient regions of web pages\.Proceedings of the SIGCHI Conference on Human Factors in Computing Systems\.External Links:[Link](https://dl.acm.org/doi/10.1145/1518701.1518705)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p3.1)\.
- D\. Cai, S\. Yu, J\. Wen, and W\. Ma \(2003\)VIPS: a vision\-based page segmentation algorithm\.External Links:[Link](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2003-79.pdf)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p4.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- H\. Chae, N\. Kim, K\. T\. Ong, M\. Gwak, G\. Song, J\. Kim, S\. Kim, D\. Lee, and J\. Yeo \(2025\)Web agents with world models: learning and leveraging environment dynamics in web navigation\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2410.13232)Cited by:[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px3.p1.1)\.
- Y\. Chen, B\. Xu, X\. Wang, and Z\. Mao \(2025\)An index\-based approach for efficient and effective web content extraction\.External Links:[Link](https://arxiv.org/abs/2512.06641)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.External Links:[Link](https://arxiv.org/abs/2512.02556)Cited by:[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px2.p1.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2Web: towards a generalist agent for the web\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2306.06070)Cited by:[Appendix C](https://arxiv.org/html/2605.07134#A3.p1.1),[§2](https://arxiv.org/html/2605.07134#S2.p2.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. D\. Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez, N\. Chapados, and A\. Lacoste \(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.InThe Forty\-First International Conference on Machine Learning,External Links:[Link](https://arxiv.org/abs/2403.07718)Cited by:[Appendix D](https://arxiv.org/html/2605.07134#A4.p1.1)\.
- Y\. Fan, L\. Ding, C\. Kuo, S\. Jiang, Y\. Zhao, X\. Guan, J\. Yang, Y\. Zhang, and X\. E\. Wang \(2024\)Read anywhere pointed: layout\-aware gui screen reading with tree\-of\-lens grounding\.InThe 2024 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/2406.19263)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p3.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Gerber, J\. Saxer, K\. Rabishokr, B\. Kreiner, and A\. Weiler \(2025\)WebClasSeg\-25: a dual\-classified webpage segmentation dataset\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,External Links:[Link](https://dl.acm.org/doi/10.1145/3726302.3730309)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p4.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Guo, C\. Guo, A\. Sun, H\. He, X\. Yang, Y\. Lu, Y\. Zhang, X\. Guo, D\. Zhang, J\. Liu, J\. Duan, Y\. Xiao, L\. Wen, H\. Xu, and Y\. Dai \(2026\)Web\-cogreasoner: towards knowledge\-induced cognitive reasoning for web agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2508.01858)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024\)WebVoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2401.13919)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1),[§1](https://arxiv.org/html/2605.07134#S1.p3.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- P\. Huang, X\. Zheng, J\. Lin, Y\. Zhang, J\. Zhou, Z\. Yang, R\. Yuan, Z\. Liu, Y\. Yan, G\. Zhang, and W\. Huang \(2025\)R2D2: remembering, reflecting and dynamic decision making for web agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2503.07675)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1)\.
- M\. Kang, W\. Chen, D\. Han, H\. A\. Inan, L\. Wutschitz, Y\. Chen, R\. Sim, and S\. Rajmohan \(2025\)ACON: optimizing context compression for long\-horizon llm agents\.External Links:[Link](https://arxiv.org/abs/2510.00615)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1)\.
- J\. Kiesel, L\. Meyer, F\. Kneist, B\. Stein, and M\. Potthast \(2020\)Web page segmentation revisited: evaluation framework and dataset\.InProceedings of the 29th ACM International Conference on Information and Knowledge Management,External Links:[Link](https://dl.acm.org/doi/10.1145/3340531.3412782)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p4.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Kiesel, L\. Meyer, F\. Kneist, B\. Stein, and M\. Potthast \(2021\)An empirical comparison of web page segmentation algorithms\.InProceedings of the 43rd European Conference on IR Research,External Links:[Link](https://downloads.webis.de/publications/papers/kiesel_2021a.pdf)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- D\. Lee, J\. Lee, K\. Kim, J\. Tack, J\. Shin, Y\. W\. Teh, and K\. Lee \(2025\)Learning to contextualize web pages for enhanced decision making by llm agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2503.10689)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.07134#S5.SS3.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2605.07134#S5.T2.4.4.3.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Li, Y\. Zhao, R\. Varma, O\. Salpekar, P\. Noordhuis, T\. Li, A\. Paszke, J\. Smith, B\. Vaughan, P\. Damania, and S\. Chintala \(2020\)PyTorch distributed: experiences on accelerating data parallel training\.Proc\. VLDB Endow\.\.External Links:[Link](https://arxiv.org/abs/2006.15704)Cited by:[§F\.2](https://arxiv.org/html/2605.07134#A6.SS2.SSS0.Px1.p1.4)\.
- T\. Lin, P\. Goyal, R\. Girshick, K\. He, and P\. Dollár \(2017\)Focal loss for dense object detection\.InProceedings of the IEEE international conference on computer vision,External Links:[Link](https://arxiv.org/abs/1708.02002)Cited by:[§3\.4](https://arxiv.org/html/2605.07134#S3.SS4.SSS0.Px2.p1.3)\.
- E\. Z\. Liu, K\. Guu, P\. Pasupat, T\. Shi, and P\. Liang \(2018\)Reinforcement learning on web interfaces using workflow\-guided exploration\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/1802.08802)Cited by:[Appendix C](https://arxiv.org/html/2605.07134#A3.p2.1)\.
- M\. Liu, J\. Peng, W\. Ning, P\. Chu, J\. Qiu, R\. Ma, H\. Zhu, R\. Min, L\. Lu, L\. Hou, K\. Liu, Y\. Qu, Z\. Li, C\. Xu, Z\. Tu, W\. Zhang, and C\. He \(2025a\)Dripper: token\-efficient main html extraction with a lightweight lm\.External Links:[Link](https://arxiv.org/abs/2511.23119)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p4.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, T\. Zhang, Y\. Gu, I\. L\. Iong, Y\. Xu, X\. Song, S\. Zhang, H\. Lai, X\. Liu, H\. Zhao, J\. Sun, X\. Yang, Y\. Yang, Z\. Qi, S\. Yao, X\. Sun, S\. Cheng, Q\. Zheng, H\. Yu, H\. Zhang, W\. Hong, M\. Ding, L\. Pan, X\. Gu, A\. Zeng, Z\. Du, C\. H\. Song, Y\. Su, Y\. Dong, and J\. Tang \(2025b\)VisualAgentBench: towards large multimodal models as visual foundation agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2408.06327)Cited by:[§5\.3](https://arxiv.org/html/2605.07134#S5.SS3.SSS0.Px1.p1.1)\.
- L\. Logeswaran, J\. Kim, S\. Sohn, C\. Glasscock, and H\. Lee \(2025\)Scaling web agent training through automatic data generation and fine\-grained evaluation\.InSecond Conference on Language Modeling,External Links:[Link](https://arxiv.org/abs/2602.12544)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1)\.
- Y\. Ma, Y\. Tian, N\. Moniz, and N\. V\. Chawla \(2025\)Class\-imbalanced learning on graphs: a survey\.ACM Computing Survey\.External Links:[Link](https://doi.org/10.1145/3718734)Cited by:[§3\.4](https://arxiv.org/html/2605.07134#S3.SS4.SSS0.Px2.p1.3)\.
- A\. Moskaleva, M\. Abdelhady, A\. Katharopoulos, D\. Toyama, and S\. Schug \(2025\)FocusAgent: simple yet effective ways of trimming the large context of web agents\.External Links:[Link](https://arxiv.org/abs/2510.03204)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- L\. Mou, G\. Li, L\. Zhang, T\. Wang, and Z\. Jin \(2016\)Convolutional neural networks over tree structures for programming language processing\.InProceedings of the 30th AAAI Conference on Artificial Intelligence,External Links:[Link](https://arxiv.org/abs/1409.5718)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px3.p1.1)\.
- OpenAI \(2025\)OpenAI gpt\-5 system card\.External Links:[Link](https://arxiv.org/abs/2601.03267)Cited by:[§3\.4](https://arxiv.org/html/2605.07134#S3.SS4.p1.2),[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px2.p1.1)\.
- Y\. Pan, D\. Kong, S\. Zhou, C\. Cui, Y\. Leng, B\. Jiang, H\. Liu, Y\. Shang, S\. Zhou, T\. Wu, and Z\. Wu \(2024\)WebCanvas: benchmarking web agents in online environments\.External Links:[Link](https://arxiv.org/abs/2406.12373)Cited by:[Appendix C](https://arxiv.org/html/2605.07134#A3.p2.1)\.
- Z\. Qi, X\. Liu, I\. L\. Iong, H\. Lai, X\. Sun, W\. Zhao, Y\. Yang, X\. Yang, J\. Sun, S\. Yao, T\. Zhang, W\. Xu, J\. Tang, and Y\. Dong \(2025\)WebRL: training llm web agents via self\-evolving online curriculum reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2411.02337)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1)\.
- T\. M\. Schiepanski and N\. Piël \(2025\)Beyond pixels: exploring dom downsampling for llm\-based web agents\.External Links:[Link](https://arxiv.org/abs/2508.04412)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p2.1),[§1](https://arxiv.org/html/2605.07134#S1.p3.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2303.11366)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1)\.
- K\. Singh, S\. Singh, and M\. Khanna \(2025\)TRISHUL: towards region identification and screen hierarchy understanding for large vlm based gui agents\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\) Workshops,External Links:[Link](https://arxiv.org/abs/2502.08226)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p3.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- P\. Sodhi, S\. R\. K\. Branavan, Y\. Artzi, and R\. McDonald \(2024\)SteP: stacked llm policies for web actions\.InFirst Conference on Language Modeling,External Links:[Link](https://arxiv.org/abs/2310.03720)Cited by:[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.07134#S5.T1.1.13.12.1)\.
- K\. S\. Tai, R\. Socher, and C\. D\. Manning \(2015\)Improved semantic representations from tree\-structured long short\-term memory networks\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing,External Links:[Link](https://arxiv.org/abs/1503.00075)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px3.p1.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3\.4](https://arxiv.org/html/2605.07134#S3.SS4.SSS0.Px3.p1.1),[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px2.p1.1)\.
- F\. Wang, Z\. Shi, B\. Wang, N\. Wang, and H\. Xiao \(2025\)ReaderLM\-v2: small language model for html to markdown and json\.External Links:[Link](https://arxiv.org/abs/2503.01151)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px1.p1.1)\.
- Q\. Wang, Y\. Fang, A\. Ravula, F\. Feng, X\. Quan, and D\. Liu \(2022\)WebFormer: the web\-page transformer for structure information extraction\.InProceedings of the ACM Web Conference 2022,External Links:[Link](https://arxiv.org/abs/2202.00217)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px3.p1.1)\.
- W\. Wang, G\. Li, S\. Shen, X\. Xia, and Z\. Jin \(2021\)Modular tree network for source code representation learning\.ACM Transactions on Software Engineering and Methodology\.External Links:[Link](https://dl.acm.org/doi/10.1145/3441472)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px3.p1.1)\.
- Z\. Wei, W\. Yao, Y\. Liu, W\. Zhang, Q\. Lu, L\. Qiu, C\. Yu, P\. Xu, C\. Zhang, B\. Yin, H\. Yun, and L\. Li \(2025\)WebAgent\-r1: training web agents via end\-to\-end multi\-turn reinforcement learning\.InThe 2025 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/2505.16421)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1)\.
- J\. Wu, W\. Yin, Y\. Jiang, Z\. Wang, Z\. Xi, R\. Fang, L\. Zhang, Y\. He, D\. Zhou, P\. Xie, and F\. Huang \(2025\)WebWalker: benchmarking llms in web traversal\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2501.07572)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1)\.
- T\. Xue, W\. Qi, T\. Shi, C\. H\. Song, B\. Gou, D\. Song, H\. Sun, and Y\. Su \(2025\)An illusion of progress? assessing the current state of web agents\.InSecond Conference on Language Modeling,External Links:[Link](https://arxiv.org/abs/2504.01382)Cited by:[Appendix A](https://arxiv.org/html/2605.07134#A1.p1.1)\.
- K\. Yang, Y\. Liu, S\. Chaudhary, R\. Fakoor, P\. Chaudhari, G\. Karypis, and H\. Rangwala \(2025\)AgentOccam: a simple yet strong baseline for llm\-based web agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2410.13825)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p2.1),[§1](https://arxiv.org/html/2605.07134#S1.p3.1),[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.07134#S5.T1.1.15.14.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2210.03629)Cited by:[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px2.p1.1)\.
- B\. Yeoh and H\. Wang \(2022\)GROWN\+up: a graph representation of a webpage network utilizing pre\-training\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,External Links:[Link](https://arxiv.org/abs/2208.02252)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px3.p1.1)\.
- J\. Zhang, X\. Wang, H\. Zhang, H\. Sun, K\. Wang, and X\. Liu \(2019\)A novel neural source code representation based on abstract syntax tree\.InProceedings of the 41st International Conference on Software Engineering,External Links:[Link](https://dl.acm.org/doi/10.1109/ICSE.2019.00086)Cited by:[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px3.p1.1)\.
- J\. Zhang, K\. Chen, Z\. Lu, E\. Zhou, Q\. Yu, and J\. Zhang \(2026a\)Prune4Web: dom tree pruning programming for web agent\.InProceedings of the 40th AAAI Conference on Artificial Intelligence,External Links:[Link](https://arxiv.org/abs/2511.21398)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1),[§1](https://arxiv.org/html/2605.07134#S1.p3.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- W\. Zhang, J\. Wang, J\. Zhou, Q\. Li, X\. Ma, C\. Zheng, X\. Lou, W\. Liu, Z\. Zhang, J\. Wang, Y\. Yu, and W\. Zhang \(2026b\)Plan\-mcts: plan exploration for action exploitation in web navigation\.External Links:[Link](https://arxiv.org/abs/2602.14083)Cited by:[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px3.p1.1)\.
- B\. Zheng, B\. Gou, J\. Kil, H\. Sun, and Y\. Su \(2024\)GPT\-4v\(ision\) is a generalist web agent, if grounded\.InThe Forty\-First International Conference on Machine Learning,External Links:[Link](https://arxiv.org/abs/2401.01614)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p1.1),[§1](https://arxiv.org/html/2605.07134#S1.p3.1),[§6](https://arxiv.org/html/2605.07134#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2307.13854)Cited by:[§1](https://arxiv.org/html/2605.07134#S1.p7.1),[§5\.1](https://arxiv.org/html/2605.07134#S5.SS1.SSS0.Px1.p1.1)\.

## Appendix ALimitations and Future Work

Region4Web operates over the AXTree, and its decomposition and abstraction quality therefore depend on how completely each page exposes its accessibility semantics, with pages that render through canvas or rely on non\-semantic markup providing weaker structural cues for boundary classification\. Our evaluation focuses on WebArena, whose consistent AXTree fidelity supports the controlled comparisons our experiments require\. Broader validation on real\-world web environments, where AXTree fidelity and page complexity vary across sites, complements these results\. Future work includes broadening evaluation to live web environments such as Online\-Mind2Web\[[42](https://arxiv.org/html/2605.07134#bib.bib30)\], and applying region\-level granularity to screenshot\-based agents, since organizing observation by shared functional purpose generalizes across modalities\.

## Appendix BBroader Impacts

Region4Web and PageDigest reduce the observation length required for web agent operation, which lowers inference cost and broadens access to web agent technology in resource\-constrained settings\. The same efficiency gains can also lower the barrier to misuse such as large\-scale scraping or automated abuse of online services, where mitigation lies at the deployment level through controls such as rate limiting and access policies\. Our training data is constructed from publicly accessible pages on Tranco\-listed websites and contains no personal or sensitive information, limiting privacy concerns from the released artifacts\.

## Appendix CDataset Selection for Preliminary Analysis

Our preliminary analysis requires a web agent benchmark that provides per\-action ground\-truth annotations across diverse real\-world web pages, so that consecutive action targets can be identified and localized within the page structure\. Mind2Web\[[8](https://arxiv.org/html/2605.07134#bib.bib15)\]is well suited for this purpose\. It provides 2,350 tasks across 137 websites spanning 31 domains, where each action step is grounded in the DOM snapshot of the page at that step\. Since our analysis targets structural properties within individual snapshots, the static nature of these representations does not affect the validity of the measurements\.

Other web agent benchmarks do not meet these requirements\. MiniWoB\+\+\[[21](https://arxiv.org/html/2605.07134#bib.bib9)\]consists of atomic\-level tasks in synthetic web environments that do not reflect the structural complexity of real\-world pages\. Mind2Web\-Live\[[29](https://arxiv.org/html/2605.07134#bib.bib40)\]provides tasks on live websites, but its annotations adopt a key\-node evaluation scheme that assesses task completion at designated milestones rather than providing per\-action ground\-truth annotations with element\-level targets\. Although the raw data provides per\-action ground\-truth annotations,444[https://github\.com/imeanai/webcanvas?tab=readme\-ov\-file\#download](https://github.com/imeanai/webcanvas?tab=readme-ov-file#download)page sources are identified by URL without stored snapshots, and the referenced pages have since undergone content updates and layout modifications, making the original page structures unrecoverable\.

## Appendix DAXTree Preprocessing

Our AXTree preprocessing follows BrowserGym\[[9](https://arxiv.org/html/2605.07134#bib.bib20)\], which extracts the accessibility tree via the Chrome DevTools Protocol, filters out nodes with no accessible content, and serializes each remaining node in an indentation\-based text format \(\[id\] role name value\)\.555[https://github\.com/ServiceNow/BrowserGym/blob/main/browsergym/core/src/browsergym/core/observation\.py](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/core/src/browsergym/core/observation.py)We adopt this technique with three modifications\.

First, each node is identified by a persistent identifier, the browser\-assignedbackendDOMNodeIdor BrowserGym’sbid, that remains stable across same\-page DOM mutations\. This enables stable cross\-step node matching and serves as the basis for the observation transition history in Section[2\.2](https://arxiv.org/html/2605.07134#S2.SS2)\. Second, BrowserGym unconditionally removes all property\-lessgenericandnonenodes, which causes wrapper elements that group related content in the DOM to collapse into flat sibling lists\. We retain such a node when it has two or more child branches, each containing a visible descendant, preserving structural grouping that hierarchical decomposition relies on\. Finally, forimageandlinknodes whose accessible name is empty, the node is enriched with the correspondingsrcorhrefattribute retrieved from the DOM\.

## Appendix ETraining Dataset Construction

### E\.1Source Page Collection

#### Domain categories\.

We select 10 domain categories from the 37 Tier 1 categories in the IAB Content Taxonomy 3\.1 for their relevance to web agent tasks, covering Shopping, Travel, Technology & Computing, Business and Finance, Education, Food & Drink, Real Estate, Careers, Entertainment, and Sports\.

#### Website selection\.

We use the Tranco top\-1M ranking list snapshot from April 1, 2026 as the source\. To assign each website to a domain category, we embed the 37 Tier 1 categories as reference embeddings and embed each website’s concatenated title and description metadata usingsentence\-transformers/paraphrase\-MiniLM\-L6\-v2\.666[https://huggingface\.co/sentence\-transformers/paraphrase\-MiniLM\-L6\-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)Each website is assigned to the nearest category by cosine similarity\. From the resulting clusters, we retain the 10 categories defined above and select the 500 highest\-ranked websites by Tranco position across these categories\.

#### Page URL sampling\.

For each website, page URLs are sampled from itssitemap\.xmlfile\. Each URL is scored by the sum of three signals from the sitemap metadata, namely priority \(0\.0–1\.0, default 0\.5\), change frequency \(0\.15 for daily or hourly, 0\.1 for weekly\), and URL depth \(0\.03 per path segment, up to 5 levels\)\. Up to 100 URLs with the highest scores are retained per website\. Websites without an accessiblesitemap\.xmlor unreachable via Playwright headless Chromium are excluded, removing 247 of the original 500\. The AXTree of each remaining page is extracted via Playwright, yielding 21,974 pages from 253 websites\.

### E\.2Data Annotation

Since the knowledge of how web pages are functionally organized is implicit, we construct training data for both decomposition and abstraction usinggpt\-5\-mini\-2025\-08\-27as the annotator\. Because decomposition produces the region partition that abstraction then interprets, any partition error contaminates downstream abstraction labels\. We therefore add a verification stage between the two, retaining only pages where every region passes validation\. The annotation accordingly proceeds through three stages\.

#### Decomposition annotation\.

The annotator receives each page’s preprocessed AXTree together with the page URL and produces a list of region root node IDs\. Since the annotator occasionally assigns an entire page to a single region, a fallback mechanism re\-partitions any region whose node count exceeds 50% of the page total and is more than 10 times the median region size\. Of the 21,974 pages, 7 fail due to context length limits and 2,690 \(12\.2%\) trigger the fallback\. The remaining 21,967 pages yield 547,075 regions, averaging 24\.9 per page\.

#### Partition verification\.

The annotator receives each page’s region partition and identifies regions that were incorrectly decomposed\. Since even a single invalid region is sufficient to corrupt the abstraction labels derived from it, we retain only pages that yield a valid region partition, one in which every region is correctly decomposed\. This reduces 21,967 pages to 2,052 \(9\.3%\) with 46,487 regions, averaging 22\.7 per page\.

#### Abstraction annotation\.

The annotator receives each verified region’s AXTree subtree and produces a purpose and a state summary\. Of the 46,487 regions, 1,340 consist solely ofnoneorgenericnodes with no visible content and are excluded, yielding 45,147 annotated regions from 2,052 pages\. The annotations reduce the average region representation from 176\.6 tokens to 56\.2 tokens under the OpenAIo200k\_basetokenizer, resulting in a 68\.2% reduction\.

Table[3](https://arxiv.org/html/2605.07134#A5.T3)summarizes the dataset at each stage of the construction process\. The annotation prompts used at each stage are provided in Figure[9](https://arxiv.org/html/2605.07134#A8.F9),[10](https://arxiv.org/html/2605.07134#A8.F10), and[11](https://arxiv.org/html/2605.07134#A8.F11), respectively\.

Table 3:Statistics at each stage of training dataset construction\.

## Appendix FRegion4Web Implementation Details

### F\.1Hierarchical Decomposition

#### Node features\.

The feature vector𝐱v\\mathbf\{x\}\_\{v\}is 16\-dimensional, concatenating a learned role embedding \(11 dimensions\) with five numeric features\. The role vocabulary contains 204 entries, 203 from the Chromium accessibility role enumeration777Chromium 125\.0\.6422\.26\.chromium/src/ui/accessibility/ax\_enums\.mojomand one for unknown roles\. The five numeric features are the node’s depth in the tree, subtree size, number of children, accessible name presence, and child role diversity, providing structural cues beyond the role embedding for boundary classification\. Accessible name presence is set to 1 when the node has a non\-empty accessible name and 0 otherwise\. Child role diversity calculates the ratio of unique child roles to the number of children\.

#### Model architecture\.

RegionEncoderandEdgeClassifierare both three\-layer MLPs with ReLU activations and a hidden dimension of 256\.RegionEncodermaps a 272\-dimensional input \(𝐱v\\mathbf\{x\}\_\{v\}and the merged children aggregation\) to the 256\-dimensional representation𝐫v\\mathbf\{r\}\_\{v\}\.EdgeClassifiermaps a 528\-dimensional input \(𝐱v\\mathbf\{x\}\_\{v\},𝐫ci\\mathbf\{r\}\_\{c\_\{i\}\}, and𝐫¯\\bar\{\\mathbf\{r\}\}\) to a scalar logity^v,ci\\hat\{y\}\_\{v,c\_\{i\}\}\. The model totals approximately 536K parameters including the role embedding table\. The full inference procedure is given in Algorithm[1](https://arxiv.org/html/2605.07134#alg1)\.

Algorithm 1Hierarchical Decomposition1:Page AXTree

𝒯=\(𝒱,ℰ\)\\mathcal\{T\}=\(\\mathcal\{V\},\\mathcal\{E\}\), threshold

τ\\tau
2:Region partition

ℛ\\mathcal\{R\}
3:

ℛ←∅\\mathcal\{R\}\\leftarrow\\emptyset
4:foreach node

v∈𝒱v\\in\\mathcal\{V\}in bottom\-up orderdo

5:

Sv←\{v\}S\_\{v\}\\leftarrow\\\{v\\\}
6:

𝐱v←Concat​\(𝐄role​\(v\),𝐧v\)\\mathbf\{x\}\_\{v\}\\leftarrow\\text\{Concat\}\(\\mathbf\{E\}\_\{\\text\{role\}\}\(v\),\\;\\mathbf\{n\}\_\{v\}\)
7:if

vvis a leafthen

8:

𝐫v←RegionEncoder​\(𝐱v,0\)\\mathbf\{r\}\_\{v\}\\leftarrow\\textsc\{RegionEncoder\}\(\\mathbf\{x\}\_\{v\},\\;\\mathbf\{0\}\)
9:else

10:Let

c1,…,ckc\_\{1\},\\ldots,c\_\{k\}be the children of

vv
11:

𝐫¯v←1k​∑j=1k𝐫cj\\bar\{\\mathbf\{r\}\}\_\{v\}\\leftarrow\\frac\{1\}\{k\}\\sum\_\{j=1\}^\{k\}\\mathbf\{r\}\_\{c\_\{j\}\}⊳\\trianglerightSibling mean

12:

ℳv←∅\\mathcal\{M\}\_\{v\}\\leftarrow\\emptyset
13:for

i=1,…,ki=1,\\ldots,kdo

14:if

EdgeClassifier​\(𝐱v,𝐫ci,𝐫¯v\)≥τ\\textsc\{EdgeClassifier\}\(\\mathbf\{x\}\_\{v\},\\;\\mathbf\{r\}\_\{c\_\{i\}\},\\;\\bar\{\\mathbf\{r\}\}\_\{v\}\)\\geq\\tauthen

15:

ℛ←ℛ∪\{Sci\}\\mathcal\{R\}\\leftarrow\\mathcal\{R\}\\cup\\\{S\_\{c\_\{i\}\}\\\}⊳\\trianglerightCut:cic\_\{i\}’s subtree constitutes a region

16:else

17:

ℳv←ℳv∪\{ci\}\\mathcal\{M\}\_\{v\}\\leftarrow\\mathcal\{M\}\_\{v\}\\cup\\\{c\_\{i\}\\\}
18:

Sv←Sv∪SciS\_\{v\}\\leftarrow S\_\{v\}\\cup S\_\{c\_\{i\}\}⊳\\trianglerightMerge:cic\_\{i\}’s subtree merges intovv’s region

19:endif

20:endfor

21:if

ℳv≠∅\\mathcal\{M\}\_\{v\}\\neq\\emptysetthen

22:

𝐫v←RegionEncoder​\(𝐱v,1\|ℳv\|​∑cj∈ℳv𝐫cj\)\\mathbf\{r\}\_\{v\}\\leftarrow\\textsc\{RegionEncoder\}\\bigl\(\\mathbf\{x\}\_\{v\},\\;\\tfrac\{1\}\{\|\\mathcal\{M\}\_\{v\}\|\}\\sum\_\{c\_\{j\}\\in\\mathcal\{M\}\_\{v\}\}\\mathbf\{r\}\_\{c\_\{j\}\}\\bigr\)
23:else

24:

𝐫v←RegionEncoder​\(𝐱v,0\)\\mathbf\{r\}\_\{v\}\\leftarrow\\textsc\{RegionEncoder\}\(\\mathbf\{x\}\_\{v\},\\;\\mathbf\{0\}\)
25:endif

26:endif

27:endfor

28:

ℛ←ℛ∪\{Svroot\}\\mathcal\{R\}\\leftarrow\\mathcal\{R\}\\cup\\\{S\_\{v\_\{\\text\{root\}\}\}\\\}⊳\\trianglerightRoot subtree constitutes the final region

29:return

ℛ\\mathcal\{R\}

#### Training configuration\.

The model is trained with teacher forcing, where ground\-truth edge labels determine cut and merge decisions during the bottom\-up traversal rather than the model’s own predictions\. Training runs for 140 epochs on a NVIDIA RTX A6000 GPU with Adam optimizer at a learning rate of1×10−41\\times 10^\{\-4\}and gradient clipping at1\.01\.0, and focal loss withα=0\.75\\alpha=0\.75andγ=2\.0\\gamma=2\.0to address the class imbalance between merge and cut edges\. The data is split into 90% training and 10% validation sets at the page level with seed 42\.

#### Checkpoint selection and threshold tuning\.

The training epoch and the inference thresholdτ\\tauare determined in two steps, each using the metric that matches its objective\.

The training epoch is selected based on edge\-level F1 on the validation set, as the model directly optimizes edge\-level binary classification during training\. Among the epochs with the highest validation F1, we choose the one with the smallest training\-validation F1 gap to avoid overfitting, yielding epoch 125\. Figure[7](https://arxiv.org/html/2605.07134#A6.F7)shows the edge\-level F1 curves over training\.

The inference thresholdτ\\tauconverts edge\-level logits into a region partition, whose quality is not captured by edge\-level metrics\. We therefore tuneτ\\tauat the region level\. Each ground\-truth region is matched to the predicted region with the highest Intersection\-over\-Union \(IoU\), and counted as matched if this IoU meets or exceeds 0\.5\. Region\-level precision, recall, and F1 are then computed over the matched counts relative to the total predicted and ground\-truth regions\. This yieldsτ=0\.55\\tau=0\.55\. Table[4](https://arxiv.org/html/2605.07134#A6.T4)reports the region\-level metrics across threshold values\.

![[Uncaptioned image]](https://arxiv.org/html/2605.07134v1/x9.png)

Figure 7:Edge\-level F1 on training and validation sets over 140 epochs\. Epoch 125 is selected for deployment\.
Table 4:Region\-level precision, recall, and F1 across inference thresholds at epoch 125\.τ=0\.55\\tau=0\.55achieves the highest F1\.

### F\.2Semantic Abstraction

#### Training configuration\.

Qwen3\-0\.6Bis fine\-tuned with full supervised fine\-tuning inbfloat16precision with gradient checkpointing\. Each training example pairs a region’s preprocessed AXTree subtree as input with a JSON object containing the corresponding purposepip\_\{i\}and state summarysis\_\{i\}as output, using the annotation prompt as the instruction prefix, which is shown in Figure[11](https://arxiv.org/html/2605.07134#A8.F11)\. The loss is computed only on the output tokens, with all input and padding tokens masked\. Training runs for 90 epochs \(76,200 steps\) on 3 NVIDIA RTX A6000 GPUs using distributed data parallel \(DDP\)\[[19](https://arxiv.org/html/2605.07134#bib.bib36)\]with a per\-device batch size of 1 and gradient accumulation of 16, yielding an effective batch size of 48\. The optimizer is AdamW with a learning rate of5×10−65\\times 10^\{\-6\},200200linear warmup steps, and cosine decay\. The maximum sequence length is 8,192 tokens, and 37 samples \(0\.08%\) exceeding this limit are skipped during training\. The data is split into 90% training and 10% validation sets with the same seed 42 used throughout decomposition model training\.

#### Checkpoint selection\.

The checkpoint at step 65,350 is selected by jointly considering validation loss and manual quality assessment of sampled outputs\. Figure[8](https://arxiv.org/html/2605.07134#A6.F8)shows the training and validation loss curves\.

#### Inference\.

The fine\-tuned model processes regions with greedy decoding\. The annotation prompt as in Figure[11](https://arxiv.org/html/2605.07134#A8.F11)is reused at inference to maintain distributional consistency between training and deployment\.

![Refer to caption](https://arxiv.org/html/2605.07134v1/x10.png)Figure 8:Training and validation loss over 90 epochs\. Step 65,350 is selected for deployment\.

## Appendix GCase Studies

We provide qualitative case studies of Region4Web’s decomposition and abstraction stage on representative pages from each WebArena domain in Tables[5](https://arxiv.org/html/2605.07134#A8.T5)through[9](https://arxiv.org/html/2605.07134#A8.T9)\.

## Appendix HPrompts

For reproducibility, we provide all prompts used in this work in Figures[9](https://arxiv.org/html/2605.07134#A8.F9)through[14](https://arxiv.org/html/2605.07134#A8.F14)\.

Table 5:Region4Web output on WebArena Shopping domain\.Page URL: http://localhost:7770 \(343 nodes\)Hierarchical Decomposition: total 23 regions![[Uncaptioned image]](https://arxiv.org/html/2605.07134v1/x11.png)Semantic Abstraction: R0, R1, R5, R7, R8R0\(purpose: Account navigation menu\)state summary: Provides navigation links to account\-related pages \(My Account, My Wish List, Sign In, Create an Account\) and a Welcome message\. The Create an Account link is actionable to initiate a new account\.R1\(purpose: Search form\)state summary: Search is currently enabled with a combobox labeled "Search" and a "Advanced Search" link\. The combobox is not expanded and the button is disabled\.R5\(purpose: product listing card\)state summary: Product: Pre\-baked Gingerbread House Kit Value Pack, 17 oz\., Pack of 2, Total 34 oz\. with a 20% rating and $19\.99\. Available actions: Add to Cart, Add to Wish List, and Add to Compare\.R6\(purpose: product card\)state summary: Healthy energy drink with a 57% rating and $14\.47 price\. Available actions: Add to Cart, Add to Wish List, and Add to Compare\.R7\(purpose: product card\)state summary: Product: Elmwood Inn Fine Teas, Orange Vanilla Caffeine\-free Fruit Infusion, 16\-Ounce Pouch \(95% rating\) priced at $19\.36\. Available actions: Add to Cart, Add to Wish List, and Add to Compare\.Table 6:Region4Web output on WebArena CMS \(shopping admin\) domain\.Page URL: http://localhost:7780/admin/admin/dashboard \(217 nodes\)Hierarchical Decomposition: total 17 regions![[Uncaptioned image]](https://arxiv.org/html/2605.07134v1/x12.png)Semantic Abstraction: R2, R5, R6, R10R2\(purpose: User control links\)state summary: Contains two clickable links labeled "admin"s\. Clicking either link navigates to the corresponding admin page\.R5\(purpose: Average order value display\)state summary: Shows an average order value of $0\.00\. No interactive controls are present in this region\.R6\(purpose: Order history table\)state summary: Shows order details for five orders \(ID 299, 65, 125, 136, 230\) with each row showing customer name, item count, and total\. Each order link is actionable \(clickable URL\) to view the order\.R10\(purpose: Scope and data management controls\)state summary: Shows a ’Scope:’ heading and provides a ’All Store Views’ button with a menu popup and a ’Reload Data’ button\. The ’What is this?’ link is actionable for clarification\.Table 7:Region4Web output on WebArena Reddit domain\.Page URL: http://localhost:9999/friedly\-reminder\-bookshop\-org\-exists \(4,151 nodes\)Hierarchical Decomposition: total 345 regions![[Uncaptioned image]](https://arxiv.org/html/2605.07134v1/x13.png)Semantic Abstraction: R5, R6, R11, R342R5\(purpose: Promotional call\-to\-action\)state summary: Promotes a local bookstore program that 30% of book purchases go to the store and encourages supporting local bookstores\. The region contains static text and a closing statement that appears to be a call to action\.R6\(purpose: comment count display\)state summary: Shows a count of 129 comments\. The item is a link that can be activated to open the comment list or view more details\.R11\(purpose: upvote/downvote controls\)state summary: Contains two buttons labeled "Upvote" and "Downvote" with a numeric value of 367 displayed\. Clicking the buttons will toggle the up/down vote state and the 367 number is static text showing the current count\.R342\(purpose: Book listing\)state summary: Contains a single book entry with a ’books’ link and a ’Subscribe via RSS’ image\. The book’s timestamp shows it was created 4 years ago\.Table 8:Region4Web output on WebArena Gitlab domain\.Page URL: http://localhost:8023/byteblaze/a11y\-syntax\-highlighting \(546 nodes\)Hierarchical Decomposition: total 14 regions![[Uncaptioned image]](https://arxiv.org/html/2605.07134v1/x14.png)Semantic Abstraction: R2, R3, R6, R10R2\(purpose: Help and account navigation links\)state summary: Contains two links: a ’Help’ link with an image and a ’Sign in / Register’ link\. Clicking either will navigate to the help documentation or account sign\-in/register page\.R3\(purpose: Project statistics card\)state summary: Shows project statistics: 49 commits, 1 branch, 0 tags, and 2\.1 MB project storage\. All items are clickable links that navigate to the corresponding metrics\.R6\(purpose: List of files and their commit/updates\)state summary: Contains a list of files \(dist, images, test, LICENSE, README\.md, package\.json\) with their last commit and update dates\. Each file is a link that opens the file’s page or shows the file’s name and time\.R10\(purpose: license and documentation links\)state summary: Contains two clickable links: a README image and a GNU GPLv3 license link\. Click either link to open the corresponding documentation or license page\.Table 9:Region4Web output on WebArena Map domain\.Page URL: https://www\.openstreetmap\.org/directions?engine=fossgis\_osrm\_car… \(176 nodes\)Hierarchical Decomposition: total 13 regions![[Uncaptioned image]](https://arxiv.org/html/2605.07134v1/x15.png)Semantic Abstraction: R0, R1, R5, R7, R11R0\(purpose: site navigation menu\)state summary: Provides navigational links to site sections: History, Export, GPS Traces, User Diaries, Communities, Copyright, Help, Donate, and About\. Each item is a clickable link to navigate to the corresponding page\.R1\(purpose: Authentication and sign\-up navigation\)state summary: Provides two navigation links: ’Log In’ to initiate account access and ’Sign Up’ to create a new account\. Both links are actionable and can be activated to proceed with the respective authentication or sign\-up process\.R5\(purpose: Directions map route\)state summary: Provides a route map with 11 steps \(1–11\) and a destination\. Includes a downloadable GeoJSON file and a link to the OSRM \(FOSSGIS\) source\. The heading ’Directions’ is a heading and the table shows distance and time for each step\.R7\(purpose: Page header controls\)state summary: Provides navigation links to Layers, Legend, Share, Add a note to the map, and Query features\. A ’Show My Location’ button is available to open the location view\.R11\(purpose: Directions routing selection panel\)state summary: Selects directions services \(OSRM\) and provides a ’Reverse Directions’ button to reverse the route\. The ’From’ and ’To’ fields are populated with the specified addresses and the ’Close’ button cancels the panel\.Decomposition promptYou are an observation space analyst for web agents\. You partition a web page’s accessibility tree \(AXTree\) into non\-overlapping functional regions so that an autonomous agent can understand and interact with the page\.
<definition\>
A functional region is a subtree whose elements are collectively organized to serve a distinct purpose\. Functional purpose is not a property of individual elements\. It arises from how elements are collectively organized\. A single link has an element\-level action, but a region\-level purpose emerges only when multiple elements are organized together to fulfill a coherent function\. Every node belongs to exactly one region\.
</definition\>
<constraints\>
1\. Structural containers \(the tree root, ARIA landmarks like banner/main/contentinfo\) group content by page position, not by purpose\. Always evaluate their children\. A container becomes a region only for children that do not form their own regions\.
2\. A region must be meaningful to an agent, something it would need to independently recognize or interact with to carry out a task\. Purely decorative elements and isolated utility shortcuts belong to their parent’s region\.
</constraints\>
<algorithm\>
To evaluate a node N, apply these steps in order\.
Step 1\. Container passthrough\.
If N is the tree root or an ARIA landmark, evaluate each direct child of N by applying this algorithm recursively\. N itself becomes a region only for its remaining children, those that did not form their own regions\. If all children form regions, N is not recorded\.
Step 2\. Evaluate N\.
For each candidate N that is not a container:
\(a\) If N’s children are structurally repetitive \(a collection of entries\), determine whether each entry’s purpose arises from its own internal organization: its own elements collectively organized into different roles \(information, metadata, and actions\) around a single entity, interpretable without shared context from siblings or parent structure\. If yes, each entry is its own region\. Apply this algorithm recursively to each\. If no, N is one region\. Record N\.
\(b\) If N has multiple children that serve distinct purposes and each child’s purpose arises from its own internal organization, each such child is its own region\. Apply this algorithm recursively to each\. Children whose purpose does not arise from their own internal organization remain in N’s region\.
\(c\) If neither \(a\) nor \(b\) applies, N serves one coherent purpose\. Record N\.
Step 3\. Meaningfulness check\.
Before recording N, verify it is meaningful to an agent \(see constraint 2\)\. If not, N belongs to its parent’s region\.
</algorithm\>
<output\_format\>
Output the recorded region root node IDs as a comma\-separated list\.
After the list, output nothing further\.
</output\_format\>
URL: \{url\}
AXTree:
\{axtree\}Figure 9:Prompt for decomposition annotation stage\.Verification promptYou are an observation space analyst for web agents\. You verify whether a page’s accessibility tree \(AXTree\) has been correctly partitioned into functional regions\.
<definition\>
A functional region is a subtree whose elements are collectively organized to serve a distinct purpose\. Functional purpose is not a property of individual elements\. It arises from how elements are collectively organized\. A single link has an element\-level action, but a region\-level purpose emerges only when multiple elements are organized together to fulfill a coherent function\.
</definition\>
<criteria\>
For each region, verify that it is correctly formed by checking:
1\. The region corresponds to one recognizable functional unit on the page, an area that an agent would identify as serving a single role\.
2\. If you can identify multiple sub\-components within the region that are each independently recognizable as their own functional area on the page, the region is incorrectly formed\. Those areas should be separate regions\.
</criteria\>
<output\_format\>
Output the IDs of incorrectly formed regions as a comma\-separated list \(e\.g\., R3, R7\)\.
If no region is incorrectly formed, output: none
After the output, output nothing further\.
</output\_format\>
URL: \{url\}
\{region\_partition\}Figure 10:Prompt for partition verification stage\.Abstraction promptYou are an observation space analyst for web agents\. You produce semantic descriptions of functional regions extracted from web page accessibility trees \(AXTree\)\.
<task\>
Given a region’s AXTree subtree, produce two descriptions:
1\. purpose: Identify the collective function that the region’s elements are organized to serve\. This should name the type of region, not describe its current contents or enumerate its features\. Write a short noun phrase\.
2\. state\_summary: Interpret the region’s current content and available actions to inform task\-based decision making\. Lead with the key information an agent would match against a task, not with descriptions of what the region shows\. Write one to two concise sentences\.
</task\>
<guidelines\>
\- Derive both fields solely from the elements present in the subtree\.
\- For purpose, identify what the elements collectively accomplish, not what individual elements are\.
\- For state\_summary, interpret and select what matters for decision making, not exhaustively describe elements\.
\- Always output in English, translating non\-English content as needed\.
</guidelines\>
\{region\_axtree\}
Figure 11:Prompt for abstraction, used across annotation, training, and inference\.Selection promptYou are a page region selector for a web agent\. You receive the agent’s task, the actions taken so far, and a list of functional regions on the current page, each described by its purpose and state summary\. Select the regions the agent needs on this page to make progress on the task\.
<principles\>
1\. First understand what the current page offers from the full set of region abstractions\. Then, given the task and the action history, select every region whose content could be relevant to the task, whether the agent needs to interact with it or read information from it\. Do not exclude regions based on an assumed course of action\.
2\. Exclude a region only when its purpose is clearly unrelated to the task\. If relevance cannot be determined from the description, include the region\.
3\. When multiple regions share a similar purpose and their state summaries do not indicate which ones the task requires, include all of them\.
4\. A state summary that appears to match the task does not by itself justify excluding other potentially relevant regions\. The rendered content may not satisfy the task’s exact requirements\.
</principles\>
<output\_format\>
Output the selected region IDs as a comma\-separated list \(e\.g\., R3, R7\)\.
After the list, output nothing further\.
</output\_format\>
Task: \{task\_instruction\}
Action history:
\{action\_history\}
\{region\_abstractions\}Figure 12:Prompt for task\-relevant region selection\.Action selection promptYou are a web agent\. You receive a task, the current page state, and your previous actions\. You select the next action to make progress on the task\.
<action\_space\>
\{action\_space\}
<action\_space\>
Elements are identified by unique bid\. Use bid to refer to elements in your actions\. Interacting with comboboxes, dropdowns, and auto\-complete fields may require different actions depending on the element\. Try select\_option first\. If it does not work, try fill or click and observe the result\. Your final message to the user must contain only the answer value in the format implied by the task, with no prefixes, explanations, or restatements\.
</action\_space\>
<output\_format\>
Reason about the current state and decide the next action inside <think\> tags\.
Then output exactly one action inside <action\> tags\.
After the tags, output nothing further\.
<think\>
Your step\-by\-step reasoning\.
</think\>
<action\>
click\(’a324’\)
</action\>
</output\_format\>
Task: \{task\_instruction\}
Action history:
\{action\_history\}
\{axtree\}Figure 13:Prompt template for action selection\. \{action\_space\} is replaced with the set of 15 available actions and their descriptions provided by BrowserGym\. \(e\.g\.,click,fill\)Action selection prompt \(w/ PageDigest\)\# action space
view\_all\(\)
Description: Reorganize the current page state into functional regions and reveal all of them in their full AXTree form\. Use this when the currently exposed regions appear insufficient for the task, for example when the target element seems missing from the observation or when no available action makes progress\. This action remains in effect until the agent navigates to a new page\.
Examples:
view\_all\(\)
\# observation space
<observation\_space\>
The page is split into functional regions, each rendered as a <Rx purpose="…"\> … </Rx\> block\. Regions deemed task\-relevant render their full AXTree subtree inside the block; others appear as the opening tag with no inner content\. Newly appeared nodes since the page was first entered are listed at the end inside an <added\_elements\> block\.
If the elements you need to interact with do not appear inside any rendered subtree, or no available action makes progress, call view\_all\(\) to re\-expose every region’s full subtree for the rest of this page\.
</observation\_space\>Figure 14:Prompt for action selection with PageDigest\. Only the additions to Figure[13](https://arxiv.org/html/2605.07134#A8.F13)are shown\.

Similar Articles

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Hugging Face Daily Papers

MM-WebAgent is a hierarchical agentic framework that generates coherent and visually consistent webpages by coordinating AIGC-based element generation through joint optimization of layout and multimodal content. The paper introduces a benchmark and multi-level evaluation protocol, demonstrating improvements over code-generation and agent-based baselines.