Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
Apple Research introduces Weblica, a framework for creating scalable and reproducible training environments for visual web agents using HTTP caching and LLM-based synthesis.
arXiv:2605.06761v1 Announce Type: new Abstract: The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 07:06 AM
# Scalable and Reproducible Training Environments for Visual Web Agents
Source: [https://arxiv.org/html/2605.06761](https://arxiv.org/html/2605.06761)
Roman BachmannYuanzheng GongAnders Boesen Lindbo LarsenAfshin DehghanApple

\(May 7, 2026\)

###### Abstract

The web is complex, open\-ended, and constantly changing, making it challenging to scale training data for visual web agents\. Existing data collection attempts remain limited to offline trajectories for supervised fine\-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity\. We propose WEBLICA\(Web Replica\), a framework for constructing reproducible and scalable web environments\. Our framework leverages 1\) HTTP\-level caching to capture and replay stable visual states while preserving interactive behavior and 2\) LLM\-based environment synthesis grounded in real\-world websites and core web navigation skills\. Using this framework, we scale RL training to thousands of diverse environments and tasks\. Our best model, WEBLICA\-8B, outperforms open\-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test\-time compute, and is competitive with API models\.

## 1Introduction

Large language models are increasingly capable as autonomous agents in domains like coding\[anthropic2025claudecode;openaicodex;wang2024openhands;hui2024qwen2\], mathematics\[hubert2025olympiad;novikov2025alphaevolve\], and computer use\[anthropic2024claudecomputeruse;openaicomputeruse;geminicomputeruse;qin2025ui\]\. This progress is driven by the availability of large\-scale, high\-quality training data\. Web navigation has emerged as a recent focus toward building agents that autonomously navigate the web to solve tasks ranging from information retrieval to form filling to online shopping\. These agents aim to complete multi\-step workflows that currently require manual effort, representing a significant step toward personalized digital assistants\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.06761v1/x2.png)

Figure 1:Scaling training data for visual web agents is challenging due to the complex, dynamic, and open\-ended nature of the web\.Left:We synthetically generate web environments in a fully automated and scalable manner, spanning a broad set of capabilities like navigation, form filling, filtering, date picking, and more\. Alongside caching real websites, these create a fully offline alternative to the live web training\.Right:Training on these environments improves performance across multiple web navigation benchmarks \(Online\-Mind2Web\[xue2025illusion\]shown here\), scaling with both test\-time compute \(top\) and model size \(bottom\)\.However, progress on building web agents has been slower, primarily due to the difficulty of scaling training data and environments to capture the complexity of the web\. Recent attempts at data generation include collecting offline trajectories as demonstration data for supervised fine\-tuning\[wang2025opencua;awadallah2025fara;gupta2026molmoweb\], offering limited support to handle the stochastic nature of the web due to lack of interaction\. As an alternative, building simulated web environments offers interaction, but they commonly cover only a handful of manually defined domains\[zhou2023webarena;koh2024visualwebarena\], limiting generalization\. While training directly on live websites could alleviate that, it suffers from brittleness due to timeouts and bot detection, making training unstable\. In addition, the live web is constantly evolving and slow to interact with, making carefully controlled ablations and fully reproducible training difficult\. This raises a natural question:how to scale interactive web environments while maintaining reproducibility?

We proposeWeblica, a framework for constructing reproducible and scalable web environments to train visual web agents\. Our framework introduces two complementary mechanisms\. First, we develop an HTTP\-level caching system that records and replays real website interactions, capturing stable visual states while preserving interactive behavior\. This enables reproducible training on diverse real\-world websites without the brittleness of live web training, though it is limited to domains where stable recordings can be obtained\. Second, we present an LLM\-based environment synthesis pipeline that generates interactive web environments grounded in real websites and core web navigation skills \(e\.g\., form submission, authentication flows, dynamic search\), enabling scaling to broader domains at the cost of a potential sim\-to\-real gap\. All environments are served locally, eliminating network latency and enabling fast training\. Together, these approaches provide diverse, reproducible environments at scale\.

Using this framework, we scale training to thousands of diverse environments and web navigation tasks\. We fine\-tune models from the Qwen3\-VL\[yang2025qwen3\]family, which operate purely on screenshots without requiring set\-of\-marks annotations or DOM access, as these can hurt generalization due to the web’s inconsistent underlying structure\[yutori2025bitter\]\. We study the effect of training stages and environment composition, and analyze how performance scales with model size and test\-time compute\. Our 8B model achieves strong results across multiple web navigation benchmarks\. On Online\-Mind2Web\[xue2025illusion\], it reaches 39\.2% pass@1 with only 30 steps, outperforming open\-weight models that use 3×\\timesmore steps, and improves further with additional test\-time compute\.

## 2Related Work

#### Building Web Agents\.

Early web agents relied on text\-only language models that process structured representations such as accessibility trees or DOM elements\[yao2022webshop;deng2023mind2web;zhou2023webarena;gur2023real\]\. Later work adopted vision\-language models \(VLMs\) to ground actions visually\[koh2024visualwebarena;he2024webvoyager;hong2024cogagent\]\. Since early VLMs had limited grounding capabilities, initial approaches augmented screenshots with set\-of\-marks\[yang2023set\]overlays\. These overlay numbered bounding boxes on interactive elements to simplify action prediction\. However, this introduces dependencies on accurate element detection and adds visual clutter that does not reflect natural web perception\[zheng2024gpt\]\. More recent work removes these aids entirely, building agents that operate on raw screenshots and predict actions as pixel coordinates\[qin2025ui;wang2025ui;andreux2025surfer;awadallah2025fara;gupta2026molmoweb;openaicomputeruse;geminicomputeruse\]\. We follow this direction and train visual web agents with screenshot input and coordinate\-based actions\.

#### Data and Environments for Web Agents\.

Several efforts collect supervised fine\-tuning \(SFT\) trajectories through human annotation or model\-generated rollouts\. Fara\[awadallah2025fara\]develops a multi\-agent data generation system that produces 145K trajectories across 70K domains\. MolmoWeb\[gupta2026molmoweb\]combines over 100K synthetic task trajectories with 30K\+ human demonstrations and GUI perception data\. OpenCUA\[wang2025opencua\]and AgentTrek\[xu2024agenttrek\]similarly collect demonstration data for web tasks\. While valuable, SFT data alone provides limited support for the exploration and trial\-and\-error learning that RL training enables\.

Synthetic environments offer an alternative by enabling RL training in controlled settings\. WebArena\[zhou2023webarena\]and VisualWebArena\[koh2024visualwebarena\]provide self\-hosted websites that simulate e\-commerce, forums, and content management systems\. WebRL\[qi2024webrl\]and AgentGym\-RL\[xi2025agentgym\]build on these for RL training, yet they cover only a handful of domains and do not capture the diversity of the real web\.

Recent work has explored scaling task generation\. InstaV3\[trabucco2025insta\]develops an LLM\-based pipeline to generate web navigation tasks across 146K live websites\. WebGym\[bai2026webgym\]sources several datasets for RL training on live websites, but suffers from reproducibility issues and training instability due to timeouts and bot detection\. Our framework addresses this through caching and LLM\-based synthesis while remaining grounded in real websites and web navigation skills\.

#### Evaluating Web Agents\.

Evaluation benchmarks for web agents span visual grounding and end\-to\-end task completion\. For visual grounding, benchmarks like ScreenSpot\-v2\[wu2024atlas\], ScreenSpot\-Pro\[li2025screenspot\], and MMBench\-GUI\[wang2025mmbench\]evaluate an agent’s ability to localize and interact with UI elements\. For task completion, benchmarks vary in their use of simulated versus real environments\. World of Bits\[shi2017world\]was an early effort that cached HTTP traffic to create reproducible offline approximations of websites, though limited to simple mini\-tasks\. WebArena\[zhou2023webarena\]and VisualWebArena\[koh2024visualwebarena\]evaluate agents on self\-hosted websites with programmatic success checking\. While reproducible, they suffer from a sim\-to\-real gap\. Benchmarks on real websites include GAIA\[mialon2023gaia\], WebVoyager\[he2024webvoyager\], and Mind2Web\[deng2023mind2web\], which test agents on live web tasks but face reproducibility challenges as websites change over time\. WebVoyager additionally suffers from limited task diversity, with up to 51% of tasks solvable via search shortcuts\. Online\-Mind2Web\[xue2025illusion\]addresses these issues with a more realistic setup that evaluates agents on live websites using an LLM\-as\-Judge for task success\. DeepShop\[lyu2025deepshop\]and WebTailBench\[awadallah2025fara\]further test agents on e\-commerce and long\-tail web tasks, respectively\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x3.png)Figure 2:Framework overview\.Weblica\-Cache\(top\): We record a browsing session capturing all HTTP traffic, then identify volatile parameters \(e\.g\., timestamps, session tokens\) that cause cache misses during playback\. These are used to generate site\-specific caching rules that strip volatile parameters from cache keys, enabling deterministic replay under complete network isolation\.Weblica\-Synth\(bottom\): We task coding agents to generate web environments parameterized by a navigation capability, website category, and visual style\. The agent writes framework\-free HTML, CSS, and JavaScript and iterates using tools \(e\.g\., image generation, screenshot validation\) until the website and tasks are functional\. We apply both approaches at scale to create diverse offline training environments \(see[Figure˜1](https://arxiv.org/html/2605.06761#S1.F1)for samples\)\.

## 3Framework

[Figure˜2](https://arxiv.org/html/2605.06761#S2.F2)provides an overview of our framework\. We first describe the agent formulation for visual web navigation \([Section˜3\.1](https://arxiv.org/html/2605.06761#S3.SS1)\), then detail the two environment construction mechanisms: HTTP\-level caching \([Section˜3\.2](https://arxiv.org/html/2605.06761#S3.SS2)\) and LLM\-based synthesis \([Section˜3\.3](https://arxiv.org/html/2605.06761#S3.SS3)\)\.

### 3\.1Agent Formulation

We formulate web navigation as a partially observable Markov decision process \(POMDP\) defined by the tuple\(𝒮,𝒜,𝒪,T,R\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},T,R\), where𝒮\\mathcal\{S\}is the state space of the browser,𝒜\\mathcal\{A\}is the action space,𝒪\\mathcal\{O\}is the observation space,T\(st\+1∣st,at\)T\(s\_\{t\+1\}\\mid s\_\{t\},a\_\{t\}\)is the transition function governing how the browser state changes in response to an action \(implemented via Playwright\), andRRis the reward function \(defined in[Section˜4\.1](https://arxiv.org/html/2605.06761#S4.SS1)\)\. At each timesteptt, the agent receives an observationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}and selects an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}conditioned on the task instructionτ\\tauand historyo≤to\_\{\\leq t\}\. See[Figure˜3](https://arxiv.org/html/2605.06761#S3.F3)for an example trajectory\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x4.png)Figure 3:Example trajectory ofWeblica\-8Bsolving a data entry task in aWeblica\-Synthenvironment, evaluated by an LLM judge against task\-specific criteria \(more examples in[Appendix˜B](https://arxiv.org/html/2605.06761#A2)\)\.#### Observation Space\.

Each observationot=\(st,ut\)o\_\{t\}=\(s\_\{t\},u\_\{t\}\)consists of a browser screenshotsts\_\{t\}rendered at1280×7201280\\times 720pixels and the current URLutu\_\{t\}\. Unlike approaches that rely on accessibility trees or DOM structures, our agents operate purely on visual input\.

#### Action Space\.

We adopt a coordinate\-based action space following recent work on visual web agents\. Coordinate\-based actions \(click,hover\) take pixel positions\(x,y\)\(x,y\)as arguments, while other actions take task\-specific arguments \(text, keys, direction, etc\.\)\. Thestopaction terminates the episode and optionally returns a response\. Please see[Table˜6](https://arxiv.org/html/2605.06761#A3.T6)for the full action space\.

#### Policy\.

The agent policyπθ\(at∣o≤t,τ\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid o\_\{\\leq t\},\\tau\)is parameterized by a vision\-language model and follows a ReAct\-style\[yao2022react\]framework\. At each step, the model produces a reasoning tracertr\_\{t\}analyzing the current observation, then selects an actionata\_\{t\}\. Both reasoning traces and actions are appended to the history for subsequent steps\. We use Qwen3\-VL\-Instruct as our base model, which supports coordinate\-based action prediction without set\-of\-marks or other visual annotations\.

### 3\.2HTTP\-Level Caching

Recording and replay\.We implement HTTP\-level caching using Playwright to record and replay web interactions\. During recording, we capture all HTTP traffic and index responses by normalized request signatures\. The key challenge is handling volatile parameters \(timestamps, session tokens\) that change between visits and cause cache misses\. We address this with a rule\-based normalization system that filters such parameters from URLs, headers, and POST bodies, with domain\-specific rules and a multi\-level fallback for progressive matching\.

Automated rule generation\.Developing caching rules for each website requires analyzing its traffic patterns\. We automate this with a pipeline: first, we record a browsing session performed by a Qwen3\-VL\-32B\-Instruct agent, capturing all request parameters without filtering\. A subsequent playback reveals cache misses, which we fuzzy\-match against the recording to identify which parameters changed across visits\. These reports are used to synthesize site\-specific caching rules and synthetic responses for non\-essential endpoints \(e\.g\., analytics\)\. Generated rules are validated through playback with complete network isolation\. Only sessions where the agent successfully completes the task under cached conditions are retained for training\. This automated approach captures the full fidelity of real web content, including dynamic layouts and UI interactions, and scales to thousands of domains\.

Environments and tasks\.We leverage the InstaV3\[trabucco2025insta\]dataset as our task pool, which provides web navigation tasks across 146K websites generated through an LLM\-based pipeline\. We match these tasks to cached environments and verify solvability under cached conditions, retaining 15\.6K cached environments and tasks\. We call the resulting collectionWeblica\-Cache\.

### 3\.3LLM\-Based Environment Synthesis

Recent agentic coding tools such as Claude Code\[anthropic2025claudecode\]have demonstrated strong autonomous coding capabilities\. Given task descriptions and verification criteria, these systems can work independently until success requirements are satisfied\. We leverage such tools to generate synthetic web environments at scale, enabling on\-demand creation of functionality that is difficult to extract via caching alone, e\.g\. stateful tasks\.

Capability extraction\.We target the set of broad web navigation capabilities used across common websites and reflected in benchmarks like Online\-Mind2Web\[xue2025illusion\]\. To identify these capabilities, we first collect trajectories from Qwen3\-VL\-32B\-Instruct attempting the Online\-Mind2Web tasks\. We then use GPT\-5\.2\[openai2025gpt52\]to analyze the successful and failed trajectory screenshots and extract coarse web interaction capabilities apparent in each\. This yields 19,721 fine\-grained capabilities \(e\.g\., tab interface navigation, open dropdown menu, etc\.\), which we aggregate into 144 higher\-level capability groups \(e\.g\., navigation, form input, date selection, map interaction\)\.

Diverse website generation\.We use Claude Code \(Opus 4\.5\[anthropic2025opus45\]\) to automatically generate self\-contained web environments, writing static HTML, JavaScript, and CSS without any external framework dependencies\. Each generation is parameterized by a target capability group, a website category randomly sampled from a pool of 1,160 domains \(e\.g\., aviation, banking, yoga studio, zoology, etc\.\), and a visual style sampled from 961 options \(e\.g\., Editorial, Minimalist, Skeuomorphic, Duotone, etc\.\)\. See[Figure˜19](https://arxiv.org/html/2605.06761#A4.F19)for an overview of the distribution of capabilities, domains, and visual styles\. Sampling website categories and visual styles is important for diversity, as we found that without them, generated websites converge to a narrow visual style and content\. For each website we generate at least 10 tasks of varying difficulty that require performing the target capability and other interactions present on the site\. The websites are static and have no backend, and we use JavaScript’slocalStoragefeature to save values \(e\.g\., items in cart\) during a session\. We use Z\-Image\-Turbo\[Team2025ZImage\]to automatically generate relevant visual assets for each website, such as product images, banners, etc\. To ensure quality, we instruct Claude Code to self\-validate each site by taking screenshots via Playwright and iterating until the output satisfies the desired quality, i\.e\., that the website and tasks are functional and that there are no CSS issues\.

Environment server\.Serving synthetically generated websites instead of training on the real web also enables much faster environment interactions, with environment setup/reset and action\-to\-screenshot times around 50 to 150ms\. Hosting environments locally removes the need to wait for the network to stabilize, and we additionally use Playwright’s animation skipping feature\. Together, these speed up Playwright by an order of magnitude \(from∼\\sim1\.5s to 50\-150ms per action\), leading to an overall 30\-40% speed improvement in end\-to\-end RL training\.

Environments and tasks\.In total, we generated 310 sites covering the higher\-level capability groups, and 2500 sites targeting the fine\-grained capabilities sorted by frequency\. We call the resulting collection ofsyntheticweb environmentsWeblica\-Synth\. From this full set, we reserve 2560 web environments for training covering 44,227 tasks \(Weblica\-train\), and 250 sites for validation covering 500 tasks \(Weblica\-val\)\. Please see[Figure˜1](https://arxiv.org/html/2605.06761#S1.F1)as well as[Figures˜8](https://arxiv.org/html/2605.06761#A2.F8),[9](https://arxiv.org/html/2605.06761#A2.F9)and[10](https://arxiv.org/html/2605.06761#A2.F10)for random examples of synthetically generated websites\.

## 4Training

We describe our training pipeline below\. Please see[Appendix˜C](https://arxiv.org/html/2605.06761#A3)for further details\.

### 4\.1Data and Reward

#### RL Data\.

We describe task sourcing for RL environments in[Section˜3\.2](https://arxiv.org/html/2605.06761#S3.SS2)and[Section˜3\.3](https://arxiv.org/html/2605.06761#S3.SS3)\. Please see[Section˜D\.1](https://arxiv.org/html/2605.06761#A4.SS1)for environment statistics\.

#### LLM\-as\-Judge Reward\.

As many web navigation tasks are open\-ended and cannot be evaluated programmatically or via string matching, we implement an LLM\-as\-judge\[zheng2023judging\]reward mechanism\. Given a task description, the agent’s action sequence, and resulting screenshots, we prompt GPT\-4o\[hurst2024gpt\]to assess whether the agent successfully completed the task\. This enables training on the full diversity of web tasks beyond those with programmatic verification\. We validate the LLM judge by measuring agreement with human evaluations, finding 88% agreement \(see[Section˜C\.4](https://arxiv.org/html/2605.06761#A3.SS4)for details\)\.

#### SFT Data\.

We generate supervised fine\-tuning data by collecting trajectories from a Qwen3\-VL\-32B\-Instruct agent on InstaV3 queries\. As not all rollouts succeed due to agent failures or website\-specific issues such as changed task criteria or timeouts, we filter trajectories using the LLM judge to retain only successful completions\. To increase coverage, we sample multiple rollouts per query with diverse sampling parameters, yielding 51\.7K SFT trajectories in total\. Please see[Figure˜18](https://arxiv.org/html/2605.06761#A3.F18)for the trajectory length distribution\.

### 4\.2Post\-Training Stages

#### SFT Warm\-Start\.

We fine\-tune the base VLM on verified trajectories before RL training\. This provides a strong initialization that already exhibits reasonable web navigation behavior\. We ablate the role of SFT by comparing warm\-started RL against cold\-start RL that trains directly from the base model in[Section˜5\.4](https://arxiv.org/html/2605.06761#S5.SS4)\.

#### RL Training\.

We train using Dr\. GRPO\[liu2025understanding\], a variant of Group Relative Policy Optimization \(GRPO\)\[shao2024deepseekmath\], with the LLM\-as\-Judge reward on our diverse environment suite\. During RL, we oversample rollouts and filter to retain groups with mixed success signals, similar to DAPO\[yu2025dapo\], and apply fallback mechanisms to handle errors or timeouts, ensuring stable training\.

## 5Experiments

We evaluateWeblicaon web navigation and UI grounding benchmarks\. After describing our setup \([Section˜5\.1](https://arxiv.org/html/2605.06761#S5.SS1)\), we present main results against existing agents \([Section˜5\.2](https://arxiv.org/html/2605.06761#S5.SS2)\), analyze test\-time scaling \([Section˜5\.3](https://arxiv.org/html/2605.06761#S5.SS3)\), and ablate training stages, environment choices, and the effect on grounding \([Section˜5\.4](https://arxiv.org/html/2605.06761#S5.SS4)\)\.

### 5\.1Setup

#### Environments\.

Our training environment consists ofWeblica\-CacheandWeblica\-Synthenvironments, covering diverse domains including e\-commerce, information, entertainment, and government services\. We subsample 10K tasks from each, yielding 20K tasks for RL training\.

Table 1:Comparison across web navigation benchmarks\. Avg\. is the mean over the three live benchmarks\. Baseline results are reported from\[awadallah2025fara;gupta2026molmoweb\]\. Italic Online\-Mind2Web results are from the auto\-eval leaderboard\. With only 60 total steps compared to≥\\geq100 for baselines,Weblica\-8Balready achieves the best average among non\-proprietary models \(above dashed line\), and further improves with additional test\-time compute \(below dashed line\)\. Weboldthe best non\-proprietary result per column andunderlinethe second best\.†Fara\-7B and MolmoWeb\-8B allow up to 5 and 10 retries per trajectory\.Live Web NavigationSyntheticModelTotal \# StepsOnline\-Mind2WebDeepShopWebTailBenchAvg\.Weblica\-valAPI\-onlyOpenAI CUA10058\.324\.725\.736\.2–Gemini CUA10057\.362\.063\.060\.8–Yutori Navigator\[yutori2025navigator\]–64\.7––––Open\-weightQwen3\-VL\-Instruct\-8B3028\.624\.121\.824\.856\.9UI\-TARS\-1\.5\-7B\[qin2025ui\]10031\.311\.619\.520\.8–GLM\-4\.1V\-9B\-Thinking\[hong2025glm\]10033\.932\.022\.429\.4–Fara\-7B\[awadallah2025fara\]≥\\geq100†34\.126\.238\.432\.9–MolmoWeb\-8B\[gupta2026molmoweb\]≥\\geq100†35\.342\.349\.542\.4–Ours \(pass@k\)Weblica\-8B\(k=1\)3039\.234\.233\.535\.670\.6Weblica\-8B\(k=2\)6050\.345\.447\.047\.679\.0Weblica\-8B\(k=4\)12060\.555\.960\.358\.984\.7Weblica\-8B\(k=8\)24068\.865\.872\.268\.988\.6

#### Evaluation\.

We evaluate on four web navigation benchmarks\. Online\-Mind2Web \(OM2W\)\[xue2025illusion\]tests agents on diverse live websites\. DeepShop\[lyu2025deepshop\]focuses on e\-commerce tasks\. WebTailBench \(WTB\)\[awadallah2025fara\]tests agents on long\-tail web tasks\. We use the official judges provided by each benchmark for evaluation\. For WebTailBench, we follow MolmoWeb\[gupta2026molmoweb\]and use the WebVoyager\[he2024webvoyager\]judge\. We also evaluate onWeblica\-val, 500 held\-out tasks on 250 unseen synthesized environments, to measure in\-distribution generalization\. We study test\-time scaling by runningk∈\{1,2,4,8\}k\\in\\\{1,2,4,8\\\}independent attempts and reporting pass@kk, where total step budget scales linearly withkk\. Unlike retry\-based evaluation in prior work\[awadallah2025fara;gupta2026molmoweb\], where failures attributed to the environment are selectively discarded, this provides a more reliable estimate of task success and a transparent measure of total test\-time compute\. We repeat each experiment three times and report mean accuracy; standard deviations are reported in plots and in[Appendix˜A](https://arxiv.org/html/2605.06761#A1)\. We additionally evaluate UI grounding on MMBench\-GUI\[wang2025mmbench\], ScreenSpot\-v2\[wu2024atlas\], and ScreenSpot\-Pro\[li2025screenspot\]\([Section˜5\.4](https://arxiv.org/html/2605.06761#S5.SS4.SSS0.Px3)\)\.

#### Baselines\.

We compare against open\-weight models including UI\-TARS\-1\.5\-7B\[qin2025ui\], GLM\-4\.1V\-9B\-Thinking\[hong2025glm\], Fara\-7B\[awadallah2025fara\], and MolmoWeb\-8B\[gupta2026molmoweb\], as well as API\-based agents including OpenAI computer\-use\-preview, Gemini computer\-use\-preview, and Yutori Navigator\[yutori2025navigator\]\. We also report the base Qwen3\-VL\-Instruct\-8B as a reference\.

### 5\.2Main Results

[Table˜1](https://arxiv.org/html/2605.06761#S5.T1)presents results across four benchmarks and[Figure˜3](https://arxiv.org/html/2605.06761#S3.F3)shows a qualitative example \(more in[Appendix˜B](https://arxiv.org/html/2605.06761#A2)\)\. We summarize the key observations below\.

#### Training onWeblicadata improves the base model by 44%\.

At pass@1 with 30 steps,Weblica\-8Bimproves the base Qwen3\-VL\-8B from 24\.8% to 35\.6% average across the three live benchmarks, an absolute gain of 10\.8 percentage points\. OnWeblica\-val, our held\-out evaluation set of synthesized environments, training improves the base from 56\.9% to 70\.6%\. We observe consistent improvements at 2B and 4B model sizes as well \([Figure˜1](https://arxiv.org/html/2605.06761#S1.F1)and[Section˜A\.1](https://arxiv.org/html/2605.06761#A1.SS1)\)\.

#### Weblica\-8Boutperforms open\-weight baselines with less test\-time compute\.

With only 60 total steps,Weblica\-8Bsurpasses the best open\-weight baseline average \(47\.6% vs 42\.4% for MolmoWeb\-8B at≥\\geq100 steps\), and scales to 58\.9% and 68\.9% with 120 and 240 steps respectively\.

#### Weblica\-8Bis competitive with API models\.

At pass@1 with only 30 steps,Weblica\-8Balready matches OpenAI computer\-use\-preview \(35\.6% vs 36\.2% at 100 steps\)\. Additional test\-time compute leads to further improvements: with 120 total steps,Weblica\-8Breaches 58\.9%, approaching Gemini computer\-use\-preview \(60\.8%\), and with 240 steps reaches 68\.9%\. These results demonstrate the effectiveness of training on syntheticWeblicaenvironments, producing strong visual web agents without relying on large\-scale human demonstrations\.

### 5\.3Test\-Time Scaling

![Refer to caption](https://arxiv.org/html/2605.06761v1/x5.png)Figure 4:\(a\)Test\-time scaling on Online\-Mind2Web\.Weblica\-8Bbenefits from both increasing the action budget \(steps per episode\) and parallel attempts \(pass@kk\)\. The base Qwen3\-VL\-8B, in contrast, shows minimal gains from longer episodes, suggesting that RL training enables the model to effectively leverage additional steps\.\(b\)Environment ablation across four benchmarks\. Both training configurations improve substantially over the base model, with synthesized environments outperforming cache\-only on most benchmarks\.We study how test\-time compute scaling affects performance through two axes: the number of parallel attempts \(pass@kk\) and the per\-episode action budget\.[Figure˜4](https://arxiv.org/html/2605.06761#S5.F4)a comparesWeblica\-8Band the base Qwen3\-VL\-8B on Online\-Mind2Web\.Weblica\-8Bimproves consistently along both axes: increasing the action budget from 15 to 30 steps improves pass@1 from 32\.6% to 39\.2%, and pass@8 at 30 steps per attempt reaches 68\.8%\. In contrast, the base model shows minimal gains from a larger action budget, with its 15\- and 30\-step curves nearly overlapping\. This gap suggests that RL training is key to unlocking effective use of longer episodes\.

### 5\.4Analysis

#### Environment Type\.

We compare the effect of training withWeblica\-Cacheonly versusWeblica\-Synthonly\.[Figure˜4](https://arxiv.org/html/2605.06761#S5.F4)b shows pass@1 results across all four benchmarks\. Both configurations improve substantially over the base model\. Synthesized environments outperform cache\-only on most benchmarks, with the largest gains on Online\-Mind2Web \(39\.2% vs 35\.3%\) and WebTailBench \(33\.5% vs 30\.2%\), while cache\-only performs slightly better on DeepShop \(35\.8% vs 34\.2%\)\. Our initial attempts with different mixture ratios of cached and synthesized environments did not yield further improvements; we expect that designing effective training curricula across environment types could lead to additional gains\.

#### Training Stages and Model Size\.

[Figure˜5](https://arxiv.org/html/2605.06761#S5.F5)compares the contribution of each training stage across model sizes on Online\-Mind2Web\. Both SFT and RL provide consistent gains at all scales\. At 2B, the full SFT\+RL pipeline improves pass@1 from 13\.3% to 24\.1%, at 4B from 23\.2% to 35\.2%, and at 8B from 28\.6% to 39\.2%\. RL on top of SFT provides a larger improvement than SFT alone at every scale\. Comparing SFT→\\toRL with Base→\\toRL shows that SFT initialization is critical for smaller models, but SFT data has diminishing returns at larger scales\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x6.png)Figure 5:Training stage ablation on Online\-Mind2Web across three model sizes\. Both SFT and RL individually improve over the base model, but their combination performs best at all scales\.Table 2:Weblica\-8Bpreserves grounding despite no grounding\-specific training data\.ModelMMBench\-GUIScreenSpot\-v2ScreenSpot\-ProFara\-7B—89\.3—MolmoWeb\-Ground\-8B—91\.8—Qwen3\-VL\-Instruct\-8B82\.8593\.9554\.71Weblica\-8B83\.7494\.5055\.28
#### Effect on Grounding\.

As visual grounding is a core capability for web agents, we evaluate whether our training pipeline affects it, given that our training data contains no grounding\-specific data\.[Table˜2](https://arxiv.org/html/2605.06761#S5.T2)shows that grounding is preserved after training, with modest improvements across all three benchmarks\. Since grounding performance remains comparable, the gains on web navigation benchmarks stem from improved navigation behavior rather than better visual grounding\.

## 6Conclusion and Limitations

We presentedWeblica, a framework for building scalable and reproducible training environments for visual web agents\.Weblicacombines HTTP\-level caching of real websites with LLM\-based synthesis of interactive environments, enabling large\-scale RL training without the instability of live web interaction\. Our best model,Weblica\-8B, achieves strong results across several web navigation benchmarks compared to open\-weight baselines of similar size, and scales favorably with additional test\-time compute\. We outline some limitations of our work and future directions\.

- •Cached environmentsprovide a narrow and partial view of live websites\. They are static snapshots that do not reflect updates over time, and do not capture the full complexity of dynamic web applications\. While we observe clear gains from training on cached environments, further exploration of methods to close this gap is an interesting direction\.
- •Synthesized environmentscapture core navigation patterns but do not yet model all aspects of real websites\. A sim\-to\-real gap remains, which could be further closed with stronger generative models that produce more faithful and diverse website designs\.
- •Single\-turn tasks\.Our current setup evaluates agents on isolated tasks with a fixed goal defined at the start of the episode\. Real\-world web usage involves multi\-turn sessions with evolving goals, human\-in\-the\-loop interaction where the user provides feedback or corrections mid\-session, and personalization aspects such as adapting to user preferences and maintaining memory across sessions\. Extending to these settings is an exciting direction\.
- •Training\.We obtain promising gains using a vanilla RL training formulation\. Further exploration of the RL framework, including long\-horizon RL and scaling RL compute\[khatri2025art\], and richer SFT data such as error recovery trajectories are interesting directions for improving web agent training\.
- •Beyond web\.Our framework currently targets web navigation\. Extending it to other GUI environments such as mobile and desktop applications to build generalist computer\-use agents is a promising direction\.

## Acknowledgements

We thank Jesse Allardice, Mingfei Gao, Rui Tian, and Ege Özsoy for their help with the project, and Andrew Szot, Alexander Toshev, and Kaixin Ma for feedback and discussions\.

††Apple and the Apple logo are trademarks of Apple Inc\., registered in the U\.S\. and other countries and regions\.
## References

## Appendix

## Appendix AAdditional Quantitative Results

### A\.1Results across model sizes

[Figure˜6](https://arxiv.org/html/2605.06761#A1.F6)shows pass@k results across all three model sizes and four benchmarks, with corresponding numerical values in[Table˜3](https://arxiv.org/html/2605.06761#A1.T3)\. Training consistently improves over the base model at all scales and evaluation points\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x7.png)Figure 6:Pass@k accuracy forWeblica\-2B,Weblica\-4B, andWeblica\-8B\(orange\) compared to their respective Qwen3\-VL\-Instruct base models \(gray\) across four benchmarks\.Table 3:Pass@k accuracy across model sizes and benchmarks\.pass@1pass@2pass@4pass@8Model size: 2BOnline\-Mind2WebQwen3\-VL13\.3±\\,\\pm\\,0\.320\.1±\\,\\pm\\,0\.628\.6±\\,\\pm\\,0\.938\.2±\\,\\pm\\,1\.3Weblica\-2B24\.1±\\,\\pm\\,0\.833\.9±\\,\\pm\\,1\.144\.7±\\,\\pm\\,1\.255\.6±\\,\\pm\\,2\.2DeepShopQwen3\-VL3\.3±\\,\\pm\\,0\.46\.3±\\,\\pm\\,0\.811\.6±\\,\\pm\\,1\.319\.6±\\,\\pm\\,1\.9Weblica\-2B18\.0±\\,\\pm\\,0\.627\.8±\\,\\pm\\,0\.538\.8±\\,\\pm\\,0\.750\.2±\\,\\pm\\,0\.8WebTailBenchQwen3\-VL8\.2±\\,\\pm\\,0\.213\.3±\\,\\pm\\,0\.419\.9±\\,\\pm\\,0\.727\.5±\\,\\pm\\,0\.6Weblica\-2B17\.0±\\,\\pm\\,0\.427\.3±\\,\\pm\\,0\.840\.0±\\,\\pm\\,1\.453\.8±\\,\\pm\\,1\.9Weblica\-valQwen3\-VL30\.0±\\,\\pm\\,0\.340\.5±\\,\\pm\\,0\.150\.1±\\,\\pm\\,0\.358\.1±\\,\\pm\\,0\.6Weblica\-2B50\.1±\\,\\pm\\,0\.662\.7±\\,\\pm\\,0\.572\.9±\\,\\pm\\,0\.280\.3±\\,\\pm\\,0\.2Model size: 4BOnline\-Mind2WebQwen3\-VL23\.2±\\,\\pm\\,0\.231\.3±\\,\\pm\\,0\.439\.4±\\,\\pm\\,0\.346\.9±\\,\\pm\\,0\.2Weblica\-4B35\.2±\\,\\pm\\,0\.346\.0±\\,\\pm\\,0\.655\.8±\\,\\pm\\,1\.064\.1±\\,\\pm\\,1\.6DeepShopQwen3\-VL17\.1±\\,\\pm\\,0\.525\.5±\\,\\pm\\,0\.933\.8±\\,\\pm\\,1\.541\.8±\\,\\pm\\,2\.7Weblica\-4B27\.8±\\,\\pm\\,0\.538\.9±\\,\\pm\\,1\.549\.7±\\,\\pm\\,2\.760\.0±\\,\\pm\\,3\.1WebTailBenchQwen3\-VL16\.0±\\,\\pm\\,0\.722\.8±\\,\\pm\\,0\.930\.2±\\,\\pm\\,1\.237\.7±\\,\\pm\\,1\.4Weblica\-4B23\.8±\\,\\pm\\,0\.435\.4±\\,\\pm\\,0\.347\.9±\\,\\pm\\,0\.359\.6±\\,\\pm\\,0\.3Weblica\-valQwen3\-VL51\.6±\\,\\pm\\,0\.761\.4±\\,\\pm\\,0\.769\.1±\\,\\pm\\,0\.774\.9±\\,\\pm\\,0\.9Weblica\-4B64\.2±\\,\\pm\\,0\.574\.5±\\,\\pm\\,0\.381\.3±\\,\\pm\\,0\.286\.2±\\,\\pm\\,0\.6Model size: 8BOnline\-Mind2WebQwen3\-VL28\.6±\\,\\pm\\,0\.437\.6±\\,\\pm\\,0\.246\.3±\\,\\pm\\,0\.354\.4±\\,\\pm\\,1\.7Weblica\-8B39\.2±\\,\\pm\\,0\.950\.3±\\,\\pm\\,0\.760\.5±\\,\\pm\\,0\.968\.8±\\,\\pm\\,1\.7DeepShopQwen3\-VL24\.1±\\,\\pm\\,0\.833\.5±\\,\\pm\\,1\.241\.6±\\,\\pm\\,2\.049\.1±\\,\\pm\\,3\.4Weblica\-8B34\.2±\\,\\pm\\,1\.745\.4±\\,\\pm\\,2\.355\.9±\\,\\pm\\,3\.165\.8±\\,\\pm\\,3\.8WebTailBenchQwen3\-VL21\.8±\\,\\pm\\,0\.231\.5±\\,\\pm\\,0\.542\.0±\\,\\pm\\,1\.252\.4±\\,\\pm\\,1\.8Weblica\-8B33\.5±\\,\\pm\\,0\.947\.0±\\,\\pm\\,1\.160\.3±\\,\\pm\\,1\.372\.2±\\,\\pm\\,1\.7Weblica\-valQwen3\-VL56\.9±\\,\\pm\\,0\.765\.4±\\,\\pm\\,0\.672\.5±\\,\\pm\\,0\.777\.9±\\,\\pm\\,1\.0Weblica\-8B70\.6±\\,\\pm\\,0\.579\.0±\\,\\pm\\,0\.784\.7±\\,\\pm\\,1\.188\.6±\\,\\pm\\,1\.5
### A\.2Downstream performance during training

[Figure˜7](https://arxiv.org/html/2605.06761#A1.F7)shows steady improvement in pass@1 accuracy on Online\-Mind2Web andWeblica\-val throughout RL training forWeblica\-8Bfor a sample training run\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x8.png)Figure 7:Pass@1 accuracy on Online\-Mind2Web \(left\) andWeblica\-val \(right\) during RL training forWeblica\-8B\. Each training step uses 1024 rollouts\. The first data point \(0 rollouts\) is the SFT\-initialized model before RL\.

## Appendix BAdditional Qualitative Results

### B\.1Weblica\-Synthvisualizations

We provide additional random sample screenshots ofWeblica\-Synthsynthetically generated web environments in[Figures˜8](https://arxiv.org/html/2605.06761#A2.F8),[9](https://arxiv.org/html/2605.06761#A2.F9)and[10](https://arxiv.org/html/2605.06761#A2.F10)\. The generated pages target a broad distribution of web capabilities and are visually diverse\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x9.png)Figure 8:Weblica\-Synthsamples: Additional visualizations of synthetically generated web environments, capturing a broad set of web capabilities\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x10.png)Figure 9:Weblica\-Synthsamples:[Figure˜8](https://arxiv.org/html/2605.06761#A2.F8)cont\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x11.png)Figure 10:Weblica\-Synthsamples:[Figure˜8](https://arxiv.org/html/2605.06761#A2.F8)and[Figure˜9](https://arxiv.org/html/2605.06761#A2.F9)cont\.
### B\.2Solved trajectory examples

In[Figures˜11](https://arxiv.org/html/2605.06761#A2.F11),[12](https://arxiv.org/html/2605.06761#A2.F12),[13](https://arxiv.org/html/2605.06761#A2.F13),[14](https://arxiv.org/html/2605.06761#A2.F14),[15](https://arxiv.org/html/2605.06761#A2.F15),[16](https://arxiv.org/html/2605.06761#A2.F16)and[17](https://arxiv.org/html/2605.06761#A2.F17)we show trajectory visualizations ofWeblica\-8BsolvingWeblica\-val tasks of various difficulties, and displaying different visual grounding and web navigation skills\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x12.png)Figure 11:Example trajectory ofWeblica\-8Bsolving aWeblica\-val task\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x13.png)Figure 12:Example trajectory ofWeblica\-8Bsolving aWeblica\-val task\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x14.png)Figure 13:Example trajectory ofWeblica\-8Bsolving aWeblica\-val task\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x15.png)Figure 14:Example trajectory ofWeblica\-8Bsolving aWeblica\-val task\. This synthetic website was designed to contain substantial visual clutter, such as popups and ad banners, yet the agent successfully navigates the page\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x16.png)Figure 15:Example trajectory ofWeblica\-8Bsolving aWeblica\-val task\. Although the agent could select the correct dropdown entry directly via clicking,Weblica\-Synthtasks are designed to train web navigation capabilities broadly; in this case, the task explicitly requires completion through keyboard navigation\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x17.png)Figure 16:Example trajectory ofWeblica\-8Bsolving aWeblica\-val task\. The agent attempts to open the dropdown menu but, as it lies near the bottom of the page, must first scroll down to reveal it\.![Refer to caption](https://arxiv.org/html/2605.06761v1/x18.png)Figure 17:Example trajectory ofWeblica\-8Bsolving aWeblica\-val task\. The agent initially opens two incorrect menus before identifyingManage Medicationsat the bottom of the screen and successfully editing the medication\.

## Appendix CAdditional Training Details

### C\.1Training hyperparameters

We performed a lightweight hyperparameter exploration for both SFT and RL stages\.[Tables˜4](https://arxiv.org/html/2605.06761#A3.T4)and[5](https://arxiv.org/html/2605.06761#A3.T5)list the values explored\.

Table 4:SFT hyperparameters\.Boldvalues indicate selected configurations that performed similarly\.HyperparameterValuesBase modelQwen3\-VL\-8B\-InstructTrainable componentsLanguage model only \(vision tower \+ projector frozen\)Learning rate1e\-5,1e\-6LR scheduleCosineWarmup ratio0\.1Epochs1,2,3Image resolution640×\\times360,1280×\\times720Cutoff length80,000Effective batch size8Precisionbf16Table 5:RL hyperparameters\.Boldvalues indicate selected configurations that performed similarly\.HyperparameterValuesBase modelQwen3\-VL\-8B\-Instruct,SFT modelAlgorithmDr\. GRPOLearning rate1e\-5,5e\-6, 1e\-6LR scheduleConstantKL coefficient0,0\.005,0\.01, 0\.02, 0\.03, 0\.05Batch size64,128,256Rollouts per prompt \(nn\)4,8, 16Rounds \(RR\)15,25, 35PPO mini\-batch size32Max grad norm1\.0Precisionbf16Image resolution1280×\\times720Max model context length80,000Max new tokens per action512
### C\.2Compute

All SFT and RL experiments are conducted on 8 NVIDIA B200 GPUs\. Environments are served locally on the same node\. Each RL training step generates 1024 rollouts \(256 prompts×\\times4 rollouts per prompt\), each with up to 25 interaction rounds, completing in approximately 28 minutes with our local environment setup\.

### C\.3Action space

[Table˜6](https://arxiv.org/html/2605.06761#A3.T6)lists the full action space used by our visual web agent\.

Table 6:Action space of visual web agent\. Coordinate\-based actions take pixel positions as arguments\.ActionDescriptionclick\(x, y\)Click at pixel coordinateshover\(x, y\)Hover at pixel coordinatestype\(text, \[x, y\], \[enter\]\)Type text, optionally at coordinates and press enterpress\(key\)Press a keyboard keyscroll\(direction, \[amount\]\)Scroll in a given directiongo\_back\(\)Navigate back in historygo\_forward\(\)Navigate forward in historywait\(\)Wait for page to loadstop\(response\)Submit response and end episode
### C\.4Training reward judge prompt

During RL training, we use a VLM\-based judge \(GPT\-4o\) to evaluate whether the agent successfully completed each task\. The judge receives the full trajectory \(screenshots and actions\) and evaluates against task\-specific criteria\. Samples labeled as “website failure” are discarded from the training batch\. The prompt is shown below\.

Training Judge PromptSystem: You are an expert evaluator for web navigation tasks\. Analyze the provided trajectory steps and determine if the agent successfully completed the following task: \{task\} Respond with one of the following based on the task\-specific criteria below: \- ‘correct’ \-\- The agent successfully completed the task \- ‘incorrect’ \-\- The agent failed to complete the task due to its own mistakes \- ‘website failure’ \-\- The agent was making reasonable progress but was blocked by technical issues beyond its control Task\-specific evaluation criteria: \{criteria\} Technical issues that qualify for ‘website failure’: \- Page timeouts or loading failures \- Blank or empty pages that fail to render \- Connection errors or server errors \(5xx responses\) \- CAPTCHA or bot detection blocking the agent \- Pages stuck in infinite loading states \- Elements that fail to become interactive despite multiple attempts To determine if a website issue occurred, look for these indicators in the trajectory: \- The agent repeatedly tries the same reasonable action without success \- Screenshots show loading spinners, error messages, or blank content \- The agent uses ‘wait’ actions multiple times without page progress \- The agent’s actions are correct but the page state doesn’t change as expected Important: Only use ‘website failure’ when the agent was on a reasonable path toward completing the task\. If the agent made fundamental mistakes before encountering technical issues, respond with ‘incorrect’\. These are the actions the agent can take: \[Action space description omitted for brevity; see[Table˜6](https://arxiv.org/html/2605.06761#A3.T6)\] User: Step 1 \- Screenshot:\[image\] Step 1 \- Agent Action:\[action\] Step 2 \- Screenshot:\[image\] Step 2 \- Agent Action:\[action\] … Based on these trajectory steps, did the agent successfully complete the task? First respond with your decision followed by your reasoning\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x19.png)Figure 18:Distribution of trajectory lengths in the SFT training data\.

## Appendix DData Statistics

### D\.1Environments

[Figure˜19](https://arxiv.org/html/2605.06761#A4.F19)shows the distribution of synthesized environments across capability, domain, and visual style categories\.

![Refer to caption](https://arxiv.org/html/2605.06761v1/x20.png)Figure 19:Weblica\-Synthsite grouping: Distribution of synthetic web browsing tasks grouped across capability categories \(top left\), domain categories \(top right\), and visual style categories \(bottom\)\.
### D\.2SFT Trajectories

[Figure˜18](https://arxiv.org/html/2605.06761#A3.F18)shows the distribution of trajectory lengths across the 51\.7K SFT training trajectories\.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

Similar Articles

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Region4Web: Rethinking Observation Space Granularity for Web Agents

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

Submit Feedback

Similar Articles

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Region4Web: Rethinking Observation Space Granularity for Web Agents
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents