AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents
Summary
AsyncWebRL introduces an asynchronous multi-step reinforcement learning system for vision-language web agents, achieving up to 2.9x training speedup and setting a new state-of-the-art on WebGym by replacing per-trajectory normalization with a constant to reduce trajectory length inefficiency.
View Cached Full Text
Cached at: 06/05/26, 08:11 AM
# AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents
Source: [https://arxiv.org/html/2606.05597](https://arxiv.org/html/2606.05597)
Hao Bai1,2Rui Yang1,2Chenlu Ye1 Spencer Whitehead2Aviral Kumar3Tong Zhang1 1UIUC2Microsoft3CMU [Project Page](https://asyncwebrl-website.github.io/)[Code](https://github.com/microsoft/webgym/tree/async)[Task Set](https://huggingface.co/datasets/microsoft/webgym_tasks)
###### Abstract
Training vision\-language web agents with multi\-step RL is compute\-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary\. We presentAsyncWebRL, which addresses both\. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web\-agent\-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a2\.9×\\bm\{2\.9\\times\}end\-to\-end training\-throughput speedup over the previously fastest open synchronous pipeline \(WebGym\)\. On the algorithmic side, we identify the per\-trajectory normalizer1/\|τi\|1/\|\\tau\_\{i\}\|in multi\-step GRPO as the root cause of trajectory\-level and token\-level inefficiency: because failures are systematically longer than successes, it down\-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas\. Replacing1/\|τi\|1/\|\\tau\_\{i\}\|with a constant1/k1/kbreaks this coupling, contracting trajectories while preserving aggregate success\. Together, these contributions set a new open\-source state of the art on the WebGym out\-of\-distribution test split \(\+5\.8%\\bm\{\+5\.8\\%\}relative over the42\.9%42\.9\\%prior best\), with the largest gains on the harder slices \(\+𝟒𝟐%\\bm\{\+42\\%\}relative on Medium,\+𝟒𝟖%\\bm\{\+48\\%\}relative on Hard\)\.
AsyncWebRL: Efficient Multi\-Step RL for Visual Web Agents
Hao Bai1,2Rui Yang1,2Chenlu Ye1Spencer Whitehead2Aviral Kumar3Tong Zhang11UIUC2Microsoft3CMU[Project Page](https://asyncwebrl-website.github.io/)[Code](https://github.com/microsoft/webgym/tree/async)[Task Set](https://huggingface.co/datasets/microsoft/webgym_tasks)
## 1Introduction
Vision\-language web agents trained with multi\-step reinforcement learning have rapidly become a leading recipe for autonomous browsers that complete real\-world tasks across the long tail of the web\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27); Wanget al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib10); Weiet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib11); Yanget al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib9)\), part of a broader shift from single\-turn preference tuning to long\-horizon, environment\-grounded agent training\. The defining feature of this regime is compute: each training run consumes hundreds of GPU\-hours against hundreds of concurrent browser sessions, and progress is largely a function of how many trajectories the trainer can profitably consume per wall\-clock hour\. Against this backdrop, any compute inefficiency translates directly into a lower\-performing agent at fixed budget\. This paper targets both forms of inefficiency within a single framework, combining \(1\) a fully asynchronous system design tailored to per\-step visual rollouts with \(2\) a one\-line algorithmic fix that contracts trajectory and token usage without harming ending success\.
The natural systems solution for GPU idleness is*async RL*, where rollout, gradient update, and policy refresh proceed concurrently across iteration boundaries\(Mnihet al\.,[2016](https://arxiv.org/html/2606.05597#bib.bib17); Espeholtet al\.,[2018](https://arxiv.org/html/2606.05597#bib.bib18); Hiltonet al\.,[2021](https://arxiv.org/html/2606.05597#bib.bib29)\)\. On the surface this appears to be a solved problem: single\-step LLM\-RL is already async, and WebGym\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\)already makes within\-iteration rollouts async\. Combining the three properties*visual, multi\-step, and fully async*, however, requires a coordinated design that none of the existing open\-source systems carry together \([Section˜2](https://arxiv.org/html/2606.05597#S2)\)\. The core difficulty has two parts\. First, the natural payload of a visual multi\-step rollout, tens of high\-resolution screenshots per trajectory shared across workers every step, swamps the standard inter\-worker data store and pushes the framework into a slow disk\-spill path that erases any async benefit\. Second, an iteration\-synchronous rollout pool of the kind WebGym uses pays a warm\-up cost on hundreds of browser sessions every round\.
We address both with two web\-agent\-specific designs: lightweight screenshot handling and an everlasting rollout pool\. The former keeps image tensors out of the shared data store entirely, routing only lightweight references between rollout workers and the trainer\. The latter keeps workers continuously alive across iteration boundaries so episodes hand off immediately and parameter updates ofπθ\\pi\_\{\\theta\}never stall collection \([Section˜3\.1](https://arxiv.org/html/2606.05597#S3.SS1)\)\. Closing the resulting off\-policy gap also requires an algorithmic correction\. We adopt the decoupled\-PPO factorization ofHiltonet al\.\([2021](https://arxiv.org/html/2606.05597#bib.bib29)\), splitting the standardπθ/πbehave\\pi\_\{\\theta\}/\\pi\_\{\\mathrm\{behave\}\}ratio into a rollout\-staleness term and a current\-update term and centering PPO\-style clipping\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.05597#bib.bib26)\)on a proximal policyπprox≈πθ\\pi\_\{\\mathrm\{prox\}\}\\approx\\pi\_\{\\theta\}\. As we will show in[Section˜3\.2](https://arxiv.org/html/2606.05597#S3.SS2)and[Section˜4\.3](https://arxiv.org/html/2606.05597#S4.SS3), this roughly halves the clip\-trigger rate and substantially speeds reward improvement\. The resulting system is, to our knowledge, the fastest reported open multi\-step RL framework for visual web agents, at a2\.4\\bm\{2\.4\}to2\.9×\\bm\{2\.9\\times\}end\-to\-end speedup over prior open rollout pools\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\)\.
Beyond this form of off\-policy correction, we identify that per\-trajectory step\-number based normalization111Throughout the paper, “step\-number normalization” refers to the multi\-step GRPO factor1/\|τi\|1/\|\\tau\_\{i\}\|that normalizes the loss by the number of*steps*\(not tokens\) in a trajectory\. This is distinct from Dr\. GRPO’s removal of the single\-turn factor1/\|yi\|1/\|y\_\{i\}\|, which normalizes by the number of tokens in a response\. We call the latter “token\-number normalization\.”, i\.e\.,𝟏/\|𝝉𝒊\|\\bm\{1/\|\\tau\_\{i\}\|\}in multi\-step GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib28)\)as the root cause of trajectory\- and token\-level inefficiency, and replace it with a constant𝟏/𝒌\\bm\{1/k\}\([Section˜5](https://arxiv.org/html/2606.05597#S5)\)\. The intuition follows Dr\. GRPO\(Liuet al\.,[2025b](https://arxiv.org/html/2606.05597#bib.bib4)\)but at a step\-level granularity rather than token granularity: in our setting failures average12\.512\.5steps against5\.15\.1for successes, so1/\|τi\|1/\|\\tau\_\{i\}\|attenuates the gradient on failed tokens by≈2\.4×\\approx 2\.4\\times, and the policy responds by producing verbose memory schemas\. Combined with the off\-policy correction, this sets a new state of the art on the WebGym OOD test split \([Section˜4\.2](https://arxiv.org/html/2606.05597#S4.SS2)\)\.
Our contributions\.\(1\) System\.The first open multi\-step RL framework for visual web agents that is fully async end\-to\-end, via an everlasting rollout pool, lightweight screenshot handling, and a decoupled importance\-sampling ratio \([Section˜3\.1](https://arxiv.org/html/2606.05597#S3.SS1)\); this delivers a2\.42\.4to2\.9×2\.9\\timesend\-to\-end speedup over the previously fastest open sync pipeline and a new open\-source state of the art on the WebGym OOD test split \(\+5\.8%\+5\.8\\%relative over the42\.9%42\.9\\%prior best\)\.\(2\) Algorithm\.A diagnosis of the per\-trajectory step\-number normalizer1/\|τi\|1/\|\\tau\_\{i\}\|in multi\-step GRPO as the root cause of trajectory\- and token\-level inefficiency, traced through length\-coupled memory drift \([Section˜5](https://arxiv.org/html/2606.05597#S5)\), and a one\-line fix replacing1/\|τi\|1/\|\\tau\_\{i\}\|with a constant1/k1/kthat drives the largest gains on the harder OOD slices \(\+42%\+42\\%relative on Medium,\+48%\+48\\%relative on Hard over the prior WebGym SOTA\)\.
## 2Related Work
System: visual, multi\-step, and async RL\.Multi\-step RL on VLM policies has been studied for web browsing\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27); Weiet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib11); Yanget al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib9)\), computer use\(Baiet al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib13),[2025](https://arxiv.org/html/2606.05597#bib.bib12); Wanget al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib10)\), embodied learning\(Kimet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib15); Liuet al\.,[2025a](https://arxiv.org/html/2606.05597#bib.bib14)\), and sequential decision\-making\(Zhaiet al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib16)\)\. Open\-source frameworks split into two disjoint subsets that AsyncWebRL bridges\.*Async LLM\-RL on text*andasync single\-turn VLM\-RL\(Fuet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib20); Luet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib23); Shenget al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib24); Wuet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib5); Hanet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib22); Zhonget al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib21); Zhouet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib25); Noukhovitchet al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib19); Shenget al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib2)\)ship at most one image per training example; pushing multi\-turn web\-agent rollouts \(tens of high\-resolution screenshots per trajectory, hundreds concurrent\) through their shared data store exhausts its budget and triggers a disk\-spill path that erases the async benefit\.*Sync multi\-step VLM\-RL*\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27); Weiet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib11); Baiet al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib13),[2025](https://arxiv.org/html/2606.05597#bib.bib12)\)carries the right workload but resynchronizes at every iteration boundary, leaving GPU bubbles\. Switching the latter to fully async execution opens an off\-policy gap that sync frameworks never see: training batches contain tokens drawn from several policy versions at once, and the trust region must be re\-centered on a proximal policy\(Hiltonet al\.,[2021](https://arxiv.org/html/2606.05597#bib.bib29)\)so that clipping reflects current\-update movement rather than rollout staleness\. The closed\-source UI\-TARS\-2\(Wanget al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib10)\)reportedly satisfies all three properties but releases only model weights\. AsyncWebRL is, to our knowledge, the first open framework to combine all three\.
Algorithm: loss shape under length asymmetry\.On the algorithmic side, Dr\. GRPO\(Liuet al\.,[2025b](https://arxiv.org/html/2606.05597#bib.bib4)\)showed that the single\-turn GRPO normalizer1/\|yi\|1/\|y\_\{i\}\|reweights per\-token gradients between long and short responses, and removing it \(replacing1/\|yi\|1/\|y\_\{i\}\|with a constant\) corrects an unintended length bias inside a single response\. We extend the same diagnosis one level up\. In multi\-step GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib28); Zenget al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib7)\), the analogous factor is the per\-trajectory step\-number normalizer1/\|τi\|1/\|\\tau\_\{i\}\|, which couples per\-token gradient scale to trajectory length\. What is new at the step granularity that the single\-turn analysis does not see is that the multi\-step web\-agent setting has two structural features which turn Dr\. GRPO’s step\-level prediction into a self\-reinforcing failure mode: failed trajectories are dominated by horizon exhaustion rather than clearly wrong actions, and an additive memory prompt against a multi\-step visual environment mechanically lengthens per\-step responses as step count grows\. We trace these features in[Section˜5](https://arxiv.org/html/2606.05597#S5)and show that the multi\-step counterpart of Dr\. GRPO’s fix breaks the loop at its source\.
## 3AsyncWebRL
### 3\.1System
Figure 1:Multi\-step Asynchronous Management\.Compared to WebGym, AsyncWebRL eliminates theinter\-iteration bubble timecaused by reconstructing the rollout pool at every iteration boundary and waiting for the policy refresh ofπt\\pi\_\{t\}to complete\. Colored blocks denote concurrent rollout workers producing trajectories, gradient updates onπt\\pi\_\{t\}, and policy refreshes that broadcast new weights to the rollout workers\. White gaps under synchronous RL \(top\) are bubble time\. AsyncWebRL \(bottom\) eliminates these gaps by maintaining an everlasting rollout pool so that rollout, gradient update, and policy refresh overlap continuously\.Our goals are to keep GPUs from waiting on each other, and to keep the shared inter\-worker store from saturating under per\-step image traffic\. On top of the synchronous multi\-step rollout pool of WebGym\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\), we make two changes:*fully asynchronous execution with an everlasting rollout pool*, and*lightweight screenshot handling*\. The first keeps rollout workers continuously alive across iteration boundaries and overlaps rollout, gradient update, and policy refresh end to end: when an episode ends, the next begins immediately on the same worker without waiting for the rest of the batch or for the next iteration\. The parameter update ofπθ\\pi\_\{\\theta\}can happen at any time while rollout continues; new weights are broadcast in place to the inference workers and the next rollout segment is sampled under the updated policy \([Figure˜1](https://arxiv.org/html/2606.05597#S3.F1)\)\. The second \([Appendix˜B](https://arxiv.org/html/2606.05597#A2)\) keeps per\-step image tensors out of the shared store and routes only lightweight references between rollout workers and the trainer, which prevents the disk\-spill path that hundreds of concurrent high\-resolution rollouts otherwise induce\. Together these eliminate the GPU idle time spent waiting for the slowest trajectory in a batch and the warm\-up cost paid on hundreds of browser sessions when the rollout pool is rebuilt each round\.
### 3\.2Algorithm
𝒥\(θ\)=𝔼τ∼πbehave\[1G⋅k∑i=1G∑j=1\|τi\|∑t=1\|τi,j\|min\(πθπbehaveA^i,πproxπbehaveclip\(πθπprox,1−ϵ,1\+ϵ\)A^i\)\]\.\\displaystyle\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\mathrm\{behave\}\}\}\\\!\\left\[\\frac\{1\}\{G\\cdot\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}k\}\}\\sum\_\{i=1\}^\{G\}\\sum\_\{j=1\}^\{\|\\tau\_\{i\}\|\}\\sum\_\{t=1\}^\{\|\\tau\_\{i,j\}\|\}\\min\\\!\\left\(\\frac\{\\pi\_\{\\theta\}\}\{\\pi\_\{\\mathrm\{behave\}\}\}\\hat\{A\}\_\{i\},\\;\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\frac\{\\pi\_\{\\mathrm\{prox\}\}\}\{\\pi\_\{\\mathrm\{behave\}\}\}\}\\,\\mathrm\{clip\}\\\!\\left\(\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\frac\{\\pi\_\{\\theta\}\}\{\\pi\_\{\\mathrm\{prox\}\}\}\},\\,1\{\-\}\\epsilon,\\,1\{\+\}\\epsilon\\right\)\\hat\{A\}\_\{i\}\\right\)\\right\]\\\!\.\(1\)
Decoupled off\-policy correction\.Asynchronous execution shifts the sampling distribution: the policy that generated a given token \(πbehave\\pi\_\{\\mathrm\{behave\}\}\) is several updates behind the policyπθ\\pi\_\{\\theta\}the trainer is now updating, and a single trajectory in a training batch can be stitched from different policy snapshots\. The natural correction is importance sampling: weight every token byπθ/πbehave\\pi\_\{\\theta\}/\\pi\_\{\\mathrm\{behave\}\}for an unbiased off\-policy gradient, and clip this ratio to\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]as in PPO\. Under async RL the same ratio must capture two quantities at once: how much the policy has moved since the rollout was sampled \(*rollout staleness*\), and how much the optimizer has moved the policy during the current gradient update\. Clipping a single coupled ratio confounds the two, so rollout staleness alone triggers clip events at many token positions and slows training substantially\. We adopt the decoupled\-PPO factorization ofHiltonet al\.\([2021](https://arxiv.org/html/2606.05597#bib.bib29)\)\(also used by AReaL\(Fuet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib20)\)for text RL\) and split the ratio into a rollout\-staleness factorπprox/πbehave\\pi\_\{\\mathrm\{prox\}\}/\\pi\_\{\\mathrm\{behave\}\}and a current\-update factorπθ/πprox\\pi\_\{\\theta\}/\\pi\_\{\\mathrm\{prox\}\}, whereπprox\\pi\_\{\\mathrm\{prox\}\}is the policy snapshot at the start of the current update\. The PPO clip is then centered aroundπprox\\pi\_\{\\mathrm\{prox\}\}only, so that clip events reflect how much the optimizer has moved the policy during this update rather than staleness inherited from the rollout\. Empirically this roughly halves the clip rate relative to the coupled formulation \([Section˜4\.3](https://arxiv.org/html/2606.05597#S4.SS3)\), and enters the final loss as the innermin\\minof[Equation˜1](https://arxiv.org/html/2606.05597#S3.E1)\(a negative\-advantage dual\-clip extension is given in[Equation˜2](https://arxiv.org/html/2606.05597#A2.E2)\)\.
Removing trajectory\-length normalization\.Separately from the off\-policy correction, we change the loss aggregation\. The standard multi\-step GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05597#bib.bib28)\)loss normalizes each rollout by its own step count\|τi\|\|\\tau\_\{i\}\|, which under our setting introduces a length\-coupled gradient attenuation we trace in[Section˜5](https://arxiv.org/html/2606.05597#S5)\. The fix is one line: replace1/\|τi\|1/\|\\tau\_\{i\}\|with a constant1/k1/k, wherekkis the Easy\-difficulty horizon \(1010throughout this paper\)\. That is, instead of every rollout entering the loss with total weight11regardless of length, each rollout enters with weight\|τi\|/k\|\\tau\_\{i\}\|/k, restoring full per\-token gradient weight on the long failures the policy must learn to avoid\. This change appears as the constant1/k1/kouter factor in[Equation˜1](https://arxiv.org/html/2606.05597#S3.E1)\.
We treat each trajectory as a bandit and share a single trajectory\-level advantageA^i=\(ri−mean\(𝐫\)\)/std\(𝐫\)\\hat\{A\}\_\{i\}=\(r\_\{i\}\-\\mathrm\{mean\}\(\\mathbf\{r\}\)\)/\\mathrm\{std\}\(\\mathbf\{r\}\)across all tokens in trajectoryii\. WithGGrollouts per task,\|τi\|\|\\tau\_\{i\}\|steps per rollout, and\|τi,j\|\|\\tau\_\{i,j\}\|tokens per step, the AsyncWebRL loss combines the two changes above with the standard GRPO surrogate \([Equation˜1](https://arxiv.org/html/2606.05597#S3.E1)\)\.
The red\-colored terms in[Equation˜1](https://arxiv.org/html/2606.05597#S3.E1)mark these two changes: the constant1/k1/kin place of1/\|τi\|1/\|\\tau\_\{i\}\|, and the decoupled clip centered aroundπprox\\pi\_\{\\mathrm\{prox\}\}withπprox/πbehave\\pi\_\{\\mathrm\{prox\}\}/\\pi\_\{\\mathrm\{behave\}\}as an unclipped weight\. Following DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib6)\), we also drop the reference\-KL term and apply dynamic sampling, skipping all\-success or all\-failure groups and gathering128128mixed trajectories \(1616groups\) before launching training of that step\.
RAFT\+\+ as a baseline\.As a contrasting off\-policy baseline we also consider RAFT\+\+\(Xionget al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib8)\), which can be viewed as vanilla multi\-step GRPO with the same per\-trajectory normalizer1/\|τi\|1/\|\\tau\_\{i\}\|, but with group normalization disabled and the group\-relative advantage replaced by a success filter \(r\>0r\>0\)\. Only successful trajectories contribute gradient, so RAFT\+\+ effectively performs behavior cloning on a rolling buffer of positives and provides no contrastive signal on below\-average trajectories\. Decoupled importance sampling is still applied to keep the off\-policy gradient unbiased\.
## 4Experiments
### 4\.1Setup
Environment\.We train and test under the protocol from WebGym\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\)\. We picked WebGym because it is the largest open multi\-step visual web\-agent training environment to date: roughly 290k training tasks across 128k real\-world websites in three difficulty levels \(Easy, Medium, Hard\), evaluated on a 1,167 task OOD test split whose websites do not appear in training\. The scale of the OOD test split \(1,167 tasks across held\-out websites\) is what makes single\-seed comparisons informative here\. WebGym’s own analysis reports that two runs differing in early\-training trajectory dynamics nonetheless converge to overlapping final test performance, with error bars across runs small enough that single\-seed comparisons are informative \(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27), Figure 8\)\. We inherit this property and report single\-seed results throughout, focusing reporting budget on algorithm\-level ablations rather than seed sweeps\. Rewards are binary, produced by WebGym’s GPT\-4o rubric evaluator\. Following WebGym, we use the coordinate\-based action space\{click,type,scroll,go\_back,navigate,ANSWER\}\\\{\\texttt\{click\},\\texttt\{type\},\\texttt\{scroll\},\\texttt\{go\\\_back\},\\texttt\{navigate\},\\texttt\{ANSWER\}\\\}on raw screenshots, with per\-difficulty horizons of\(10,20,30\)\(10,20,30\)steps\. The Easy\-difficulty horizonk=10k\{=\}10is the WebGym recommended trade\-off between truncation and rollout cost, and our constant1/k1/kfix usesk=10k\{=\}10throughout\. All hyperparameter details are in[Section˜C\.5](https://arxiv.org/html/2606.05597#A3.SS5)\.
Model Variants\.We use two Qwen3\-VL\-8B variants: Instruct \(Qwen3\-VL\-8B\-Instruct\) and Thinking \(Qwen3\-VL\-8B\-Thinking\), both with the additive memory prompt template shown in[Section˜C\.4](https://arxiv.org/html/2606.05597#A3.SS4)\. At each step the policy sees only the previous screenshot and its previous response\. For the Thinking variant the previous step’s thinking tokens are hidden from history, following standard practice\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\)\.
Comparisons\.REINFORCE style on\-policy algorithms present a type mismatch for an async framework that is not on\-policy by design\. We therefore compare to the closest off\-policy baseline, RAFT\+\+ \(defined in[Section˜3\.2](https://arxiv.org/html/2606.05597#S3.SS2)\)\. We run three configurations:
- •WebGym: prior synchronous RL pipeline with the Filtered BC objective ofBaiet al\.\([2026](https://arxiv.org/html/2606.05597#bib.bib27)\), the published WebGym numbers\.
- •AsyncWebRL\-RAFT\+\+: our async framework with RAFT\+\+ as the loss, the async compatible substitute for WebGym’s sync REINFORCE\.
- •AsyncWebRL \(full\): our async framework with the full method from[Section˜3](https://arxiv.org/html/2606.05597#S3), namely multi\-step GRPO with decoupled importance sampling and the constant1/k1/kstep\-number normalizer\.
### 4\.2Main Results
Table 1:Peak test set success rate \(%\) on the WebGym\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\)OOD test split\. Best per column within each variant in bold\.Figure 2:Test success rate vs\. training trajectories collected on the WebGym OOD test split\. Solid colored curves are our runs under AsyncWebRL:AsyncWebRL \(full\)andAsyncWebRL\-RAFT\+\+\. The gray dashed curve is the prior WebGym sync REINFORCE baseline \(values fromBaiet al\.\([2026](https://arxiv.org/html/2606.05597#bib.bib27)\)\)\.Top:Instruct\.Bottom:Thinking\. No WebGym curve is shown on Thinking because its baseline was not trained under our\(10,20,30\)\(10,20,30\)per difficulty horizons\. AsyncWebRL \(full\) achieves both higher final test reward and higher sample efficiency per training trajectory collected\.AsyncWebRL \(full\) sets a new SOTA over WebGym, with the largest relative gains on harder slices\.On Instruct, AsyncWebRL \(full\) reaches45\.4%45\.4\\%Avg against WebGym’s42\.9%42\.9\\%, a\+5\.8%\\bm\{\+5\.8\\%\}relativeimprovement \([Table˜1](https://arxiv.org/html/2606.05597#S4.T1)\), but Avg understates where the gain lives:Easybarely moves \(50\.9→52\.450\.9\\to 52\.4,\+2\.9%\+2\.9\\%relative\),Mediumwidens to\+𝟒𝟐%\\bm\{\+42\\%\}relative\(24\.1→34\.324\.1\\to 34\.3\), andHardreaches\+𝟒𝟖%\\bm\{\+48\\%\}relative\(4\.8→7\.14\.8\\to 7\.1\)\. The same harder\-slice pattern shows up on Thinking in[Figure˜2](https://arxiv.org/html/2606.05597#S4.F2)\(bottom row\), where WebGym’s reported numbers are not available at our training horizons\. Medium and Hard are the slices on which prior REINFORCE\-style baselines have the most room to grow, because successful trajectories there are rarer and a behavior\-cloning loss on a slowly changing positives buffer cannot push down the dominant failure modes\. GRPO’s group\-normalized advantage does \([Section˜A\.2](https://arxiv.org/html/2606.05597#A1.SS2),[Section˜A\.1](https://arxiv.org/html/2606.05597#A1.SS1)\)\.
Figure 3:Left:End to end training trajectory throughput\.Right:Off\-policyness during GRPO training for the Instruct \(blue\) and Thinking \(red\) Qwen3\-VL\-8B runs: mean and max of the per\-token off\-policy gapggacross a training batch\.Reproducing the WebGym sync REINFORCE pipeline under async\.Before comparing algorithms, we use AsyncWebRL\-RAFT\+\+ as a like\-for\-like check against the published WebGym sync REINFORCE numbers \(same loss family, async framework substituted for sync\)\. AsyncWebRL\-RAFT\+\+ on Instruct reaches 39\.3%, against 42\.9% for the prior WebGym pipeline \([Table˜1](https://arxiv.org/html/2606.05597#S4.T1)\)\. The3\.6%3\.6\\%gap is consistent with the importance sampling overhead any async framework has to pay, and the two methods land in the same neighborhood on the same loss family, with the gain reported above being net of this cost\. Wall\-clock cost moves the other way: AsyncWebRL produces about 3,100 training trajectories per hour on both variants against approximately 1,300 and 1,050 trajectories per hour for the synchronous WebGym baseline, a2\.4\\bm\{2\.4\}to2\.9×\\bm\{2\.9\\times\}end\-to\-end speedup\([Figure˜3](https://arxiv.org/html/2606.05597#S4.F3), left\)\. The speedup also incorporates smaller engineering improvements over WebGym, including the distributed inference server and in\-memory weight broadcast\. Since WebGym\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\)itself reports the fastest open\-source rollout system for multi\-step visual web agents to date, the2\.42\.4to2\.9×2\.9\\timesend\-to\-end speedup on top of it makes AsyncWebRL, to our knowledge,the fastest reported open multi\-step RL framework for visual web agents\. We qualify with “reported” because among open multi\-step VLM RL frameworks in this category, only WebGym publishes throughput numbers\.
Off\-policyness stays small\.A natural worry under fully async RL is that the off\-policy gap might be large\. AReaL\(Fuet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib20)\), the closest async text reasoning analog, reports running with a max staleness ofη=4\\eta\{=\}4for coding andη=8\\eta\{=\}8for math\. In our setting, with the max staleness set toη=2\\eta=2, the mean per\-token off\-policy gapη\\etastays near 1\.5 and the max near 2\.0 throughout training \([Figure˜3](https://arxiv.org/html/2606.05597#S4.F3), right two panels\), well below the cap, and GPUs stay busy throughout\. Off\-policyness is naturally smaller than in coding and math because web\-agent responses are much shorter\(Shenet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib1)\), giving a higher rollout/train speed ratio and leaving few stale trajectories in the training batch\.
### 4\.3Efficiency
Figure 4:Effect of the1/\|τi\|1/\|\\tau\_\{i\}\|normalizer on GRPO training dynamics\. Rows are the two Qwen3\-VL\-8B variants \(top: Instruct, bottom: Thinking\), columns are, from left to right, test reward, \#steps per trajectory, per\-token entropy, and \#tokens per step\. Test reward is essentially tied between the two losses, but the1/\|τi\|1/\|\\tau\_\{i\}\|run produces longer trajectories, longer per\-step responses, and lower per\-token entropy\.Removing1/\|τi\|1/\|\\tau\_\{i\}\|preserves performance and shortens trajectories\.The constant1/k1/kreplacement does not change what the policy can solve: test reward is essentially tied throughout training on both variants \([Figure˜4](https://arxiv.org/html/2606.05597#S4.F4), first column\)\. What changes is trajectory and per\-step response length\. Under the standard1/\|τi\|1/\|\\tau\_\{i\}\|normalizer the average trajectory is consistently longer, and the per\-step response grows to around240240tokens; the constant1/k1/krun uses fewer steps per trajectory and fewer tokens per step at matched test reward\. Measured within each run \(same hardware throughout\), the trajectory contraction translates into a per\-step gradient\-update time reduction of𝟏𝟏\\bm\{11\}–𝟏𝟓%\\bm\{15\\%\}from the first2020to the last2020training steps under the constant1/k1/kfix, against only33–5%5\\%under the standard1/\|τi\|1/\|\\tau\_\{i\}\|loss; the corresponding total per\-step wall\-clock reductions are𝟏𝟖\\bm\{18\}–𝟏𝟗%\\bm\{19\\%\}vs\.44–5%5\\%\(full table in[Section˜A\.5](https://arxiv.org/html/2606.05597#A1.SS5)\)\. The mechanism behind this, length\-coupled memory drift, is analyzed in[Section˜5](https://arxiv.org/html/2606.05597#S5)\.
Figure 5:Coupled vs\. decoupled importance sampling under the async RL GRPO loss\. From left to right: per update mean of training reward, fraction of tokens hit by theϵ\\epsilon\-clip, and fraction of tokens hit by the dual clip\.Decoupled importance sampling halves the clip rate\.A separate ablation isolates the off\-policy correction described in[Section˜3\.2](https://arxiv.org/html/2606.05597#S3.SS2)\. We run a coupled loss baseline that folds rollout staleness and current\-update movement into a singleπθ/πbehave\\pi\_\{\\theta\}/\\pi\_\{\\text\{behave\}\}ratio with PPO\-style clipping centered on the coupled ratio\. The coupled run sits at roughly2×2\\timestheϵ\\epsilon\-clip\-trigger rate of the decoupled run at start and improves training reward substantially more slowly \([Figure˜5](https://arxiv.org/html/2606.05597#S4.F5)\)\. We therefore keep the decoupled factorization in every AsyncWebRL configuration above\.
## 5Dynamics Analysis
### 5\.1Main Analysis
The1/\|τi\|1/\|\\tau\_\{i\}\|factor in multi\-step GRPO is the step\-level analog of the token\-level normalizer Dr\. GRPO\(Liuet al\.,[2025b](https://arxiv.org/html/2606.05597#bib.bib4)\)identifies as biased: each trajectory enters the loss with total weight11regardless of length, so any token in a long trajectory carries a1/\|τi\|1/\|\\tau\_\{i\}\|share\. In a setting like ours where failure is dominated by horizon exhaustion rather than clearly wrong actions, this attenuates the penalty on exactly the trajectories the policy needs to learn to avoid\. On the base model, only around30%30\\%to40%40\\%of rollouts succeed in the DAPO\-filtered training batch, and failures average12\.512\.5steps against5\.15\.1for successes \(a2\.4×2\.4\\timesgap\) when RL starts\. The short successful population is both rarer and shorter and cannot counterbalance the effect of long failed trajectories, so early in training, the policy is pulled toward longer rollouts \([Figure˜4](https://arxiv.org/html/2606.05597#S4.F4), second column\)\.
Figure 6:Memoryat step44of one representative 10/20/30\-horizon rollout per checkpoint\.1/\|τi\|1/\|\\tau\_\{i\}\|accumulates verbose generic keys; the constant1/k1/kfix re\-keysMemoryto task sub\-questions\.Figure 7:Trajectory\-mean number ofMemoryJSON keys per agent step, split by outcome \(Success, Failure\) and difficulty \(Easy, Medium\)\.1/\|τi\|1/\|\\tau\_\{i\}\|tracks the one\-new\-key\-per\-step diagonal; the constant1/k1/kfix stays close to Base\.This lengthening of trajectories then propagates into longer per\-step responses through the prompt structure\. Under the WebGym additive\-memory prompt against a multi\-step visual environment\(Baiet al\.,[2026](https://arxiv.org/html/2606.05597#bib.bib27)\), each step appends toMemory, and each new screenshot exposes new information the policy can legitimately log, soMemorygrows roughly linearly in step count\. Under1/\|τi\|1/\|\\tau\_\{i\}\|the per\-stepMemorykey count tracks the one\-new\-key\-per\-step diagonal, while under the constant1/k1/kfix it stays close to the base\-model baseline \([Figure˜7](https://arxiv.org/html/2606.05597#S5.F7)\)\. SinceMemoryis part of the response, the larger step count drags per\-step response length up by≈33%\\approx 33\\%\([Figure˜4](https://arxiv.org/html/2606.05597#S4.F4), fourth column\), and the entropy drop \(third column\) follows because the extra response length is mostly low\-entropyMemoryboilerplate \(per\-token entropy annotations on representative responses in[Section˜C\.3](https://arxiv.org/html/2606.05597#A3.SS3)\)\. The schema itself also drifts: under1/\|τi\|1/\|\\tau\_\{i\}\|the typical trajectory rewritesMemoryevery step with generic placeholders liketask\_1orcurrent\_step\(34%34\\%of all observed key\-occurrences are generic\-slot, only7%7\\%of trajectories preserve their key set end\-to\-end, only36%36\\%of step pairs leave theMemorykey set unchanged\), while under the constant1/k1/kfix the policy declares a small set of task\-anchored keys at step0and holds them throughout \(65%65\\%,76%76\\%, generic slots drop to11%11\\%,[Table˜2](https://arxiv.org/html/2606.05597#S5.T2)and[Figure˜6](https://arxiv.org/html/2606.05597#S5.F6); token\-level last\-step responses in[Sections˜C\.1](https://arxiv.org/html/2606.05597#A3.SS1)and[C\.2](https://arxiv.org/html/2606.05597#A3.SS2)\)\. The constant1/k1/kfix restores full per\-token weight on long failures, and every downstream symptom above \(memory bloat, longer responses, lower entropy, generic schemas\) disappears\.
Table 2:Schema signatures ofMemoryunder GRPO with1/\|τi\|1/\|\\tau\_\{i\}\|vs\. the constant1/k1/kfix \(Instruct, additive prompt, first 30 steps per trajectory\)\. Rows: fraction of trajectories with first\-step and last\-step key sets equal; fraction of consecutive step pairs with the same key set; fraction of key occurrences matching a generic placeholder regex \(e\.g\.task,current\_step,search\_query\)\.
### 5\.2Ablations
We run three ablations to confirm that the diagnosis above identifies the right cause\. The first checks that a closely related algorithm \(RAFT\+\+\) exhibits the same pathology when it shares the same loss factor, ruling out group\-relative advantage as the driver\. The second checks that the cause lives in the loss rather than the prompt, by trying to fix the symptom from the prompt side and observing that it does not work\. The third checks that the effect scales as the mechanism predicts when we make long failures even longer\.
RAFT\+\+ exhibits the same pathology, more mildly\.RAFT\+\+ also normalizes per\-trajectory loss by1/\|τi\|1/\|\\tau\_\{i\}\|, and the diagnosis predicts the same memory bloat in kind\.[Figure˜7](https://arxiv.org/html/2606.05597#S5.F7)\(Instruct row\) confirms it: the RAFT\+\+ per\-stepMemorykey\-count curve drifts upward in the same direction as GRPO with1/\|τi\|1/\|\\tau\_\{i\}\|, closely shadowing the GRPO length\-norm curve on Easy successful trajectories and sitting well above the constant\-1/k1/krun\. The effect is muted on failure trajectories, consistent with RAFT\+\+’s success\-conditioned filter \(r\>0r\>0\) providing no direct gradient on failed rollouts: the same length\-coupled attenuation is present in the loss but only propagates through the kept\-positives\. The matched per\-step Add/Del/Mod edit\-op decomposition \([Section˜A\.3](https://arxiv.org/html/2606.05597#A1.SS3)\) tells the same story at a smaller magnitude than full GRPO\. RAFT\+\+ and GRPO differ in nearly every other respect, yet both show the same length\-coupled drift whenever they share the1/\|τi\|1/\|\\tau\_\{i\}\|factor\. The factor itself, not the surrounding algorithm, is the driver\.
Prompt\-level intervention fails\.If the loss is what drives the pathology, not the prompt, then intervening at the prompt level should not help\. Swapping the additive\-memory prompt for a compressive one that explicitly instructs the model to compressMemoryat every step, while keeping1/\|τi\|1/\|\\tau\_\{i\}\|in the loss, leaves per\-step Add/Del/Mod rates elevated relative to the Base run \([Table˜3](https://arxiv.org/html/2606.05597#A1.T3), “GRPO, compressive”\)\. The prompt change does not reach the cause\.
Scaled horizons amplify the effect\.The mechanism predicts that anything making long failures even longer should make the symptoms worse\. Scaling per\-difficulty horizons from 10/20/30 to 20/40/60 while keeping1/\|τi\|1/\|\\tau\_\{i\}\|does exactly that, and we see the predicted amplification in[Figure˜7](https://arxiv.org/html/2606.05597#S5.F7): the*GRPO \(length norm, 20/40/60\)*curve tracks the one\-new\-key\-per\-step diagonal across both Easy and Medium panels, exceeding the default10/20/3010/20/30length\-norm run\.
## 6Conclusion
We presented AsyncWebRL, a fully asynchronous multi\-step RL framework for visual web agents\. An everlasting rollout pool and lightweight screenshot handling give a2\.42\.4–2\.9×2\.9\\timesend\-to\-end speedup over the fastest prior open pipeline; replacing1/\|τi\|1/\|\\tau\_\{i\}\|with a constant1/k1/kthen contracts trajectories at matched reward for a further1\.80×1\.80\\timesper\-step speedup\. A decoupled importance\-sampling correction sets a new open SOTA on WebGym’s OOD split \(45\.4%45\.4\\%vs\.42\.9%42\.9\\%\)\.
## Acknowledgements
We thank Nan Jiang, Anikait Singh, and Aravind Rajeswaran for their useful discussions on algorithm designs and presentation\. This work is partially supported by NSF under Grant No\. 2416897, Grant No\. 2505932, and by ONR under Grant No\. N000142512318\. This research used both Delta \(NSF award OAC 2005572\) and DeltaAI \(NSF award OAC 2320345\) advanced computing systems, and computing resources provided by Illinois Computes and NAIRR Pilot NAIRR250157\.
## References
- WebGym: scaling training environments for visual web agents with realistic tasks\.arXiv preprint arXiv:2601\.02439\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p1.1),[§1](https://arxiv.org/html/2606.05597#S1.p2.1),[§1](https://arxiv.org/html/2606.05597#S1.p3.5),[§2](https://arxiv.org/html/2606.05597#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.05597#S3.SS1.p1.1),[Figure 2](https://arxiv.org/html/2606.05597#S4.F2),[1st item](https://arxiv.org/html/2606.05597#S4.I1.i1.p1.1),[§4\.1](https://arxiv.org/html/2606.05597#S4.SS1.p1.5),[§4\.1](https://arxiv.org/html/2606.05597#S4.SS1.p2.1),[§4\.2](https://arxiv.org/html/2606.05597#S4.SS2.p2.5),[Table 1](https://arxiv.org/html/2606.05597#S4.T1),[§5\.1](https://arxiv.org/html/2606.05597#S5.SS1.p2.13)\.
- H\. Bai, Y\. Zhou, M\. Cemri, J\. Pan, A\. Suhr, S\. Levine, and A\. Kumar \(2024\)Digirl: training in\-the\-wild device\-control agents with autonomous reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 12461–12495\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- H\. Bai, Y\. Zhou, L\. E\. Li, S\. Levine, and A\. Kumar \(2025\)Digi\-q: learning q\-value functions for training device\-control agents\.arXiv preprint arXiv:2502\.15760\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- L\. Espeholt, H\. Soyer, R\. Munos, K\. Simonyan, V\. Mnih, T\. Ward, Y\. Doron, V\. Firoiu, T\. Harley, I\. Dunning,et al\.\(2018\)Impala: scalable distributed deep\-rl with importance weighted actor\-learner architectures\.InInternational conference on machine learning,pp\. 1407–1416\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p2.1)\.
- W\. Fu, J\. Gao, X\. Shen, C\. Zhu, Z\. Mei, C\. He, S\. Xu, G\. Wei, J\. Mei, J\. Wang, T\. Yang, B\. Yuan, and Y\. Wu \(2025\)AReaL: a large\-scale asynchronous reinforcement learning system for language reasoning\.External Links:2505\.24298Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.05597#S3.SS2.p1.9),[§4\.2](https://arxiv.org/html/2606.05597#S4.SS2.p3.4)\.
- Z\. Han, A\. You, H\. Wang, K\. Luo, G\. Yang, W\. Shi, M\. Chen, S\. Zhang, Z\. Lan, C\. Deng,et al\.\(2025\)AsyncFlow: an asynchronous streaming rl framework for efficient llm post\-training\.arXiv preprint arXiv:2507\.01663\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- J\. Hilton, K\. Cobbe, and J\. Schulman \(2021\)Batch size\-invariance for policy optimization\.External Links:2110\.00641Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p2.1),[§1](https://arxiv.org/html/2606.05597#S1.p3.5),[§2](https://arxiv.org/html/2606.05597#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.05597#S3.SS2.p1.9)\.
- D\. Kim, S\. Park, H\. Jang, J\. Shin, J\. Kim, and Y\. Seo \(2025\)Robot\-r1: reinforcement learning for enhanced embodied reasoning in robotics\.arXiv preprint arXiv:2506\.00070\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- J\. Liu, F\. Gao, B\. Wei, X\. Chen, Q\. Liao, Y\. Wu, C\. Yu, and Y\. Wang \(2025a\)What can rl bring to vla generalization? an empirical study\.arXiv preprint arXiv:2505\.19789\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025b\)Understanding r1\-zero\-like training: a critical perspective\.External Links:2503\.20783Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p4.6),[§2](https://arxiv.org/html/2606.05597#S2.p2.3),[§5\.1](https://arxiv.org/html/2606.05597#S5.SS1.p1.8)\.
- H\. Lu, Z\. Liu, S\. Xiong, Y\. He, W\. Gao, Y\. Wu, W\. Wang, J\. Liu, Y\. Li, H\. Zhao,et al\.\(2025\)Part ii: roll flash–accelerating rlvr and agentic training with asynchrony\.arXiv preprint arXiv:2510\.11345\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- V\. Mnih, A\. P\. Badia, M\. Mirza, A\. Graves, T\. Lillicrap, T\. Harley, D\. Silver, and K\. Kavukcuoglu \(2016\)Asynchronous methods for deep reinforcement learning\.InInternational conference on machine learning,pp\. 1928–1937\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p2.1)\.
- M\. Noukhovitch, S\. Huang, S\. Xhonneux, A\. Hosseini, R\. Agarwal, and A\. Courville \(2024\)Asynchronous rlhf: faster and more efficient off\-policy rl for language models\.arXiv preprint arXiv:2410\.18252\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p3.5)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p4.6),[§2](https://arxiv.org/html/2606.05597#S2.p2.3),[§3\.2](https://arxiv.org/html/2606.05597#S3.SS2.p2.8)\.
- J\. Shen, H\. Bai, L\. Zhang, Y\. Zhou, A\. Setlur, S\. Tong, D\. Caples, N\. Jiang, T\. Zhang, A\. Talwalkar,et al\.\(2025\)Thinking vs\. doing: agents that reason by scaling test\-time interaction\.arXiv preprint arXiv:2506\.07976\.Cited by:[§4\.2](https://arxiv.org/html/2606.05597#S4.SS2.p3.4)\.
- G\. Sheng, Y\. Tong, B\. Wan, W\. Zhang, C\. Jia, X\. Wu, Y\. Wu, X\. Li, C\. Zhang, Y\. Peng,et al\.\(2025\)Laminar: a scalable asynchronous rl post\-training framework\.arXiv preprint arXiv:2510\.12633\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv:2409\.19256\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- H\. Wang, H\. Zou, H\. Song, J\. Feng, J\. Fang, J\. Lu, L\. Liu, Q\. Luo, S\. Liang, S\. Huang,et al\.\(2025\)Ui\-tars\-2 technical report: advancing gui agent with multi\-turn reinforcement learning\.arXiv preprint arXiv:2509\.02544\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p1.1),[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- Z\. Wei, W\. Yao, Y\. Liu, W\. Zhang, Q\. Lu, L\. Qiu, C\. Yu, P\. Xu, C\. Zhang, B\. Yin,et al\.\(2025\)Webagent\-r1: training web agents via end\-to\-end multi\-turn reinforcement learning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 7920–7939\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p1.1),[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- B\. Wu, S\. Wang, Y\. Tang, J\. Ding, E\. Helenowski, L\. Tan, T\. Xu, T\. Gowda, Z\. Chen, C\. Zhu,et al\.\(2025\)LlamaRL: a distributed asynchronous reinforcement learning framework for efficient large\-scale llm training\.arXiv preprint arXiv:2505\.24034\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- W\. Xiong, J\. Yao, Y\. Xu, B\. Pang, L\. Wang, D\. Sahoo, J\. Li, N\. Jiang, T\. Zhang, C\. Xiong,et al\.\(2025\)A minimalist approach to llm reasoning: from rejection sampling to reinforce\.arXiv preprint arXiv:2504\.11343\.Cited by:[§3\.2](https://arxiv.org/html/2606.05597#S3.SS2.p5.2)\.
- R\. Yang, Q\. Wu, Z\. Wang, H\. Chen, K\. Yang, H\. Cheng, H\. Yao, B\. Peng, H\. Zhang, J\. Gao,et al\.\(2026\)Gui\-libra: training native gui agents to reason and act with action\-aware supervision and partially verifiable rl\.arXiv preprint arXiv:2602\.22190\.Cited by:[§1](https://arxiv.org/html/2606.05597#S1.p1.1),[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- D\. Ye, Z\. Liu, M\. Sun, B\. Shi, P\. Zhao, H\. Wu, H\. Yu, S\. Yang, X\. Wu, Q\. Guo,et al\.\(2020\)Mastering complex control in moba games with deep reinforcement learning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 6672–6679\.Cited by:[§B\.2](https://arxiv.org/html/2606.05597#A2.SS2.p2.9)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§B\.2](https://arxiv.org/html/2606.05597#A2.SS2.p4.1),[§3\.2](https://arxiv.org/html/2606.05597#S3.SS2.p4.6)\.
- S\. Zeng, Q\. Wei, W\. Brown, O\. Frunza, Y\. Nevmyvaka, Y\. K\. Zhao, and M\. Hong \(2025\)Reinforcing multi\-turn reasoning in llm agents via turn\-level credit assignment\.InICML 2025 Workshop on Computer Use Agents,Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p2.3)\.
- Y\. Zhai, H\. Bai, Z\. Lin, J\. Pan, S\. Tong, Y\. Zhou, A\. Suhr, S\. Xie, Y\. LeCun, Y\. Ma,et al\.\(2024\)Fine\-tuning large vision\-language models as decision\-making agents via reinforcement learning\.Advances in neural information processing systems37,pp\. 110935–110971\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- Y\. Zhong, Z\. Zhang, X\. Song, H\. Hu, C\. Jin, B\. Wu, N\. Chen, Y\. Chen, Y\. Zhou, C\. Wan,et al\.\(2025\)Streamrl: scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation\.arXiv preprint arXiv:2504\.15930\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
- Y\. Zhou, J\. Li, Y\. Su, G\. Ramesh, Z\. Zhu, X\. Long, C\. Zhao, J\. Pan, X\. Yu, Z\. Wang, K\. Du, J\. Wu, X\. Sun, J\. Liu, Q\. Yu, H\. Chen, Z\. Liu, and E\. Barsoum \(2025\)APRIL: active partial rollouts in reinforcement learning to tame long\-tail generation\.arXiv preprint arXiv:2509\.18521\.Cited by:[§2](https://arxiv.org/html/2606.05597#S2.p1.1)\.
## Appendix AAdditional Analyses
This appendix collects supporting analyses for the main\-text results\.[Section˜A\.1](https://arxiv.org/html/2606.05597#A1.SS1)reports a learning\-rate ablation on RAFT\+\+\.[Section˜A\.2](https://arxiv.org/html/2606.05597#A1.SS2)unpacks the GRPO\-vs\-RAFT\+\+ gap with entropy and clip\-rate diagnostics\.[Section˜A\.3](https://arxiv.org/html/2606.05597#A1.SS3)provides the per\-stepMemoryAdd/Del/Mod table referenced by the[Section˜5](https://arxiv.org/html/2606.05597#S5)ablations\.[Section˜A\.4](https://arxiv.org/html/2606.05597#A1.SS4)checks batch\-size sensitivity\.
### A\.1RAFT\+\+ Learning Rate
Figure 8:Learning\-rate ablation on RAFT\+\+ \(Qwen3\-VL\-8B\-Instruct,b=120b\{=\}120, off\-policy==2, no KL\)\. From left to right: held\-out test success rate \(the larger\-LR run wins by a wide margin at the peak\), training reward, where the trend reverses, pre\-clipping gradient L2 norm \(consistently bounded for the larger LR\), and per\-token policy entropy, which decays faster under the larger LR\. The train/test flip suggests low\-LR RAFT\+\+ overfits the filtered positives buffer, noisier optimization generalizes better\.Two RAFT\+\+ runs onQwen3\-VL\-8B\-Instructthat share dataset, batch size \(b=120b\{=\}120\), and every other hyperparameter except the learning rate \(5×10−65\{\\times\}10^\{\-6\}vs\.1×10−51\{\\times\}10^\{\-5\}\) move in*opposite*directions on training and held\-out reward \([Figure˜8](https://arxiv.org/html/2606.05597#A1.F8)\): the smaller\-LR run reaches a higher peak training reward but a markedly lower peak test success rate; the larger\-LR run gives up some training fit and gains substantially on the test set\. We do not have a controlled experiment to attribute the flip causally, but the pattern is consistent with the RAFT\+\+ objective behaving like behavior cloning on a slowly\-changing positives buffer, where many small gradient steps can over\-fit the kept successful trajectories\. This observation motivates our choice of the higher learning rate \(1×10−51\\times 10^\{\-5\}\) for the Thinking RAFT\+\+ baseline reported in[Table˜5](https://arxiv.org/html/2606.05597#A3.T5)\.
### A\.2Why GRPO Outperforms RAFT\+\+
Figure 9:GRPO vs\. RAFT\+\+ onQwen3\-VL\-8B\-Instruct\.Left:training reward per optimizer step\.Middle:per\-token policy entropy averaged over the loss mask, the step\-zero offset is induced by the conditional vs\. unconditional denominator \(see text\)\.Right:fraction of tokens hitting theϵ\\epsilon\-clip, both runs sit near 0\.3%, ruling out clip mechanics as the driver of the entropy gap\.RAFT\+\+ falls short of GRPO on both variants\. We attribute this primarily to RAFT\+\+ providing no contrastive signal on below\-average rollouts: itsr\>0r\>0filter only pulls mass*onto*successful trajectories, while GRPO’s group\-normalized advantage simultaneously pushes mass*down*on below\-average trajectories within the same group\. The entropy gap on Instruct in[Figure˜9](https://arxiv.org/html/2606.05597#A1.F9)is consistent with this picture: RAFT\+\+’s entropy collapses from0\.0570\.057to0\.0100\.010while GRPO’s settles near0\.0250\.025, andϵ\\epsilon\-clip rates are essentially identical \(≈0\.3%\\approx 0\.3\\%in both runs\), so the gap is not driven by clipping mechanics\. We do not run a controlled ablation that isolates the contrastive signal from other differences between the two losses\.
### A\.3Memory Edit\-Op Rates
Table 3:Per\-step edit operations onMemory\(instruct prompt, first 30 steps\), split by trajectory outcome\. Add / Del / Mod count per consecutive step pair the inserted / removed / value\-changed keys\.[Table˜3](https://arxiv.org/html/2606.05597#A1.T3)reports per\-step Add/Del/Mod edit\-op rates on the agent’sMemoryfield, split by trajectory outcome, for the algorithm\-prompt\-loss combinations studied in[Section˜5](https://arxiv.org/html/2606.05597#S5)\. The matched per\-step Memory key\-count curves are plotted in[Figure˜7](https://arxiv.org/html/2606.05597#S5.F7); this table reports the operation\-level decomposition, and tells four qualitative stories\. RAFT\+\+ editsMemorymore aggressively than Base, consistent with the shared1/\|τi\|1/\|\\tau\_\{i\}\|factor pushing the policy toward longer memory schemas\. GRPO with1/\|τi\|1/\|\\tau\_\{i\}\|edits even more aggressively than RAFT\+\+, because the group\-relative advantage adds direct negative gradient on below\-average trajectories that RAFT\+\+’s success\-conditioned filter does not provide\. The compressive prompt under1/\|τi\|1/\|\\tau\_\{i\}\|brings rates down from full additive GRPO but stays above Base, indicating that prompt\-level intervention only mitigates the loss\-driven effect rather than removing it\. Finally, replacing1/\|τi\|1/\|\\tau\_\{i\}\|with the constant1/k1/k, even using the additive memory prompt, brings the rates back to roughly Base levels \(and below on Success\), confirming that the loss aggregation is the root cause analyzed in[Section˜5](https://arxiv.org/html/2606.05597#S5)\.
### A\.4Batch Size Sensitivity
Figure 10:Effect of consumer batch size on GRPO Instruct training, plotted against wall\-clock hours: batch=128 \(the canonical setting used throughout the paper\) versus batch=32, all other hyperparameters held fixed\. From left: test reward, training reward, policy entropy, average steps per trajectory, and average response tokens per step\.We chose a consumer batch size of128128for the canonical Instruct GRPO run \([Section˜C\.5](https://arxiv.org/html/2606.05597#A3.SS5)\)\. To check whether our conclusions depend on this choice, we re\-ran the same configuration with batch size3232, holding learning rate, group size, off\-policy lag cap, and all other hyperparameters fixed \([Figure˜10](https://arxiv.org/html/2606.05597#A1.F10)\)\. The small\-batch run improves faster early \(reaching≈0\.38\\approx 0\.38test reward several hours ahead of batch\-128, since a four\-times\-higher optimizer\-update rate cashes in when the gradient signal is large\) but intersects with the batch\-128 run around 10 hours \(reaching≈0\.40\\approx 0\.40\), so over the full training horizon the choice is essentially neutral on test reward\. The early gain comes with visibly higher per\-iteration variance: batch\-32 training reward, entropy, and steps per trajectory all swing within a wider envelope than batch\-128 at every wall\-clock time\. Crucially, the qualitative behaviors analyzed in[Section˜5](https://arxiv.org/html/2606.05597#S5)\(entropy collapse, falling steps per trajectory, rising tokens per step\) still reproduce at batch size3232, so the length\-coupled memory drift mechanism is a property of the loss shape rather than of the batch\-size choice\. The canonical batch size of128128is therefore a sensible default, but the main qualitative claims of the paper do not depend on it\.
### A\.5Within\-Run Per\-Step Time Contraction
To quantify per\-step compute savings from the constant1/k1/kfix without cross\-run hardware confounds, we report*within\-run*ratios: the average time per training step over the last2020steps divided by the average over the first2020\([Table˜4](https://arxiv.org/html/2606.05597#A1.T4)\)\. Each run executes on the same hardware throughout, so any change reflects the policy contracting \(or expanding\) its trajectories over training\. We report two measurements\. The first is the gradient\-update time only \(forward, backward, optimizer\), which the trainer records directly for every training step\. The second is the total time elapsed per training step, also including rollout collection, evaluation, and checkpointing\. For the total time, we skip a small number of initial steps whose recorded timing reflects setup or resume overhead rather than steady\-state training; for one of the Thinking1/k1/kfragments this means the first window starts a few dozen training steps in\.
Table 4:Within\-run ratio of last\-20 to first\-20 per\-step time\. Lower is more contraction\.Both measurements agree\. Under the standard1/\|τi\|1/\|\\tau\_\{i\}\|loss, per\-step time is essentially flat over the run \(0\.950\.95–0\.970\.97\): the policy’s trajectory\-length distribution drifts slowly enough that per\-step cost stays roughly constant\. Under the constant1/k1/kfix, per\-step time contracts noticeably within a single run \(0\.810\.81–0\.890\.89, an1111–19%19\\%reduction from start to end of training\)\. This is the within\-run signature of the trajectory contraction analyzed in[Section˜5](https://arxiv.org/html/2606.05597#S5): as training proceeds, the1/k1/kpolicy emits shorter trajectories and fewer tokens per step, so each gradient update consumes less compute\.
## Appendix BImplementation Details
This appendix documents the system, loss, prompt, and hyperparameter details needed to reproduce our runs\. The “System” subsection below describes the lightweight screenshot handling promised in[Section˜3\.1](https://arxiv.org/html/2606.05597#S3.SS1)\.[Section˜B\.2](https://arxiv.org/html/2606.05597#A2.SS2)defines the dual\-clip variant used by every AsyncWebRL GRPO run\.[Section˜C\.4](https://arxiv.org/html/2606.05597#A3.SS4)provides the full agent prompt template \(Instruct and Thinking share it\)\.[Section˜C\.5](https://arxiv.org/html/2606.05597#A3.SS5)lists optimizer, loss, rollout, and hardware\-allocation settings for all three main\-table runs\.
### B\.1System
Figure 11:Lightweight Screenshot Handling\. Compared to WebGym, which serializes every screenshot through the shared RPC object store and spills to disk under concurrent rollouts, AsyncWebRL keeps all image tensors in a dedicated in\-memory actor and routes only lightweightreferencesthrough RPC, eliminating the per\-step object\-store bottleneck\.Lightweight screenshot handling\.Different from WebGym, which embeds raw image tensors directly into each trajectory dictionary and ships the entire bundle through the shared RPC object store, AsyncWebRL stores all screenshots in adedicated in\-memory actorand routes only lightweight references between rollout workers and the trainer, as shown in[Figure˜11](https://arxiv.org/html/2606.05597#A2.F11)\. This redesign is forced by the multi\-step web\-agent setting: each trajectory carries tens of high\-resolution screenshots, and at hundreds of concurrent rollouts the baseline pipeline pushes well over 100 GiB of pixel tensors through the object store per training step\. Once the object store exceeds its memory budget, the framework spills to disk and*every*downstream operation, trajectory deserialization, RPC dispatch, and tensor localization, degrades by orders of magnitude\. By keeping image tensors out of the RPC payload entirely, AsyncWebRL confines the screenshot footprint to a single bounded actor, the trainer fetches only the slices it needs at gradient\-update time and immediately releases them once the optimizer step completes\.
### B\.2Optimization
There are two considerations of algorithm design that we adopt in our experiments\.
Dual\-clipped GRPO loss\.Letf\(ρ\)A^i,tf\(\\rho\)\\hat\{A\}\_\{i,t\}denote the innermin\\minargument of[Equation˜1](https://arxiv.org/html/2606.05597#S3.E1)\. This advantage\-based clipping framework bounds the surrogate objective whenA^i,t≥0\\hat\{A\}\_\{i,t\}\\geq 0, but fails to do so whenA^i,t<0\\hat\{A\}\_\{i,t\}<0andρ≫1\+ϵ\\rho\\gg 1\+\\epsilon: in this regime the innermin\\minselects the unclipped termρA^i,t\\rho\\hat\{A\}\_\{i,t\}, which is unbounded below inρ\\rho\. A single token assigned a low probability by the rollout policy can therefore drive the per\-token contribution arbitrarily negative\. Following the dual\-clip extension of PPO\(Yeet al\.,[2020](https://arxiv.org/html/2606.05597#bib.bib3)\), we capf\(ρ\)f\(\\rho\)from below whenever the advantage is negative:
fdual\(ρ\)=\{f\(ρ\),A^i,t≥0,max\[f\(ρ\),c⋅A^i,t\],A^i,t<0,f\_\{\\text\{dual\}\}\(\\rho\)=\\begin\{cases\}f\(\\rho\),&\\hat\{A\}\_\{i,t\}\\geq 0,\\\\\[2\.0pt\] \\max\\\!\\left\[f\(\\rho\),\\;c\\cdot\\hat\{A\}\_\{i,t\}\\right\],&\\hat\{A\}\_\{i,t\}<0,\\end\{cases\}\(2\)with constantc\>1c\>1\. The dual\-clipped GRPO loss is obtained by substitutingfdual\(ρ\)f\_\{\\text\{dual\}\}\(\\rho\)forf\(ρ\)f\(\\rho\)in[Equation˜1](https://arxiv.org/html/2606.05597#S3.E1), providing an absolute lower bound on the per\-token contribution and eliminating the tail risk introduced by rare low\-probability tokens with exploded importance sampling ratios\.
On not penalizing overlong trajectories specially\.A trajectory that exhausts its horizon is marked as a failure\. DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.05597#bib.bib6)\)notes that this loses some potentially useful signal\. We accept the loss: given the horizon, a trajectory that exhausted it means some of its actions were suboptimal, at least when another trajectory in the same group did succeed\. Actions likegobackthat appear more often in long trajectories are something we want the policy to take only when necessary, not on every long rollout\. When every trajectory in the group fails by exhausting the horizon, either thosegoback\-style actions were genuinely unhelpful, or the horizon itself is too tight; either way, treating overlong rollouts as failures is the correct default\.
## Appendix CQualitative Examples
This appendix shows sample\-level evidence for the length\-coupled memory drift mechanism analyzed in[Section˜5](https://arxiv.org/html/2606.05597#S5)\.[Section˜C\.1](https://arxiv.org/html/2606.05597#A3.SS1)and[Section˜C\.2](https://arxiv.org/html/2606.05597#A3.SS2)contrast last\-step responses produced by the1/\|τi\|1/\|\\tau\_\{i\}\|run against the constant1/k1/krun on the two Qwen3\-VL\-8B variants, making the generic\-slot memory schema versus task\-anchored schema split visible at the token level\.[Section˜C\.3](https://arxiv.org/html/2606.05597#A3.SS3)overlays the same responses with per\-token entropy annotations, which lets the reader see directly that the extra tokens emitted under1/\|τi\|1/\|\\tau\_\{i\}\|sit in low\-entropy boilerplate regions rather than in decision\-bearing positions\.
### C\.1Instruct Last\-Step Response
These examples illustrate the qualitative difference described in[Section˜5](https://arxiv.org/html/2606.05597#S5)\.
Base⋅\\cdotn\_steps=10⋅\\cdotreward=1\.0⋅\\cdot178 tokens
Memory:\{"Task":"FindoutwhenthenextfullmoonisscheduledaccordingtoSpace\.com’s’SpaceCalendar’\."\} Progress:\{"GotoSpace\.com":"finished","Navigateto’SpaceCalendar’":"finished","Findnextfullmoondate":"finished"\} Intention:ThenextfullmoonisscheduledforMay1,2026,at1:23p\.m\.\(EDT\)\. Action:ThenextfullmoonisscheduledforMay1,2026,at1:23p\.m\.\(EDT\)\. <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"ThenextfullmoonisscheduledforMay1,2026,at1:23p\.m\.\(EDT\)\."\}\} </tool\_call\><\|im\_end\|\>
RAFT\+\+⋅\\cdotn\_steps=10⋅\\cdotreward=1\.0⋅\\cdot174 tokens
Memory:\{"target\_website":"https://www\.foodnetwork\.com/","task":"Searchforarecipeforadessertthatissuitableforaromanticdinner,suchasachocolatesouffléorafruittart\.","search\_term":"fruittart","results\_found":"yes","recipe\_displayed":"Fruit\-TartFlag"\} Progress:\{"Gotofoodnetwork\.com":"finished","Searchforaromanticdinnerdessertrecipe":"finished"\} Intention:Thetaskiscompletedasasuitablerecipeforaromanticdinnerhasbeenfound\. Action:Thetaskiscompletedasasuitablerecipeforaromanticdinnerhasbeenfound\. <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"Asuitablerecipeforaromanticdinnerhasbeenfound:Fruit\-TartFlag\."\}\} </tool\_call\><\|im\_end\|\>
GRPO without length norm⋅\\cdotn\_steps=10⋅\\cdotreward=1\.0⋅\\cdot150 tokens
Memory:\{"accepted\_cookies":"finished","navigated\_to\_jackets":"finished","found\_leather\_jacket\_price":"finished"\} Progress:\{"GotoDiesel\.com":"finished","Acceptcookies":"finished","Navigateto’Jackets’categoryformen":"finished","Findpriceofanyleatherjacket":"finished"\} Intention:Completethetaskbyprovidingthepriceofaleatherjacketfoundinthe’Jackets’category Action:Thetaskiscompleted\.Thepriceofaleatherjacketis$395\.00\. <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"$395\.00"\}\} </tool\_call\><\|im\_end\|\>
GRPO with length norm⋅\\cdotn\_steps=10⋅\\cdotreward=1\.0⋅\\cdot589 tokens
Memory:\{"task":"Searchfornewsarticlesaboutthelatesttechnologicaladvancementsinthefieldofmedicineandsummarizethefindingsofthearticles\.","search\_query":"latesttechnologicaladvancementsinthefieldofmedicine","current\_page":"NIHwebsitearticlepage","search\_results":\["AIteachersandcybertnetics\-whatcouldtheworldlooklikein2050?","’Thereisnotenoughtimeorspaceintheday’"\],"article\_title":"AIteachersandcybertnetics\-whatcouldtheworldlooklikein2050?","popup":"surveypopup","article\_content":"AIteachersandcybertnetics\-whatcouldtheworldlooklikein2050?","subscription\_popup":"active","attempted\_navigation":"https://www\.bbc\.com/news/health\-64810698","error\_page":"500InternalServerError","new\_website":"https://www\.nih\.gov/","search\_bar":"used","search\_results\_count":"2,141results","article\_found":"SLEEP2024,theAssociatedProfessionalSleepSocieties\(APSS\)’s38thAnnualMeeting\|NHLBI,NIH","article\_content\_summary":"TheSLEEP2024meetingwilldiscussthelatestadvancesinsleepscienceandnewdevelopmentsinthediagnosisandtreatmentofsleepdisorders,disseminatecutting\-edgeresearch,andpromotethetranslationofbasicscienceintoclinicalpractice\."\} Progress:\{"GotoBBCNews":"finished","Searchfortechnologicaladvancementsinmedicine":"finished","Reviewsearchresults":"inprogress","Readthearticle":"interruptedbysubscriptionpopupanderrorpage","Summarizethefindings":"notfinished","NavigatetoNIHwebsite":"finished","SearchforarticlesonNIHwebsite":"finished","ReviewNIHsearchresults":"finished","Selectandreadanarticle":"finished","Summarizethefindingsfromtheselectedarticle":"inprogress"\} Intention:Providethefinalanswerbasedonthearticlesummary Action:Thetaskiscompleted\.Thelatesttechnologicaladvancementsinthefieldofmedicine,asdiscussedintheSLEEP2024meeting,includethelatestadvancesinsleepscienceandnewdevelopmentsinthediagnosisandtreatmentofsleepdisorders,disseminatingcutting\-edgeresearch,andpromotingthetranslationofbasicscienceintoclinicalpractice\. <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"Thelatesttechnologicaladvancementsinthefieldofmedicine,asdiscussedintheSLEEP2024meeting,includethelatestadvancesinsleepscienceandnewdevelopmentsinthediagnosisandtreatmentofsleepdisorders,disseminatingcutting\-edgeresearch,andpromotingthetranslationofbasicscienceintoclinicalpractice\."\}\} </tool\_call\><\|im\_end\|\>
### C\.2Thinking Last\-Step Response
Base⋅\\cdotn\_steps=10⋅\\cdotreward=1\.0⋅\\cdot387 tokens
<think\> Memory:WehaveconfirmedthatThinkTokyo\.comisforsaleonHugeDomainsat$3,095\.Thepageshowspurchaseoptionsincluding"Buynow"and"Startpaymentplan,"withadditionaldetailsaboutthepriceandamoney\-backguarantee\.Progress: \-Task:Determinethecurrentstatusofthedomain’thinktokyo\.com’andifitisforsale,identifywhereitislistedforsale\. \-Completed:ConfirmedthedomainisforsaleonHugeDomainswithpriceandpurchaseoptionsidentified\. \-Remaining:Nofurtherstepsneededsincethetaskrequirementsaremet\.Intention:Completethetaskbyprovidingthefinalanswerbasedontheobservedinformation\.Action:Providethefinalanswer\.Giventheinformationgatheredfromthepage,thecurrentstatusofthedomainisclear:itisforsaleonHugeDomains,pricedat$3,095,withspecificpurchaseoptionsavailable\.Noadditionalactionsarenecessarybecausethetask’srequirementshavebeenfullysatisfied\. </think\>Memory:\{"StatusofThinkTokyo\.com":"Forsale","Price":"$3,095","ListingSite":"HugeDomains","PurchaseOptions":\["Buynow","Startpaymentplan"\]\}Progress:\{"Checkdomainstatus":"finished","Identifysaleprice":"finished","Determinelistingsite":"finished","Summarizefindings":"notfinished"\}Intention:Summarizethefindingstoconfirmthedomain’sstatusandsaledetailsAction:Providethefinalanswerbasedontheobservedinformation<tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"Thedomain’thinktokyo\.com’iscurrentlyforsaleonHugeDomainsatapriceof$3,095\.Itislistedwithpurchaseoptionsincluding’Buynow’and’Startpaymentplan’\."\}\} </tool\_call\><\|im\_end\|\>
GRPO without length norm⋅\\cdotn\_steps=10⋅\\cdotreward=1\.0⋅\\cdot612 tokens
<think\> Lookingatthecurrentscreenshot,Icanseewe’reontheFoodNetworkwebsiteviewingtheSimpleEggDropSouprecipe\.Thetaskistofindarecipeforapopularsoupdishthatincludesalistofingredientsandcookingtime\.Fromtheprevioussteps,Icanseethat: 1\.WesuccessfullynavigatedtoFoodNetwork 2\.Wesearchedfor"popularsouprecipes" 3\.Weselectedthe"SimpleEggDropSoup"recipe 4\.We’venowscrolleddowntoseetherecipedetailsThecurrentscreenshotshowstherecipepagewith: \-Therecipetitle"NoRecipeRecipe:SimpleEggDropSoup" \-Level:Easy \-Prep:2min \-Cook:10min \-Total:12min \-Ingredientslist\(whichincludeschickenbroth,soysauce,Asianhotsauce,freshginger,andeggs\) \-DirectionsformakingthesoupThetaskrequiresfindingarecipewithalistofingredientsandcookingtime\.Basedonwhat’svisibleinthescreenshot,wehaveboth: \-Thecookingtimeisclearlyshownas"Total:12min"\(withPrep:2minandCook:10min\) \-Theingredientslistisclearlyvisibleinthe"Ingredients:"sectionSincewe’vefoundasouprecipewithboththeingredientslistandcookingtime,thetaskiscomplete\.Therecipemeetsallthecriteriaspecifiedinthetask\.Thenextactionshouldbetoconfirmthatwe’vesuccessfullycompletedthetask\. </think\>Memory:\{"Currentpage":"FoodNetworkwebsite","Task":"Findarecipeforapopularsoupdishthatincludesalistofingredientsandcookingtime","Searchterm":"popularsouprecipes","Results":\["NoRecipeRecipe:SimpleEggDropSoup\(12minutes\)","GoodEatsRoastTurkey"\],"Selectedrecipe":"SimpleEggDropSoup","Cookingtime":"12minutes","Ingredients":\["low\-sodiumchickenbroth","soysauce","Asianhotsauce","freshginger","eggs"\]\}Progress:\{"AccessAllrecipes":"failed","AccessFoodNetwork":"success","Searchforsouprecipes":"completed","Selectedrecipe":"SimpleEggDropSoup","Cookingtimefound":"12minutes","Ingredientslistfound":"yes"\}Intention:Thetaskiscomplete\-we’vefoundasouprecipewithingredientsandcookingtimeAction:Thetaskiscomplete\.We’vefoundapopularsouprecipe\(SimpleEggDropSoup\)withalistofingredientsandcookingtime\(12minutestotal\)\. <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"TheSimpleEggDropSouprecipefromFoodNetworkhasacookingtimeof12minutestotal\(2minprep,10mincook\)andincludesthefollowingingredients:low\-sodiumchickenbroth,soysauce,Asianhotsauce,freshginger,andeggs\."\}\} </tool\_call\><\|im\_end\|\>
GRPO with length norm⋅\\cdotn\_steps=10⋅\\cdotreward=1\.0⋅\\cdot911 tokens
<think\> Letmeanalyzethecurrentstateandwhatweneedtodo:1\.ThetaskistofindthemostcommontreespeciesinCentralParkaccordingtomapdata\. 2\.We’vebeentryingtoaccesstheNYCTreeMapwebsite\(https://tree\-map\.nycgovparks\.org\)butencounteredpermissionissues\. 3\.We’veswitchedtousingWikipedia’sCentralParkpageasanalternativesource\. 4\.We’vescrolledthroughtheTreesandplantssectionandfoundsomeinformationabouttreespeciesinCentralPark\.Fromthecurrentscreenshot,Icanseewe’restillontheWikipediapageforCentralPark,specificallyinthe"Treesandplants"section\.Thetextmentions: \-CentralParkhasmorethan20,000trees\(froma2011survey\) \-Morethan170treespecies \-Americanelmsarementionedas"largestremainingstands" \-Otherspeciesmentionedincludesugarmaples,redmaples,oakspecies,whitespruce,tamarack,whiteoak,blackcherryThekeyinformationweneedisthemostcommontreespecies\.Thetextmentions"Americanelmsarethelargestremainingstands"whichsuggeststheymightbethemostcommonspecies,butit’snotexplicitlystatedas"mostcommon"\-just"largestremainingstands\."Thecurrentviewshowsinformationabouttreediversityandecologicalimportance\.Itmentionsthat"Americanelmsarementionedaslargestremainingstands"whichistheclosestwehavetoidentifyingthemostcommonspecies\.Sincewe’vealreadyidentifiedthatAmericanelmsarementionedasthelargestremainingstands\(whichlikelymeansthey’rethemostcommonspecies\),we’veessentiallycompletedourtask\.Thenextlogicalactionwouldbetoconfirmthisinformationandconcludeoursearchsincewe’vefoundtheanswer\(AmericanelmsappeartobethemostcommontreespeciesinCentralParkbasedontheWikipediainformation\)\.We’vealreadyscrolledthroughtherelevantsectionandfoundtheinformationweneed\.Thetaskisessentiallycompletesincewe’veidentifiedthatAmericanelmsarementionedasthelargestremainingstands,whichistheclosestwecangettoidentifyingthemostcommontreespeciesfromtheavailableinformation\. </think\>Memory:\{"Initialwebsiteattempt":"https://tree\-map\.nycgovparks\.org","Secondwebsiteattempt":"https://www\.nycgovparks\.org","Errorencountered":"PermissionDenied","Taskobjective":"FindmostcommontreespeciesinCentralParkaccordingtomapdata","Currentstatus":"UnabletoaccessNYCParkswebsitesduetopermissionissues","Alternativesourceaccessed":"Wikipedia’sCentralParkpage","Wikipediapagecontent":"MainarticleaboutCentralParkwithtableofcontentsvisible","Treesandplantssectionaccessed":"success","Informationextracted":"CentralParkhasmorethan20,000trees\(2011survey\),morethan170treespecies,Americanelmsarementionedaslargestremainingstands","Mostcommonspeciesidentification":"Americanelmsarementionedaslargestremainingstands","Additionalspeciesmentioned":"sugarmaples,redmaples,oakspecies\(redoak,willowoak,swampwhiteoak,chestnutoak\),whitespruce,tamarack,whiteoak,blackcherry","Currentsection":"Treesandplants","Taskcompletionstatus":"completed"\}Progress:\{"AttempttoaccessNYCTreeMap":"completed","Initialwebsiteloading":"failed","Secondwebsiteattempt":"completed","Currentwebsiteaccess":"denied","Alternativeinformationsources":"attempted\(Wikipedia\)","Wikipediapageloaded":"success","Relevantsectionidentification":"completed\(Treesandplantssection\)","Informationextracted":"CentralParkhasmorethan20,000trees,morethan170treespecies,Americanelmsarementionedaslargestremainingstands","Mostcommonspeciesidentification":"completed\(Americanelmsidentifiedaslargestremainingstands\)","Taskcompletion":"completed"\}Intention:"ConfirmthatAmericanelmsarethemostcommontreespeciesinCentralParkbasedontheinformationfound"Action:"Thetaskiscompleted\-themostcommontreespeciesinCentralParkisAmericanelms,astheyarementionedasthelargestremainingstandsintheTreesandplantssectionoftheWikipediapage\." <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"Americanelms"\}\} </tool\_call\><\|im\_end\|\>
### C\.3Token\-level Entropy
Each high\-entropy token is annotated astoken \(PsampledP\_\{\\mathrm\{sampled\}\}\) \[alt1:P1, alt2:P2, …\], wherePsampledP\_\{\\mathrm\{sampled\}\}is the probability the policy assigned to the token it actually sampled and the bracket lists the top\-kkalternative tokens with their probabilities\.
Qualitative results below shows the per\-token entropy of one rollout from the Qwen3\-VL\-8B\-Instruct policy trained with the1/\|τi\|1/\|\\tau\_\{i\}\|normalizer, on the first and last turn of the trajectory\. Token background opacity is proportional to entropy, high\-entropy tokens carry a subscript listing the chosen token’s probability and the top alternatives\. Mean entropy collapses from 0\.0297 nats at the first turn to 0\.0077 nats at the last, with the few remaining high\-entropy positions concentrated on coordinate slots and free\-form summary fields that must integrate fresh screenshot evidence\.
step 0⋅\\cdotmean entropy=0\.0297
Memory:\{"task":"FindthecontactemailforvendorinquiriesfortheCrawfordsvilleStrawberryFestival\.",\(0\.73\) \[",:0\.27\]"initial\_website":"https://crawfordsvillestrawberryfestival\.com","current\_issue\(0\.90\) \[\_page:0\.07\]":"Blockedfromaccessingthekitchennatchez\.com",\(0\.62\) \["\} :0\.38\]"action\_taken":"Attemptedtonavigate\(0\.10\) \[·access:0\.90\]tothek\(0\.92\) \[·festival:0\.08\]itchennatchez\.combutwas\(0\.49\) \[·encountered:0\.49\]blocked\."\(0\.32\) \["\} :0\.68\]\} Progress:\{"GotoCrawfordsvilleStrawberryFestivalwebsite":"not\(0\.90\) \[attempt:0\.06\]finished","Findcontactemailforvendorinquiries":"notfinished"\} Intention:NavigatetotheCrawfordsvilleStrawberryFestivalwebsite Action:Navigateto\(0\.73\) \[·directly:0\.27\]https://crawfordsvillestrawberryfestival\.com <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"navigate","url":"https://crawfordsvillestrawberryfestival\.com"\}\} </tool\_call\><\|im\_end\|\>
last step⋅\\cdotmean entropy=0\.0077
Memory:\{"task":"FindthecontactemailforvendorinquiriesfortheCrawfordsvilleStrawberryFestival\.","initial\_website":"https://crawfordsvillestrawberryfestival\.com","current\_issue":"Blockedfromaccessingthekitchennatchez\.com","action\_taken":"Attemptedtonavigatetothekitchennatchez\.combutwasblocked\.","navigation\_attempted":"https://crawfordsvillestrawberryfestival\.com","alternative\_navigation\_attempted":"https://www\.crawfordsvillestrawberryfestival\.com","current\_location":"Googlesearchhomepage","search\_query":"CrawfordsvilleStrawberryFestivalvendorcontactemail","current\_page":"reCAPTCHAverificationpage","new\_navigation\_attempted":"https://www\.crawfordsville\.in\.gov","search\_bar\_clicked":"finished","search\_query\_entered":"CrawfordsvilleStrawberryFestivalvendorcontact","search\_results\_found":"yes","first\_result\_title":"AnnualCommunityEvents","first\_result\_url":"https://www\.crawfordsville\.in\.gov/428/Annual\-Community\-Events","first\_result\_description":"StrawberryFestival\-SecondweekofJuneTheStrawberryFestivalhasbeenastapleeventinourcommunityforover50yearsfeaturingaweekendfilledwithmusic,vendors,carshows,foodtrucks,andofcourse","page\_navigated\_to":"AnnualCommunityEventspage","content\_viewed":"InformationaboutUSCA’sFriendsofSugarCreekCanoeRace","vendor\_contact\_info\_found":"no","strawberry\_festival\_link\_found":"https://www\.crawfordsville\.in\.gov/strawberryfestival","strawberry\_festival\_description":"StrawberryFestival\-SecondweekofJuneTheStrawberryFestivalhasbeenastapleeventinourcommunityforover50yearsfeaturingaweekendfilledwithmusic,vendors,carshows,foodtrucks,andofcourse","current\_page\_navigated\_to":"CrawfordsvilleStrawberryFestivalwebsite","page\_content":"Welcomemessageandfestivaldates\(June12\-14,2026\)","navigation\_menu":"Home,History,Entertainment,Vendors,Events,Contact","vendors\_page\_navigated\_to":"finished","vendors\_page\_content":"SectionsforArtsandCrafts,Children’sArea,CommercialVendors,FoodVendors","scroll\_action\_performed":"yes","current\_scroll\_position":"Photossection","contact\_email\_found":"no","additional\_scroll\_action\_performed":"yes","current\_scroll\_position\_after\_second\_scroll":"Socialmediaandcopyrightsection","vendor\_contact\_info\_still\_not\_found":"yes","attempted\_to\_click\_on\_contact\_link":"yes","contact\_page\_not\_loaded":"yes","current\_page\_content":"Photossectionandsocialmediastatistics","need\_to\_return\_to\_contact\_page":"yes","went\_back\_to\_previous\_page":"yes","current\_page":"CrawfordsvilleStrawberryFestivalhomepage","contact\_link\_visible":"yes","attempted\_to\_click\_on\_contact\_link\_again":"yes","contact\_page\_loaded":"yes","contact\_page\_content":"Contactformwithfieldsforname,email,andmessage","contact\_email\_found":"no","scrolled\_down\_contact\_page":"yes","individual\(0\.85\) \[contact:0\.15\]\_contacts\_section\_visible":"yes","vendor\(0\.88\) \[contact:0\.12\]\_contact\(0\.77\) \[\_emails:0\.13 \_email:0\.08\]\_emails\(0\.44\) \[\_email:0\.56\]\_found":"yes\(0\.76\) \[no:0\.22\]","food\_vendors\_email":"food@crawfordsvillestrawberryfest\.com",\(0\.88\) \["\} :0\.12\]"arts\_c\(0\.22\) \[\_and:0\.78\]rafts\_vendor\_email":"dcooksey\(0\.78\) \[se:0\.22\]strawberryfest@yahoo\.com","5\(0\.01\) \[chair:0\.81 contact:0\.14\]k\(0\.38\) \[K:0\.62\]\_run\_email":"jthompson@nm\.k12\.in\.us","chairperson\_email":"sfestqueen@gmail\.com","entertainment\_email":"entertainment@crawfordsvillestrawberryfest\.com"\} Progress:\{"GotoCrawfordsvilleStrawberryFestivalwebsite":"finished","Findcontactemailforvendorinquiries":"notfinished","NavigatetoGooglesearch":"finished","Typesearchquery":"finished","FacingreCAPTCHA":"notfinished","NavigatetoCrawfordsvilleINgovernmentwebsite":"finished","Clickonsearchbar":"finished","Typequeryintosearchbar":"finished","Reviewsearchresults":"finished","Clickonfirstsearchresult":"finished","Scrolldownpage":"finished","ClickonStrawberryFestivalwebsitelink":"finished","Clickon’Vendors’link":"finished","Scrolldownvendorspage":"finished","Scrolldownvendorspageagain":"finished","Clickon’Contact’link":"finished","Contactpagenotloaded":"yes","Currentpagecontent":"Photossectionandsocialmediastatistics","Gobacktopreviouspage":"finished","Currentpage":"CrawfordsvilleStrawberryFestivalhomepage","Contactlinkvisible":"yes","Attemptedtoclickoncontactlinkagain":"yes","Contactpageloaded":"yes","Contactpagecontent":"Contactformwithfieldsforname,email,andmessage","Contactemailfound":"no","Scrolleddowncontactpage":"yes","Individualcontactssectionvisible":"yes","Vendorcontactemailsfound":"yes","Foodvendorsemail":"food@crawfordsvillestrawberryfest\.com","Arts&Crafts\(0\.85\) \[·crafts:0\.15\]vendoremail":"dcookseystrawberryfest@yahoo\.com","5KRunemail":"jthompson@nm\.k12\.in\.us","Chairpersonemail":"sfestqueen@gmail\.com","Entertainmentemail":"entertainment@crawfordsvillestrawberryfest\.com"\} Intention:Provide\(0\.88\) \[·Find:0\.07\]thecontactemailforvendor\(0\.44\) \[·food:0\.56\]inquiriesastheanswer\(0\.02\) \[·task:0\.92 ·final:0\.06\] Action:Thecontactemailforvendorinquiriesisfood@crawfordsvillestrawberryfest\.com <tool\_call\> \{"name":"computer\_use","arguments":\{"action":"answer","text":"Thecontactemailforvendorinquiriesisfood@crawfordsvillestrawberryfest\.com"\}\} </tool\_call\><\|im\_end\|\>
### C\.4Prompts
Both the Instruct and Thinking checkpoints share the same agent prompt template, the only configurable variation is the single rule line that controls whetherMemoryis updated additively \(default\) or compressively \(ablation\)\. Both variants are shown inline below, color\-coded\.
`Prompt: AsyncWebRL Agent`
`C\.5 Hyperparameters Table˜5 reports the hyperparameters used for the three main runs on the Instruct variant: GRPO with the 1/\|τi\|1/\|\\tau\_\{i\}\| normalizer, GRPO with the constant\-1/k1/k fix, and RAFT\+\+\. The upper sub\-table lists settings shared across all three runs; the lower sub\-table lists the settings that differ\. RAFT\+\+ uses a consumer batch size of 120120 rather than 128128 because we find that using 6 GPUs for training is more speed\-optimal for RAFT\+\+, while because we use a group size of 88 we naturally use a consumer batch size of 128128 for GRPO\. This is purely a scheduling choice; no other hyperparameter was retuned\. The Thinking\-variant GRPO runs share the same prompt template \(Section˜C\.4\) and the same loss, optimizer, and rollout settings as the Instruct GRPO runs; only the base model differs \(Qwen3\-VL\-8B\-Thinking\) and, at runtime, the previous step’s thinking tokens are hidden from history\. The Thinking\-variant RAFT\+\+ run additionally uses a higher learning rate \(1×10−51\\times 10^\{\-5\}\), a single update epoch per batch, and a relaxed off\-policy lag cap of 44\. Table 5: Hyperparameters for the three main Qwen3\-VL\-8B\-Instruct runs\. Values are taken from each run’s wandb config\.`Similar Articles
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
OpenWebRL presents an open framework for training visual web agents using online multi-turn reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. Their 4B-parameter model outperforms prior open agents and competes with proprietary systems like OpenAI CUA and Gemini CUA.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Apple Research introduces Weblica, a framework for creating scalable and reproducible training environments for visual web agents using HTTP caching and LLM-based synthesis.
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.
Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models
This paper presents VLM-Safe-RL, a framework that integrates frozen vision-language models into constrained MDP Lagrangian updates to provide anticipatory cost signals for safe reinforcement learning in high-speed visual control tasks. The method outperforms standard constraint-aware baselines on Safety-Gymnasium FormulaOne L2 and generalizes to held-out environments.
Building Fast & Accurate Agents with Prime-RL Post Training (22 minute read)
Ramp presents a case study on using reinforcement learning post-training to build Fast Ask, a specialized spreadsheet retrieval agent that improves accuracy and reduces latency compared to general-purpose models.