PRO-CUA: Process-Reward Optimization for Computer Use Agents

arXiv cs.AI Papers

Summary

This paper introduces PRO-CUA, a process-reward optimization framework for training Computer Use Agents (CUAs) using iterative step-level reinforcement learning. The method decouples on-policy environment interaction from policy optimization, enabling dense credit assignment without relying on expert trajectories, and demonstrates effectiveness on live web benchmarks.

arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:13 AM

# PRO-CUA: Process-Reward Optimization for Computer Use Agents
Source: [https://arxiv.org/html/2605.29119](https://arxiv.org/html/2605.29119)
Yifei He Rui Yang Hao Bai Tong Zhang Han Zhao University of Illinois Urbana\-Champaign [Website](https://yifei-he.github.io/pro-cua-website/)[Code](https://github.com/yifei-he/PRO-CUA)[![[Uncaptioned image]](https://arxiv.org/html/2605.29119v1/x1.png)Model](https://huggingface.co/PRO-CUA)

###### Abstract

Computer use agents \(CUAs\) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high\-quality supervision\. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals\. Meanwhile, standard trajectory\-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long\-horizon GUI interaction\. In this work, we propose PRO\-CUA, a process\-reward optimization framework for training CUAs with iterative step\-level reinforcement learning\. PRO\-CUA decouples on\-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step\-level feedback from a process reward model \(PRM\), and is optimized with group\-relative advantages\. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent’s own execution states\. Experiments on live web benchmarks demonstrate the effectiveness of PRO\-CUA and the reliability of PRM\-guided step\-level training\.

PRO\-CUA: Process\-Reward Optimization for Computer Use Agents

Yifei He Rui Yang Hao Bai Tong Zhang Han ZhaoUniversity of Illinois Urbana\-Champaign[Website](https://yifei-he.github.io/pro-cua-website/)[Code](https://github.com/yifei-he/PRO-CUA)[![[Uncaptioned image]](https://arxiv.org/html/2605.29119v1/x2.png)Model](https://huggingface.co/PRO-CUA)

## 1Introduction

Driven by rapid breakthroughs in multimodal reasoning, autonomous agents are evolving into economically valuable digital coworkers\. Computer use agents \(CUAs\)\(OpenAI,[2025](https://arxiv.org/html/2605.29119#bib.bib7); Anthropic,[2024](https://arxiv.org/html/2605.29119#bib.bib6); Agasheet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib15); Qinet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib2); Wanget al\.,[2025a](https://arxiv.org/html/2605.29119#bib.bib26),[c](https://arxiv.org/html/2605.29119#bib.bib35)\)have proven highly capable of seamlessly automating complex, open\-ended workflows\. By perceiving visual interfaces and executing sequential plans, these agents natively operate across diverse digital ecosystems, including web browsers\(Denget al\.,[2023](https://arxiv.org/html/2605.29119#bib.bib13); Heet al\.,[2024a](https://arxiv.org/html/2605.29119#bib.bib3); Xueet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib8)\)and desktop environments\(Xieet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib14); Bonattiet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib12); Wuet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib38)\)\. Despite their immense commercial value and recent capability leaps, creating reliably generalized CUAs remains fundamentally bottlenecked by how they are trained\. Specifically, researchers are constrained by two interlocking challenges: the prohibitive latency and computational cost of interacting with live GUI environments, and an acute scarcity of high\-quality training data\.

The most intuitive and prevailing approach to training CUAs is filtered behavior cloning \(FBC\) from expert demonstrations\(Baiet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib25); Xuet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib1); Heet al\.,[2024b](https://arxiv.org/html/2605.29119#bib.bib4); Shenet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib24); Heet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib40)\)\. However, FBC inherently suffers from imitation bottlenecks: it over\-penalizes reasoning diversity, disproportionately overfits to easy tasks, and lacks negative learning signals for agents to learn from mistakes\. To overcome these limitations, reinforcement learning \(RL\) offers a principled alternative\. Yet, applying standard,trajectory\-levelRL to CUAs introduces optimization and infrastructural difficulties\. In long\-horizon computer use tasks, receiving a single sparse reward at task completion makes step\-wise credit assignment highly ambiguous, so the agent cannot deduce which specific action among dozens caused the ultimate failure\. Furthermore, standard synchronous RL frameworks such as verl\(Shenget al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib44)\)are computationally ill\-equipped for multi\-turn agent workflows\. The latency of live GUI execution combined with the compounding memory costs of token\-heavy image contexts renders trajectory\-level optimization especially challenging\.

![Refer to caption](https://arxiv.org/html/2605.29119v1/x3.png)Figure 1:Overview of the PRO\-CUA pipeline\.PRO\-CUA alternates between two stages across multiple training iterations\. InStage 1, the current policy interacts with the live environment to collect on\-policy states\. InStage 2, policy optimization is performed without further environment interaction through three steps:i\) Step\-level generation:The agent samples multiple candidate actions for each collected state;ii\) PRM grading:A process reward model assigns binary step\-level rewards; andiii\) GRPO update:The policy is optimized using group\-relative advantages\. The updated policy is then used for the next round of on\-policy state collection\.To bypass the sparse reward problem of trajectory\-level learning, recent efforts have shifted towardstep\-levelRL paradigms\(Luoet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib42); Yanget al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib41)\)\. While step\-wise optimization makes credit assignment tractable, existing methods remain constrained by their reward design and data collection mechanisms\. They predominantly rely onrule\-based rewardsthat require exact\-match accuracy against agolden answer, a strict requirement that drastically limits the scalability of viable training data\. Moreover, these pipelines are primarilyoff\-policy, optimizing over states collected by a stronger teacher model rather than the target policy itself\. Because an agent’s actions sequentially alter future GUI observations, this off\-policy collection introduces compounding distribution shift\(Rosset al\.,[2011](https://arxiv.org/html/2605.29119#bib.bib43)\)\. The offline training data can diverge from the actual, suboptimal states the agent will encounter, resulting in brittle performance and an inability to recover from mistakes\.

To tackle the dual challenges of reward scarcity and off\-policy distribution shift, we propose an on\-policy self\-evolvement framework termed PRO\-CUA \(Process\-Reward Optimization for Computer Use Agents\)\. As shown in[Figure˜1](https://arxiv.org/html/2605.29119#S1.F1), PRO\-CUA alternates between two stages across multiple training iterations\. In Stage 1, the current policy interacts with the live environment at an elevated sampling temperature to collect on\-policy states\. In Stage 2, policy optimization is performed without further environment interaction: for each collected state, the agent samplesGGdiverse thought\-action pairs, a Process Reward Model \(PRM\) assigns binary step\-level rewards, and the policy is updated with GRPO\. This design decouples live environment interaction from policy optimization, allowing each stage to run with infrastructure tailored to its own computational profile while training the agent on its own execution states\.

Empirically, we verify the effectiveness of the PRO\-CUA pipeline on live web benchmarks, including WebVoyager\(Heet al\.,[2024a](https://arxiv.org/html/2605.29119#bib.bib3)\), Mind2Web\-Live\(Panet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib19)\)and Online Mind2Web\(Xueet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib8)\)\. We further demonstrate the reliability of our PRM grading pipeline and the effective data utilization of our training approach\.

Overall, our PRO\-CUA framework provides a scalable pipeline for training computer use agents\. In summary, our main contributions are:i\) Tailored CUA infrastructure design:We decouple environment interaction and model training, avoiding the system challenges to simultaneously perform agent rollout, environment interaction and policy optimization\.ii\) On\-policy state collection:We eliminate the reliance on offline expert demonstrations, and enable the agent to learn from its own execution distribution\.iii\) Dense and flexible credit assignment via PRMs:We transition from sparse trajectory\-level rewards to PRM\-graded step\-level GRPO\. This provides fine\-grained supervision without the requirement of collecting golden answers from expert demonstration\.

## 2Preliminaries

#### Computer use agents \(CUAs\)

CUAs take sequential steps to interact with a graphical user interface \(GUI\) to complete a task defined in the task instruction\. Following the ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.29119#bib.bib5)\)paradigm, agents typically generate interleaved thoughts and actions, explicitly externalizing their reasoning process to improve task execution\. At any given stepnn, the agent receives a state contextxnx\_\{n\}, which comprises the instructionℐ\\mathcal\{I\}, the historical sequence of past thoughts and actions\{\(ti,ai\)\}i=1n−1\\\{\(t\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{n\-1\}, and a truncated window of thewwmost recent visual observations \(screenshots\)\{on−j\}j=0w−1\\\{o\_\{n\-j\}\\\}\_\{j=0\}^\{w\-1\}\. Truncation is applied exclusively to the visual inputs to manage the prohibitive token costs associated with multimodal contexts\. In this work, we setw=1w=1, so the agent observes only the most recent screenshot, which is memory\-efficient and has been shown to be sufficient for GUI perception\(Qinet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib2)\)\. The agent’s objective is to learn a policyπθ​\(tn,an∣xn\)\\pi\_\{\\theta\}\(t\_\{n\},a\_\{n\}\\mid x\_\{n\}\)that iteratively selects actions leading to successful task completion\.

#### Filtered Behavior Cloning \(FBC\)

Currently, the most prevalent training paradigm for CUAs is FBC, which is conceptually equivalent to Rejection Sampling Fine\-Tuning \(RFT\)\. In this approach, a dataset of candidate trajectories is filtered to retain only those that result in a successful final outcome, forming a curated dataset𝒟succ\\mathcal\{D\}\_\{\\text\{succ\}\}\. The policy is then optimized via standard supervised fine\-tuning \(SFT\) to maximize the log\-likelihood of the expert thoughts and actions:

ℒSFT​\(θ\)=−𝔼τ∼𝒟succ​\[∑n=1\|τ\|log⁡πθ​\(tn,an∣xn\)\]\.\\displaystyle\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{D\}\_\{\\text\{succ\}\}\}\\left\[\\sum\_\{n=1\}^\{\|\\tau\|\}\\log\\pi\_\{\\theta\}\(t\_\{n\},a\_\{n\}\\mid x\_\{n\}\)\\right\]\.

#### Reinforcement Learning \(RL\)

While FBC optimizes the likelihood of both the thought tokens and action tokens, RL only requires a reward signal for the generated action\. This distinction is important for CUAs: intermediate thoughts can be long and diverse, whereas task progress is ultimately determined by the executed action\. RL therefore avoids forcing the policy to reproduce the reference reasoning trace, and instead reinforces any generation that leads to a rewarded action\.

In single\-turn domains like mathematical reasoning, GRPO relies only on a sparse outcome reward \(e\.g\., verifying the answer\)\. However, this trajectory\-level approach is often suboptimal for long\-horizon computer use tasks due to credit assignment challenges\. To address this issue, recent works deploystep\-levelRL\. Similar to SFT, given step contextx=\{ℐ,\{tn−i,an−i\}i=1n−1,\{on−i\}i=1w\}x=\\\{\\mathcal\{I\},\\\{t\_\{n\-i\},a\_\{n\-i\}\\\}\_\{i=1\}^\{n\-1\},\\\{o\_\{n\-i\}\\\}\_\{i=1\}^\{w\}\\\}, GRPO samples a group ofGGcandidate thought\-action pairs\{tk,ak\}k=1G∼πθ\(⋅\|x\)\\\{t\_\{k\},a\_\{k\}\\\}\_\{k=1\}^\{G\}\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)and compute the rewardrk=ℛ​\(x,ak\)r\_\{k\}=\\mathcal\{R\}\(x,a\_\{k\}\), whereRRis predominantly a rule\-based reward based on a comparison withthe golden answerobtained through expert demonstration, which measures accuracy of the action based on the action type, the input text and the coordinates the agent interacts with \(detailed formulation in Appendix[C](https://arxiv.org/html/2605.29119#A3)\)\. The overall objective is

ℒGRPO\(θ\)=−𝔼x∼D\{ak\}k=1G∼πθold\(⋅\|x\)\[1G∑k=1Gmin\(ρkA^k,clip\(ρk,1−ϵ,1\+ϵ\)A^k\)−β⋅KL\(πθ\(⋅\|x\)∥πref\(⋅\|x\)\)\]\{\\begin\{split\}&\\mathcal\{L\}\_\{\\text\{GRPO\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim D\\\\ \\\{a\_\{k\}\\\}\_\{k=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\|x\)\\end\{subarray\}\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{k=1\}^\{G\}\\min\\Big\(\\rho\_\{k\}\\hat\{A\}\_\{k\},\\\\ &\\quad\\text\{clip\}\(\\rho\_\{k\},1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{k\}\\Big\)\-\\beta\\cdot\\text\{KL\}\(\\pi\_\{\\theta\}\(\\cdot\|x\)\\,\\\|\\,\\pi\_\{\\text\{ref\}\}\(\\cdot\|x\)\)\\Bigg\]\\end\{split\}\}\(1\)whereρk=πθ​\(ak∣x\)πθold​\(ak∣x\)\\rho\_\{k\}=\\frac\{\\pi\_\{\\theta\}\(a\_\{k\}\\mid x\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(a\_\{k\}\\mid x\)\}represents the importance sampling ratio,A^k\\hat\{A\}\_\{k\}is the relative advantage computed within the sampled group, andβ\\betacontrols the KL divergence penalty against the reference modelπref\\pi\_\{\\text\{ref\}\}to prevent policy collapse\.

## 3PRO\-CUA

### 3\.1On\-policy State Collection

Prior works primarily rely on distillation from strong teacher models or human demonstrations\(Heet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib40); Yanget al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib41); Luoet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib42)\)\. However, these expert trajectories suffer from severe distribution shift: they fail to represent the sub\-optimal states a developing agent actually encounters\. Because an agent’s primitive actions alter subsequent observations, it inevitably drifts into catastrophic or out\-of\-distribution states \(e\.g\., stuck websites\) that are entirely absent from expert data\.

While online RL offers a principled solution by allowing the agent to explore via its own policy, it introduces severe infrastructural bottlenecks in the CUA setting\. Synchronizing high\-throughput LLM inference \(e\.g\., vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2605.29119#bib.bib46)\)\), high\-latency web browser interactions, and dedicated training frameworks \(e\.g\., verl\(Shenget al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib44)\)\) into a single loop results in prohibitive latency, I/O overhead, and hardware idling\.

To overcome these bottlenecks while preserving the necessity of on\-policy exploration, we propose a decoupled state collection paradigm\. As illustrated in Stage 1 of[Figure˜1](https://arxiv.org/html/2605.29119#S1.F1), we disentangle the slow environment interactions from the compute\-heavy policy optimization loop\. We deploy the current agent policy to interact with live environments at an elevated sampling temperature, encouraging the discovery of diverse paths\. During the agent rollout, we continuously harvest these exploratory trajectories, logging the task instructions, visual observations, and action histories into a state dataset𝒟state\\mathcal\{D\}\_\{\\text\{state\}\}\. Formally, let𝒯\\mathcal\{T\}represent a set of collected trajectories \(which may include both successful and failed rollouts\)\. A single trajectoryτ∈𝒯\\tau\\in\\mathcal\{T\}of length\|τ\|\|\\tau\|is composed of a sequence of step\-level interactions\. At each stepn∈\{1,…,\|τ\|\}n\\in\\\{1,\\dots,\|\\tau\|\\\}, the environment and the agent’s history dictate a state contextxn\(τ\)x\_\{n\}^\{\(\\tau\)\}, formally defined as the tuple:

𝒟state=⋃τ∈𝒯⋃n=1\|τ\|\{xn\(τ\)\},\\displaystyle\\mathcal\{D\}\_\{\\text\{state\}\}=\\bigcup\_\{\\tau\\in\\mathcal\{T\}\}\\bigcup\_\{n=1\}^\{\|\\tau\|\}\\\{x\_\{n\}^\{\(\\tau\)\}\\\},wherexn\(τ\)=\(ℐ\(τ\),\{\(ti,ai\)\}i=1n−1,\{on−j\}j=0w−1\)x\_\{n\}^\{\(\\tau\)\}=\\left\(\\mathcal\{I\}^\{\(\\tau\)\},\\\{\(t\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{n\-1\},\\\{o\_\{n\-j\}\\\}\_\{j=0\}^\{w\-1\}\\right\)\. The dataset will be further employed in the next stage to optimize the policy\.

This decoupled mechanism solves two critical problems\. First, it allows for highly parallelized, scalable data generation without bottlenecking the training hardware\. Second, by populating the dataset with the agent’s own exploratory rollouts rather than expert demonstrations, the policy is forced to confront its own mistakes, natively generating the failure states required to learn error\-recovery during the subsequent optimization stage\.

### 3\.2Step\-level Rollout

Given the on\-policy state dataset collected in the previous stage, PRO\-CUA proceeds to the step\-level rollout process\. In each training iteration, the agent samples states from𝒟state\\mathcal\{D\}\_\{\\text\{state\}\}and, for each state, generatesGGcandidate thought\-action pairs\. These candidates are newly sampled from the current policy and are not the actions originally executed during state collection\. Thus, the collected trajectories serve only to provide the state distribution encountered by the agent, rather than reference demonstrations for imitation\.

Crucially, the candidate actions arenot executed in the live environment\. This avoids repeatedly invoking costly browser interactions inside the optimization loop, but also means that the resulting state transitions are not directly observed\. We therefore defer evaluation to the PRM, as described in the next section\.

### 3\.3Process Reward Model Grading

![Refer to caption](https://arxiv.org/html/2605.29119v1/plots/prm.png)Figure 2:Process Reward Model \(PRM\) grading pipeline\.The PRM receives a multimodal step context comprising the task instruction, the agent’s action history, the proposed current action, and an annotated screenshot\. For readability, the figure shows a zoomed\-in crop, while the actual PRM input contains the full web interface\. Based on this augmented context, the PRM generates a reasoning trace to assess whether the proposed action functionally advances the task, culminating in a binary step\-level reward\.A primary limitation of rule\-based step rewards is their dependence on golden reference actions\. Such rewards assign positive feedback only when the proposed action matches an expert action in action type, target element, and textual input\. Consequently, theycan only be applied to states from successful trajectorieswhere verified reference actions are available, and may penalize valid alternative actions different from the demonstration\. This makes rule\-based rewards data\-inefficient and prevents the agent from learning from failure states\.

PRO\-CUA replaces these static heuristics with PRM grading\. Instead of checking exact agreement with a predefined action, the PRM evaluates whether a proposed action functionally advances the task under the current state\. This decouples step\-level supervision from expert demonstrations and allows both successful and failed on\-policy trajectories to contribute training states\.

As illustrated in[Figure˜2](https://arxiv.org/html/2605.29119#S3.F2), we formulate the PRM evaluation as a multimodal reasoning task\. For any given step, the PRM is provided with a comprehensive context: the task instruction, the action history, and the agent’s proposed current action\. Since the candidate action is not executed in the live environment, the PRM does not observe the next state\. To ground the evaluation, we annotate the screenshot at the target coordinates of the proposed action, allowing the PRM to identify the intended UI element\. The PRM then reasons about whether the proposed action is visually grounded, non\-redundant, and useful for task progress, and outputs a binary step\-level reward\. The detailed grading prompt is in Appendix[F](https://arxiv.org/html/2605.29119#A6)\.

Because the PRM is used as a training signal rather than an inference\-time controller, PRO\-CUA does not require perfectly calibrated rewards\. GRPO distinguishes acceptable from unacceptable candidates within a sampled group, and the resulting updates aggregate reward signals across many states, sampled actions, and optimization steps\. This makes the training process more tolerant to PRM noise than conventional test\-time Best\-of\-N selection\(Chaeet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib61); Zhanget al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib60)\), where a single incorrect PRM preference can directly determine the executed action\.

### 3\.4Policy Optimization

We optimize the policy using the step\-level GRPO objective \([Equation˜1](https://arxiv.org/html/2605.29119#S2.E1)\)\. Unlike Filtered Behavior Cloning \(FBC\), which relies on exactly following expert demonstrations, GRPO offers two advantages\. First, by relying on the PRM for functional evaluation rather than strict imitation, it preserves reasoning diversity, allowing the agent to freely explore novel thought paths as long as the resulting action is rewarded\. Second, the group relative advantage naturally acts as a dynamic curriculum\. Because advantages are mean\-centered within the sampled group, gradients for trivial states wash out to zero, while difficult states \(where most sampled actions fail\) generate disproportionately high positive advantages for successful rollouts, focusing optimization where the policy struggles most\.

## 4Experiments

### 4\.1Setup

#### Rollout

We useQwen3\-VL\-4B\-InstructandQwen3\-VL\-8B\-Instruct\(Baiet al\.,[2025b](https://arxiv.org/html/2605.29119#bib.bib45)\)as base models\. FollowingShenet al\.\([2025](https://arxiv.org/html/2605.29119#bib.bib24)\), we use their synthetic task set derived from WebVoyager\. At each training iteration, we sample 256 tasks and roll out the current policy to collect on\-policy trajectories, running 10 iterations in total\. During rollout, we cap the maximum trajectory length at 20 steps, as most tasks are designed to be completed within this horizon, and use a temperature of 1\.0 to encourage exploration\.

#### Baselines

We compare PRO\-CUA with two baseline training algorithms\. Both baselines follow the same iterative setting: at each iteration, the current policy interacts with the environment to collect trajectories and obtain task\-level outcome rewards\. The collected data is then filtered or converted into training examples as follows:Filtered Behavior Cloning \(FBC\)keeps only successful trajectories and trains the policy on their observed thought\-action sequences using SFT\.Step\-level RL with rule\-based rewardsretains states from successful trajectories and treats the corresponding executed actions as golden references\. For each retained state, it samplesGGcandidate thought\-action pairs from the current policy, assigns rule\-based rewards by comparing each candidate action with the reference action, and optimizes the policy with GRPO\. In both baselines, failed trajectories are discarded and do not contribute to training\.

#### Training

For FBC experiments, we use LlamaFactory\(Zhenget al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib20)\)for training with a learning rate of 1e\-5 for 2 epochs\. For RL experiments, we use verl\(Shenget al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib44)\)with a learning rate of 5e\-6 for 1 epoch\. All experiments are conducted on NVIDIA A6000 GPUs with 48G memory\. We use a constant learning rate for RL training\. Unlike standard offline training, our setting alternates between live environment rollouts and policy optimization, and the number of optimization steps in each iteration depends on the dynamically collected trajectories and their lengths\. Therefore, the total number of updates is not known before rollout, making a pre\-specified cosine decay schedule less natural\. We instead use a constant learning rate for all RL methods to ensure a simple and consistent comparison across iterations\.

#### Evaluation

We evaluate PRO\-CUA on three online web benchmarks: WebVoyager\(Heet al\.,[2024a](https://arxiv.org/html/2605.29119#bib.bib3)\), Mind2Web\-Live\(Panet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib19)\), and Online Mind2Web\(Xueet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib8)\)\. For each task, we allow the agent to take up to 30 steps\. Due to website updates, some domains have introduced anti\-scraping mechanisms, anti\-bot checks, or access restrictions that were not present during benchmark construction\. Following common practice for live web evaluation, we exclude domains that cannot be reliably accessed by automated agents, with details provided in Appendix[A](https://arxiv.org/html/2605.29119#A1)\. FollowingHeet al\.\([2024a](https://arxiv.org/html/2605.29119#bib.bib3)\), we provide the full agent trajectory and corresponding screenshots to GPT\-5\(Singhet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib49)\), which serves as an automatic evaluator for task success\.

### 4\.2Reliability of the PRM Reward Signal

Table 1:Comparison of reward sources for step\-level RL\.Under the same successful\-trajectory training subset, PRM\-based rewards outperform rule\-based rewards, suggesting that visually grounded PRM feedback provides useful supervision for policy optimization\.A key question for PRO\-CUA is whether PRM feedback provides a useful training signal compared with traditional rule\-based rewards\. To isolate the effect of the reward source, we conduct a controlled ablation on WebVoyager where all methods train on the same subset of states from successful trajectories\. This setting favors rule\-based rewards, since golden reference actions are available and failed trajectories are excluded\. We compare rule\-based rewards with two visually grounded PRMs,Qwen3\-VL\-4BandGPT\-5\-mini\. As shown in[Table˜1](https://arxiv.org/html/2605.29119#S4.T1), both PRM\-based variants outperform the rule\-based baseline, indicating that PRM feedback can provide effective step\-level supervision even under the same data regime as rule\-based rewards\.

![Refer to caption](https://arxiv.org/html/2605.29119v1/x4.png)Figure 3:Step\-level rewards assigned during trainingwith moving average\.GPT\-5\-miniassigns more conservative rewards, whileQwen3\-VL\-4Bis more lenient on average\. Despite this calibration gap, both PRMs achieve similar downstream policy performance, suggesting that GRPO is robust to differences in reward strictness through group normalization\.Table 2:Task success rates on three online web benchmarks\.We compare PRO\-CUA with open CUA models trained on large\-scale expert or closed data, and with controlled baselines using Qwen3\-VL backbones\. PRO\-CUA improves the base policy through iterative on\-policy self\-evolution without relying on expert demonstrations\.Training ParadigmMethodExt\. Expert StepsWebVoyagerMind2Web\-LiveOnlineMind2WebExternal expert / closed dataUI\-TARS\-1\.5\-7BClosed data30\.318\.114\.6WebSTAR\-7B100K47\.017\.017\.0WebSTAR\-32B100K53\.520\.423\.8GUI\-Libra\-4B81K––20\.0GUI\-Libra\-8B81K––19\.3Self\-evolving 4BQwen3\-VL\-4B\-Instruct027\.518\.116\.7FBC029\.726\.423\.7Rule\-based Step\-RL034\.727\.829\.9PRO\-CUA042\.434\.728\.8Self\-evolving 8BQwen3\-VL\-8B\-Instruct025\.620\.812\.2FBC031\.823\.626\.9Rule\-based Step\-RL033\.825\.026\.2PRO\-CUA\-8043\.230\.628\.2

Furthermore, we observe that the lightweightQwen3\-VL\-4BPRM achieves similar downstream performance toGPT\-5\-mini\. To better understand this result, we plot the moving average of step\-level rewards assigned during optimization\. As shown in[Figure˜3](https://arxiv.org/html/2605.29119#S4.F3), the two PRMs exhibit substantially different reward calibration:GPT\-5\-miniis more conservative, with an average reward around 0\.5, whereasQwen3\-VL\-4Bis more lenient, with an average reward above 0\.7\. One possible explanation is evaluator pedantry\(Zhenget al\.,[2023](https://arxiv.org/html/2605.29119#bib.bib47); Liet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib48)\), a phenomenon often observed in strong LLM\-as\-a\-judge systems, where valid exploratory actions may be penalized for deviating from the evaluator’s internal preference or stylistic prior\.

Importantly, despite calibration differences, both PRMs lead to comparable final policy performance\. This suggests that PRO\-CUA does not require perfectly calibrated absolute rewards\. Since GRPO computes mean\-centered advantages within each sampled group, binary PRM feedback only needs to provide useful local discrimination between acceptable and unacceptable candidate actions\. The resulting updates aggregate these signals across many states, sampled candidates, and optimization steps, making the training process robust to differences in reward strictness across PRM backbones\.

### 4\.3Main Results

#### Benchmark evaluation

[Table˜2](https://arxiv.org/html/2605.29119#S4.T2)reports the performance on three online web benchmarks\. We include both representative open CUA models trained with large\-scale expert or closed data, and controlled baselines initialized from the same Qwen3\-VL backbones\. Compared with prior open models, PRO\-CUA is self\-evolving: it improves the base policy through on\-policy rollouts and PRM\-guided step\-level RL without relying on external expert demonstrations\. Despite this weaker supervision assumption, PRO\-CUA achieves competitive performance with expert\-trained systems, while substantially improving over its own base models and controlled training baselines\.

Within the controlled comparison, PRO\-CUA achieves the strongest overall performance across both model sizes, with particularly large gains on WebVoyager and Mind2Web\-Live\. For the 4B model, PRO\-CUA improves over FBC by 12\.7% on WebVoyager and 8\.3% on Mind2Web\-Live in success rate, and improves over rule\-based Step\-RL by 7\.7% and 6\.9%, respectively\. The gains remain consistent for the 8B model, where PRO\-CUA outperforms rule\-based Step\-RL by 9\.4% on WebVoyager and 5\.6% on Mind2Web\-Live\. These results suggest that process\-reward\-guided step\-level RL substantially improves the agent’s ability to interact with dynamic web environments\. The improvement is especially pronounced on WebVoyager, likely because its web\-domain distribution is closest to our training queries\.

Notably, PRO\-CUA consistently improves over rule\-based Step\-RL despite both methods performing step\-level reinforcement learning with GRPO\. This suggests that the key advantage of PRO\-CUA comes from the reward source rather than the optimization objective\. Rule\-based rewards provide reliable supervision only under the existence of golden answers, while our PRM grading is capable of evaluating intermediate steps across both successful and failed trajectories\. We analyze this advantage in more detail below\.

![Refer to caption](https://arxiv.org/html/2605.29119v1/x5.png)Figure 4:Data utilization across training iterations\.PRO\-CUA consistently yields more usable step\-level training data than FBC and rule\-based Step\-RL because process rewards allow learning from both successful and failed finished trajectories, while the baselines rely on successful rollouts\.
#### Data utilization

One key advantage of PRO\-CUA is its significantly improved data utilization by learning from both successful and failed trajectories, bypassing the strict bottlenecks of only successful trajectories\. To illustrate this, we further compare the data utilization of different training paradigms\. In PRO\-CUA, we apply a lightweight filtering strategy to remove redundant states\. In particular, agents may occasionally get stuck on the same page for many steps, producing highly repetitive states and actions that provide little additional training signal\. To reduce such redundancy, we retain only states fromfinishedtrajectories, where the agent terminates the task and reports an answer within the maximum step budget\. Importantly, finished trajectories are not necessarily successful: the final answer may still be incorrect\.

[Figure˜4](https://arxiv.org/html/2605.29119#S4.F4)shows the number of deployable step\-level training examples at each iteration after filtering\. PRO\-CUA consistently yields substantially more deployable training steps than FBC and rule\-based Step\-RL\. This is because both baselines rely on successful trajectories, while PRO\-CUA can learn from both successful and failed trajectories by assigning process rewards to intermediate steps\. This enables PRO\-CUA to convert a larger fraction of on\-policy interactions into trainable supervision, leading to better data utilization throughout iterative training\.

## 5Related Works

#### Reinforcement learning for computer use agents

Filtered Behavior Cloning \(FBC\)\(Heet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib40); Xuet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib1); Baiet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib62)\)is the prevailing training paradigm for computer use agents\. FBC first collects candidate trajectories from human annotators, stronger teacher models, or the agent itself, filters them by final task success, and then trains the policy with SFT on the retained successful trajectories\. Recent work has explored RL for GUI and device\-control agents to address the limitations of static imitation learning in dynamic environments\. DigiRL\(Baiet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib25)\)uses offline\-to\-online RL with advantage\-weighted regression, where the actor is updated by imitating filtered high\-advantage actions\. Digi\-Q\(Baiet al\.,[2025a](https://arxiv.org/html/2605.29119#bib.bib39)\)learns a Q\-function from offline interaction data and extracts a policy by imitating the highest\-scored candidate action\. These methods go beyond standard behavior cloning, but they still rely on outcome\-derived value estimates\. For long\-horizon GUI tasks, such outcome rewards are sparse and delayed, making it difficult to assign credit to individual intermediate actions\.

This motivates recent work on step\-level RL for GUI agents\. GUI\-R1\(Luoet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib42)\)and UI\-R1\(Luet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib63)\)apply R1\-style rule\-based RL for GUI action prediction, while GUI\-Libra\(Yanget al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib41)\)improves training under partially verifiable GUI rewards\. However, the reward design of those methods still depends on verifiable rules or reference actions\. Moreover, they typically construct training states from teacher\-generated trajectories, which are often mismatched with the states encountered by the target policy\. In contrast, PRO\-CUA collects states on\-policy from the current agent and uses visually grounded PRM feedback to score candidate thought\-action pairs at those states\. This preserves the credit\-assignment benefit of step\-level RL while reducing off\-policy distribution shift and avoiding the rigidity of rule\-based rewards, enabling learning from both successful and failed trajectories\.

#### Process reward models

PRMs have achieved widespread success in reasoning\-intensive domains such as mathematics\(Lightmanet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib51); Wanget al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib50); Zhanget al\.,[2025b](https://arxiv.org/html/2605.29119#bib.bib52)\)and code generation\(Daiet al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib54)\)\. They are commonly used in two ways:test\-time scaling, where a PRM ranks sampled solutions or actions through Best\-of\-NNselection or reward\-guided search\(Wanget al\.,[2024](https://arxiv.org/html/2605.29119#bib.bib50),[2025b](https://arxiv.org/html/2605.29119#bib.bib55); Xionget al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib58)\), andtraining\-time supervision, where PRM scores serve as intermediate rewards for reinforcement learning\(Setluret al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib57); Zhanget al\.,[2025a](https://arxiv.org/html/2605.29119#bib.bib56)\)\. Computer use tasks are structurally suited for step\-level supervision because GUI navigation offers natively atomic steps \(discrete clicks/typing\), and a single erroneous action can derail the entire trajectory \(e\.g\., closed tabs or failed verifications\)\. However, test\-time scaling is particularly costly in this setting: each candidate action may require environment interaction, screenshot processing, and PRM evaluation, making search\-style methods substantially more expensive than single\-policy execution\.

Recent work has developed PRMs for web and computer\-use agents\. WebShepherd\(Chaeet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib61)\)trains a checklist\-based PRM for web navigation, while WebArbiter\(Zhanget al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib60)\)improves it with principle\-guided reasoning\. However, these works primarily use PRMs as inference\-time evaluators for action selection\. This does not internalize the reward signal into the policy and is sensitive to PRM noise, since the agent directly executes the action preferred by the reward model\. As CUARewardBench\(Linet al\.,[2025](https://arxiv.org/html/2605.29119#bib.bib59)\)suggests, even proprietary models remain primitive in step\-level judging for CUA tasks, and most open\-sourced PRMs are distilled from proprietary teachers rather than serving as reliable ground\-truth evaluators\. In contrast, PRO\-CUA uses PRM as a source of training rewards in iterative on\-policy RL\. By aggregating noisy step\-level signals across many states, sampled actions, and optimization updates, PRO\-CUA is more robust than committing to a single PRM\-selected action at inference time, as supported by our validation thatQwen3\-VL\-4Bperforms comparably toGPT\-5\-mini\. Compared with offline filtering methods such as WebSTAR\(Heet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib40)\), PRO\-CUA directly optimizes the policy with PRM scores, enabling both successful and failed on\-policy trajectories to provide supervision\.

## 6Conclusion

We introduced PRO\-CUA, an iterative step\-level reinforcement learning framework for training computer use agents with process rewards\. PRO\-CUA decouples slow live environment interaction from policy optimization by first collecting on\-policy states from the current agent and then optimizing the policy over PRM\-graded candidate thought\-action pairs\. This design addresses two central bottlenecks in existing CUA training: the distribution shift caused by relying on teacher\-generated states, and the limited data utilization caused by rule\-based rewards that require successful trajectories or golden actions\. By using flexible PRM feedback as step\-level rewards, PRO\-CUA can learn from both successful and failed finished trajectories and convert a larger fraction of on\-policy interactions into trainable supervision\. Experiments on three online web benchmarks show that PRO\-CUA consistently improves over filtered behavior cloning and rule\-based Step\-RL, demonstrating the promise of process\-reward\-guided on\-policy training for building more capable computer use agents\.

## Limitations

While PRO\-CUA provides an effective framework for process\-reward\-guided training of computer use agents, several directions remain beyond the scope of this work\.

First, we follow the standard CUA setup where the policy observes the task instruction, action history, and the current screenshot\. We do not incorporate more advanced harnessing strategies such as long\-term memory, retrieval, or explicit context engineering\. These components are complementary to our training framework and could further improve performance on tasks that require long\-range information tracking\.

Second, our experiments only focus on web\-based computer\-use tasks\. This setting provides a realistic and widely used testbed for CUA research because web environments are dynamic, visually diverse, and require long\-horizon interaction\. However, computer use agents are broader than web navigation alone, and many practical applications involve desktop software and mobile apps\. These environments may introduce different interaction patterns, interface conventions, safety constraints, and state representations\. While PRO\-CUA is designed as a general training framework and does not rely on web\-specific assumptions beyond the environment interface, validating its effectiveness beyond web benchmarks remains an important direction for future work\.

## References

- Agent s: an open agentic framework that uses computers like a human\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- Anthropic \(2024\)Developing a computer use model\.Note:[https://www\.anthropic\.com/news/developing\-computer\-use](https://www.anthropic.com/news/developing-computer-use)Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- H\. Bai, A\. Taymanov, T\. Zhang, A\. Kumar, and S\. Whitehead \(2026\)WebGym: scaling training environments for visual web agents with realistic tasks\.arXiv preprint arXiv:2601\.02439\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p1.1)\.
- H\. Bai, Y\. Zhou, L\. E\. Li, S\. Levine, and A\. Kumar \(2025a\)Digi\-q: learning VLM q\-value functions for training device\-control agents\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CjfQssZtAb)Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p1.1)\.
- H\. Bai, Y\. Zhou, J\. Pan, M\. Cemri, A\. Suhr, S\. Levine, and A\. Kumar \(2024\)Digirl: training in\-the\-wild device\-control agents with autonomous reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 12461–12495\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p2.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge,et al\.\(2025b\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[Appendix B](https://arxiv.org/html/2605.29119#A2.p2.1),[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px1.p1.1)\.
- R\. Bonatti, D\. Zhao, F\. Bonacci, D\. Dupont, S\. Abdali, Y\. Li, Y\. Lu, J\. Wagle, K\. Koishida, A\. Bucker,et al\.\(2025\)Windows agent arena: evaluating multi\-modal os agents at scale\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- H\. Chae, S\. Kim, J\. Cho, S\. Kim, S\. Moon, G\. Hwangbo, D\. Lim, M\. Kim, Y\. Hwang, M\. Gwak, D\. Choi, M\. Kang, G\. Im, B\. Cho, H\. Kim, J\. H\. Han, T\. Kwon, M\. Kim, B\. Kwak, D\. Kang, and J\. Yeo \(2026\)Web\-shepherd: advancing PRMs for reinforcing web agents\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=G2kMroO9UV)Cited by:[§3\.3](https://arxiv.org/html/2605.29119#S3.SS3.p4.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p2.1)\.
- N\. Dai, Z\. Wu, R\. Zheng, Z\. Wei, W\. Shi, X\. Jin, G\. Liu, C\. Dun, L\. Huang, and L\. Yan \(2024\)Process supervision\-guided policy optimization for code generation\.arXiv preprint arXiv:2410\.17621\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024a\)WebVoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6864–6890\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1),[§1](https://arxiv.org/html/2605.29119#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px4.p1.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, H\. Zhang, T\. Fang, Z\. Lan, and D\. Yu \(2024b\)Openwebvoyager: building multimodal web agents via iterative real\-world exploration, feedback and optimization\.arXiv preprint arXiv:2410\.19609\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p2.1)\.
- Y\. He, P\. Chawla, Y\. Souri, S\. Som, and X\. Song \(2026\)WebSTAR: scalable data synthesis for computer use agents with step\-level filtering\.External Links:2512\.10962,[Link](https://arxiv.org/abs/2512.10962)Cited by:[Appendix B](https://arxiv.org/html/2605.29119#A2.p1.1),[§1](https://arxiv.org/html/2605.29119#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.29119#S3.SS1.p1.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p2.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.Cited by:[§3\.1](https://arxiv.org/html/2605.29119#S3.SS1.p2.1)\.
- H\. Li, Q\. Dong, J\. Chen, H\. Su, Y\. Zhou, Q\. Ai, Z\. Ye, and Y\. Liu \(2024\)Llms\-as\-judges: a comprehensive survey on llm\-based evaluation methods\.arXiv preprint arXiv:2412\.05579\.Cited by:[§4\.2](https://arxiv.org/html/2605.29119#S4.SS2.p2.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 39578–39601\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- H\. Lin, X\. Tan, Y\. Qin, Z\. Xu, Y\. Shi, Z\. Li, G\. Li, S\. Cai, S\. Cai, C\. Fu,et al\.\(2025\)Cuarewardbench: a benchmark for evaluating reward models on computer\-using agent\.arXiv preprint arXiv:2510\.18596\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p2.1)\.
- Z\. Lu, Y\. Chai, Y\. Guo, X\. Yin, L\. Liu, H\. Wang, H\. Xiao, S\. Ren, P\. Zhao, G\. Liu,et al\.\(2026\)Ui\-r1: enhancing efficient action prediction of gui agents by reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 17608–17616\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p2.1)\.
- R\. Luo, L\. Wang, W\. He, L\. Chen, J\. Li, and X\. Xia \(2025\)Gui\-r1: a generalist r1\-style vision\-language action model for gui agents\.arXiv preprint arXiv:2504\.10458\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.29119#S3.SS1.p1.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p2.1)\.
- OpenAI \(2025\)Introducing operator\.Note:[https://openai\.com/index/introducing\-operator/](https://openai.com/index/introducing-operator/)Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- Y\. Pan, D\. Kong, S\. Zhou, C\. Cui, Y\. Leng, B\. Jiang, H\. Liu, Y\. Shang, S\. Zhou, T\. Wu,et al\.\(2024\)Webcanvas: benchmarking web agents in online environments\.arXiv preprint arXiv:2406\.12373\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px4.p1.1)\.
- Y\. Qin, Y\. Ye, J\. Fang, H\. Wang, S\. Liang, S\. Tian, J\. Zhang, J\. Li, Y\. Li, S\. Huang,et al\.\(2025\)Ui\-tars: pioneering automated gui interaction with native agents\.arXiv preprint arXiv:2501\.12326\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1),[§2](https://arxiv.org/html/2605.29119#S2.SS0.SSS0.Px1.p1.8)\.
- S\. Ross, G\. Gordon, and D\. Bagnell \(2011\)A reduction of imitation learning and structured prediction to no\-regret online learning\.InProceedings of the fourteenth international conference on artificial intelligence and statistics,pp\. 627–635\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p3.1)\.
- A\. Setlur, C\. Nagpal, A\. Fisch, X\. Geng, J\. Eisenstein, R\. Agarwal, A\. Agarwal, J\. Berant, and A\. Kumar \(2025\)Rewarding progress: scaling automated process verifiers for llm reasoning\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 60808–60838\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Shen, H\. Bai, L\. Zhang, Y\. Zhou, A\. Setlur, S\. Tong, D\. Caples, N\. Jiang, T\. Zhang, A\. Talwalkar,et al\.\(2025\)Thinking vs\. doing: agents that reason by scaling test\-time interaction\.arXiv preprint arXiv:2506\.07976\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px1.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.29119#S3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px3.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px4.p1.1)\.
- H\. Wang, H\. Zou, H\. Song, J\. Feng, J\. Fang, J\. Lu, L\. Liu, Q\. Luo, S\. Liang, S\. Huang,et al\.\(2025a\)Ui\-tars\-2 technical report: advancing gui agent with multi\-turn reinforcement learning\.arXiv preprint arXiv:2509\.02544\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- W\. Wang, Z\. Gao, L\. Chen, Z\. Chen, J\. Zhu, X\. Zhao, Y\. Liu, Y\. Cao, S\. Ye, X\. Zhu,et al\.\(2025b\)Visualprm: an effective process reward model for multimodal reasoning\.arXiv preprint arXiv:2503\.10291\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, B\. Wang, D\. Lu, J\. Yang, T\. Xie, J\. Wang, J\. Deng, X\. Guo, Y\. Xu, C\. H\. Wu,et al\.\(2025c\)Opencua: open foundations for computer\-use agents\.arXiv preprint arXiv:2508\.09123\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- Q\. Wu, K\. Cheng, R\. Yang, C\. Zhang, J\. Yang, H\. Jiang, J\. Mu, B\. Peng, B\. Qiao, R\. Tan,et al\.\(2025\)GUI\-actor: coordinate\-free visual grounding for gui agents\.arXiv preprint arXiv:2506\.03143\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1)\.
- T\. Xiong, X\. Hu, Y\. Chen, Y\. Liu, C\. Wu, P\. Gao, W\. Liu, J\. Luan, and S\. Zhang \(2025\)GUI\-pra: process reward agent for gui tasks\.arXiv preprint arXiv:2509\.23263\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Xu, Z\. Wang, J\. Wang, D\. Lu, T\. Xie, A\. Saha, D\. Sahoo, T\. Yu, and C\. Xiong \(2025\)Aguvis: unified pure vision agents for autonomous gui interaction\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p2.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Xue, W\. Qi, T\. Shi, C\. H\. Song, B\. Gou, D\. Song, H\. Sun, and Y\. Su \(2025\)An illusion of progress? assessing the current state of web agents\.arXiv preprint arXiv:2504\.01382\.Cited by:[§1](https://arxiv.org/html/2605.29119#S1.p1.1),[§1](https://arxiv.org/html/2605.29119#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px4.p1.1)\.
- R\. Yang, Q\. Wu, Z\. Wang, H\. Chen, K\. Yang, H\. Cheng, H\. Yao, B\. Peng, H\. Zhang, J\. Gao,et al\.\(2026\)GUI\-libra: training native gui agents to reason and act with action\-aware supervision and partially verifiable rl\.arXiv preprint arXiv:2602\.22190\.Cited by:[Appendix C](https://arxiv.org/html/2605.29119#A3.p1.1),[§1](https://arxiv.org/html/2605.29119#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.29119#S3.SS1.p1.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px1.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.29119#S2.SS0.SSS0.Px1.p1.8)\.
- Y\. Zhang, S\. Tang, Z\. Li, Z\. Han, and V\. Tresp \(2026\)WebArbiter: a generative reasoning process reward model for web agents\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=canA6Ef0RP)Cited by:[§3\.3](https://arxiv.org/html/2605.29119#S3.SS3.p4.1),[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p2.1)\.
- Y\. Zhang, M\. Fan, J\. Fan, M\. Yi, Y\. Luo, J\. Tan, and G\. Li \(2025a\)Reward\-sql: boosting text\-to\-sql via stepwise reasoning and process\-supervised rewards\.arXiv preprint arXiv:2505\.04671\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025b\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 10495–10516\.Cited by:[§5](https://arxiv.org/html/2605.29119#S5.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§4\.2](https://arxiv.org/html/2605.29119#S4.SS2.p2.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. YeYanhan, and Z\. Luo \(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),pp\. 400–410\.Cited by:[§4\.1](https://arxiv.org/html/2605.29119#S4.SS1.SSS0.Px3.p1.1)\.

## Appendix AEvaluation Details

Table 3:Websites with access problems in Mind2Web queries\.In[Table˜3](https://arxiv.org/html/2605.29119#A1.T3), we detail specific websites that currently employ strict anti\-bot mechanisms, preventing automated access\. Consequently, we exclude all tasks requiring interaction with these domains from both our training and evaluation sets\. It is important to note that these accessibility issues reflect the state of these platforms at the time of our experiments \(May 2026\); future updates may resolve these barriers and render the tasks viable again\. Additionally, we omit tasks associated with Google, GitHub, and Allrecipes from the WebVoyager evaluation, as these platforms mandate rigorous human verification that falls outside the scope of our automated agent\.

## Appendix BExperimental Details

For the WebVoyager results in[Table˜2](https://arxiv.org/html/2605.29119#S4.T2), we directly from the original WebSTAR paper\(Heet al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib40)\)\. Since Allrecipes and GitHub are deprecated as mentioned above, we compute the average performance for UI\-TARS\-1\.5\-7B, WebSTAR\-7B and WebSTAR\-32B without considering those domains to ensure a fair comparison\.

Following the definition in Qwen3\-VL\(Baiet al\.,[2025b](https://arxiv.org/html/2605.29119#bib.bib45)\), we present the action space of the computer use task in[Table˜4](https://arxiv.org/html/2605.29119#A2.T4)\.

Table 4:Action space for computer use agents\.
## Appendix CImplementation Details of Rule\-Based Rewards

For the rule\-based Step\-RL baseline, we implement an automated verifier following GUI\-Libra\(Yanget al\.,[2026](https://arxiv.org/html/2605.29119#bib.bib41)\)\. Each sampled rollout produces a structured prediction stringyyfollowing the required output format:

y=<think\>​⋯​</think\><answer\>​a​</answer\>,\\displaystyle y=\\texttt\{<think\>\}\\cdots\\texttt\{</think\>\}\\texttt\{<answer\>\}a\\texttt\{</answer\>\},where the<answer\>block contains a structured actionaathat can be parsed into a JSON object:a=\{action\_type,description,value,point\_2d\}\.a=\\\{\\texttt\{action\\\_type\},\\texttt\{description\},\\texttt\{value\},\\texttt\{point\\\_2d\}\\\}\.Here,action\_typespecifies the primitive GUI operation,valuedenotes the text input when applicable, andpoint\_2d∈ℝ2\\texttt\{point\\\_2d\}\\in\\mathbb\{R\}^\{2\}denotes the target screen coordinate, ornonewhen the action does not require grounding\.

We define the rule\-based reward as a weighted combination of a format reward and an action\-correctness reward:

r~​\(s,a\)=wfmt​rfmt\+\(1−wfmt\)​racc,\\displaystyle\\tilde\{r\}\(s,a\)=w\_\{\\mathrm\{fmt\}\}r\_\{\\mathrm\{fmt\}\}\+\(1\-w\_\{\\mathrm\{fmt\}\}\)r\_\{\\mathrm\{acc\}\},wherewfmt∈\[0,1\]w\_\{\\mathrm\{fmt\}\}\\in\[0,1\]\. In our experiments, we setwfmt=0\.1w\_\{\\mathrm\{fmt\}\}=0\.1, so the reward primarily reflects action correctness while still encouraging valid structured outputs\.

#### Format reward

The format rewardrfmtr\_\{\\mathrm\{fmt\}\}checks whether the model output can be parsed correctly\. We setrfmt=1r\_\{\\mathrm\{fmt\}\}=1if the output contains valid<think\>and<answer\>tags and the<answer\>block can be parsed into the required JSON schema\. Otherwise,rfmt=0r\_\{\\mathrm\{fmt\}\}=0\.

#### Accuracy reward

The accuracy rewardraccr\_\{\\mathrm\{acc\}\}evaluates whether the predicted action matches the reference action from a successful trajectory\. We decompose it into three components:

racc=rtype⋅rvalue⋅rground,\\displaystyle r\_\{\\mathrm\{acc\}\}=r\_\{\\mathrm\{type\}\}\\cdot r\_\{\\mathrm\{value\}\}\\cdot r\_\{\\mathrm\{ground\}\},wherertyper\_\{\\mathrm\{type\}\}evaluates the action type,rvaluer\_\{\\mathrm\{value\}\}evaluates the textual input, andrgroundr\_\{\\mathrm\{ground\}\}evaluates coordinate grounding\.

First, the action\-type reward checks whether the predicted action type matches the reference action type:

rtype=𝟙​\[action\_type​\(a\)=action\_type​\(a⋆\)\],\\displaystyle r\_\{\\mathrm\{type\}\}=\\mathbbm\{1\}\\left\[\\texttt\{action\\\_type\}\(a\)=\\texttt\{action\\\_type\}\(a^\{\\star\}\)\\right\],wherea⋆a^\{\\star\}denotes the reference action\.

Second, for actions involving text input, the value reward compares the predicted valuevvwith the reference valuev⋆v^\{\\star\}using word\-level F1:

rvalue=𝟙​\[F1​\(v,v⋆\)\>0\.5\]\.\\displaystyle r\_\{\\mathrm\{value\}\}=\\mathbbm\{1\}\\left\[\\mathrm\{F1\}\(v,v^\{\\star\}\)\>0\.5\\right\]\.For actions that do not require a textual value, we setrvalue=1r\_\{\\mathrm\{value\}\}=1\.

Third, the grounding reward evaluates whether the predicted coordinate𝐮\\mathbf\{u\}falls inside the reference bounding boxb⋆b^\{\\star\}:

rground=𝟙​\[𝐮∈b⋆\]\.\\displaystyle r\_\{\\mathrm\{ground\}\}=\\mathbbm\{1\}\\left\[\\mathbf\{u\}\\in b^\{\\star\}\\right\]\.For actions that do not require coordinate grounding, we setrground=1r\_\{\\mathrm\{ground\}\}=1\.

This rule\-based reward provides reliable supervision when a reference action is available\. However, it is inherently tied to successful trajectories with golden actions\. It may assign low rewards to valid alternative actions that deviate from the reference, and cannot be applied to states from failed trajectories where no verified reference action is available\. This limitation motivates our use of PRM\-based rewards in PRO\-CUA\.

## Appendix DPotential Risks

This paper presents work whose goal is to advance the field of NLP\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## Appendix ELicenses

Both WebVoyager and OpenWebVoyager are under Apache\-2\.0 license\. Both Online\-Mind2Web Mind2Web\-Live are under cc\-by\-4\.0 license\.

## Appendix FPrompt for PRM Grading

PRM grading promptYou are an expert evaluator grading a Computer\-Use Agent\. Your role is to evaluate whether the agent’s proposed next action is the strictly correct and necessary step to advance the given task\.You are provided with:1\. The overarching task instruction\.2\. The history of actions taken so far\.3\. The CURRENT screenshot \(the state immediately BEFORE the proposed action\), annotated to show the proposed target of the action\.4\. The proposed Thought and Action Code\.The screenshot is an annotated visualization of the proposed action, not a raw screenshot:\- Red marks, arrows, or points indicate where the proposed action is targeting\.\- Small index labels and overlay text are part of the annotation\.\- Use these annotations to judge whether the proposed action is correctly grounded on the UI\.\- Do not confuse the annotation itself with a native page element\.<task\_instruction\> instruction </task\_instruction\><history\_actions\> history\_actions </history\_actions\><proposed\_action\> Step step\_index: action\_code </proposed\_action\>Evaluation Criteria You must evaluate the proposed action and output a binary decision: is the action CORRECT or INCORRECT?An action is INCORRECT if it exhibits ANY of the following flaws:\- Grounding Failure: The code targets the wrong coordinates, a non\-existent element, or the wrong input field based on the provided screenshot\.\- Hallucination: The agent assumes a state that is not visually present\.\- Inefficiency/Redundancy: The action needlessly repeats a past step from the history, performs useless scrolling, or wastes a step without advancing the task\.\- Logical Progression Failure: The action executes successfully but does not move the agent closer to the final goal\.An action is CORRECT ONLY if it is visually grounded, mathematically accurate, and actively advances the task toward completion\.Output Format Provide a rigorous step\-by\-step reflection\. You must perform a "mental rollout" to predict the consequences of the action before determining if it facilitates task completion\. Think about other alternatives that might result in better outcome than the proposed action, and if there exists such alternative with strictly better outcome, make the action as incorrect\. Then, output a strictly valid JSON block\.<analysis\_process\>1\. \[Current State Assessment\]: What is currently visible on the screen? What is the immediate blocker to completing the task?2\. \[Target Verification\]: Does the proposed code correctly and accurately target the intended UI element in the screenshot?3\. \[Mental Rollout\]: If this exact code is executed, what will happen? \(e\.g\., "A dropdown menu will appear," "The page will scroll down," "The text ’shoes’ will be typed"\)\.4\. \[Task Alignment\]: Does this predicted outcome meaningfully and efficiently advance the task? Or is it a redundant/wasteful action given the history?5\. \[Final Verdict\]: Conclude whether the step is Correct or Incorrect\.</analysis\_process\>‘‘‘json"is\_correct": boolean,"reflection": "A 1\-2 sentence summary of why the action was marked correct or incorrect\.""""

Similar Articles

Computer-Using Agent

OpenAI Blog

OpenAI introduced the Computer-Using Agent (CUA), a model combining GPT-4o's vision with reinforcement learning to interact with GUIs like a human, powering the new Operator agent. CUA sets new state-of-the-art benchmarks including 38.1% on OSWorld and 58.1% on WebArena, and is available as a research preview for ChatGPT Pro users in the US.