ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

arXiv cs.CL Papers

Summary

This paper introduces ReVision, a method to reduce token usage in computer-use agents by removing redundant visual patches from consecutive screenshots. It demonstrates that this efficiency gain allows agents to process longer trajectories and improve performance on benchmarks like OSWorld.

arXiv:2605.11212v1 Announce Type: new Abstract: Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:09 AM

# ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
Source: [https://arxiv.org/html/2605.11212](https://arxiv.org/html/2605.11212)
Amirhossein Abaskohi1, Yuhang He2, Peter West1, Giuseppe Carenini1, Pranit Chawla2, Vibhav Vineet2 1University of British Columbia,2Microsoft Research

###### Abstract

Computer\-use agents \(CUAs\) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens\. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets\. This has resulted in no or very limited improvement in the performance when using history unlike other domains\. We address this inefficiency by introducingReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model\. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2\.5\-VL\-7B,ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline\. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens\. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed\. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations\.

## 1Introduction

Multimodal large language models \(MLLMs\) have enabled agents that interact with graphical user interfaces \(GUIs\) by combining visual understanding with language\-based reasoning\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38);[a](https://arxiv.org/html/2605.11212#bib.bib73)\)\. These computer\-use agents \(CUAs\) operate on screenshots to generate grounded actions such as clicks, typing, and navigation for multi\-step tasks\. Benchmarks such as VisualWebArena\(Kohet al\.,[2024b](https://arxiv.org/html/2605.11212#bib.bib48)\), OSWorld\(Xieet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib46)\), and AgentNetBench\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)demonstrate their ability to handle complex workflows across web and desktop environments\. Most systems rely primarily on the current screenshot, sometimes with limited history\(Sageret al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib29)\), despite many tasks requiring memory of past states or actions\. Scaling such long\-horizon reasoning remains challenging due to the need to process extended visual trajectories under constrained context budgets\(Chenet al\.,[2026](https://arxiv.org/html/2605.11212#bib.bib30)\)\.

![Refer to caption](https://arxiv.org/html/2605.11212v1/x1.png)Figure 1:Token efficiency with ReVision\.Left:ReVision removes redundant patches across steps, reducing token accumulation while preserving spatial structure\.Right:ReVision achieves higher success rates at maximum 100 steps OSWorld and WebTailBench, with lower token cost across models\. Circle size indicates average steps to complete tasks\.A straightforward way to provide memory is to append past screenshots to the model context\. However, this approach is highly inefficient: each additional image introduces hundreds or thousands of visual tokens, quickly exhausting the context budget\. In practice, much of this cost is redundant, as consecutive GUI screenshots largely overlap \(Figure[1](https://arxiv.org/html/2605.11212#S1.F1), left\)\. As a result, the model repeatedly processes unchanged visual content, wasting computation and limiting its ability to incorporate longer, more informative histories\. Reducing this redundancy is therefore not only an efficiency optimization, but a key enabler of better decision\-making: by freeing up context budget, the model can incorporate longer and more informative histories, improving its ability to reason over long\-horizon interactions\.

To address this inefficiency, we introduceReVision, a redundancy\-aware training framework for MLLMs that operates on trajectories where redundant visual patches are removed across consecutive screenshots\. At the core of ReVision is a learned patch selector that compares patch\-level representations over time and filters out visually redundant regions while preserving the spatial structure required by the model\. Rather than treating token reduction as a post\-processing step, we train the model directly on these filtered trajectories, enabling it to reason over compact visual histories and rely on temporally distributed evidence\. This design allows ReVision to reduce unnecessary visual tokens without modifying the underlying architecture, while maintaining compatibility with existing MLLMs\.

This efficiency improvement directly translates to better performance\. Across OSWorld\(Xieet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib46)\), WebTailBench\(Awadallahet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib11)\), and AgentNetBench\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\), when using 5 history screenshots with Qwen2\.5\-VL\-7B\(Baiet al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib63)\),ReVision reduces token usage by approximately 46% on average while achieving a \+3% gain in success rate over the no\-drop baseline\. With only 3 history images, ReVision achieves performance close to the some of the best baseline while using roughly half of the visual tokens\. As the history length increases, the gains become more pronounced: with 5 or more images, ReVision consistently outperforms most of the baselines with the same size by at least 2% on average\(Figure[1](https://arxiv.org/html/2605.11212#S1.F1), right\)\.ReVision shifts the efficiency frontier by enabling longer visual histories under similar compute budgets while achieving higher success rates\. Furthermore, removing redundant tokens reveals that performance continues to improve with additional history rather than saturating early, indicating that prior saturation is driven by inefficient visual representations rather than usefulness of history\.

Our contributions are as follows:\(i\)we identify and quantify substantial temporal redundancy in sequential screenshots from long computer\-use trajectories, showing that a large fraction of visual tokens remains unchanged across consecutive steps;\(ii\)we introduceReVision, a Qwen2\.5\-VL\-7B based model trained with a temporal patch scorer that performs patch\-level filtering across consecutive screenshots, enabling the model to reason over compact visual histories without modifying the underlying architecture;\(iii\)we demonstrate across long\-horizon computer\-use benchmarks that redundancy\-aware history filtering reduces token usage, improves success rates, and delays the saturation point of visual history, revealing that longer histories are more useful than previously recognized\.

## 2Related Work

Computer\-use agents and benchmarks\.Recent progress in CUAs has been driven by multimodal models that interact with digital environments through screenshots and natural language\. Early systems such as WebShop and WebArena rely on structured representations like DOM or accessibility trees\(Yaoet al\.,[2022](https://arxiv.org/html/2605.11212#bib.bib1); Zhouet al\.,[2023](https://arxiv.org/html/2605.11212#bib.bib8)\)\. In contrast, a growing line of work adopts a vision\-first paradigm, reasoning directly over pixels\. Methods such as CogAgent, AGUVIS, OpenCUA, FARA, WebSTAR, and UI\-TARS operate purely on visual inputs\(Honget al\.,[2023](https://arxiv.org/html/2605.11212#bib.bib15); Xuet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib13); Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38); Awadallahet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib11); Heet al\.,[2026](https://arxiv.org/html/2605.11212#bib.bib12); Qinet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib72); Wanget al\.,[2025a](https://arxiv.org/html/2605.11212#bib.bib73)\)\. Other approaches, including WebVoyager, SeeAct, and ScaleCUA, incorporate both visual observations and structured signals to improve robustness in complex environments\(Heet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib17); Zhenget al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib21); Liuet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib74)\)\. Benchmarks including WebArena, VisualWebArena, OSWorld, and AgentNetBench enable evaluation in long\-horizon settings\(Zhouet al\.,[2023](https://arxiv.org/html/2605.11212#bib.bib8); Kohet al\.,[2024b](https://arxiv.org/html/2605.11212#bib.bib48); Xieet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib46); Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)\. Despite this progress, agents typically rely on limited visual history, and increasing history length yields diminishing returns, highlighting inefficiencies in naive context scaling\(Abhyankaret al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib47); Kerbouaet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib52)\)\. Our setting instead requires filtering visual history at patch granularity across consecutive screenshots while preserving temporally distributed evidence for long\-horizon decision making\.

Visual token pruning and context compression\.Prior work reduces visual token usage either within images or across trajectory steps\. Methods such as ShowUI and FocusUI prune spatially redundant regions within a single screenshot\(Linet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib55); Ouyanget al\.,[2026](https://arxiv.org/html/2605.11212#bib.bib56)\), while approaches like Focus\-Scan\-Refine and adaptive compression further remove tokens based on saliency or importance\(Tonget al\.,[2026](https://arxiv.org/html/2605.11212#bib.bib57); Huanget al\.,[2026a](https://arxiv.org/html/2605.11212#bib.bib67);[b](https://arxiv.org/html/2605.11212#bib.bib68)\)\. At the trajectory level, methods such as FocusAgent reduce the number of past steps included in context\(Kerbouaet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib52)\)\. However, these approaches operate either spatially within images or temporally at the step level, without explicitly modeling redundancy across consecutive screenshots, leading to repeated processing of unchanged visual regions\.

Temporal redundancy in sequential visual data\.Temporal redundancy has been extensively studied in video understanding, where consecutive frames share similar content\. Prior work addresses this via keyframe selection, feature reuse, and token compression\([R\. Choudhury, G\. Zhu, S\. Liu, K\. Niinuma, K\. M\. Kitani, and L\. A\. Jeni \(2024\)](https://arxiv.org/html/2605.11212#bib.bib71);[B\. Korbar, D\. Tran, and L\. Torresani \(2019\)](https://arxiv.org/html/2605.11212#bib.bib58);[J\. Choi, S\. Lee, J\. Chu, M\. Choi, and H\. J\. Kim \(2024\)](https://arxiv.org/html/2605.11212#bib.bib60);[K\. Tao, C\. Qin, H\. You, Y\. Sui, and H\. Wang \(2025\)](https://arxiv.org/html/2605.11212#bib.bib59);[33](https://arxiv.org/html/2605.11212#bib.bib61)\)\. However, computer\-use agents differ: screenshots evolve through agent actions and must be processed jointly with textual reasoning\. Existing approaches operate at the frame or feature level within vision models, whereas our setting requirespatch\-level filtering in the token space of multimodal LLMswhile preserving temporally distributed evidence for long\-horizon decision making\.

## 3Temporal Visual Redundancy

CUAs operate over sequences of screenshots that capture the evolving state of a digital environment\. At each step, the model encodes the current screenshot into a large number of visual tokens and processes them together with accumulated textual context to predict the next action\. However, consecutive screenshots in a trajectory often exhibit substantial visual overlap: in most steps, only a small region of the interface changes \(e\.g\., a button click or text update\), while the majority of the screen remains unchanged\. Despite this, standard multimodal models process each image independently, resulting in repeated encoding and consumption of nearly identical visual tokens across time\.

To quantify this, we analyze pairs of consecutive screenshots\(It−1,It\)\(I\_\{t\-1\},I\_\{t\}\)across multiple benchmarks and measure redundancy by comparing corresponding patches\. As shown in Table[1](https://arxiv.org/html/2605.11212#S3.T1), redundancy is consistently high, with an average of45\.4%of patches unchanged across steps and over56%in some settings\. This corresponds to over1,000 redundant patches per stepon average\. These findings show that a large portion of computation is spent on repeated visual content and that the context budget is dominated by redundant tokens, limiting the model’s ability to incorporate useful history\. This motivates ReVision, which removes redundant visual tokens across time while preserving task\-relevant information\.

Dataset/BenchmarkAvg\. Steps/ TaskAvg\. \# of Patches/ ImageAvg\. Redundant Patches/ ImageAvg\. \(%\) Redundant Patches/ ImageAgentNetBench[2025b](https://arxiv.org/html/2605.11212#bib.bib38)12\.12,2841,01444\.4%OSWorld[2024](https://arxiv.org/html/2605.11212#bib.bib46)16\.92,7691,55656\.2%WindowsAgentArena[2024](https://arxiv.org/html/2605.11212#bib.bib36)11\.72,7691,46252\.8%WebTailBench[2025](https://arxiv.org/html/2605.11212#bib.bib11)22\.42,7691,17442\.4%Mind2Web2[2025](https://arxiv.org/html/2605.11212#bib.bib27)13\.42,7691,19943\.3%VisualWebArena[2024a](https://arxiv.org/html/2605.11212#bib.bib6)6\.82,7691,37349\.6%AndroidWorld[2024](https://arxiv.org/html/2605.11212#bib.bib31)7\.61,19645638\.2%GUIAct[2024](https://arxiv.org/html/2605.11212#bib.bib5)5\.51,196435\.336\.4%Average12\.12,3151,08345\.4%

Table 1:Dataset\-level visual redundancy in computer\-use benchmarks\.We report average steps, patches per image, and redundant patches across environments\. WhileGUIActandAgentNetBenchare offline benchmarks with fixed steps, others depend on agent performance\. We useGPT\-5\.4to ensure minimal and consistent trajectories for fair comparison\. Results show that 36%–56% of visual tokens are redundant across steps, motivatingReVision\.
## 4Method

As illustrated in Figure[2](https://arxiv.org/html/2605.11212#S4.F2), ReVision reduces redundant visual tokens across sequential GUI observations by learning to selectively retain only informative patches\. Our approach consists of two main components\.First, we train a lightweightthree\-layer MLP classifier, referred to asReVision Token Selection\(RTS\), which takes as input the embeddings of two corresponding patches from consecutive screenshots and predicts whether the patch in the current image is redundant given the previous one\.Second, we integrate RTS into the pipeline of a MLLM and fine\-tune the model on AgentNet\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)trajectories \(with a fixed history image window\) where redundant patches are removed from all but the first image\. This training setup encourages the model to recover omitted visual information from earlier observations, enabling efficient use of longer visual histories\.

![Refer to caption](https://arxiv.org/html/2605.11212v1/x2.png)Figure 2:Overview of ReVision\.\(a\)ReVision removes redundant patches by comparing corresponding tokens across consecutive screenshots, reducing visual tokens while preserving spatial alignment before passing them to the LLM\.\(b\)The model learns to attend to relevant regions in previous images, enabling effective reasoning with reduced visual input\.### 4\.1Problem Formulation\.

CUAs operate over trajectories\{\(It,Tt,At\)\}t=1T\\\{\(I\_\{t\},T\_\{t\},A\_\{t\}\)\\\}\_\{t=1\}^\{T\}, whereItI\_\{t\}is the screenshot,TtT\_\{t\}is the accumulated textual context \(all action and reasoning from the preiovus steps and the task description\), andAtA\_\{t\}is the predicted action at steptt\. Each image is encoded into visual tokensVt=\{v1t,…,vNt\}V\_\{t\}=\\\{v\_\{1\}^\{t\},\\dots,v\_\{N\}^\{t\}\\\}, and the model conditions on\{V1,…,Vt\}\\\{V\_\{1\},\\dots,V\_\{t\}\\\}and\{T1,…,Tt\}\\\{T\_\{1\},\\dots,T\_\{t\}\\\}to generate the next reasoning and action\. Asttincreases, the number of visual tokens grows linearly, introducing substantial redundancy due to high similarity between consecutive screenshots\. As illustrated in Figure[2](https://arxiv.org/html/2605.11212#S4.F2), our goal is to construct a reduced set of tokensVt′⊆VtV\_\{t\}^\{\\prime\}\\subseteq V\_\{t\}that preserves task\-relevant information while removing redundancy\. To achieve this, for each steptt, we compute a binary mask𝐦t∈\{0,1\}N\\mathbf\{m\}\_\{t\}\\in\\\{0,1\\\}^\{N\}by comparing two corresponding patches fromIt−1I\_\{t\-1\}andItI\_\{t\}, where𝐦t​\[j\]=1\\mathbf\{m\}\_\{t\}\[j\]=1indicates that thejj\-th patch inItI\_\{t\}is retained\. The filtered tokens are then given byVt′=Vt​\[𝐦t\]V\_\{t\}^\{\\prime\}=V\_\{t\}\[\\mathbf\{m\}\_\{t\}\], and the model uses\{V1′,…,Vt′\}\\\{V\_\{1\}^\{\\prime\},\\dots,V\_\{t\}^\{\\prime\}\\\}together with\{T1,…,Tt\}\\\{T\_\{1\},\\dots,T\_\{t\}\\\}to generate the next action and reasoning\.

### 4\.2ReVision Training

Training the RTS classifier\.We train the ReVision Token Selection \(RTS\) module as a lightweight three\-layer MLP that predicts whether a patch in the current image is redundant given corrosponding patch from the previous image\. To obtain supervision, we use OmniParserV2\(Luet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib23)\)to segment screenshots into semantic regions and match regions across consecutive images\. This allows us to generate labels based on region overlap \(IoU\), identifying which patches correspond to unchanged content\. We adopt this approach instead of relying on raw pixel or embedding similarity, as region\-level matching is more robust to small visual variations \(e\.g\., rendering noise, cursor movement\) while still capturing semantic redundancy\.

Following the formulation in Section[4\.1](https://arxiv.org/html/2605.11212#S4.SS1)and the procedure outlined in Algorithm[1](https://arxiv.org/html/2605.11212#alg1), we construct training data from AgentNet trajectories by applying RTS to each pair of consecutive images\. At each steptt, RTS compares each patch inItI\_\{t\}with its corresponding patch inIt−1I\_\{t\-1\}to produce a mask𝐦t\\mathbf\{m\}\_\{t\}, which is then used to obtain the filtered tokensVt′V\_\{t\}^\{\\prime\}\. The first image in each window is kept unchanged\. This results in sequences where the model observes\{V1′,…,Vt′\}\\\{V\_\{1\}^\{\\prime\},\\dots,V\_\{t\}^\{\\prime\}\\\}together with the full textual context\{T1,…,Tt\}\\\{T\_\{1\},\\dots,T\_\{t\}\\\}to generate the next reasoning and action\. We then fine\-tune the MLLM on these filtered trajectories, training it to operate under partially observed visual inputs where redundant patches are removed and must be implicitly recovered from previous steps\.

Training MLLM with filtered trajectories\.RTS is applied as a plug\-in token filtering mechanism on top of MLLMs, following the same formulation and pipeline used during both training and inference \(Algorithm[1](https://arxiv.org/html/2605.11212#alg1)\)\. At each steptt, the model conditions on the full textual context\{T1,…,Tt\}\\\{T\_\{1\},\\dots,T\_\{t\}\\\}, while only the most recentkkimages within a fixed history window are included from the trajectory\. Each imageItI\_\{t\}is encoded into visual tokensVtV\_\{t\}, and for consecutive images, RTS compares each patch inItI\_\{t\}with its corresponding patch inIt−1I\_\{t\-1\}to produce a binary mask𝐦t\\mathbf\{m\}\_\{t\}, which is used to construct the filtered tokensVt′=Vt​\[𝐦t\]V\_\{t\}^\{\\prime\}=V\_\{t\}\[\\mathbf\{m\}\_\{t\}\]\. The first image in the window is kept unchanged, while subsequent images retain only non\-redundant patches\. The model then operates on\{V1′,…,Vt′\}\\\{V\_\{1\}^\{\\prime\},\\dots,V\_\{t\}^\{\\prime\}\\\}together with\{T1,…,Tt\}\\\{T\_\{1\},\\dots,T\_\{t\}\\\}to generate the next reasoning and action\. We construct training samples from AgentNet\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)trajectories using this sliding window of k, ensuring that all trajectories are used while only images within the history window are provided at each step\. By training under the same filtered setting used at inference time, the model learns to recover missing visual information from earlier images, enabling efficient use of longer histories without requiring full observations at every step\.

Algorithm 1ReVision token filtering \(refer to Algorithm[2](https://arxiv.org/html/2605.11212#alg2)for detailed version\)1:current step

NN, history window

kk, text context

𝐱\\mathbf\{x\}, recent images

ℐN−k\+1:N\\mathcal\{I\}\_\{N\-k\+1:N\}
2:

𝒱,𝒫←\[\],\[\]\\mathcal\{V\},\\mathcal\{P\}\\leftarrow\[\\ \],\[\\ \]
3:for

t=N−k\+1t=N\-k\+1to

NNdo

4:

\(𝐯t,𝐟t,𝐩t\)←EncodeImage​\(It\)\(\\mathbf\{v\}\_\{t\},\\mathbf\{f\}\_\{t\},\\mathbf\{p\}\_\{t\}\)\\leftarrow\\textsc\{EncodeImage\}\(I\_\{t\}\)
5:if

i\>N−k\+1i\>N\-k\+1then

6:

𝐦t←ReVisionTokenSelection​\(𝐟i−1,𝐟t\)\\mathbf\{m\}\_\{t\}\\leftarrow\\textsc\{ReVisionTokenSelection\}\(\\mathbf\{f\}\_\{i\-1\},\\mathbf\{f\}\_\{t\}\)
7:

𝐯t,𝐩t←𝐯t​\[𝐦t\],𝐩t​\[𝐦t\]\\mathbf\{v\}\_\{t\},\\mathbf\{p\}\_\{t\}\\leftarrow\\mathbf\{v\}\_\{t\}\[\\mathbf\{m\}\_\{t\}\],\\mathbf\{p\}\_\{t\}\[\\mathbf\{m\}\_\{t\}\]
8:endif

9:append

𝐯t\\mathbf\{v\}\_\{t\}to

𝒱\\mathcal\{V\}; append

𝐩t\\mathbf\{p\}\_\{t\}to

𝒫\\mathcal\{P\}
10:endfor

11:return

LLMDecoder​\(BuildMultimodalInput​\(𝐱,𝒱,𝒫\)\)\\textsc\{LLMDecoder\}\(\\textsc\{BuildMultimodalInput\}\(\\mathbf\{x\},\\mathcal\{V\},\\mathcal\{P\}\)\)

## 5Experiments and Results

Models, Training, and Implementation Details\.We build on the OpenCUA framework\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)and adopt its default setup\. We train our model on AgentNet\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)\. For each history window sizekk, we train a separate model using trajectory segments with up tokkimages, ensuring consistency between training and inference\. We use standard autoregressive next\-token prediction with the same optimizer and hyperparameters as OpenCUA, and fix decoding temperature toT=0\.0T\{=\}0\.0to isolate the effect of token filtering\. Additional training details inclduing metrics are provided in Appendix[D](https://arxiv.org/html/2605.11212#A4)\.

Benchmarks\.We evaluate on OSWorld\(Xieet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib46)\), AgentNetBench\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\), and WebTailBench\(Awadallahet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib11)\), which cover long\-horizon desktop and web\-based tasks\. OSWorld reports results under step budgets \(15, 50, 100\), while AgentNetBench is an offline benchmark with fixed trajectories\. For WebTailBench, we execute tasks in the OSWorld environment and use an LLM\-as\-a\-judge \(gpt\-4o\(OpenAI,[2024](https://arxiv.org/html/2605.11212#bib.bib4)\)\) to assess step\-level correctness and compute success rates\.

Baselines\.We compare ReVision against general vision\-language models, includingQwen\-2\.5\-VL\(Baiet al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib63)\),Qwen\-3\-VL\(Team,[2025](https://arxiv.org/html/2605.11212#bib.bib44); Baiet al\.,[2025a](https://arxiv.org/html/2605.11212#bib.bib2)\), andKimi\-VL\-A3B\(Teamet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib9)\), as well as specialized UI agents such asUI\-TARS\(Wanget al\.,[2025a](https://arxiv.org/html/2605.11212#bib.bib73); Qinet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib72)\)andOpenCUA\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)\. For all the baselines we naivly pass all the k history images, without removing any patches\. To isolate the effect of our method, we includeReVision No Drop, which shares the same training setup but disables token removal at inference time\. All the baselines and ReVision receive all the history reasonings and actions\.All the ReVision models \(including no drop baseline\) are usingQwen2\.5\-VL\-7Bunless mentioned\.

![Refer to caption](https://arxiv.org/html/2605.11212v1/x3.png)Figure 3:Success rate versus average tokens per step across OSWorld at 100 steps, AgentNetBench, and WebTailBench at 100 steps\. ReVision consistently achieves high success rates at comparable or lower token budgets, effectively shifting the efficiency frontier\. Detailed numerical results are provided in Tables[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px1),[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px2), and[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px3)in Appendix[A](https://arxiv.org/html/2605.11212#A1)\. See Figure[7](https://arxiv.org/html/2605.11212#A8.F7)in Appendix[H](https://arxiv.org/html/2605.11212#A8)for results on OSWorld at 15 steps, 50 steps, and WebTailBench at 50 steps\.![Refer to caption](https://arxiv.org/html/2605.11212v1/x4.png)Figure 4:Success rate versus average trajectory length \(number of steps\) for OSWorld and WebTailBench at 100 steps\. ReVision achieves higher success rates with fewer steps, indicating more efficient decision\-making\. Detailed numerical results are provided in Tables[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px2)and[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px3)in Appendix[A](https://arxiv.org/html/2605.11212#A1)\. See Figure[8](https://arxiv.org/html/2605.11212#A8.F8)in Appendix[H](https://arxiv.org/html/2605.11212#A8)for results on OSWorld and WebTailBench at 50 steps\.### 5\.1Efficiency\-Performance Trade\-offs

We analyze the trade\-off between task performance and computational cost through two views: success rate versus token usage and success rate versus trajectory length\. At each step, the agent receives the lastkkimages along with the full textual context \(reasoning and actions\), where the history constraint applies only to images\.

Performance vs\. token usage\.Figure[3](https://arxiv.org/html/2605.11212#S5.F3)shows that increasing the number of history images improves success rate marginally across all baselines, but at a substantial increase in token usage\. This behavior indicates that simply scaling visual context is inefficient, as redundant information accumulates across consecutive observations\. ReVision exhibits a fundamentally different scaling behavior\. Despite using significantly fewer tokens, it consistently matches or outperforms baselines across all benchmarks\. In particular, ReVision improves success rate by up to7 pointson OSWorld and AgentNetBench, and up to14 pointson WebTailBench\. This is achieved while using34% fewer tokens per image, enabling9\-image historiescompared to5 imagesfor baselines under a similar token budget\. On WebTailBench, ReVision reaches nearly50% success rate \(42% improvement\), compared tobelow 30%for strong baselines, despite operating under a significantly reduced token budget per image\. Even in shorter\-horizon settings such as AgentNetBench, ReVision achieves\+6 point gainswith fewer tokens, demonstrating that efficiency gains translate directly into improved decision quality\. This demonstrates that a large portion of visual tokens in GUI trajectories is redundant\. More importantly,reducing tokens does not degrade performance; in many cases, it improves it\. We attribute this to improved context quality: by removing temporally redundant patches, the model operates on a more compact and informative representation, allowing it to focus on relevant trajectory signals\. Refer to Appendix[I](https://arxiv.org/html/2605.11212#A9)for a case study analysis with ReVision\.

Performance vs\. trajectory length\.Figure[4](https://arxiv.org/html/2605.11212#S5.F4)shows the relationship betweensuccess rate and trajectory length \(number of steps\), highlighting how efficiently models solve tasks as more context is incorporated\. ReVision consistently achieveshigher success rates with fewer steps\. Quantitatively, onOSWorld,ReVision reduces the average trajectory length by 4 steps while achieving higher success rates\.OnWebTailBench, the improvements are even more pronounced, with reductions of up to4 stepsalongside gains of up to14 pointsin success rate\. Notably, while strong baselines require up to 33–37 steps and still remain below40%success rate, ReVision achieves nearly50% success ratewith only∼\\sim25–30 steps\. Interestingly, onOSWorld, when using9\-image histories, we observe a slight increase in the number of steps for ReVision despite strong performance\. This may be due toover\-reasoning with longer histories, where additional context occasionally leads to unnecessary actions\. We note that this behavior is not observed at lower step budgets\(SR@15 and SR@50\) or other benchmarks, suggesting it primarily affects longer\-horizon settings of this benchmark\. These results demonstrate that redundancy\-aware token filtering not only improves performance but also enablesfaster and more effective decision\-making\.

### 5\.2Effect of Using Different Token Selection Strategies

We compare ReVision against several token selection strategies, including random dropping, spiral, pixel\-based similarity, and embedding\-based filtering using Qwen2\.5\-VL\-7B and DINOv2\. Naive strategies consistently degrade performance despite reducing tokens: moderate dropping lowers token usage but reduces success rate, while aggressive removal \(Random 90%\) causes catastrophic failure\. Pixel\-based similarity achieves stronger compression but underperforms due to noisy, low\-level comparisons\. Embedding\-based methods offer a better trade\-off, maintaining near\-baseline performance while reducing tokens, but do not surpass the no\-drop setting\. In contrast,ReVision \(RTS\) achieves the best performance\-efficiency balance, improving success rate \(e\.g\.,73\.873\.8vs\.72\.572\.5on AgentNet and34\.034\.0vs\.32\.332\.3on OSWorld\) while reducing tokens per step by48%on average\. Although region\-based filtering with OmniParserV2 at the inference time yields the highest performance, it incursprohibitively high latency \(\>550\>550ms\), whereas ReVision achieves comparable gains withlow latency\(∼\\sim22 ms\)\. These results highlight the importance of*semantic and temporal awareness*for effective token selection \(see Appendix[G](https://arxiv.org/html/2605.11212#A7)\)\.

StrategyAgentNetOSWorldSRTok/StepVis%LatencySR@100Tok/StepVis%Latency\\rowcolorgray\!20 No Drop72\.51507692\.9032\.31507192\.90Random \(50%\)67\.9995287\.0027\.8978886\.60Random \(90%\)18\.9423492\.604\.6438589\.40Spiral \(50%\)69\.4982186\.8029\.0966286\.30Pixel68\.4821381\.61828\.6612580\.322Qwen2\.5\-VL\-7B \(Cosine Similarity\)72\.3942485\.6632\.1762484\.18DINOv2\-base \(Cosine Similarity\)71\.7968286\.22631\.4791584\.931\\rowcolorcyan\!4 RTS \+ OmniParserV274\.6842079\.855835\.2648578\.9572\\rowcolorcyan\!12RTS \(Ours\)73\.8897583\.42334\.0696382\.922

Table 2:Token selection strategies\.Moderate dropping reduces tokens but degrades performance, while aggressive removal \(Random 90%\) causes catastrophic failure\. ReVision \(RTS\) achieves the best performance\-efficiency trade\-off with low latency, whereas region\-based methods \(OmniParserV2\) improve performance at significantly higher cost\. See Appendix[B](https://arxiv.org/html/2605.11212#A2)for qualitative analysis on different removal strategies\.
### 5\.3History Scaling and Saturation in Visual Agents

![Refer to caption](https://arxiv.org/html/2605.11212v1/x5.png)Figure 5:Saturation vs\. history length\.As the number of history images increases, the No Drop baseline saturates early due to rising token usage, while ReVision removes redundant tokens, delaying saturation and achieving higher performance under a similar budget\.We analyze how increasing the number of history images affects performance and token usage for theNo Dropbaseline and ReVision\. As shown in Figure[5](https://arxiv.org/html/2605.11212#S5.F5), performance initially improves with longer histories but eventually saturates:theNo Dropbaseline peaks earlier \(around 7 images\) and then declines, while ReVision continues to improve up to larger windows \(around 11 images\) before plateauing\.Notably, saturation aligns more closely with total context length than with the number of images, occurring at approximately23k tokensacross benchmarks\. By removing redundant visual tokens, ReVision compresses the context and delays saturation, enabling more effective use of longer histories within the same token budget\. See Appendix[F](https://arxiv.org/html/2605.11212#A6)for analysis of how ReVision generalizes when using different history window sizes at inference compared to training\.

## 6Ablations

### 6\.1ReVision Generalizes Across Models

We evaluate whetherReVisiongeneralizes beyond a single backbone by comparing its performance across two model families:Qwen2\.5\-VL\-7BandQwen3\-VL\-8B\. We report results for history window sizes of 3 and 5 images onWebTailBench,OSWorld, andAgentNetBench\. As shown in Table[3](https://arxiv.org/html/2605.11212#S6.T3), increasing the history size consistently improves performance for both model families while maintaining predictable scaling in token usage\. Notably, the relative gains are consistent across benchmarks and architectures, indicating that ReVision generalizes effectively beyond a specific backbone\. We observe that the improvement margins forQwen3\-VL\-8Bare slightly smaller, which may be attributed to its stronger baseline performance on computer\-use tasks, leaving less room for improvement\.

ReVision Base ModelHist\.WebTailBenchOSWorldAgentNetBenchSR@100Tok/StepSR@100Tok/StepAvg SRTok/StepQwen2\.5\-VL\-7B335\.26,73130\.55,07470\.76,235540\.29,65134\.06,96373\.88,975Qwen3\-VL\-8B342\.17,20934\.15,39673\.56,654546\.610,94136\.77,25876\.09,218

Table 3:Generalization of ReVision across model families using 3 and 5 image history windows\. Performance consistently improves with larger history while maintaining predictable token scaling\. Full results are provided in Appendix[E](https://arxiv.org/html/2605.11212#A5)\.
### 6\.2ReVision Does not Hurt Performance with a Single Image

To verify that training withReVisiondoes not degrade performance in the standard single\-image setting, we evaluate our models on four GUI grounding benchmarks:OSWorld\-G\(Xieet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib20)\),ScreenSpot\-Pro\(Liet al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib19)\),ScreenSpot\-V2\(Wuet al\.,[2024](https://arxiv.org/html/2605.11212#bib.bib18)\), andUI\-Vision\(Nayaket al\.,[2025](https://arxiv.org/html/2605.11212#bib.bib3)\)\. In this setting, no historical images are provided, and thus ReVision’s token filtering mechanism is effectively inactive at inference time\. This experiment isolates whether the modified training distribution, where redundant visual tokens are removed across trajectories, negatively impacts performance when only a single screenshot is available\. As shown in Table[4](https://arxiv.org/html/2605.11212#S6.T4), ReVision achieves performance comparable to the base models across all benchmarks, with only minor variations\. These results indicate that training with filtered visual inputs does not harm the model’s grounding ability in the single\-image regime\.

ModelOSWorld\-GScreenSpot\-ProScreenSpot\-V2UI\-VisionQwen2\.5\-VL\-7B31\.327\.888\.40\.85Qwen3\-VL\-7B57\.855\.390\.127\.6ReVision \(Qwen2\.5\-VL\-7B\)31\.127\.688\.70\.83ReVision \(Qwen3\-VL\-8B\)57\.555\.689\.827\.2

Table 4:Performance comparison in the single\-image setting across four GUI grounding benchmarks\. ReVision achieves comparable performance to the base models, confirming that training with temporal token filtering does not degrade single\-image grounding ability\.

## 7Conclusion

In this work, we introducedReVision, a redundancy\-aware history representation for computer\-use agents that reduces unnecessary visual tokens by explicitly modeling temporal redundancy across consecutive screenshots\. Our results show that a substantial portion of visual context in GUI trajectories is redundant, and that removing these tokens improves both efficiency and task performance\. Across multiple benchmarks, ReVision consistently achieves higher success rates while using fewer tokens and shorter trajectories, demonstrating that better history representation can improve decision\-making in long\-horizon computer\-use tasks\. More broadly, our findings suggest that the key challenge in scaling visual reasoning is not simply the number of past images, but how much useful information can be preserved within a limited context budget\. Looking forward, an important direction is to extend redundancy modeling beyond time to also capture spatial redundancy within screenshots, and to better understand the mechanisms behind performance saturation in long\-context multimodal reasoning\.

## Ethics Statement

We acknowledge several ethical considerations associated with this work\. First, while ReVision improves efficiency by filtering redundant visual tokens, there is a risk that important but subtle visual information may be removed, potentially affecting downstream decision\-making in safety\-critical settings\. We mitigate this by preserving spatial structure and training the model under the same filtering regime used at inference, but careful validation remains necessary for high\-stakes applications\. Second, our approach is evaluated on existing benchmarks such as OSWorld and AgentNetBench, which may not fully capture the diversity of real\-world interfaces or user populations, and could introduce biases in performance assessment\. Third, increased efficiency may enable longer or more autonomous interactions, raising concerns about unintended actions or misuse in real\-world systems\. We emphasize that ReVision is intended as a research contribution to improve computational efficiency and should be deployed with appropriate safeguards, monitoring, and human oversight\.

## Reprodicibility Statement

We aim to ensure full reproducibility of our results\. We will release the training and inference code, along with the implementation of the ReVision token filtering pipeline, upon acceptance\. We will also release the trained model checkpoints to enable direct replication of our reported results\. All experiments are conducted using publicly available base models \(e\.g\., Qwen\-VL family\) and benchmarks, including OSWorld, WebTailBench, and AgentNetBench, with clearly specified evaluation protocols\. For training, we use trajectories constructed from these environments with the same token filtering applied as in inference; we will provide details and access to the processed training data or scripts to regenerate it\. We include comprehensive descriptions of model configurations, training procedures, hyperparameters, and data preprocessing steps in the paper and appendix\. In addition, we report all key metrics, including success rates, token usage, and latency, and provide evaluation scripts, prompts, and settings to ensure that other researchers can reliably reproduce and build upon our work\.

## References

- R\. Abhyankar, Q\. Qi, and Y\. Zhang \(2025\)OSWorld\-human: benchmarking the efficiency of computer\-use agents\.External Links:2506\.16042,[Link](https://arxiv.org/abs/2506.16042)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- A\. Awadallah, Y\. Lara, R\. Magazine, H\. Mozannar, A\. Nambi, Y\. Pandya, A\. Rajeswaran, C\. Rosset, A\. Taymanov, V\. Vineet,et al\.\(2025\)Fara\-7b: an efficient agentic model for computer use\.arXiv preprint arXiv:2511\.19663\.Cited by:[§1](https://arxiv.org/html/2605.11212#S1.p4.1),[§2](https://arxiv.org/html/2605.11212#S2.p1.1),[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.5.1),[§5](https://arxiv.org/html/2605.11212#S5.p2.1)\.
- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge,et al\.\(2025a\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[§5](https://arxiv.org/html/2605.11212#S5.p3.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025b\)Qwen2\.5\-vl technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[§1](https://arxiv.org/html/2605.11212#S1.p4.1),[§5](https://arxiv.org/html/2605.11212#S5.p3.1)\.
- R\. Bonatti, D\. Zhao, F\. Bonacci, D\. Dupont, S\. Abdali, Y\. Li, J\. Wagle, K\. Koishida, A\. Bucker, L\. Jang, and Z\. Hui \(2024\)Windows agent arena: evaluating multi\-modal os agents at scale\.Cited by:[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.4.1)\.
- G\. Chen, Z\. Qiao, X\. Chen, D\. Yu, H\. Xu, X\. Zhao, R\. Song, W\. Yin, H\. Yin, L\. Zhang, K\. Li, M\. Liao, Y\. Jiang, P\. Xie, F\. Huang, and J\. Zhou \(2026\)IterResearch: rethinking long\-horizon agents via markovian state reconstruction\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=qQ5MZ5Mx7p)Cited by:[§1](https://arxiv.org/html/2605.11212#S1.p1.1)\.
- W\. Chen, J\. Cui, J\. Hu, Y\. Qin, J\. Fang, Y\. Zhao, C\. Wang, J\. Liu, G\. Chen, Y\. Huo, Y\. Yao, Y\. Lin, Z\. Liu, and M\. Sun \(2024\)GUICourse: from general vision language models to versatile gui agents\.Cited by:[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.9.1)\.
- J\. Choi, S\. Lee, J\. Chu, M\. Choi, and H\. J\. Kim \(2024\)Vid\-tldr: training free token merging for light\-weight video transformer\.InConference on Computer Vision and Pattern Recognition,Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p3.1)\.
- R\. Choudhury, G\. Zhu, S\. Liu, K\. Niinuma, K\. M\. Kitani, and L\. A\. Jeni \(2024\)Don’t look twice: faster video transformers with run\-length tokenization\.Advances in Neural Information Processing Systems37,pp\. 28127–28149\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p3.1)\.
- B\. Gou, Z\. Huang, Y\. Ning, Y\. Gu, M\. Lin, B\. Yu, A\. Kopanev, W\. Qi, Y\. Shu, J\. Wu, C\. H\. Song, B\. J\. Gutierrez, Y\. Li, Z\. Liao, H\. N\. Moussa, T\. ZHANG, J\. Xie, T\. Xue, S\. Chen, B\. Zheng, K\. Zhang, Z\. Cai, V\. Rozgic, M\. Ziyadi, H\. Sun, and Y\. Su \(2025\)Mind2Web 2: evaluating agentic search with agent\-as\-a\-judge\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=AUaW6DS9si)Cited by:[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.6.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024\)WebVoyager: building an end\-to\-end web agent with large multimodal models\.arXiv preprint arXiv:2401\.13919\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- Y\. He, P\. Chawla, Y\. Souri, S\. Som, and X\. Song \(2026\)WebSTAR: scalable data synthesis for computer use agents with step\-level filtering\.External Links:2512\.10962,[Link](https://arxiv.org/abs/2512.10962)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- W\. Hong, W\. Wang, Q\. Lv, J\. Xu, W\. Yu, J\. Ji, Y\. Wang, Z\. Wang, Y\. Dong, M\. Ding, and J\. Tang \(2023\)CogAgent: a visual language model for gui agents\.External Links:2312\.08914Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- M\. Huang, B\. Jiang, D\. Zheng, H\. Hu, K\. Han, and X\. Chen \(2026a\)PPE: positional preservation embedding for token compression in multimodal large language models\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=OV0AoK7QEr)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p2.1)\.
- Y\. Huang, F\. Ma, Y\. Shao, J\. Guo, Z\. YU, L\. Cui, and Q\. Tian \(2026b\)Nüwa: mending the spatial integrity torn by VLM token pruning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=C9yclwdquU)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p2.1)\.
- I\. Kerboua, S\. O\. Shayegan, M\. Thakkar, X\. H\. Lù, L\. Boisvert, M\. Caccia, J\. Espinas, A\. Aussem, V\. Eglin, and A\. Lacoste \(2025\)Focusagent: simple yet effective ways of trimming the large context of web agents\.arXiv preprint arXiv:2510\.03204\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1),[§2](https://arxiv.org/html/2605.11212#S2.p2.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. C\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024a\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.arXiv preprint arXiv:2401\.13649\.Cited by:[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.7.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024b\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 881–905\.External Links:[Link](https://aclanthology.org/2024.acl-long.50/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.50)Cited by:[§1](https://arxiv.org/html/2605.11212#S1.p1.1),[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- B\. Korbar, D\. Tran, and L\. Torresani \(2019\)Scsampler: sampling salient clips from video for efficient action recognition\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 6232–6242\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p3.1)\.
- K\. Li, M\. Ziyang, H\. Lin, Z\. Luo, Y\. Tian, J\. Ma, Z\. Huang, and T\. Chua \(2025\)ScreenSpot\-pro: GUI grounding for professional high\-resolution computer use\.InWorkshop on Reasoning and Planning for Large Language Models,External Links:[Link](https://openreview.net/forum?id=XaKNDIAHas)Cited by:[§6\.2](https://arxiv.org/html/2605.11212#S6.SS2.p1.1)\.
- K\. Q\. Lin, L\. Li, D\. Gao, Z\. Yang, S\. Wu, Z\. Bai, W\. Lei, L\. Wang, and M\. Z\. Shou \(2024\)ShowUI: one vision\-language\-action model for gui visual agent\.External Links:2411\.17465,[Link](https://arxiv.org/abs/2411.17465)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p2.1)\.
- Z\. Liu, J\. Xie, Z\. Ding, Z\. Li, B\. Yang, Z\. Wu, X\. Wang, Q\. Sun, S\. Liu, W\. Wang, S\. Ye, Q\. Li, X\. Dong, Y\. Yu, C\. Lu, Y\. Mo, Y\. Yan, Z\. Tian, X\. Zhang, Y\. Huang, Y\. Liu, W\. Su, G\. Luo, X\. Yue, B\. Qi, K\. Chen, B\. Zhou, Y\. Qiao, Q\. Chen, and W\. Wang \(2025\)ScaleCUA: scaling open\-source computer use agents with cross\-platform data\.arXiv preprint arXiv:2509\.15221\.Note:PreprintExternal Links:[Link](https://github.com/OpenGVLab/ScaleCUA)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- Y\. Lu, J\. Yang, Y\. Shen, and A\. Awadallah \(2024\)OmniParser for pure vision based gui agent\.External Links:2408\.00203,[Link](https://arxiv.org/abs/2408.00203)Cited by:[§4\.2](https://arxiv.org/html/2605.11212#S4.SS2.p1.1)\.
- S\. Nayak, X\. Jian, K\. Q\. Lin, J\. A\. Rodriguez, M\. Kalsi, R\. Awal, N\. Chapados, M\. T\. Özsu, A\. Agrawal, D\. Vazquez, C\. Pal, P\. Taslakian, S\. Gella, and S\. Rajeswar \(2025\)UI\-vision: a desktop\-centric gui benchmark for visual perception and interaction\.External Links:2503\.15661,[Link](https://arxiv.org/abs/2503.15661)Cited by:[§6\.2](https://arxiv.org/html/2605.11212#S6.SS2.p1.1)\.
- OpenAI \(2024\)GPT\-4o\.External Links:[Link](https://openai.com/index/hello-gpt-4o/)Cited by:[§5](https://arxiv.org/html/2605.11212#S5.p2.1)\.
- M\. Ouyang, K\. Q\. Lin, M\. Z\. Shou, and H\. T\. Ng \(2026\)FocusUI: efficient ui grounding via position\-preserving visual token selection\.arXiv preprint arXiv:2601\.03928\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p2.1)\.
- Y\. Qin, Y\. Ye, J\. Fang, H\. Wang, S\. Liang, S\. Tian, J\. Zhang, J\. Li, Y\. Li, S\. Huang,et al\.\(2025\)Ui\-tars: pioneering automated gui interaction with native agents\.arXiv preprint arXiv:2501\.12326\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1),[§5](https://arxiv.org/html/2605.11212#S5.p3.1)\.
- C\. Rawles, S\. Clinckemaillie, Y\. Chang, J\. Waltz, G\. Lau, M\. Fair, A\. Li, W\. Bishop, W\. Li, F\. Campbell\-Ajala, D\. Toyama, R\. Berry, D\. Tyamagundlu, T\. Lillicrap, and O\. Riva \(2024\)AndroidWorld: a dynamic benchmarking environment for autonomous agents\.External Links:2405\.14573,[Link](https://arxiv.org/abs/2405.14573)Cited by:[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.8.1)\.
- P\. J\. Sager, B\. Meyer, P\. Yan, R\. von Wartburg\-Kottler, L\. Etaiwi, A\. Enayati, G\. Nobel, A\. Abdulkadir, B\. F\. Grewe, and T\. Stadelmann \(2025\)A comprehensive survey of agents for computer use: foundations, challenges, and future directions\.arXiv preprint arXiv:2501\.16150\.Cited by:[§1](https://arxiv.org/html/2605.11212#S1.p1.1)\.
- K\. Tao, C\. Qin, H\. You, Y\. Sui, and H\. Wang \(2025\)DyCoke: dynamic compression of tokens for fast video large language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 18992–19001\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p3.1)\.
- K\. Team, A\. Du, B\. Yin, B\. Xing, B\. Qu, B\. Wang, C\. Chen, C\. Zhang, C\. Du, C\. Wei, C\. Wang, D\. Zhang, D\. Du, D\. Wang, E\. Yuan, E\. Lu, F\. Li, F\. Sung, G\. Wei, G\. Lai, H\. Zhu, H\. Ding, H\. Hu, H\. Yang, H\. Zhang, H\. Wu, H\. Yao, H\. Lu, H\. Wang, H\. Gao, H\. Zheng, J\. Li, J\. Su, J\. Wang, J\. Deng, J\. Qiu, J\. Xie, J\. Wang, J\. Liu, J\. Yan, K\. Ouyang, L\. Chen, L\. Sui, L\. Yu, M\. Dong, M\. Dong, N\. Xu, P\. Cheng, Q\. Gu, R\. Zhou, S\. Liu, S\. Cao, T\. Yu, T\. Song, T\. Bai, W\. Song, W\. He, W\. Huang, W\. Xu, X\. Yuan, X\. Yao, X\. Wu, X\. Zu, X\. Zhou, X\. Wang, Y\. Charles, Y\. Zhong, Y\. Li, Y\. Hu, Y\. Chen, Y\. Wang, Y\. Liu, Y\. Miao, Y\. Qin, Y\. Chen, Y\. Bao, Y\. Wang, Y\. Kang, Y\. Liu, Y\. Du, Y\. Wu, Y\. Wang, Y\. Yan, Z\. Zhou, Z\. Li, Z\. Jiang, Z\. Zhang, Z\. Yang, Z\. Huang, Z\. Huang, Z\. Zhao, and Z\. Chen \(2025\)Kimi\-VL technical report\.External Links:2504\.07491,[Link](https://arxiv.org/abs/2504.07491)Cited by:[§5](https://arxiv.org/html/2605.11212#S5.p3.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§5](https://arxiv.org/html/2605.11212#S5.p3.1)\.
- \[33\]TimeChat\-online: 80\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p3.1)\.
- E\. Tong, Y\. Bai, Y\. Zhu, J\. Jiang, and X\. Liu \(2026\)Focus\-scan\-refine: from human visual perception to efficient visual token pruning\.arXiv preprint arXiv:2602\.05809\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p2.1)\.
- H\. Wang, H\. Zou, H\. Song, J\. Feng, J\. Fang, J\. Lu, L\. Liu, Q\. Luo, S\. Liang, S\. Huang,et al\.\(2025a\)Ui\-tars\-2 technical report: advancing gui agent with multi\-turn reinforcement learning\.arXiv preprint arXiv:2509\.02544\.Cited by:[§1](https://arxiv.org/html/2605.11212#S1.p1.1),[§2](https://arxiv.org/html/2605.11212#S2.p1.1),[§5](https://arxiv.org/html/2605.11212#S5.p3.1)\.
- X\. Wang, B\. Wang, D\. Lu, J\. Yang, T\. Xie, J\. Wang, J\. Deng, X\. Guo, Y\. Xu, C\. H\. Wu, Z\. Shen, Z\. Li, R\. Li, X\. Li, J\. Chen, B\. Zheng, P\. Li, F\. Lei, R\. Cao, Y\. Fu, D\. Shin, M\. Shin, J\. Hu, Y\. Wang, J\. Chen, Y\. Ye, D\. Zhang, D\. Du, H\. Hu, H\. Chen, Z\. Zhou, H\. Yao, Z\. Chen, Q\. Gu, Y\. Wang, H\. Wang, D\. Yang, V\. Zhong, F\. Sung, Y\. Charles, Z\. Yang, and T\. Yu \(2025b\)OpenCUA: open foundations for computer\-use agents\.External Links:2508\.09123,[Link](https://arxiv.org/abs/2508.09123)Cited by:[Appendix D](https://arxiv.org/html/2605.11212#A4.SS0.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2605.11212#S1.p1.1),[§1](https://arxiv.org/html/2605.11212#S1.p4.1),[§2](https://arxiv.org/html/2605.11212#S2.p1.1),[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.2.1),[§4\.2](https://arxiv.org/html/2605.11212#S4.SS2.p3.11),[§4](https://arxiv.org/html/2605.11212#S4.p1.1),[§5](https://arxiv.org/html/2605.11212#S5.p1.3),[§5](https://arxiv.org/html/2605.11212#S5.p2.1),[§5](https://arxiv.org/html/2605.11212#S5.p3.1)\.
- Z\. Wu, Z\. Wu, F\. Xu, Y\. Wang, Q\. Sun, C\. Jia, K\. Cheng, Z\. Ding, L\. Chen, P\. P\. Liang,et al\.\(2024\)OS\-atlas: a foundation action model for generalist gui agents\.arXiv preprint arXiv:2410\.23218\.Cited by:[§6\.2](https://arxiv.org/html/2605.11212#S6.SS2.p1.1)\.
- T\. Xie, J\. Deng, X\. Li, J\. Yang, H\. Wu, J\. Chen, W\. Hu, X\. Wang, Y\. Xu, Z\. Wang, Y\. Xu, J\. Wang, D\. Sahoo, T\. Yu, and C\. Xiong \(2025\)Scaling computer\-use grounding via user interface decomposition and synthesis\.External Links:2505\.13227,[Link](https://arxiv.org/abs/2505.13227)Cited by:[§6\.2](https://arxiv.org/html/2605.11212#S6.SS2.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu \(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.External Links:2404\.07972Cited by:[§1](https://arxiv.org/html/2605.11212#S1.p1.1),[§1](https://arxiv.org/html/2605.11212#S1.p4.1),[§2](https://arxiv.org/html/2605.11212#S2.p1.1),[Table 1](https://arxiv.org/html/2605.11212#S3.T1.1.1.3.1),[§5](https://arxiv.org/html/2605.11212#S5.p2.1)\.
- Y\. Xu, Z\. Wang, J\. Wang, D\. Lu, T\. Xie, A\. Saha, D\. Sahoo, T\. Yu, and C\. Xiong \(2024\)Aguvis: unified pure vision agents for autonomous gui interaction\.External Links:[Link](https://arxiv.org/abs/2412.04454)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 20744–20757\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- B\. Zheng, B\. Gou, J\. Kil, H\. Sun, and Y\. Su \(2024\)GPT\-4v\(ision\) is a generalist web agent, if grounded\.arXiv preprint arXiv:2401\.01614\.Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, Y\. Bisk, D\. Fried, U\. Alon,et al\.\(2023\)WebArena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.External Links:[Link](https://webarena.dev/)Cited by:[§2](https://arxiv.org/html/2605.11212#S2.p1.1)\.

## Appendix ADetailed Results Tables

We provide the full numerical results corresponding to the efficiency–performance trade\-off analysis in Section[5\.1](https://arxiv.org/html/2605.11212#S5.SS1)\.

#### OSWorld\.

Table[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px1)reports detailed results on OSWorld across general VLMs, UI agents, and ReVision\. Across all model families, increasing the number of history images leads to consistent but diminishing improvements in success rate\. Moving from one to three images provides the largest gains, while increasing to five images yields only marginal improvements despite a substantial increase in token usage \(e\.g\., from∼\\sim4k to∼\\sim15k tokens per step\)\. This trend suggests that short\-term visual context is beneficial, but additional history quickly becomes redundant\. A similar pattern is observed in trajectory length, where increasing history results in only minor reductions in the number of steps, indicating limited improvements in action efficiency\.

ReVision exhibits a different scaling behavior\. Without token dropping, it matches the performance of the corresponding baselines, confirming that the training procedure does not degrade performance\. When redundant visual tokens are removed, ReVision consistently improves both efficiency and performance\. For example, with Qwen2\.5\-VL\-7B and a 5\-image history, ReVision improves 50\-step success rate from 34\.5 to 35\.9 while reducing tokens per step by more than 2×\\times\(15,071→\\rightarrow6,963\) and decreasing the average number of steps \(22\.7→\\rightarrow19\.8\)\. Similar trends are observed across step budgets and backbones\. These results indicate that a significant portion of visual tokens is redundant, and that removing them not only reduces computational cost but also leads to more efficient decision\-making by enabling the model to better focus on relevant actions and observations\.

ModelHistoryDrop15 Steps50 Steps100 StepsAvg Tokens / StepSRAvg StepsSRAvg StepsSRAvg Steps\\rowcolorgray\!20General VLMsQwen2\.5\-VL\-7B1✗1\.614\.52\.229\.62\.738\.24,0133✗1\.913\.62\.528\.43\.237\.111,1765✗2\.313\.22\.728\.13\.536\.515,062Qwen2\.5\-VL\-32B1✗2\.513\.83\.328\.53\.936\.74,0673✗2\.912\.73\.627\.44\.235\.611,2385✗3\.012\.53\.827\.04\.435\.215,087Qwen2\.5\-VL\-72B1✗3\.613\.44\.428\.15\.036\.24,1093✗3\.912\.64\.727\.35\.335\.411,3075✗4\.112\.44\.927\.05\.535\.015,183Kimi\-VL\-A3B1✗8\.512\.69\.726\.510\.334\.14,0943✗9\.211\.810\.025\.910\.733\.511,2215✗9\.511\.610\.225\.610\.933\.115,136Qwen\-3\-VL\-8B1✗29\.812\.834\.220\.733\.927\.34,0613✗30\.211\.935\.119\.834\.826\.211,2845✗30\.911\.635\.619\.235\.325\.815,171Qwen\-3\-VL\-32B1✗33\.210\.441\.121\.441\.028\.44,1333✗39\.89\.342\.719\.642\.427\.811,3465✗40\.39\.142\.818\.942\.527\.315,229Qwen3\-VL\-30B\-A3B1✗27\.910\.632\.423\.631\.030\.84,1173✗35\.29\.539\.822\.638\.630\.011,3185✗35\.89\.340\.422\.239\.229\.615,198\\rowcolorgreen\!12UI AgentsOpenCUA1✗23\.811\.227\.424\.626\.031\.83,9943✗30\.510\.134\.123\.832\.231\.111,1935✗30\.810\.034\.423\.632\.530\.915,075UI\-TARS\-72B\-DPO1✗20\.712\.223\.725\.623\.532\.93,9673✗20\.911\.724\.425\.024\.232\.611,1595✗21\.111\.424\.924\.624\.631\.415,032UI\-TARS\-1\.5\-7B1✗24\.111\.427\.325\.227\.129\.14,0583✗25\.510\.928\.524\.628\.528\.611,2275✗26\.010\.729\.324\.328\.828\.215,141\\rowcolorcyan\!12ReVisionReVision3✗30\.69\.834\.122\.932\.229\.811,0783✓28\.68\.932\.021\.030\.527\.95,0745✗30\.79\.734\.522\.732\.329\.715,0715✓32\.18\.335\.919\.834\.026\.46,963

Table 5:OSWorld results across general VLMs, UI agents, and ReVision\. For ReVision, we only report settings where the training window size matches the number of history images\. A checkmark indicates redundant history token dropping, while a cross indicates no dropping\.#### AgentNetBench\.

Table[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px2)reports results on AgentNetBench across general VLMs, UI agents, and ReVision\. Similar to OSWorld, increasing the number of history images leads to consistent improvements across all models, with gains saturating beyond 3 to 5 images\. This trend holds across coordinate, content, and functional success rates, indicating that additional short\-term visual context is beneficial but provides diminishing returns as history grows\. Compared to OSWorld, the improvements from longer history are more stable and less sensitive, reflecting the offline nature of AgentNetBench where trajectories are fixed and less prone to compounding errors\.

ReVision again exhibits a different scaling behavior\. Without token dropping, it matches the corresponding baselines, confirming that the training setup does not degrade performance\. When redundant visual tokens are removed, ReVision consistently improves performance across all metrics\. For example, with Qwen2\.5\-VL\-7B and a 5\-image history, ReVision improves average success rate from 72\.5 to 73\.8, with gains observed across coordinate, content, and functional metrics\. Similar improvements are observed with stronger backbones, such as Qwen\-3\-VL\-8B\. These results suggest that removing redundant visual tokens does not harm any specific aspect of the task and instead enables more effective use of the available context\. Overall, the improvements are more modest compared to OSWorld, but remain consistent, indicating that redundancy\-aware token filtering provides reliable gains even in less challenging, offline evaluation settings\.

ModelHistoryDropCoord\. SRContent SRFunc\. SRAvg\. SR\\rowcolorgray\!20General VLMsQwen2\.5\-VL\-7B1✗25\.118\.311\.923\.4Qwen2\.5\-VL\-7B3✗26\.419\.512\.725\.0Qwen2\.5\-VL\-7B5✗27\.020\.113\.325\.8Qwen2\.5\-VL\-32B1✗38\.528\.218\.435\.6Qwen2\.5\-VL\-32B3✗40\.229\.819\.537\.9Qwen2\.5\-VL\-32B5✗41\.030\.520\.238\.7Qwen2\.5\-VL\-72B1✗42\.130\.520\.238\.7Qwen2\.5\-VL\-72B3✗43\.631\.821\.040\.8Qwen2\.5\-VL\-72B5✗44\.432\.421\.741\.5Kimi\-VL\-A3B1✗51\.637\.825\.947\.2Kimi\-VL\-A3B3✗54\.039\.527\.249\.8Kimi\-VL\-A3B5✗55\.240\.428\.150\.9Qwen\-3\-VL\-8B1✗73\.454\.940\.268\.7Qwen\-3\-VL\-8B3✗74\.658\.843\.171\.9Qwen\-3\-VL\-8B5✗75\.259\.343\.472\.2Qwen\-3\-VL\-72B1✗79\.264\.548\.176\.9Qwen\-3\-VL\-72B3✗82\.567\.450\.879\.3Qwen\-3\-VL\-72B5✗83\.668\.351\.980\.2Qwen3\-VL\-30B\-A3B1✗76\.561\.344\.674\.1Qwen3\-VL\-30B\-A3B3✗80\.465\.148\.277\.0Qwen3\-VL\-30B\-A3B5✗81\.366\.049\.177\.8\\rowcolorgreen\!12UI AgentsOpenCUA1✗72\.055\.639\.868\.4OpenCUA3✗75\.859\.442\.572\.1OpenCUA5✗76\.159\.742\.872\.4UI\-TARS\-72B\-DPO1✗66\.248\.134\.762\.3UI\-TARS\-72B\-DPO3✗67\.149\.035\.563\.2UI\-TARS\-72B\-DPO5✗67\.849\.636\.063\.8UI\-TARS\-1\.5\-7B1✗70\.452\.338\.866\.9UI\-TARS\-1\.5\-7B3✗71\.853\.840\.268\.4UI\-TARS\-1\.5\-7B5✗72\.354\.540\.668\.9\\rowcolorcyan\!12ReVisionReVision3✗75\.959\.342\.472\.0ReVision3✓74\.458\.141\.770\.7ReVision5✗76\.259\.842\.972\.5ReVision5✓77\.061\.044\.273\.8

Table 6:AgentNetBench results across general VLMs, UI agents, and ReVision\. For ReVision, we only report settings where the training window size matches the number of history images\.#### WebTailBench\.

Table[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px3)reports results on WebTailBench, a long\-horizon benchmark designed to evaluate performance on complex multi\-step tasks\. Compared to OSWorld and AgentNetBench, improvements from increasing history are limited for standard baselines\. While moving from one to three images provides moderate gains, further increasing to five images yields marginal or even diminishing returns despite a substantial increase in token usage\. For example, OpenCUA improves from 25\.8 to 29\.5 at 50\-step success rate when increasing history from one to three images, but slightly drops to 29\.1 at five images while token cost continues to grow significantly\. This pattern indicates that longer visual histories introduce redundancy that models struggle to utilize effectively\.

ReVision demonstrates a markedly different behavior in this setting\. When redundant visual tokens are removed, performance improves consistently as the history length increases, while maintaining significantly lower token usage\. For instance, with Qwen2\.5\-VL\-7B, ReVision improves from 28\.4 to 40\.8 at 50\-step success rate as history increases from three to nine images, and from 35\.2 to 48\.9 at 100 steps\. At the same time, it reduces the average number of steps required to complete tasks \(e\.g\., from 27\.9 to 23\.6 at 5 images\), indicating more efficient decision\-making\. These results show that, unlike baselines, ReVision is able to effectively leverage longer histories by removing redundant visual information\. As a result, longer context becomes beneficial rather than detrimental, highlighting that redundancy\-aware token filtering is critical for scaling performance in long\-horizon computer\-use tasks\.

ModelHistoryDropSR@50Avg\. Steps@50SR@100Avg\. Steps@100Avg Tokens / Step\\rowcolorgray\!20General VLMsQwen2\.5\-VL\-7B1✗6\.531\.87\.837\.24,213Qwen2\.5\-VL\-7B3✗7\.931\.19\.236\.512,067Qwen2\.5\-VL\-7B5✗8\.630\.710\.136\.015,361Qwen2\.5\-VL\-32B1✗6\.232\.17\.137\.44,287Qwen2\.5\-VL\-32B3✗6\.831\.47\.936\.812,183Qwen2\.5\-VL\-32B5✗7\.531\.08\.636\.115,472Qwen2\.5\-VL\-72B1✗7\.531\.78\.636\.94,349Qwen2\.5\-VL\-72B3✗8\.231\.09\.436\.212,244Qwen2\.5\-VL\-72B5✗9\.030\.610\.135\.715,561Kimi\-VL\-A3B1✗13\.231\.015\.036\.14,318Kimi\-VL\-A3B3✗14\.830\.316\.935\.412,217Kimi\-VL\-A3B5✗15\.929\.918\.135\.015,438Qwen\-3\-VL\-8B1✗26\.026\.232\.835\.34,266Qwen\-3\-VL\-8B3✗29\.828\.736\.634\.512,089Qwen\-3\-VL\-8B5✗30\.428\.437\.134\.015,392Qwen\-3\-VL\-32B1✗34\.827\.541\.934\.34,331Qwen\-3\-VL\-32B3✗39\.629\.646\.833\.412,214Qwen\-3\-VL\-32B5✗40\.229\.247\.332\.915,487Qwen3\-VL\-30B\-A3B1✗32\.527\.139\.234\.84,309Qwen3\-VL\-30B\-A3B3✗37\.429\.144\.334\.012,173Qwen3\-VL\-30B\-A3B5✗38\.028\.844\.933\.515,456\\rowcolorgreen\!12UI AgentsOpenCUA1✗25\.829\.732\.426\.44,237OpenCUA3✗29\.528\.936\.230\.212,041OpenCUA5✗29\.128\.635\.829\.815,329UI\-TARS\-72B\-DPO1✗15\.326\.817\.434\.64,248UI\-TARS\-72B\-DPO3✗16\.928\.418\.233\.712,107UI\-TARS\-72B\-DPO5✗17\.428\.119\.533\.215,376UI\-TARS\-1\.5\-7B\-DPO1✗16\.526\.620\.435\.04,169UI\-TARS\-1\.5\-7B\-DPO3✗18\.128\.822\.234\.011,982UI\-TARS\-1\.5\-7B\-DPO5✗18\.728\.526\.833\.515,214\\rowcolorcyan\!12ReVisionReVision3✗29\.628\.136\.330\.311,907ReVision3✓28\.425\.435\.229\.26,731ReVision5✗29\.327\.936\.030\.015,362ReVision5✓32\.823\.640\.227\.49,651

Table 7:WebTailBench results on long\-horizon tasks\. We report success rates at 50 and 100 steps, average trajectory length, and average tokens per step\. For ReVision, we report configurations with matched training window and history size, together with the corresponding no\-drop controls when available\.## Appendix BQualitative Comparison of Token Selection Strategies

Figure[6](https://arxiv.org/html/2605.11212#A2.F6)provides a qualitative comparison of different token selection strategies on a representative trajectory step\. We visualize which patches are retained across methods by overlaying the patch grid on the screenshots, where removed regions are suppressed and retained patches are highlighted\.

Naive strategies such asRandomandSpiraldropping remove patches without considering semantic consistency across time\. As a result, they often discard important regions \(e\.g\., UI elements relevant to the task\) while retaining redundant background content\. This leads to fragmented visual context, where critical information may be missing or partially preserved\.

Pixel\-based similarityperforms more structured filtering by removing patches with low pixel\-level changes\. However, it is sensitive to small visual variations \(e\.g\., rendering noise, cursor movement\), which causes it to either retain redundant regions or remove semantically important details\. Consequently, although it achieves stronger token reduction, it often harms downstream reasoning\.

Embedding\-based methods\(DINO and Qwen\) provide improved consistency by comparing patch embeddings\. These approaches better preserve semantically meaningful regions, but they still struggle to precisely localize task\-relevant changes\. In particular, they may retain large redundant areas or fail to capture fine\-grained updates in the interface\.

In contrast,ReVisionexplicitly models temporal redundancy between corresponding patches across consecutive images\. As shown in the figure, it effectively removes unchanged regions while preserving newly updated and task\-relevant content\. This results in a cleaner and more focused visual representation, where the model can rely on previous images for redundant information while attending to only the necessary updates in the current step\.

![Refer to caption](https://arxiv.org/html/2605.11212v1/x6.png)Figure 6:Qualitative comparison of token selection strategies\.We show patch retention across different methods for two consecutive steps \(t−1t\{\-\}1andtt\)\. For visualization purposes, we use lower\-resolution images, resulting in fewer patches and clearer overlays\. Random and spiral strategies remove patches indiscriminately, often discarding important UI elements\. Pixel\-based similarity removes more patches but fails to preserve fine\-grained semantic details\. Embedding\-based methods \(DINO and Qwen\) improve consistency but still retain redundant regions\. In contrast, ReVision selectively removes temporally redundant patches while preserving task\-relevant updates, leading to a more informative and compact visual representation\.## Appendix CDetailed ReVision Token Filtering Procedure

Algorithm[1](https://arxiv.org/html/2605.11212#alg1)in the main paper presents a simplified view of the ReVision pipeline\. Here, we provide a detailed version that expands each step and clarifies the interaction between components\.

#### Multimodal input construction\.

At each stepNN, the model receives the task instructionqq, previous actions𝒜1:N−1\\mathcal\{A\}\_\{1:N\-1\}, and reasoningℛ1:N−1\\mathcal\{R\}\_\{1:N\-1\}\. These are tokenized together withkkimage placeholders corresponding to the most recent imagesℐN−k\+1:N\\mathcal\{I\}\_\{N\-k\+1:N\}\. This ensures that textual context is fully preserved while only a fixed number of recent images are included\.

#### Visual token extraction\.

Each imageIiI\_\{i\}is processed by the vision encoder to produce patch representations𝐮i\\mathbf\{u\}\_\{i\}and their spatial position indices𝐩i\\mathbf\{p\}\_\{i\}\. These are then passed through a projection layer to obtain visual tokens𝐯i\\mathbf\{v\}\_\{i\}and intermediate patch features𝐟i\\mathbf\{f\}\_\{i\}\. In the notation of Section[4\.1](https://arxiv.org/html/2605.11212#S4.SS1), this corresponds to constructingVi=\{v1i,…,vNi\}V\_\{i\}=\\\{v\_\{1\}^\{i\},\\dots,v\_\{N\}^\{i\}\\\}\.

#### Temporal token filtering\.

For the first image in the window, all tokens are retained\. For subsequent images, ReVision Token Selection \(RTS\) compares each patch feature in𝐟i\\mathbf\{f\}\_\{i\}with its corresponding patch in𝐟i−1\\mathbf\{f\}\_\{i\-1\}and produces a binary mask𝐦i∈\{0,1\}N\\mathbf\{m\}\_\{i\}\\in\\\{0,1\\\}^\{N\}\. This mask identifies redundant patches inIiI\_\{i\}relative toIi−1I\_\{i\-1\}\. The filtered tokens are then obtained asVi′=Vi​\[𝐦i\]V\_\{i\}^\{\\prime\}=V\_\{i\}\[\\mathbf\{m\}\_\{i\}\], where only non\-redundant patches are kept\.

#### Position preservation\.

Importantly, token filtering does not modify positional indices\. The retained tokens keep their original position IDs, i\.e\.,𝐩^i=𝐩i​\[𝐦i\]\\hat\{\\mathbf\{p\}\}\_\{i\}=\\mathbf\{p\}\_\{i\}\[\\mathbf\{m\}\_\{i\}\]\. This ensures that spatial alignment is preserved across time and avoids disrupting the attention mechanism or positional encoding \(e\.g\., m\-ROPE\)\.

#### Decoding\.

The filtered visual tokens from all images in the window are concatenated and combined with textual tokens to form the final multimodal input\. The decoder then generates the next reasoning and action conditioned on\{V1′,…,VN′\}\\\{V\_\{1\}^\{\\prime\},\\dots,V\_\{N\}^\{\\prime\}\\\}and\{T1,…,TN\}\\\{T\_\{1\},\\dots,T\_\{N\}\\\}\.

Algorithm 2ReVision visual token dropping during training and inference1:Input:current step

NN, image window size

kk, task information

qq, previous actions

𝒜1:N−1\\mathcal\{A\}\_\{1:N\-1\}, previous reasoning

ℛ1:N−1\\mathcal\{R\}\_\{1:N\-1\}, recent images

ℐN−k\+1:N=\[IN−k\+1,…,IN\]\\mathcal\{I\}\_\{N\-k\+1:N\}=\[I\_\{N\-k\+1\},\\dots,I\_\{N\}\]
2:

𝐱←Tokenizer​\(q,𝒜1:N−1,ℛ1:N−1,⟨image⟩k\)\\mathbf\{x\}\\leftarrow\\textsc\{Tokenizer\}\\\!\\left\(q,\\mathcal\{A\}\_\{1:N\-1\},\\mathcal\{R\}\_\{1:N\-1\},\\langle\\text\{image\}\\rangle^\{k\}\\right\)⊳\\trianglerightInsert one<image\>placeholder per image in its actual step

3:

𝒱←\[\]\\mathcal\{V\}\\leftarrow\[\\ \],

𝒫←\[\]\\mathcal\{P\}\\leftarrow\[\\ \]
4:for

i=N−k\+1i=N\-k\+1to

NNdo

5:

\(𝐮i,𝐩i\)←VisionEncoder​\(Ii\)\(\\mathbf\{u\}\_\{i\},\\mathbf\{p\}\_\{i\}\)\\leftarrow\\textsc\{VisionEncoder\}\(I\_\{i\}\)
6:

\(𝐯i,𝐟i\)←ProjectionLayer​\(𝐮i\)\(\\mathbf\{v\}\_\{i\},\\mathbf\{f\}\_\{i\}\)\\leftarrow\\textsc\{ProjectionLayer\}\(\\mathbf\{u\}\_\{i\}\)⊳\\triangleright𝐯i\\mathbf\{v\}\_\{i\}: visual tokens,𝐟i\\mathbf\{f\}\_\{i\}: patch features,𝐩i\\mathbf\{p\}\_\{i\}: original position ids

7:if

i=N−k\+1i=N\-k\+1then

8:

𝐯^i←𝐯i\\hat\{\\mathbf\{v\}\}\_\{i\}\\leftarrow\\mathbf\{v\}\_\{i\}
9:

𝐩^i←𝐩i\\hat\{\\mathbf\{p\}\}\_\{i\}\\leftarrow\\mathbf\{p\}\_\{i\}⊳\\trianglerightKeep the first image in the window intact

10:else

11:

𝐦i←ReVisionTokenSelection​\(𝐟i−1,𝐟i\)\\mathbf\{m\}\_\{i\}\\leftarrow\\textsc\{ReVisionTokenSelection\}\(\\mathbf\{f\}\_\{i\-1\},\\mathbf\{f\}\_\{i\}\)⊳\\triangleright𝐦i\\mathbf\{m\}\_\{i\}is a binary indicator of which features/tokens to keep

12:

𝐯^i←𝐯i​\[𝐦i\]\\hat\{\\mathbf\{v\}\}\_\{i\}\\leftarrow\\mathbf\{v\}\_\{i\}\[\\mathbf\{m\}\_\{i\}\]
13:

𝐩^i←𝐩i​\[𝐦i\]\\hat\{\\mathbf\{p\}\}\_\{i\}\\leftarrow\\mathbf\{p\}\_\{i\}\[\\mathbf\{m\}\_\{i\}\]⊳\\trianglerightRetained tokens keep their original position ids

14:endif

15:append

𝐯^i\\hat\{\\mathbf\{v\}\}\_\{i\}to

𝒱\\mathcal\{V\}
16:append

𝐩^i\\hat\{\\mathbf\{p\}\}\_\{i\}to

𝒫\\mathcal\{P\}
17:endfor

18:

\(𝐞,𝐩𝐨𝐬\)←BuildMultimodalInput​\(𝐱,𝒱,𝒫\)\(\\mathbf\{e\},\\mathbf\{pos\}\)\\leftarrow\\textsc\{BuildMultimodalInput\}\(\\mathbf\{x\},\\mathcal\{V\},\\mathcal\{P\}\)⊳\\trianglerightPass all previous actions and reasoning, but only the lastkkimages

19:return

LLMDecoder​\(𝐞,𝐩𝐨𝐬\)\\textsc\{LLMDecoder\}\(\\mathbf\{e\},\\mathbf\{pos\}\)

## Appendix DTraining and Implementation Details

#### Training setup\.

We follow the OpenCUA framework\(Wanget al\.,[2025b](https://arxiv.org/html/2605.11212#bib.bib38)\)and use the same training configuration for fair comparison\. All models are trained on a cluster of8×8\\timesNVIDIA H200 GPUs\. We use standard autoregressive next\-token prediction, where the loss is applied only to text tokens \(reasoning and actions\), while visual tokens are used as conditioning input\. For each history window sizekk, we train a separate model using trajectory segments with up tokkimages, ensuring that the training distribution matches the inference setting\.

#### Data construction\.

Training samples are constructed from AgentNet trajectories by applying a sliding window over interaction sequences\. At each steptt, we form a sample consisting of the most recentkkimagesℐt−k\+1:t\\mathcal\{I\}\_\{t\-k\+1:t\}together with all previous reasoning and actions\. This converts trajectories into step\-level supervision and increases the number of training samples\.

#### Token filtering\.

During training, we apply the same ReVision token filtering pipeline used at inference time \(Algorithm[1](https://arxiv.org/html/2605.11212#alg1)\)\. For each pair of consecutive images, RTS compares corresponding patches and produces a binary mask to remove redundant tokens\. The first image in each window is kept intact, while subsequent images retain only non\-redundant patches\. This ensures that the model learns to operate under partially observed visual inputs\.

#### Decoding and evaluation\.

During inference, we use a fixed decoding temperature ofT=0\.0T\{=\}0\.0to ensure deterministic behavior and isolate the effect of token filtering\. All reported results are averaged over three runs\. For WebTailBench, we adopt an LLM\-as\-a\-judge protocol usinggpt\-4oto evaluate step\-level correctness and compute final success rates\.

#### Metrics\.

We report success rate \(SR\) as the primary metric across all benchmarks\. For OSWorld and WebTailBench, we additionally report results under different step budgets to capture long\-horizon performance\. To better understand efficiency, we also analyze the relationship between success rate, token usage, and the average number of interaction steps required to complete tasks\.

## Appendix EGeneralization Across Different Models

We provide the full results for ReVision across different history window sizes \(3, 5, 7, and 9 images\) and model families\. These results extend Table[3](https://arxiv.org/html/2605.11212#S6.T3)in the main paper\.

ReVision Base ModelHist\.WebTailBenchOSWorldAgentNetBenchSR@100Avg\. Tok/StepSR@100Avg\. Tok/StepAvg SRAvg\. Tok/StepQwen2\.5\-VL\-7B335\.26,73130\.55,07470\.76,235540\.29,65134\.06,96373\.88,975745\.613,52836\.78,80275\.911,284948\.915,28338\.410,24177\.113,012Qwen3\-VL\-8B342\.17,20934\.15,39673\.56,654546\.610,94136\.77,25876\.09,218749\.813,46240\.48,99478\.611,759952\.416,03141\.610,72380\.213,921

Table 8:Generalization results of ReVision across model families and history sizes\.## Appendix FTraining on a Fixed Context Window Generalizes to Other Window Sizes\.

We analyze whether models trained withReVisionunder a fixed visual context window generalize to different context sizes at inference time\. Although ReVision is trained with a fixed number of history images, agents in practice may operate under varying context budgets\. To study this, we train models with window sizesw∈3,5w\\in\{3,5\}and evaluate them under both matched and mismatched inference windows\. Table[9](https://arxiv.org/html/2605.11212#A6.T9)reports results onOSWorld\(100\-step success rate\) andAgentNetBench\(average success rate\), along with average tokens per step\. Performance is best when the training and inference window sizes match, but the drop under mismatched settings is modest\. This suggests that ReVision\-trained models are robust to changes in the number of visual context images and generalize beyond their training configuration without substantial loss in performance\.

Train Window SizeInference Window SizeOSWorldAgentNetBenchSR@100Tok/StepAvg SRTok/Step3330\.55,07470\.76,2353529\.16,75569\.78,4375329\.45,39469\.96,9595534\.06,96373\.88,975

Table 9:Cross\-window generalization of ReVision onQwen2\.5\-VL\-7B\. Models perform best when training and inference window sizes match, but maintain competitive performance under mismatched settings, indicating robustness to varying visual context sizes\.## Appendix GRegion\-Level Grouping and Learned Filtering

#### Learned filtering vs\. cosine similarity\.

We first analyze the impact of replacing cosine similarity with a lightweight classifier for redundancy detection\. Across both Qwen and DINOv2 embeddings, classifier\-based filtering consistently improves performance \(e\.g\., Qwen: 72\.3→\\rightarrow72\.9 SR; DINOv2: 71\.7→\\rightarrow72\.1\), while slightly reducing token usage\. This improvement stems from the classifier’s ability to learn an adaptive decision boundary, in contrast to cosine similarity which relies on a fixed threshold \(set to 0\.95 in our experiments\)\. These results suggest that even simple learned filtering can better capture context\-dependent redundancy\.

#### Effect of region\-level grouping \(OmniParser\)\.

We next evaluate the effect of incorporating region\-level structure using OmniParser\. By grouping semantically coherent UI elements, OmniParser enables more structured redundancy detection, leading to improved performance and stronger token reduction \(74\.6 SRon AgentNetBench and35\.2 SR@100on OSWorld, with lower visual token ratios\)\. However, this improvement comes at a significant computational cost\. OmniParser introduces substantial latency \(∼\\sim558–572 ms\), which is an order of magnitude higher than embedding\-based approaches\. In contrast, ReVision \(RTS\) avoids region parsing at inference time and maintains low latency \(∼\\sim22–23 ms\) while achieving competitive performance \(73\.8 SR\)\. These results highlight a key trade\-off: while structured region\-level grouping can improve redundancy detection, it incurs high overhead that may limit practical deployment\. Our approach instead leverages such structure during training, enabling efficient inference without sacrificing performance\.

StrategyAgentNetBenchOSWorldSRAvg Tok/StepVis Tok%Latency \(ms\)SR@100Avg Tok/StepVis Tok%Latency \(ms\)\\rowcolorgray\!20 No Drop72\.51507692\.9032\.31507192\.90Qwen2\.5\-VL\-7B \(Cosine\)72\.3942485\.6632\.1762484\.18Qwen2\.5\-VL\-7B \(Classifier\)72\.9930185\.21032\.6748083\.813DINOv2\-base \(Cosine\)71\.7968286\.22631\.4791584\.931DINOv2\-base \(Classifier\)72\.1952085\.92931\.9776884\.536\\rowcolorcyan\!4 RTS \+ OmniParser74\.6842079\.855835\.2648578\.9572\\rowcolorcyan\!12RTS \(Ours\)73\.8897583\.42334\.0696382\.922

Table 10:Ablation on learned filtering and region\-level grouping\.We compare cosine similarity and classifier\-based filtering across embedding backbones, as well as region\-based grouping with OmniParser\. Classifier\-based methods provide consistent improvements over cosine similarity by learning adaptive thresholds\. Incorporating OmniParser further improves performance and reduces visual redundancy, but introduces substantial latency overhead\. ReVision \(RTS\) achieves a strong balance, maintaining competitive performance with significantly lower inference cost\.
## Appendix HAdditional Efficiency Results

Figures[7](https://arxiv.org/html/2605.11212#A8.F7)and[8](https://arxiv.org/html/2605.11212#A8.F8)provide additional views of the efficiency\-performance trade\-offs across benchmarks and step budgets\. Consistent with the main results,ReVisionachieves a more favorable trade\-off, reaching higher success rates with substantially fewer tokens per step\.

![Refer to caption](https://arxiv.org/html/2605.11212v1/x7.png)Figure 7:Success rate versus average tokens per step across OSWorld at 15 steps, 50 steps, WebTailBench at 50 steps\. Detailed numerical results are provided in Tables[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px1),[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px2), and[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px3)in Appendix[A](https://arxiv.org/html/2605.11212#A1)\.![Refer to caption](https://arxiv.org/html/2605.11212v1/x8.png)Figure 8:Success rate versus average trajectory length \(number of steps\) for WebTailBench and OSWorld at 50 steps\. Detailed numerical results are provided in Tables[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px2)and[A](https://arxiv.org/html/2605.11212#A1.SS0.SSS0.Px3)in Appendix[A](https://arxiv.org/html/2605.11212#A1)\.## Appendix ICase Study

To provide a concrete illustration of howReVisionimproves efficiency during sequential decision\-making, we present a step\-by\-step case study on OSWorld in TableLABEL:tab:case\_study\. The task requires removing tracking data from Amazon by navigating browser settings and disabling privacy\-related options\. As shown in the table, consecutive screenshots contain substantial visual overlap, particularly in static UI regions such as navigation bars, menus, and background areas\. Without token filtering, all patches are retained across steps, leading to significant redundancy in the visual context\.

ReVision addresses this by selectively removing visually redundant patches between consecutive frames while preserving regions that are relevant to the current action\. This results in a progressively more compact representation, as seen in the “After Token Removal” column, where large portions of unchanged interface elements are omitted\. Importantly, despite this reduction, the agent continues to produce correct actions and coherent reasoning at each step\. For example, the model successfully identifies and navigates to browser settings, selects the appropriate privacy options, and proceeds to disable ad personalization, all while operating on a reduced visual context\.

This example highlights two key advantages of ReVision: \(i\) it effectively eliminates temporal redundancy without disrupting spatial alignment or task\-relevant information, and \(ii\) it enables the model to rely on temporally distributed evidence, where previously observed content can be implicitly recalled rather than repeatedly re\-encoded\. Overall, the case study demonstrates that substantial token savings can be achieved in realistic multi\-step interactions without sacrificing decision quality\.

Table 11:Case study of revision on OSWorld\.![[Uncaptioned image]](https://arxiv.org/html/2605.11212v1/x9.png)![[Uncaptioned image]](https://arxiv.org/html/2605.11212v1/x10.png)![[Uncaptioned image]](https://arxiv.org/html/2605.11212v1/x11.png)![[Uncaptioned image]](https://arxiv.org/html/2605.11212v1/x12.png)

Similar Articles

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Hugging Face Daily Papers

AVR is an adaptive visual reasoning framework that dynamically selects optimal reasoning formats to reduce token usage by 50-90% while maintaining accuracy in visual reasoning tasks. The method addresses reasoning path redundancy by decomposing visual reasoning into three cognitive functions and using FS-GRPO training to encourage efficient format selection.

Region4Web: Rethinking Observation Space Granularity for Web Agents

arXiv cs.CL

This paper introduces Region4Web, a framework that improves web agent performance by organizing observation spaces into functional regions rather than individual elements. It demonstrates that this approach reduces observation length and increases task success rates on the WebArena benchmark.

On the Reliability of Computer Use Agents

Hugging Face Daily Papers

A preprint analyzing why computer-use agents succeed once but fail on repeated executions, attributing unreliability to execution stochasticity, task ambiguity, and behavioral variability, and advocating repeated evaluation and stable strategies.