Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

arXiv cs.CL Papers

Summary

This academic paper proposes a unified architecture-lifecycle framework for securing computer-use agents (CUAs) as they transition from benchmarks to real-world software environments. It analyzes reliability challenges across perception, decision, execution layers and creation, deployment, operation, maintenance stages.

arXiv:2605.07110v1 Announce Type: new Abstract: Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer captured by task success alone: perception errors,planning drift, memory use, tool mediation, permission scope,and runtime oversight jointly determine whether agent actionsremain aligned with user intent, Existing surveys organize theCUA landscape by methods, platforms, benchmarks, or securitythreats, but less explicitly connect capability formation, author-ity exposure, failure manifestation, and control placement. Toaddress this gap, the article develops an architecture-lifecycleframework for deployment-grounded reliability in CUAs. Thearchitectural view analyzes Perception, Decision, and Executionas coupled layers that transform software observations intoauthority-bearing actions, The lifecycle view examines Creation.Deployment, Operation, and Maintenance as stages in which priorsare learned, tools and permissions are bound, runtime trajecto.ries are stressed, and assurance must be preserved under drift.Using this lens, the analysis synthesizes representative systems,benchmarks, and security/privacy studies; distinguishes wherefailures become visible from where their enabling conditions areintroduced, and maps recurring intervention surfaces for controloversight, and assurance. OpenClaw is used only as a public moti.vating example of an open deployment pattern, not as a verifedinternal case study. The conclusion highlights open challengesin controllable grounding, long-horizon constraint preservation,safe authority binding, mixed-trust runtime defense, privacy-preserving memory,and continual assurance.
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:48 AM

# Securing Computer-Use Agents: A Unified Architecture–Lifecycle Framework for Deployment-Grounded Reliability
Source: [https://arxiv.org/html/2605.07110](https://arxiv.org/html/2605.07110)
Zejian Chen1, Zhanyuan Liu1, Chaozhuo Li1, Mengxiang Han2, Songyang Liu1, Litian Zhang1, Feng Gao2, Yiming Hei3, Xi Zhang1

###### Abstract

Computer\-use agents \(CUAs\) are moving from bounded benchmarks toward real software environments, where they operate browsers, desktops, mobile applications, filesystems, terminals, and tool backends\. In such settings, reliability is no longer captured by task success alone: perception errors, planning drift, memory use, tool mediation, permission scope, and runtime oversight jointly determine whether agent actions remain aligned with user intent\. Existing surveys organize the CUA landscape by methods, platforms, benchmarks, or security threats, but less explicitly connect capability formation, authority exposure, failure manifestation, and control placement\. To address this gap, the article develops an architecture–lifecycle framework for deployment\-grounded reliability in CUAs\. The architectural view analyzes*Perception*,*Decision*, and*Execution*as coupled layers that transform software observations into authority\-bearing actions\. The lifecycle view examines*Creation*,*Deployment*,*Operation*, and*Maintenance*as stages in which priors are learned, tools and permissions are bound, runtime trajectories are stressed, and assurance must be preserved under drift\. Using this lens, the analysis synthesizes representative systems, benchmarks, and security/privacy studies; distinguishes where failures become visible from where their enabling conditions are introduced; and maps recurring intervention surfaces for control, oversight, and assurance\. OpenClaw is used only as a public motivating example of an open deployment pattern, not as a verified internal case study\. The conclusion highlights open challenges in controllable grounding, long\-horizon constraint preservation, safe authority binding, mixed\-trust runtime defense, privacy\-preserving memory, and continual assurance\.

## IIntroduction

Computer\-use agents \(CUAs\) are increasingly studied as agents that can operate real software environments rather than answer prompts alone\. Recent systems already span browsers, desktops, mobile applications, filesystems, terminals, and tool backends\[[136](https://arxiv.org/html/2605.07110#bib.bib98),[39](https://arxiv.org/html/2605.07110#bib.bib99),[148](https://arxiv.org/html/2605.07110#bib.bib128),[186](https://arxiv.org/html/2605.07110#bib.bib132),[149](https://arxiv.org/html/2605.07110#bib.bib133),[5](https://arxiv.org/html/2605.07110#bib.bib119),[41](https://arxiv.org/html/2605.07110#bib.bib125)\]\. This transition changes the meaning of reliability\. In a dialogue\-only setting, a local mistake often remains textual\. In a live software environment, the same mistake can become a deleted file, a leaked secret, an unintended transfer, or a persistent misconfiguration\. Reliable computer use is therefore a systems problem as much as a modeling problem, because perception, planning, execution authority, memory, tool use, and oversight interact under live software conditions\.

The recent expansion of CUA deployment settings makes an integrative survey timely\. Benchmarks have moved from bounded website tasks toward visually grounded, enterprise, personalized, and open\-environment settings\[[25](https://arxiv.org/html/2605.07110#bib.bib19),[187](https://arxiv.org/html/2605.07110#bib.bib18),[62](https://arxiv.org/html/2605.07110#bib.bib36),[28](https://arxiv.org/html/2605.07110#bib.bib38),[9](https://arxiv.org/html/2605.07110#bib.bib180),[160](https://arxiv.org/html/2605.07110#bib.bib135),[177](https://arxiv.org/html/2605.07110#bib.bib150),[16](https://arxiv.org/html/2605.07110#bib.bib151),[101](https://arxiv.org/html/2605.07110#bib.bib138),[146](https://arxiv.org/html/2605.07110#bib.bib21)\]\. At the same time, system\-building and evaluation work has diversified across grounding, memory, long\-horizon planning, tool\-augmented execution, safety evaluation, and open\-deployment stacks\[[136](https://arxiv.org/html/2605.07110#bib.bib98),[39](https://arxiv.org/html/2605.07110#bib.bib99),[130](https://arxiv.org/html/2605.07110#bib.bib115),[158](https://arxiv.org/html/2605.07110#bib.bib114),[175](https://arxiv.org/html/2605.07110#bib.bib116)\]\. The difficulty is no longer only the lack of evidence about CUA capability\. It is also the lack of a common coordinate system for interpreting how capability is formed, how it is bound to operational authority, where failures first become visible, and where controls can intervene\.

Existing surveys clarify important slices of this landscape\. Current overviews summarize method families, platform coverage, benchmark inventories, and high\-level ecosystem structure\[[100](https://arxiv.org/html/2605.07110#bib.bib79),[50](https://arxiv.org/html/2605.07110#bib.bib80),[125](https://arxiv.org/html/2605.07110#bib.bib81),[133](https://arxiv.org/html/2605.07110#bib.bib109),[36](https://arxiv.org/html/2605.07110#bib.bib108),[115](https://arxiv.org/html/2605.07110#bib.bib82)\]\. More focused surveys examine reinforcement\-learning enhancement, phone automation, WebAgents, or safety and security threats\[[71](https://arxiv.org/html/2605.07110#bib.bib105),[82](https://arxiv.org/html/2605.07110#bib.bib106),[102](https://arxiv.org/html/2605.07110#bib.bib107),[12](https://arxiv.org/html/2605.07110#bib.bib104)\]\. These works are valuable, but they usually organize the field by method, platform, benchmark, or threat category\. Less explicit is a deployment\-grounded account of how learned capabilities become authority\-bearing actions, how similar runtime failures can originate from different upstream conditions, and how security, privacy, and oversight mechanisms should be placed across both system architecture and lifecycle stage\.

To address that gap, the article develops an analytical framework for deployment\-grounded reliability in general\-purpose CUAs across web, desktop, mobile, and cross\-application settings\. The framework combines two views\. The architectural view analyzes CUAs through three coupled layers:*Perception*, which reconstructs actionable state from software observations;*Decision*, which preserves task\-conditioned intent under uncertainty and long\-horizon pressure; and*Execution*, which converts decisions into authority\-bearing operations\. The lifecycle view analyzes four stages:*Creation*, where priors, grounding habits, objectives, and action abstractions are formed;*Deployment*, where tools, sessions, permissions, and observation channels are bound to the agent;*Operation*, where active trajectories are stressed by mixed\-trust inputs, partial observability, and asynchronous change; and*Maintenance*, where models, interfaces, tools, and ecosystems drift after release\. Together, these two views provide a diagnostic lens for distinguishing where a failure manifests from where its enabling condition was introduced\.

The scope here is deployment\-grounded reliability rather than CUA capability alone\. The analysis focuses on systems, benchmarks, deployment patterns, and security or privacy studies that materially inform how CUAs behave once they interact with real software surfaces and operational authority\. OpenClaw is used only as a public motivating example of an open deployment pattern, not as a verified internal case study\[[105](https://arxiv.org/html/2605.07110#bib.bib96),[95](https://arxiv.org/html/2605.07110#bib.bib97)\]\. More generally, the analysis targets the broader class of CUAs that combine interface interaction, tooling, memory, and authority\-bearing execution under real\-world deployment conditions\.

The resulting contributions are:

- •A deployment\-grounded reliability scope for CUAs\.CUA reliability is characterized as a joint property of software observation, task\-conditioned decision making, authority\-bearing execution, memory, tool use, and oversight, rather than as benchmark task success alone\.
- •An architecture–lifecycle framework\.The literature is organized through a tri\-layer architecture of*Perception*,*Decision*, and*Execution*, together with a four\-stage lifecycle of*Creation*,*Deployment*,*Operation*, and*Maintenance*\.
- •A failure\-origin analysis\.The analysis distinguishes where failures become visible from where their enabling conditions are introduced, relating grounding errors, planning drift, over\-privilege, memory misuse, mixed\-trust inputs, privacy leakage, and ecosystem drift within one diagnostic view\.
- •A control and governance map\.Recurring intervention surfaces are mapped to architectural layers and lifecycle stages, including data and reward design, permission scoping, provenance\-aware tool mediation, runtime verification, human escalation, rollback, and continual assurance\.

#### Scope, Evidence Base, and Non\-Goals

The analysis is organizational and diagnostic\. It does not attempt to verify OpenClaw internals, introduce a new benchmark, or provide a formal safety proof\. Its goal is to synthesize existing CUA literature and public deployment patterns through a common analytical frame\. The coverage is weighted toward recent CUA work, while retaining earlier studies when they contribute concepts that remain necessary for computer\-use reasoning, grounding, execution, or oversight\. The selection prioritizes representative systems, benchmarks, surveys, deployment\-oriented stacks, and security or privacy studies that materially inform deployed CUA behavior\[[133](https://arxiv.org/html/2605.07110#bib.bib109),[124](https://arxiv.org/html/2605.07110#bib.bib110),[59](https://arxiv.org/html/2605.07110#bib.bib113),[188](https://arxiv.org/html/2605.07110#bib.bib137),[160](https://arxiv.org/html/2605.07110#bib.bib135),[181](https://arxiv.org/html/2605.07110#bib.bib164),[163](https://arxiv.org/html/2605.07110#bib.bib155),[80](https://arxiv.org/html/2605.07110#bib.bib193)\]\. Adjacent agent\-security, tool\-use, or governance literature is included only when it clarifies a CUA\-relevant mechanism, and is treated as adjacent evidence rather than direct CUA evidence\.

TABLE I:Heuristic positioning map for the proposed architecture–lifecycle view within recent CUA literature\. The table foregrounds the analytical gap addressed here rather than serving as an exhaustive related\-work inventory\.Table[I](https://arxiv.org/html/2605.07110#S1.T1)situates the proposed architecture–lifecycle view within the literature, not to rank prior work\. It is intentionally selective: the goal is to indicate the analytical scope here rather than to compress the full CUA literature into a single table\. Many prior studies already address important aspects of safety, lifecycle, and governance\. The contribution here is not to replace those lines of work, but to relate them through a common analytical frame centered on capability formation, authority exposure, failure emergence, and control placement\. The remainder develops that frame in sequence\. Section[II](https://arxiv.org/html/2605.07110#S2)defines the problem space and its main design axes; Section[III](https://arxiv.org/html/2605.07110#S3)introduces the analytical framework; Sections[IV](https://arxiv.org/html/2605.07110#S4)and[V](https://arxiv.org/html/2605.07110#S5)develop the architectural and lifecycle dimensions, respectively; Section[VI](https://arxiv.org/html/2605.07110#S6)analyzes security and privacy through that joint perspective; and Section[VII](https://arxiv.org/html/2605.07110#S7)draws out the resulting governance and control implications\.

## IIProblem Definition and Design Axes

Before discussing architecture or lifecycle, this section fixes the control problem that a computer\-use agent \(CUA\) is meant to solve\. Given a user goalgg, a constraint setcc, and a stream of partial observations from a live software environmentℰ\\mathcal\{E\}, the agent must maintain task\-relevant state and emit actions through an authority\-bearing interface so that the task progresses without violating user intent or system permissions\. The object of study is therefore not interface understanding in isolation\. It is control in partially observable, dynamically changing, permission\-constrained software environments\. The section formalizes that setting and isolates the two design axes that matter most for the following analysis: how the agent constructs executable state from software observations, and how it binds intent to authority\-bearing action\.

### II\-AProblem Formulation: Software Interaction as Partially Observable Control

A CUA can be written abstractly as a policyπ​\(at∣ht,ot,g,c\)\\pi\(a\_\{t\}\\mid h\_\{t\},o\_\{t\},g,c\)operating over a software\-facing environmentℰ\\mathcal\{E\}\. At steptt, the agent receives a partial observationoto\_\{t\}, updates an internal statehth\_\{t\}, and emits an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}through the environment’s available control surface\. The output is therefore not text alone, but an action with real operational effect\. The objective is to reach a task\-complete state while preserving user\-specified constraints, respecting authority boundaries, and remaining recoverable when uncertainty is high\.

What makes this setting distinctive is that software control is only partially observable and only weakly synchronized\. A realistic task can move across browsers, operating\-system windows, filesystems, terminals, APIs, and tool backends\[[144](https://arxiv.org/html/2605.07110#bib.bib26),[118](https://arxiv.org/html/2605.07110#bib.bib7),[136](https://arxiv.org/html/2605.07110#bib.bib98),[148](https://arxiv.org/html/2605.07110#bib.bib128)\]\. Even with screenshots, DOM or accessibility trees, OCR output, memory traces, and tool responses, relevant state may remain hidden, delayed, or already stale: windows change asynchronously, permission prompts appear after a plan is formed, and tool state may become visible only after an invocation succeeds\[[187](https://arxiv.org/html/2605.07110#bib.bib18),[44](https://arxiv.org/html/2605.07110#bib.bib28),[146](https://arxiv.org/html/2605.07110#bib.bib21)\]\.

Three properties follow\. First, software environments are semantically dense: a small icon, checkbox, or command argument can carry disproportionate consequences\. Second, they are mixed\-trust by default: the same observation stream may contain user intent, benign interface content, deceptive prompts, retrieved text, and attacker\-controlled or compromised tool output\[[87](https://arxiv.org/html/2605.07110#bib.bib77),[35](https://arxiv.org/html/2605.07110#bib.bib46),[183](https://arxiv.org/html/2605.07110#bib.bib100),[150](https://arxiv.org/html/2605.07110#bib.bib48)\]\. Third, they are layered with authority: the action the model selects is not identical to the permission the system ultimately exercises\. This is why CUA behavior is treated here as controlled interaction under partial observability and constrained authority, not as next\-step prediction alone\.

### II\-BObservation Modalities and Task\-Relevant State Representations

The first major design choice is how raw software observations become task\-relevant state\. In some environments, DOM trees, accessibility metadata, or view hierarchies provide precise handles for localization and control\[[187](https://arxiv.org/html/2605.07110#bib.bib18)\]\. In others, screenshot\-grounded interaction dominates, which turns computer use into a multimodal grounding problem shaped by OCR quality, scale, localization, theme variation, and small\-target robustness\. Hybrid approaches sit between those extremes by invoking parsers or specialist grounders only when they improve downstream control\[[157](https://arxiv.org/html/2605.07110#bib.bib31),[117](https://arxiv.org/html/2605.07110#bib.bib1),[33](https://arxiv.org/html/2605.07110#bib.bib5),[161](https://arxiv.org/html/2605.07110#bib.bib83)\]\.

Task\-relevant state is also broader than whatever is currently visible\. A deployed CUA may need to reason over memory traces, retrieved documents, execution logs, tool outputs, file context, and permission status alongside the active interface\[[144](https://arxiv.org/html/2605.07110#bib.bib26),[118](https://arxiv.org/html/2605.07110#bib.bib7),[91](https://arxiv.org/html/2605.07110#bib.bib84),[147](https://arxiv.org/html/2605.07110#bib.bib127)\]\. Environment class then determines which observation assumptions remain viable: web settings stress structure and untrusted content, desktop and OS settings mix GUI state with account authority, mobile settings compress semantics into small targets and permission prompts, and cross\-app workflows create handoff errors across surfaces\. Benchmarks such as Mind2Web, WebArena, and OSWorld are useful here mainly because they reveal which observation assumptions a design can actually survive\[[25](https://arxiv.org/html/2605.07110#bib.bib19),[187](https://arxiv.org/html/2605.07110#bib.bib18),[146](https://arxiv.org/html/2605.07110#bib.bib21)\]\.

The main takeaway of this axis is direct: observation design determines what the agent can know, what it can verify, and what it can trust strongly enough to act on\. That is why perception later appears as the first architectural layer rather than as a preprocessing detail\.

### II\-CExecution Interfaces, Action Spaces, and Authority Binding

The second major design choice is how model intent becomes operational authority\. Action design is therefore not only a question of granularity\. Two action spaces may look equally expressive while exposing very different blast radii, auditability, and rollback properties\.

Low\-level mouse, keyboard, and touch actions are the most general\. They allow the agent to operate almost any visible interface, but they make success depend on precise grounding and stable timing: a coordinate error can become a misclick, and a short delay can become a time\-of\-check/time\-of\-use failure\. Higher\-level actions such as browser APIs, application macros, shell commands, or code execution compress many low\-level steps into fewer semantic operations\. They often improve efficiency and logging, but they also increase the consequence of each mistake because one invocation may encode a wider authority surface\[[144](https://arxiv.org/html/2605.07110#bib.bib26),[118](https://arxiv.org/html/2605.07110#bib.bib7),[186](https://arxiv.org/html/2605.07110#bib.bib132),[149](https://arxiv.org/html/2605.07110#bib.bib133)\]\.

This is why*action abstraction*and*authority binding*are treated as related but distinct axes\. A click is low\-level and weakly semantic, but it can still be high\-risk if it confirms a destructive operation\. A shell command or API invocation is high\-level and often more auditable, but it usually carries more direct authority over files, networks, or processes\. Modern CUAs increasingly combine both forms of action, which makes execution design inseparable from permission scope, auditability, and recoverability\.

The main takeaway of this second axis is equally direct: execution design determines what the agent can cause, how tightly those consequences are bound to authority, and how easy they are to inspect or reverse\. That is why execution later appears as a distinct architectural layer rather than as the final step of a generic control loop\.

#### Takeaway

The central problem of computer use is therefore not only recognizing what is on the screen\. It is deciding what should count as executable state, and binding that state to action interfaces that carry real authority\. Observation design determines what the agent can know; execution design determines what the agent can cause\. Their coupling determines which failures appear first as ambiguity, which appear first as overreach, and why the same error can have very different consequences across architectures and deployment stages\. The next section turns that problem definition into an explicit analytical framework\.

## IIIAnalytical Framework

The previous section defined the problem space and its main design axes\. This section defines how the following analysis reasons about that space\. The proposed joint framework combines anarchitectural viewwith alifecycle view\. The architectural view explains how a CUA is built across*Perception*,*Decision*, and*Execution*\. The lifecycle view explains when capability is created, bound to authority, stressed in live use, and revised over time through*Creation*,*Deployment*,*Operation*, and*Maintenance*\. The framework is used as a heuristic analytical lens for organizing literature and locating intervention surfaces in deployed settings\[[125](https://arxiv.org/html/2605.07110#bib.bib81),[71](https://arxiv.org/html/2605.07110#bib.bib105),[36](https://arxiv.org/html/2605.07110#bib.bib108)\]\.

It should not be read as a formal causal model, nor as a claim that real systems or incidents fall neatly into mutually exclusive classes\. Its purpose is narrower and more practical: to provide a working coordinate system for reasoning about how capability, authority, and oversight become coupled in deployed computer\-use settings\.

### III\-AWhy a Component View Is Not Enough

Architecture alone remains necessary, but it is incomplete\. It tells us how an agent reconstructs interface state, forms plans, and acts\. That is enough to compare systems at the component level\. It is not enough to explain why the same model behaves well in one environment and fails in another, why one action interface is tolerable under one deployment and materially riskier under another, or why a failure first seen at runtime may in fact originate in data curation, permission binding, or post\-release drift\.

The lifecycle view adds the missing temporal dimension\.*Creation*asks what priors the system learns before release\.*Deployment*asks how those priors are coupled to live channels, tools, sessions, and permissions\.*Operation*asks how the system behaves under long horizons, mixed\-trust inputs, and asynchronous change\.*Maintenance*asks whether the same architecture remains valid as interfaces, tools, workflows, and threat models evolve\. This second dimension helps separate*where a failure becomes visible*from*where the enabling condition was introduced*\. That distinction is a main analytical commitment of the framework\.

TABLE II:Architecture–lifecycle coordinate system for the analytical framework\. The table maps lifecycle stages to the objects they shape and to the corresponding pressure points in the perception, decision, and execution layers\.
### III\-BWhat the Joint Framework Claims

The joint framework makes three claims that organize the rest of the analysis\.

First, capability is lifecycle\-dependent rather than purely model\-dependent\.What practitioners often call “agent ability” is jointly shaped by representation choices, post\-training, execution interfaces, deployment mediation, memory, and recovery mechanisms\. Runtime performance is therefore not an intrinsic property of the base model alone\.

Second, failure origin and failure manifestation are different analytical questions\.A brittle recovery path visible during operation may reflect weak grounding supervision in creation, over\-broad or weakly mediated permission scope in deployment, or untested drift in maintenance\. Treating every higher\-risk action as a runtime reasoning defect hides the real control surface\.

Third, control placement must follow both layer and stage\.Perception errors, decision errors, and execution errors do not respond to the same controls, and creation\-stage controls differ fundamentally from deploy\-time, runtime, or maintenance\-time ones\. Security and governance are therefore not appended after the agent is built; they are about where constraints attach relative to the system’s structure and evolution\.

Table[II](https://arxiv.org/html/2605.07110#S3.T2)summarizes the working coordinate system used throughout the article\. Later tables are narrower instantiations of the same logic rather than competing taxonomies\. Whenever later sections compare systems, failures, or controls, the guiding questions remain the same: what stage has changed the object of analysis, and which architectural layer now carries the main pressure?

#### Assignment Rule

An issue is assigned primarily by the earliest stage at which it is materially introduced or could still have been constrained\. Creation applies when pre\-release data, objectives, or action abstractions are the main source of the problem\. Deployment applies when live binding decisions about tools, permissions, sessions, observation channels, or mediation dominate\. Operation applies when online trajectory evolution under uncertainty is the main source of failure\. Maintenance applies when post\-release change to the model, environment, or ecosystem is the primary driver\. Persistent memory, for example, is a deployment choice; misuse of that memory during a live trajectory is operational; and memory\-related regressions after updates belong to maintenance\.

### III\-CDeployment Archetypes and an Open\-Deployment Example

To keep the framework concrete without collapsing the analysis into a single case study, three recurring deployment archetypes are useful\.

Benchmark\-centered research agentsare optimized for bounded evaluation\. They typically expose narrow authority, controlled observability, and reproducible task structure\[[160](https://arxiv.org/html/2605.07110#bib.bib135),[2](https://arxiv.org/html/2605.07110#bib.bib136),[177](https://arxiv.org/html/2605.07110#bib.bib150)\]\.

Consumer or assistant\-style deployed agentssit inside user workflows such as search, browsing, productivity, or messaging\. Their defining pressure is that user intent, real accounts, and high\-impact actions coexist in one interface loop\[[148](https://arxiv.org/html/2605.07110#bib.bib128),[186](https://arxiv.org/html/2605.07110#bib.bib132),[149](https://arxiv.org/html/2605.07110#bib.bib133)\]\.

Self\-hosted gateway or tool\-binding agentscan make CUA capability available through persistent ingress channels, longer\-lived memory, tool registries, or automation layers\. Their defining pressure is the coupling among authority, provenance, and ecosystem evolution rather than raw capability alone\[[21](https://arxiv.org/html/2605.07110#bib.bib118),[168](https://arxiv.org/html/2605.07110#bib.bib179),[42](https://arxiv.org/html/2605.07110#bib.bib181)\]\.

OpenClaw is used only as a motivating deployment setting\. Public materials describe it as a local\-first assistant with persistent ingress, tool connectivity, and visible participation in a broader agent community\[[105](https://arxiv.org/html/2605.07110#bib.bib96),[95](https://arxiv.org/html/2605.07110#bib.bib97),[14](https://arxiv.org/html/2605.07110#bib.bib71)\]\. That public framing is treated as an illustrative deployment pattern rather than as verified system evidence: once one deployment surface is presented as combining channels, tools, sessions, and longer\-lived context, deployment choices can account for a large share of the resulting risk profile\. The analysis remains about the broader class of systems that share a similar structure\.

### III\-DHow To Read the Rest of the Paper

The remaining sections form a guided traversal of Table[II](https://arxiv.org/html/2605.07110#S3.T2)\. Section[IV](https://arxiv.org/html/2605.07110#S4)develops the architectural dimension by asking how perception, decision, and execution interact\. Section[V](https://arxiv.org/html/2605.07110#S5)develops the temporal dimension by asking how capability, failure, and control move across Creation, Deployment, Operation, and Maintenance\. Section[VI](https://arxiv.org/html/2605.07110#S6)maps threats back onto both dimensions\. Section[VII](https://arxiv.org/html/2605.07110#S7)then converts that analysis into control surfaces and governance implications\. The intended reading is cumulative rather than parallel: each later section reuses the same framework to answer a different layer of the same reliability problem\.

## IVArchitectural Evolution of Computer\-Use Agents: A Tri\-Layer Framework

The architectural question in CUAs is simple to state and difficult to answer well: how does an agent convert a changing software environment into justified, executable action? A useful synthesis is structural\. A deployed CUA must reconstruct actionable state, preserve task\-conditioned intent, and finally translate that intent into authority\-bearing operations\. This section therefore analyzes CUA architectures through three coupled layers:*Perception*,*Decision*, and*Execution*\. The value of this decomposition is not that every system literally contains three modules\. It is that the decomposition helps indicate where capability is created, where uncertainty accumulates, and where small errors become operationally expensive\[[125](https://arxiv.org/html/2605.07110#bib.bib81),[115](https://arxiv.org/html/2605.07110#bib.bib82),[71](https://arxiv.org/html/2605.07110#bib.bib105)\]\.

The recent literature makes this tri\-layer view increasingly necessary\. Perception work spans structured web grounding, parser\-augmented visual understanding, native screenshot\-grounded policies, and newer grounding methods that adapt zoom, attention, or symbolic structure to interface complexity\[[89](https://arxiv.org/html/2605.07110#bib.bib39),[51](https://arxiv.org/html/2605.07110#bib.bib88),[117](https://arxiv.org/html/2605.07110#bib.bib1),[33](https://arxiv.org/html/2605.07110#bib.bib5),[106](https://arxiv.org/html/2605.07110#bib.bib167),[161](https://arxiv.org/html/2605.07110#bib.bib83)\]\. Planning and control work now covers memory\-augmented reasoning, long\-horizon decomposition, rollback, reflection, and hybrid planners that mix GUI control with programmatic execution\[[184](https://arxiv.org/html/2605.07110#bib.bib35),[166](https://arxiv.org/html/2605.07110#bib.bib30),[118](https://arxiv.org/html/2605.07110#bib.bib7),[45](https://arxiv.org/html/2605.07110#bib.bib111),[55](https://arxiv.org/html/2605.07110#bib.bib175),[158](https://arxiv.org/html/2605.07110#bib.bib114)\]\. Execution work is widening at the same time, from atomic GUI actions to code\-as\-action, dexterous control, and bundled routines, which means architectural choice is now inseparable from authority design\[[144](https://arxiv.org/html/2605.07110#bib.bib26),[56](https://arxiv.org/html/2605.07110#bib.bib11),[99](https://arxiv.org/html/2605.07110#bib.bib10),[49](https://arxiv.org/html/2605.07110#bib.bib131),[159](https://arxiv.org/html/2605.07110#bib.bib120)\]\. The field does not lack methods; instead, recent capability gains often shift risk across layers rather than remove it\.

![Refer to caption](https://arxiv.org/html/2605.07110v1/images/p3.1new.png)Figure 1:Tri\-layer pressure map for deployed CUAs\. The figure frames deployed CUA behavior as a recurrent control loop between user goals and constraints and a live, partially observable software environment\. Perception reconstructs operationally trustworthy state, decision maintains task\-conditioned intent under long\-horizon pressure, and execution converts the resulting trajectory into authority\-bearing action\.Figure[1](https://arxiv.org/html/2605.07110#S4.F1)depicts a recurrent control loop rather than a one\-way pipeline\. The live environment supplies dynamic and partially observable state to perception, user goals and constraints shape decision, and execution not only acts on the environment but can also return verification, feedback, rollback, or replanning signals to decision\. Perception shapes what the system can treat as actionable state\. Decision shapes whether that state is converted into a coherent and constrained trajectory\. Execution shapes how much authority is exercised when that trajectory is enacted\. Failure pressure can propagate accordingly: weak grounding can distort planning, weak planning can amplify execution risk, and high\-authority execution can raise the cost of even small perceptual or decisional mistakes\.

TABLE III:Architectural archetypes in contemporary CUAs\. The table summarizes recurring design patterns, their primary capability gains, the layers they chiefly strengthen, and the reliability pressures they introduce in deployed stacks\.Table[III](https://arxiv.org/html/2605.07110#S4.T3)summarizes heuristic architectural bets rather than ranking individual systems\. The same deployed CUA may combine several rows at once, which is precisely why capability gains often reappear later as cross\-layer reliability pressure\.

### IV\-APerception Layer: From Raw Interfaces to Actionable State

The perception layer answers the first architectural question: what does the agent believe the interface contains, and how trustworthy is that belief for downstream action? In CUAs, perception is not only recognition\. It is the construction of operational state from screenshots, DOM trees, accessibility metadata, OCR output, parser\-derived layouts, and retrieved contextual traces\. The design problem is therefore not simply to maximize visual accuracy\. It is to decide what form of state remains precise enough to act on and portable enough to survive across environments\.

The literature now falls into four recurring families\.Structured\-state agentsrely on DOM, accessibility, or view\-hierarchy signals\. Their strength is precision when the environment exposes reliable structure, which is why Mind2Web and WebArena remain important reference points for disciplined web control\[[25](https://arxiv.org/html/2605.07110#bib.bib19),[187](https://arxiv.org/html/2605.07110#bib.bib18)\]\. Their weakness is portability: once interfaces become rendering\-heavy, streamed, or weakly instrumented, the same structural advantage can disappear\.

Parser\-augmented visual agentsrecover part of that structure from rendered interfaces\. Systems such as OmniParser, ScreenAI, SeeClick, Ferret\-UI, TRISHUL, and newer complete\-screen parsing approaches add OCR, icon captioning, region proposals, or layout parsing before downstream reasoning\[[89](https://arxiv.org/html/2605.07110#bib.bib39),[6](https://arxiv.org/html/2605.07110#bib.bib16),[22](https://arxiv.org/html/2605.07110#bib.bib17),[157](https://arxiv.org/html/2605.07110#bib.bib31),[117](https://arxiv.org/html/2605.07110#bib.bib1),[43](https://arxiv.org/html/2605.07110#bib.bib2)\]\. This family is attractive because it retains visual generality while recovering some of the handles that make execution easier\. Its main failure mode is familiar: the parser becomes a bottleneck, and downstream reasoning can remain confidently wrong when the parsed state is incomplete or distorted\.

Native end\-to\-end visual agentspush farther toward portability by leaving more of grounding inside one multimodal policy\. CogAgent, Fuyu\-style VLMs, Qwen\-VL style systems, WinClick, MAI\-UI, Step\-GUI, Mobile\-Agent\-v3\.5, and high\-resolution\-aware agents such as AFRAgent illustrate this direction\[[47](https://arxiv.org/html/2605.07110#bib.bib14),[8](https://arxiv.org/html/2605.07110#bib.bib15),[7](https://arxiv.org/html/2605.07110#bib.bib3),[51](https://arxiv.org/html/2605.07110#bib.bib88),[186](https://arxiv.org/html/2605.07110#bib.bib132),[149](https://arxiv.org/html/2605.07110#bib.bib133),[148](https://arxiv.org/html/2605.07110#bib.bib128),[4](https://arxiv.org/html/2605.07110#bib.bib134)\]\. The gain is broad environment coverage\. The cost is entanglement\. OCR, localization, affordance inference, and actionability all live inside one representation, which makes it harder to know whether a failure arose from vision, reasoning, or the coupling between the two\.

Hybrid or compositional grounding agentsmix several channels rather than committing to one\. They use structured state when it is available, screenshots when it is not, and specialist grounders only when those tools improve downstream control\[[144](https://arxiv.org/html/2605.07110#bib.bib26),[166](https://arxiv.org/html/2605.07110#bib.bib30),[33](https://arxiv.org/html/2605.07110#bib.bib5),[5](https://arxiv.org/html/2605.07110#bib.bib119)\]\. This family is especially important for cross\-platform and open\-deployment settings because the observation channel itself may vary within a single task\.

These families are best compared along four axes: element\-grounding precision, small\-target robustness, robustness to scale/theme/localization changes, and downstream executability\. No current approach simultaneously maximizes all four\. Benchmarks such as ScreenSpot\-Pro help make the trade\-off concrete by indicating that high\-resolution professional interfaces still punish small\-target grounding even when broader GUI capability looks competitive elsewhere\[[73](https://arxiv.org/html/2605.07110#bib.bib224)\]\. The most precise observation channel is rarely the most portable, and the most portable channel is rarely the easiest to verify\. That tension is why perception in CUAs should be judged not by recognition accuracy alone, but by whether the resulting state remains trustworthy enough for safe execution\.

#### Observation\-channel mismatch

One under\-discussed architectural risk is mismatch between the observation channel assumed during training and the one encountered in deployment\. A model trained on parser\-normalized or DOM\-rich state can look strong on benchmarks yet fail in screenshot\-only or streamed settings\. A screenshot\-trained model can make the opposite mistake by ignoring structure that would have improved precision\. The architectural point is that a policy can look transferable while the real grounding problem has already changed underneath it\.

This classification remains heuristic rather than exhaustive\. Many deployed systems mix structured state, screenshots, parsers, and specialist grounders within one task, so the perception families above capture dominant observation bets rather than clean system boundaries\.

### IV\-BDecision Layer: Preserving Intent Under Long\-Horizon Pressure

If perception answers “what is happening now,” the decision layer answers “what should happen next, and how can the system verify that it still matches the user’s goal?” In deployed CUAs, this is where autonomy becomes either stable or brittle\. The core difficulty is not merely local next\-step selection\. It is preserving task\-conditioned intent across long trajectories, incomplete observations, changing interfaces, and sometimes hostile runtime content\.

Four pressures organize the literature here\. The first islong\-horizon coherence: can the system keep the task moving in the right direction across many screens and tool calls? The second isconstraint retention: can it preserve instructions such as “draft, do not send” or “ask before deleting” after dozens of intermediate steps? The third isverification and uncertainty handling: does it know when to check, backtrack, or ask for clarification? The fourth isdecomposition: does it solve the task in one policy loop, or distribute work across roles, tools, or subagents?

Memory\-augmented systems are an early answer to the coherence problem\. Synapse illustrated that state abstraction, trajectory exemplars, and retrieval can materially stabilize computer\-control behavior under limited context\[[184](https://arxiv.org/html/2605.07110#bib.bib35)\]\. AppAgent and InfiGUIAgent push the same idea toward reusable operating knowledge rather than disposable context\[[167](https://arxiv.org/html/2605.07110#bib.bib24),[88](https://arxiv.org/html/2605.07110#bib.bib4)\]\. Newer systems such as WebATLAS, SecAgent, ColorBrowserAgent, and anchored\-memory benchmarks such as AndroTMem make a similar point from different directions by treating summarization, simulation, or anchored recall as first\-class planning supports rather than optional add\-ons\[[21](https://arxiv.org/html/2605.07110#bib.bib118),[147](https://arxiv.org/html/2605.07110#bib.bib127),[130](https://arxiv.org/html/2605.07110#bib.bib115),[114](https://arxiv.org/html/2605.07110#bib.bib163)\]\. OSWorld evaluations are consistent with the same concern: long tasks can degrade quickly without some mechanism for compression, recall, or structured recovery\[[146](https://arxiv.org/html/2605.07110#bib.bib21)\]\. These systems matter because they suggest that CUA performance is not only a function of local reasoning quality\. It also depends on how continuity is carried forward\.

Constraint retention is the more safety\-critical version of the same problem\. A CUA can appear productive while gradually dropping confirmation requirements, user preferences, or exposure limits\. Reflection\-oriented work and safe\-planning or interruptibility evaluations are therefore informative not only because they improve task completion, but because they ask whether the original task boundary stays attached to later steps\[[141](https://arxiv.org/html/2605.07110#bib.bib85),[74](https://arxiv.org/html/2605.07110#bib.bib86),[15](https://arxiv.org/html/2605.07110#bib.bib67),[191](https://arxiv.org/html/2605.07110#bib.bib157),[153](https://arxiv.org/html/2605.07110#bib.bib161)\]\. The more concerning failure mode is often not that the model stops progressing\. It is that it keeps progressing after silently forgetting what should have constrained it\.

Verification and uncertainty handling are what keep that drift from becoming irreversible\. General agent methods such as ReAct, Tree of Thoughts, and Reflexion contributed three ideas that recur in CUA planners: interleaving reasoning with action, branching over alternatives, and using prior errors as planning signals\[[156](https://arxiv.org/html/2605.07110#bib.bib32),[155](https://arxiv.org/html/2605.07110#bib.bib33),[116](https://arxiv.org/html/2605.07110#bib.bib34)\]\. WebVoyager, You Only Look at Screens, ReInAgent, Mobile\-Agent, BacktrackAgent, TreeCUA, stable\-planner modules, and action\-effect verification systems carry those ideas into live interfaces through backtracking, chain\-of\-action reasoning, active questioning, replanning, explicit recovery after detected errors, and post\-action verification\[[44](https://arxiv.org/html/2605.07110#bib.bib28),[179](https://arxiv.org/html/2605.07110#bib.bib6),[54](https://arxiv.org/html/2605.07110#bib.bib8),[131](https://arxiv.org/html/2605.07110#bib.bib25),[142](https://arxiv.org/html/2605.07110#bib.bib124),[55](https://arxiv.org/html/2605.07110#bib.bib175),[98](https://arxiv.org/html/2605.07110#bib.bib177),[178](https://arxiv.org/html/2605.07110#bib.bib153)\]\. The literature does not point to one universally superior planner\. It does suggest, however, that robust computer use cannot remain purely feedforward once trajectories become long and mixed\-trust\.

Decomposition offers a second route to scale\. UFO separates application routing from local action selection, while CoAct\-1 coordinates GUI operation with programmatic execution through a central planner\[[166](https://arxiv.org/html/2605.07110#bib.bib30),[118](https://arxiv.org/html/2605.07110#bib.bib7)\]\. EE\-MCP, GraphPilot, LiteWebAgent, Agent\-SAMA, PolySkill, and earlier planning\-oriented web agents extend the same design space by balancing GUI interaction against tool calls, reusable skills, finite\-state planning, or program synthesis, which makes the coordination problem partly an interface\-selection and skill\-selection problem rather than only a planning problem\[[45](https://arxiv.org/html/2605.07110#bib.bib111),[158](https://arxiv.org/html/2605.07110#bib.bib114),[168](https://arxiv.org/html/2605.07110#bib.bib179),[40](https://arxiv.org/html/2605.07110#bib.bib123),[159](https://arxiv.org/html/2605.07110#bib.bib120),[42](https://arxiv.org/html/2605.07110#bib.bib181)\]\. These designs can make complex workflows tractable, but they do not eliminate difficulty\. They move difficulty from single\-policy context management into coordination, attribution, and trust management across components\.

The architectural lesson is therefore narrow but important: decision quality in CUAs is not the ability to choose the next plausible click\. It is the ability to preserve user\-intended constraints while converting uncertain, partial state into a trajectory that remains coherent over time\. Stronger planning without stronger verification usually means stronger drift\.

This decision\-layer classification also has limits\. Real systems often fuse memory, reflection, skills, planners, and verification into one controller, so the categories above capture recurring planning pressures rather than strict module types\.

### IV\-CExecution Layer: Action Abstraction, Authority, and Recoverability

The execution layer is where the reliability argument becomes concrete\. A CUA does not merely choose what should happen\. It acts through a specific interface that determines how much authority the decision carries, how easy the action is to audit, and how recoverable the outcome remains when something goes wrong\. In other words, execution is where intent becomes consequence\.

Three execution styles recur across the literature\.Atomic input executionuses clicks, drags, taps, and keystrokes\. It is maximally general and often the only option when no trusted higher\-level interface exists\. Cradle and AndroidEnv illustrate the breadth of this style, while many visual desktop and mobile agents inherit the same strengths and weaknesses\[[123](https://arxiv.org/html/2605.07110#bib.bib23),[128](https://arxiv.org/html/2605.07110#bib.bib9),[129](https://arxiv.org/html/2605.07110#bib.bib12),[49](https://arxiv.org/html/2605.07110#bib.bib131)\]\. Its advantage is portability\. Its failure modes are misclicks, action rebinding, clickjacking, and time\-of\-check/time\-of\-use gaps\.

Programmatic executioncompresses many low\-level steps into semantically richer operations\. OS\-Copilot and CoAct\-1 illustrate how file operations, application macros, shell commands, or code execution can bypass brittle GUI sequences in complex workflows\[[144](https://arxiv.org/html/2605.07110#bib.bib26),[118](https://arxiv.org/html/2605.07110#bib.bib7)\]\. GUI\-360, MAI\-UI, and Step\-GUI add complementary evidence that hybrid GUI\+API or GUI\+MCP action spaces are a recurring part of the current CUA design landscape rather than isolated systems choices\[[99](https://arxiv.org/html/2605.07110#bib.bib10),[186](https://arxiv.org/html/2605.07110#bib.bib132),[149](https://arxiv.org/html/2605.07110#bib.bib133)\]\. The gain is speed, auditability, and leverage\. The trade\-off is stronger authority binding: one invocation can modify files, alter configurations, or contact external systems directly\.

Bundled or macro\-action executionsits between those extremes\. AppAgentX is a direct example because it evolves recurrent action sequences into higher\-level routines that substitute for repeated low\-level interaction\[[56](https://arxiv.org/html/2605.07110#bib.bib11)\]\. PolySkill provides a related skill\-abstraction view in which reusable polymorphic skills stand in for repeated interaction fragments across tasks\[[159](https://arxiv.org/html/2605.07110#bib.bib120)\]\. This can reduce planning burden and improve efficiency, but it also makes failure harder to localize because several consequential substeps may be bundled into one command\. More general screenshot\-grounded systems such as ScreenAgent and OmegaUse reinforce the field’s movement toward broad end\-to\-end GUI control, but they do not resolve on their own where rollback and verification boundaries should sit\[[104](https://arxiv.org/html/2605.07110#bib.bib27),[173](https://arxiv.org/html/2605.07110#bib.bib13)\]\.

These three styles are best compared through two coupled questions: how abstract is the action, and how much authority is bound to it? That distinction clarifies why shell execution belongs in the execution layer rather than in GUI control\. It is a high\-authority interface, not a visual interaction primitive\. It also clarifies why recoverability deserves its own place in CUA design\. Atomic inputs are easy to interrupt but hard to replay semantically\. Programmatic actions are easier to log and replay but often harder to roll back safely\. Bundled actions reduce step count while blurring rollback boundaries\. For deployed CUAs, recoverability can matter more than raw task completion because it determines whether a failure remains local or becomes irreversible\.

This execution grouping should not be read as a set of mutually exclusive modes\. In practice, many deployed CUAs mix atomic, programmatic, and bundled execution within the same workflow, which is exactly why authority binding and rollback boundaries remain hard to reason about\.

### IV\-DCross\-Layer Coupling and Architectural Implications

The three layers should be read together\. Perception determines what state is available for planning\. Decision determines whether that state is converted into a coherent and constrained trajectory\. Execution determines how much real authority that trajectory acquires\. Better performance in one layer can therefore shift pressure to another: more portable perception can increase ambiguity, stronger autonomy can increase the need for verification, and more efficient execution can enlarge the blast radius of mistakes\.

This is why architecture already contains a latent governance question\. A system that combines ambiguous perception, long\-horizon autonomy, and high\-authority execution is not only an engineering design\. It can also create a distinctive configuration of risk\. Reliability problems in CUAs often become visible as cross\-layer coupling before they appear as isolated bugs\[[115](https://arxiv.org/html/2605.07110#bib.bib82),[152](https://arxiv.org/html/2605.07110#bib.bib87)\]\. The tri\-layer framework is therefore the structural backbone of the analysis\. It explains where capability resides, where ambiguity enters, and where authority is exercised\. The next section adds the missing temporal backbone by asking how those same layers are pressured differently across Creation, Deployment, Operation, and Maintenance\.

## VA Lifecycle Framework for Computer\-Use Agents: From Capability Formation to Continual Adaptation

The previous section explained what a CUA is made of\. This section explains how that architecture becomes a live system over time\. This distinction matters because the same visible failure can emerge from different upstream causes\. A misleading click may reflect weak grounding supervision during training, unsafe permission binding at release, brittle recovery under runtime drift, or a regression introduced during maintenance\. The lifecycle view is therefore used not to retell the development timeline, but to distinguish when capability is formed, when authority is attached, when risk first becomes visible, and where effective controls can still enter\.

Recent benchmarks and environments make this lifecycle sensitivity easier to see\. Newer evaluations increasingly include personalized mobile interaction, privacy\-sensitive workflows, collaborative assistance, verification\-centered replay, software environments generated at scale, and human\-like long\-horizon behavior\[[109](https://arxiv.org/html/2605.07110#bib.bib154),[188](https://arxiv.org/html/2605.07110#bib.bib137),[101](https://arxiv.org/html/2605.07110#bib.bib138),[160](https://arxiv.org/html/2605.07110#bib.bib135),[190](https://arxiv.org/html/2605.07110#bib.bib142),[153](https://arxiv.org/html/2605.07110#bib.bib161),[52](https://arxiv.org/html/2605.07110#bib.bib143)\]\. These settings do not by themselves establish a lifecycle theory\. They do suggest, however, that capability formation, deployment binding, runtime pressure, and maintenance drift leave different empirical signatures once agents are evaluated outside narrow one\-shot tasks\.

Throughout this section, each stage is read through the same analytical template:*What primary object does this stage shape? What risks can first enter here? Why might they only become visible later? Where is the earliest control surface that can still matter?*Figure[2](https://arxiv.org/html/2605.07110#S5.F2)makes that template explicit by organizing every stage into three aligned bands:*Primary Object*,*Main Entry Risks*, and*Earliest Controls*\. The categories inside each band are not meant as exhaustive engineering checklists\. They summarize the main ways in which a stage can change the eventual behavior of a deployed CUA\.

![Refer to caption](https://arxiv.org/html/2605.07110v1/images/p4.1new.png)Figure 2:Revised four\-stage lifecycle framework for deployed CUAs\. Each stage is organized into three aligned bands: the*primary object*shaped at that stage, the*main entry risks*that can first be introduced there, and the*earliest controls*that may still change downstream outcomes\. The bottom guide emphasizes the main analytical point: failures may be introduced earlier and become visible later, so failure origin can precede failure manifestation\.### V\-AWhy These Four Stages and Why These Categories

The four\-stage split is intentionally minimal\.*Creation*is separated out because it shapes the policy’s priors before the system meets any live account or interface\.*Deployment*is distinct because it changes the binding between learned capability and real authority\.*Operation*is where the active trajectory is exposed to temporal pressure, mixed\-trust inputs, and partial observability\.*Maintenance*is separate because post\-release drift affects the model, the environment, and the surrounding ecosystem differently\. In the revised figure, those differences are captured as different*primary objects*: priors and grounding habits in creation, authority\-bearing bindings in deployment, active trajectories under uncertainty in operation, and validity of the model–environment–ecosystem stack in maintenance\.

This separation is analytically useful because coarser decompositions hide important differences in failure origin\. If creation and deployment are merged, training\-time bias becomes hard to distinguish from over\-authority introduced only at release\. If operation and maintenance are merged, live runtime stress is confused with regressions or ecosystem drift\. The four stages are therefore analytical categories tied to when different enabling conditions are introduced, not just milestones on a project timeline\. The revised diagram also makes the analytical sequence more concrete: once the stage\-specific object is identified, the next question is what risks first enter there, and then which controls can still act early enough to matter\.

The subcategories inside each stage follow the same logic\.*Creation*is decomposed by the mechanisms that shape priors before release, such as data bias, weak supervision, objective bias, and unsafe action abstractions\.*Deployment*is decomposed by the bindings through which capability acquires operational effect, including authority binding, observation channels, tools, sessions, and permissions\.*Operation*is decomposed by the runtime pressures that can turn local uncertainty into higher\-risk trajectories, such as long\-horizon drift, mixed\-trust inputs, TOCTOU, and intent dilution\.*Maintenance*is decomposed by the post\-release objects that drift independently, including the model, the environment, and the surrounding ecosystem\. Read this way, the lifecycle framework is not an expanded project timeline\. It is a map from failure origin to intervention point\.

The stage boundaries are operational rather than merely conceptual\. Creation is used when a problem is primarily driven by pre\-release data, objective design, or action abstraction\. Deployment is used when live binding decisions about tools, permissions, sessions, observation channels, or mediation dominate\. Operation is used when online trajectory evolution under uncertainty is the main source of failure\. Maintenance is used when post\-release change to the model, environment, or ecosystem becomes the primary driver\. Persistent memory, for example, is a deployment choice; misuse of that memory during a live trajectory is an operational phenomenon; and memory\-related regressions after updates belong to maintenance\.

### V\-BCreation: Building Priors Before Release

Creation shapes the object that later gets deployed\. It determines what counts as actionable state, what trajectories look normal, and what trade\-offs the agent learns when task success competes with caution, confirmation, or recoverability\. Failures that enter here may remain latent for a long time\. They often surface only after the system is trusted with real authority\. That is precisely why this stage matters\.

#### Training data source and trajectory quality

Training data determines which behaviors and edge cases are even visible to the learner\. Human trajectory corpora such as AITW, web interaction datasets such as Mind2Web, and generalized agent tuning pipelines suggested that CUAs benefit from task\-conditioned action traces rather than generic multimodal pairs\[[110](https://arxiv.org/html/2605.07110#bib.bib20),[25](https://arxiv.org/html/2605.07110#bib.bib19),[162](https://arxiv.org/html/2605.07110#bib.bib22)\]\. Newer resources such as OpenCUA, TongUI, WebChain, MolmoWeb, and OS\-Genesis increase scale, platform diversity, interface coverage, or synthetic trajectory breadth\[[136](https://arxiv.org/html/2605.07110#bib.bib98),[165](https://arxiv.org/html/2605.07110#bib.bib91),[32](https://arxiv.org/html/2605.07110#bib.bib141),[41](https://arxiv.org/html/2605.07110#bib.bib125),[121](https://arxiv.org/html/2605.07110#bib.bib198)\]\. Yet scale alone does not solve the main bias problem\. Human traces often encode hesitation, confirmation, and recovery\. Synthetic or tutorial\-derived traces scale faster, but they can overrepresent clean completion paths and underrepresent ambiguity or conservative stopping\. Later runtime failures such as overconfident continuation or brittle recovery are often seeded here before they are ever observed live\.

#### Grounding supervision and observation priors

Creation also decides what the model is taught to trust as executable state\. Structured supervision can improve precise element grounding, whereas screenshot\-first training improves portability\. Systems such as CogAgent, ScreenAI, SeeClick, WinClick, Ferret\-UI, TRISHUL, and newer screen\-parsing or exploration\-based grounding pipelines suggest that downstream reliability can still depend heavily on that choice\[[47](https://arxiv.org/html/2605.07110#bib.bib14),[6](https://arxiv.org/html/2605.07110#bib.bib16),[22](https://arxiv.org/html/2605.07110#bib.bib17),[51](https://arxiv.org/html/2605.07110#bib.bib88),[157](https://arxiv.org/html/2605.07110#bib.bib31),[117](https://arxiv.org/html/2605.07110#bib.bib1),[43](https://arxiv.org/html/2605.07110#bib.bib2),[33](https://arxiv.org/html/2605.07110#bib.bib5),[161](https://arxiv.org/html/2605.07110#bib.bib83)\]\. Explicit grounding supervision makes the operational target clearer\. Weak supervision can still produce good benchmark performance, but it often leaves the system relying on unstable salience, parser artifacts, or theme\-specific cues that later fail in deployment\.

#### Objective design, reward shaping, and action priors

Objective design determines what the model is implicitly rewarded to optimize\. Work on UI\-R1, GUI\-R1, ProgRM, Web\-Shepherd, UI\-Genie, MagicGUI\-RMS, Video\-Based Reward Modeling, and CUARewardBench suggests that reward\-shaped post\-training and reward\-model design can materially shift action efficiency, verification quality, and task completion behavior\[[90](https://arxiv.org/html/2605.07110#bib.bib89),[92](https://arxiv.org/html/2605.07110#bib.bib90),[169](https://arxiv.org/html/2605.07110#bib.bib195),[11](https://arxiv.org/html/2605.07110#bib.bib196),[145](https://arxiv.org/html/2605.07110#bib.bib194),[76](https://arxiv.org/html/2605.07110#bib.bib192),[119](https://arxiv.org/html/2605.07110#bib.bib140),[80](https://arxiv.org/html/2605.07110#bib.bib193)\]\. The lifecycle implication is an inference rather than a direct claim of those papers: if training increasingly rewards low action count, fast completion, or shortcut\-taking without parallel incentives for confirmation and recoverability, then a convenience bias may be introduced before any live account is touched\. Action abstractions matter for the same reason\. If the action vocabulary normalizes broad, weakly mediated authority, that authority may already be treated as ordinary at creation time\.

#### Interactive environments as pre\-release stress

Creation is also where the field can expose delayed consequences before release\. Benchmarks such as WebArena, WebForge, Gym\-Anything, ClawBench, WorkArena\+\+, and OSWorld matter not only because they measure capability, but because they reveal multi\-step recovery, delayed effects, open\-environment noise, and broader software coverage while the system is still under development\[[187](https://arxiv.org/html/2605.07110#bib.bib18),[160](https://arxiv.org/html/2605.07110#bib.bib135),[2](https://arxiv.org/html/2605.07110#bib.bib136),[177](https://arxiv.org/html/2605.07110#bib.bib150),[9](https://arxiv.org/html/2605.07110#bib.bib180),[146](https://arxiv.org/html/2605.07110#bib.bib21)\]\. They are best read as pre\-release stress instruments that surface creation\-stage weakness earlier than closed instruction\-following benchmarks can\.

#### Why creation\-stage failures stay hidden

The reason creation matters so much is that its failures often look fine until authority is attached\. A model can appear capable in benchmark loops while already carrying the wrong observation priors, the wrong convenience bias, or the wrong action habits\. The earliest effective control point for creation\-stage failures therefore lies in data design, supervision quality, objective shaping, and safe action abstraction\.

### V\-CDeployment: Binding Capability to Authority

Deployment changes the object of analysis from a capable policy into a consequential system\. The model is now connected to real observation channels, real tools, real sessions, real permissions, and real users\. Capability is not merely exposed to the world here; it is bound to authority\. This is why deployment deserves its own stage rather than being collapsed into generic release engineering\.

#### Observation binding

Observation binding determines what state reaches the model once it leaves the training environment\. Screenshot\-only deployments maximize reach but accept ambiguity, latency, and mixed\-trust content\. Screenshot\-first systems such as Cradle, Mobile\-Agent, Surfer 2, and Mobile\-Agent\-v3\.5 illustrate that reach\[[123](https://arxiv.org/html/2605.07110#bib.bib23),[131](https://arxiv.org/html/2605.07110#bib.bib25),[5](https://arxiv.org/html/2605.07110#bib.bib119),[148](https://arxiv.org/html/2605.07110#bib.bib128)\]\. Controlled environments such as WebArena reduce ambiguity by exposing cleaner state or more deterministic execution paths\[[187](https://arxiv.org/html/2605.07110#bib.bib18)\]\. The key trade\-off is not simply convenience\. It is portability versus controllability\.

#### Tool binding

Tool binding determines how many external interfaces the agent can call and through what mediation\. Systems such as OS\-Copilot, UFO, CoAct\-1, GraphPilot, LiteWebAgent, MAI\-UI, and Step\-GUI illustrate how tool augmentation can expand what CUAs can achieve, especially across applications and action channels\[[144](https://arxiv.org/html/2605.07110#bib.bib26),[166](https://arxiv.org/html/2605.07110#bib.bib30),[118](https://arxiv.org/html/2605.07110#bib.bib7),[158](https://arxiv.org/html/2605.07110#bib.bib114),[168](https://arxiv.org/html/2605.07110#bib.bib179),[186](https://arxiv.org/html/2605.07110#bib.bib132),[149](https://arxiv.org/html/2605.07110#bib.bib133)\]\. At the same time, every new tool call introduces another trust boundary\. Tool outputs influence reasoning, tool invocations affect authority, and the tool ecosystem itself becomes part of the deployment surface\.

#### Permission binding

Permission scope is where abstract capability turns into blast radius\. A model that can view or modify local files, send messages, change settings, or install software is not merely a stronger benchmark policy\. It can function as a live principal acting under inherited authority\. Least privilege, sandboxing, and execution mediation therefore belong to deployment by design, not only to incident response\.

#### Memory, session, and channel binding

Deployment also decides whether context is ephemeral, task\-local, cross\-session, or persistent across users and channels\. Persistent memory can improve continuity and make assistants feel more competent\. It can also enlarge the security and privacy surface because old state remains operationally available long after the original task boundary has passed\. Personalized and socially embedded benchmarks such as PSPA\-Bench, KnowU\-Bench, and CowCorpus help make that shift visible by foregrounding long\-lived user context, collaborative intervention, and context carry\-over across tasks\[[101](https://arxiv.org/html/2605.07110#bib.bib138),[16](https://arxiv.org/html/2605.07110#bib.bib151),[52](https://arxiv.org/html/2605.07110#bib.bib143)\]\. The same is true of channel exposure\. Messaging routes, automation triggers, browser sessions, shared desktops, and internal tools differ sharply in provenance and trust assumptions\.

#### Why deployment changes the risk profile

Public materials describe OpenClaw as a gateway\-style assistant with persistent ingress, local\-first execution, tool connectivity, and longer\-lived context\[[105](https://arxiv.org/html/2605.07110#bib.bib96),[95](https://arxiv.org/html/2605.07110#bib.bib97)\]\. That motivating deployment setting is used only as an illustrative deployment pattern rather than as verified system evidence: once one deployment surface is presented as binding channels, tools, sessions, and permissions together, deployment configuration can account for a large share of the resulting risk profile\. Comparable design pressures can also be seen in MCP\-enabled assistants, local automation gateways, and other multi\-channel CUA deployments\[[170](https://arxiv.org/html/2605.07110#bib.bib70),[77](https://arxiv.org/html/2605.07110#bib.bib57),[144](https://arxiv.org/html/2605.07110#bib.bib26)\]\. The earliest effective control point for deployment\-stage failures therefore lies in mediation, permission scoping, session isolation, provenance, and safe defaults\.

### V\-DOperation: Runtime Pressure, Mixed Trust, and Oversight

Operation is where the deployed system is stressed under live time\. The architecture has been chosen, authority has been bound, and the agent now has to remain coherent while the environment changes, the task unfolds, and new content enters the loop\. Many failures become visible here even when they did not originate here\. That is why runtime incidents should not automatically be read as runtime causes\.

#### Long\-horizon drift and stale context

One source of operational failure is trajectory degradation\. WorldGUI, MobileUse, ActionEngine, the Amazing Agent Race, ClawBench, MobileWorldBench, and AndroTMem suggest in different ways that step\-wise competence does not automatically produce reliable long\-horizon behavior\[[180](https://arxiv.org/html/2605.07110#bib.bib92),[74](https://arxiv.org/html/2605.07110#bib.bib86),[185](https://arxiv.org/html/2605.07110#bib.bib95),[61](https://arxiv.org/html/2605.07110#bib.bib147),[177](https://arxiv.org/html/2605.07110#bib.bib150),[75](https://arxiv.org/html/2605.07110#bib.bib176),[114](https://arxiv.org/html/2605.07110#bib.bib163)\]\. An agent can keep taking plausible local steps while losing the original goal, working off stale state, or repeating low\-value recovery loops\. This is often the main path by which capable\-looking systems become operationally unreliable\.

#### Mixed\-trust runtime inputs

Another source is mixed\-trust input\. During live use, the agent consumes page text, retrieved snippets, messages, document content, and tool outputs that do not share one trust level\. Runtime operation is therefore where injection and deceptive content stop being abstract evaluation categories and become active task conditions\[[134](https://arxiv.org/html/2605.07110#bib.bib208),[151](https://arxiv.org/html/2605.07110#bib.bib211),[17](https://arxiv.org/html/2605.07110#bib.bib213)\]\.

#### TOCTOU and delayed execution

Operation also exposes the temporal gap between decision and effect\. Interface state can change after the plan is formed but before the click lands, the macro runs, or the command executes\. When this happens, a locally sensible action can become globally misbound or otherwise unsafe without the goal itself ever changing\. This is one reason execution verification belongs in operation rather than only in design\.

#### User\-intent dilution in persistent workflows

Once sessions persist across tasks, channels, or users, the system can blur ownership and scope\. Personalized long\-lived evaluations and deployment\-oriented analyses are consistent with that concern in settings where memory, task identity, and ingress are modeled as longer\-lived than a single interaction\[[138](https://arxiv.org/html/2605.07110#bib.bib103),[139](https://arxiv.org/html/2605.07110#bib.bib101)\]\. The issue is not only security\. It is whether user intent remains the dominant organizing constraint in a long\-lived runtime context\.

#### Why runtime control must be explicit

These pressures are why operation needs an explicit oversight ladder rather than vague references to “human in the loop\.” In increasing order of control strength, that ladder typically includes logging only, post\-action verification, plan preview, step confirmation, and high\-impact approval\. Systems such as CORA further suggest that this runtime control point can be operationalized through calibrated execute\-versus\-abstain decisions under an explicit risk budget rather than only through informal approval heuristics\[[34](https://arxiv.org/html/2605.07110#bib.bib149)\]\. The correct runtime control depends on blast radius, not only on task difficulty\. The earliest effective control point for operation\-stage failures therefore lies in verification, scoped memory, calibrated escalation, and impact\-aware oversight\.

### V\-EMaintenance: Preserving Validity After Release

Maintenance keeps the deployment from silently decaying\. A released CUA does not operate against a frozen environment\. Models are adapted, interfaces drift, permissions change, tools are updated, and ecosystems evolve around the base system\. Treating maintenance as a generic “update” phase hides too much of that complexity\. In practice, maintenance governs whether earlier assurance survives change\.

#### Model maintenance

Model maintenance concerns adaptation, retraining, evaluator changes, and post\-release policy updates\. Work such as MAGNET, UI\-Oceanus, and PolySkill points to the difficulty of updating interface\-specific competence without erasing stable knowledge or reintroducing old failure modes\[[120](https://arxiv.org/html/2605.07110#bib.bib93),[140](https://arxiv.org/html/2605.07110#bib.bib129),[159](https://arxiv.org/html/2605.07110#bib.bib120)\]\. Maintenance is therefore not only about getting a stronger model\. It is about preserving the validity of earlier safety and reliability gains under change\.

#### Environment maintenance

Environment maintenance concerns everything outside the model that changes underneath it: UI redesign, localization drift, new permission prompts, tool API changes, and altered workflow ordering\. Benchmarks such as MemGUI\-Bench, TimeWarp, Risky\-Bench, WebForge, Vision2Web, WebTestBench, GUITester, and OpeFlo matter here because they illustrate how quickly evaluation assumptions can become stale once the deployment environment moves on\[[83](https://arxiv.org/html/2605.07110#bib.bib94),[53](https://arxiv.org/html/2605.07110#bib.bib230),[183](https://arxiv.org/html/2605.07110#bib.bib100),[160](https://arxiv.org/html/2605.07110#bib.bib135),[46](https://arxiv.org/html/2605.07110#bib.bib159),[63](https://arxiv.org/html/2605.07110#bib.bib160),[37](https://arxiv.org/html/2605.07110#bib.bib117),[122](https://arxiv.org/html/2605.07110#bib.bib112)\]\. Continuous re\-evaluation is therefore a basic operational requirement rather than a benchmarking luxury\.

#### Ecosystem maintenance

Open deployment adds a third maintenance problem: the surrounding ecosystem changes independently of both model and UI\. Plugins, skills, registries, agent identities, watcher layers, and sharing channels all create new drift paths\. Your Agent, Their Asset suggests that persistent capability, identity, and knowledge can remain exposed after deployment, while ClawKeeper focuses more directly on skills, plugins, and watcher layers\[[139](https://arxiv.org/html/2605.07110#bib.bib101),[86](https://arxiv.org/html/2605.07110#bib.bib102)\]\. PASB adds a complementary evaluation frame for long\-lived personalized agents rather than a direct extension\-governance mechanism\[[138](https://arxiv.org/html/2605.07110#bib.bib103)\]\. In open CUA systems, maintenance can become inseparable from governance\.

#### Why maintenance governs trust continuity

The key implication is that maintenance governs not only quality, but trust continuity\. If models, tools, and registries evolve faster than the assurance stack around them, then the deployment can reopen risks even when the base architecture is unchanged\. The earliest effective control point for maintenance\-stage failures therefore lies in continual evaluation, controlled adaptation, extension governance, and coordinated patching\.

TABLE IV:Lifecycle diagnostic and intervention map for deployed CUAs\. The table links stage\-specific failure patterns to their first observable manifestations, likely diagnosis targets, and early intervention priorities\.

### V\-FLifecycle Coupling and Its Analytical Payoff

The four stages are distinct, but not independent\. Creation shapes the priors that deployment later binds to authority\. Deployment determines which runtime mistakes can become consequential\. Operation reveals which assumptions fail under live stress\. Maintenance determines whether earlier fixes remain valid as the world changes\. Table[IV](https://arxiv.org/html/2605.07110#S5.T4)serves as an operational troubleshooting aid: start from the symptom, then work backward to the most likely stage\-level diagnosis and earliest useful intervention surface\.

The larger lesson is that CUA reliability cannot be inferred from architecture alone\. It depends on when capability is formed, when authority is attached, when trajectories are stressed, and when the system is re\-evaluated after change\. Lifecycle analysis is therefore not supplementary to the architecture chapter\. It is the temporal backbone that explains why similar visible failures can demand very different interventions\.

## VISecurity and Privacy Analysis

Security is one of the clearest reasons CUAs need the joint framework developed here\. These systems read mixed\-trust content, retain state across steps, and act through interfaces that may reach files, accounts, tools, and external services\. A useful security account must therefore answer three questions at once: where corruption enters, how it becomes operational, and at what lifecycle stage it could have been constrained earlier\. Attack names alone are not enough\.

The recent CUA security corpus reflects that widening scope\. It spans visual prompt injection, harmful\-task benchmarking, action rebinding, adversarial backdoors, runtime guardrails, permission scoping, dark\-pattern manipulation, privacy\-focused evaluation, and runtime monitoring or mediation\[[10](https://arxiv.org/html/2605.07110#bib.bib207),[93](https://arxiv.org/html/2605.07110#bib.bib200),[18](https://arxiv.org/html/2605.07110#bib.bib216),[94](https://arxiv.org/html/2605.07110#bib.bib220),[24](https://arxiv.org/html/2605.07110#bib.bib219),[72](https://arxiv.org/html/2605.07110#bib.bib202),[48](https://arxiv.org/html/2605.07110#bib.bib206),[134](https://arxiv.org/html/2605.07110#bib.bib208),[3](https://arxiv.org/html/2605.07110#bib.bib210),[181](https://arxiv.org/html/2605.07110#bib.bib164),[135](https://arxiv.org/html/2605.07110#bib.bib204)\]\. Safety benchmarks such as MobileSafetyBench, ST\-WebAgentBench, OS\-BLIND, and RiosWorld further indicate that harmful\-task completion, policy\-noncompliant behavior, and benign\-intent failure have become explicit evaluation targets rather than incidental by\-products of general task completion\[[69](https://arxiv.org/html/2605.07110#bib.bib225),[70](https://arxiv.org/html/2605.07110#bib.bib226),[26](https://arxiv.org/html/2605.07110#bib.bib146),[150](https://arxiv.org/html/2605.07110#bib.bib48)\]\. That literature expands the attack inventory and suggests that risk conditions can enter through different layers and become visible at different stages\.

This section therefore proceeds in three steps\. It first fixes a threat model\. It then introduces a working taxonomy of*input\-side corruption*,*decision\-side corruption*, and*execution\-side abuse*, together with a heuristic attribution lens of*scope overreach*\(SO\),*objective corruption*\(OC\), and*environmental misbinding*\(EM\)\. Finally, it maps those patterns back onto the lifecycle so that control placement remains tied to timing rather than only to attack naming\.

### VI\-AThreat Model

The baseline attacker may control any combination of*screen content*,*retrieved content*, and*tool outputs*\. This includes attacker\-controlled or deceptive pages, rendered documents, OCR\-visible instructions, search results, clipboard contents, agent messages, and tool or MCP responses consumed during execution\[[27](https://arxiv.org/html/2605.07110#bib.bib52),[57](https://arxiv.org/html/2605.07110#bib.bib40),[35](https://arxiv.org/html/2605.07110#bib.bib46),[171](https://arxiv.org/html/2605.07110#bib.bib47),[127](https://arxiv.org/html/2605.07110#bib.bib68),[150](https://arxiv.org/html/2605.07110#bib.bib48)\]\. In creation\-stage settings, the attacker may additionally influence training data or reward signals\. In open deployments, the attacker may also reach the system through persistent communication channels, shared tools, or capability\-sharing surfaces\.

Protected assets include execution integrity, data confidentiality, and authority boundaries around files, credentials, tools, networks, and operating\-system functions\. Some realistic CUA deployments expose high\-impact actions such as external messaging, deletion, payment, credential entry, or permission changes\. The threat model does*not*assume that every deployment enforces a strong execution\-policy layer\. Some systems can insert sandboxing, approval hooks, or action gating\[[18](https://arxiv.org/html/2605.07110#bib.bib216),[34](https://arxiv.org/html/2605.07110#bib.bib149)\]\. Others expose the model more directly to the execution surface\. That distinction matters because the same planning error can be much harder to contain once runtime mediation is weak\.

Three trust boundaries recur throughout the section: the boundary between user\-authored intent and runtime content, the boundary between trusted execution channels and external tools or services, and the boundary between task\-local memory and persistent cross\-session state\. Weakness at any of those boundaries can turn seemingly ordinary autonomy into unauthorized execution or disclosure\.

Figure[3](https://arxiv.org/html/2605.07110#S6.F3)presents the stage\-first version of that argument\. The upper band identifies the salient threat surfaces at each stage, the middle band identifies the system object under configuration or stress, and the lower band shows the control surfaces that can still act early enough to matter\. The figure therefore links the lifecycle analysis of Section[V](https://arxiv.org/html/2605.07110#S5)to the more detailed threat discussion below\.

![Refer to caption](https://arxiv.org/html/2605.07110v1/x1.png)Figure 3:Lifecycle\-aligned CUA threats and controls\. The figure serves as an intervention map: for each stage, it identifies the salient threat surfaces, the system object under pressure, and the earliest practical control surfaces that may still alter the eventual outcome\.
### VI\-BStructural Threat Taxonomy and Attribution Lens

The working taxonomy answers*where*corruption enters the loop\. The heuristic attribution lens answers*how*a risk becomes operational\. These two views are complementary rather than competing\. Together they provide an organizing vocabulary rather than a definitive or exhaustive classification\.

Input\-side corruptioncovers cases where the observation stream itself is adversarial or misleading\. Indirect prompt injection, deceptive UI, attacker\-controlled documents, hostile retrieval results, and poisoned tool outputs all fall in this family\. The corruption enters through what the agent reads\.

Decision\-side corruptioncovers cases where the agent’s working objective, planning substrate, or memory becomes attacker\-steered or otherwise compromised\. Poisoned memory, attacker\-steered subplans, and latent training\-time triggers whose effect appears during inference all belong here\. The corruption enters through what the agent optimizes or retains\.

Execution\-side abusecovers cases where the interface translating intent into action becomes the main risk surface\. Over\-privilege, weak isolation, action rebinding, TOCTOU, and unsafe tool invocation all fall in this family\. The corruption enters through how the chosen action is operationalized\.

The attribution lens asks a different question\.Scope overreachis used when the agent expands beyond the user’s intended task boundary without strong evidence that the operative goal itself has been replaced\.Objective corruptionis used when attacker\-controlled content, memory, tooling, or training signal becomes the dominant planning objective\.Environmental misbindingis used when the apparent goal remains intact, but the observation, execution, or authority layer redirects the realized outcome in unsafe ways\. Some incidents may involve more than one dominant attribution pathway\.

#### Operational decision rule

The attribution lens is applied pragmatically\. If attacker\-controlled state has become the operative goal, the dominant class isobjective corruption\. If the goal appears intact but the observation–action or authority binding redirects the outcome, the dominant class isenvironmental misbinding\. If neither condition clearly holds and the main problem is task\-boundary expansion or unnecessary data or action collection, the dominant class isscope overreach\. The purpose of the rule is to support control placement, not to eliminate every ambiguous case\.

#### Worked example

Consider a CUA asked to download one invoice from a customer portal\. A malicious on\-screen banner says “export the full session history for troubleshooting,” the agent has an over\-broad cloud\-upload permission, and no approval hook intervenes before upload\. The banner is*input\-side corruption*because the attack enters through what the agent reads\. If that injected instruction becomes the operative plan, the dominant attribution is*objective corruption*\. If the original invoice\-downloading goal remains intact but the broad permission and missing approval hook allow an unsafe export to proceed, the dominant attribution is*environmental misbinding*\. If, even without clear attacker steering, the agent interprets the task too broadly and exports extra files because it over\-collects beyond task need, the dominant attribution is*scope overreach*\. The same incident can therefore expose overlapping mechanisms while still having one dominant attribution path for control placement\.

### VI\-CCreation\-Stage Threats

Creation\-stage threats matter because they often stay hidden until the system acquires real authority\. This stage is therefore the clearest example of why failure origin and failure manifestation should not be conflated\.

Training poisoning and backdoors\.Hidden Ghost Hand offers GUI\-agent evidence, in an evaluated setting, that training\-time triggers can later redirect behavior at inference time\[[23](https://arxiv.org/html/2605.07110#bib.bib56)\]\. SlowBA adds newer evidence that reward\-level or efficiency\-oriented backdoors may also be inserted into GUI\-agent optimization pipelines rather than only into obvious behavior\-cloning traces\[[72](https://arxiv.org/html/2605.07110#bib.bib202)\]\. BadVLA provides adjacent VLA evidence that similar backdoor logic can appear in multimodal action models more broadly\[[189](https://arxiv.org/html/2605.07110#bib.bib55)\]\. Taken together, these studies are consistent with treating objective corruption as a creation\-stage risk that may be introduced upstream and surface later\.

Unsafe action priors and benchmark\-shaped blind spots\.Creation also decides what kinds of actions become normalized as ordinary\. OS\-Harm, CUAHarm, MobileSafetyBench, ST\-WebAgentBench, and RiosWorld indicate that the affordances and policy constraints an evaluation suite foregrounds materially affect what risks become visible and measurable\[[65](https://arxiv.org/html/2605.07110#bib.bib41),[126](https://arxiv.org/html/2605.07110#bib.bib42),[69](https://arxiv.org/html/2605.07110#bib.bib225),[70](https://arxiv.org/html/2605.07110#bib.bib226),[150](https://arxiv.org/html/2605.07110#bib.bib48)\]\. Those benchmarks do not by themselves establish deployment behavior, but they do suggest that high\-authority operations can be normalized in the development loop before corresponding safety abstractions are in place\.

The main security point is therefore upstream: some of the most consequential later failures may already be latent at creation time\. Some failures that appear at runtime can often be read more accurately as creation\-stage objective or action\-space problems that only become visible once the system is trusted with real authority\.

### VI\-DDeployment\-Stage Threats

Deployment\-stage threats appear when learned capability is connected to real tools, permissions, and ingress routes\. The central question at this stage is whether capability is bound to authority faster than it is bound to control\.

Over\-privilege and weak isolation\.CaMeLs, CSAgent, and CellMate illustrate a shared deployment concern: in the studied settings, once a CUA is granted broad access without strong mediation, even moderate reasoning mistakes can lead to higher\-impact outcomes\[[35](https://arxiv.org/html/2605.07110#bib.bib46),[38](https://arxiv.org/html/2605.07110#bib.bib44),[96](https://arxiv.org/html/2605.07110#bib.bib64)\]\. These cases are best read as environmental misbinding\. The model may not be pursuing an adversarial goal, but the authority configuration allows a local error to become a consequential one\.

Supply\-chain risk in tools, skills, and MCP servers\.Les Dissonances and MCP Security Bench suggest that weakly governed tool or extension ecosystems create a second deployment path to failure\[[77](https://arxiv.org/html/2605.07110#bib.bib57),[170](https://arxiv.org/html/2605.07110#bib.bib70)\]\. The two papers support different parts of the claim\. Les Dissonances highlights cross\-tool control\-flow corruption in multi\-tool agents\. MCP Security Bench suggests that tool descriptions, standardized metadata, and protocol\-level interfaces can carry attacker influence into planning and invocation\. Together they motivate treating the tool layer as a deployment\-time trust choice rather than as a neutral capability add\-on\.

Insufficient deployment\-grounded evaluation\.RedTeamCUA and HackWorld suggest that many hybrid web–OS attack paths and exploit\-oriented failure modes can remain invisible in narrower evaluation settings\[[78](https://arxiv.org/html/2605.07110#bib.bib76),[111](https://arxiv.org/html/2605.07110#bib.bib205)\]\. The broader lesson is methodological: when a system goes live without testing the actual coupling among observation, tool outputs, permissions, and oversight, the release process itself can become a security weakness\.

Taken together, deployment\-stage threats can be read as forms of misbound authority\. They arise when tools, channels, permissions, and sessions are opened before provenance, isolation, and mediation are strong enough to constrain them\.

### VI\-EOperation\-Stage Threats

Operation is where many CUA attack settings become consequential because mixed\-trust inputs, live authority, and long\-horizon planning coexist in one loop\. OS\-BLIND is a useful complement here because it suggests that higher\-risk operational outcomes can also emerge under benign user instructions once contextual cues and subtask decomposition obscure harm\[[26](https://arxiv.org/html/2605.07110#bib.bib146)\]\. The order below follows the typical path by which corruption becomes consequential: from what the agent reads, to what it retains or optimizes, to how it acts, and finally to how risk\-bearing behavior may travel beyond one task instance\.

Input\-side corruption: injection and deceptive interfaces\.Indirect prompt injection and deceptive UI attacks remain central because they exploit the same content\-readiness that makes CUAs useful\. InjecAgent, GhostEI\-Bench, Chameleon, WebInject, attacker\-controlled image\-patch attacks, and active environmental injection studies provide evidence that attacker\-controlled instructions or cues can be embedded in pages, screenshots, documents, mobile notifications, or on\-screen image regions in ways that influence downstream action in evaluated settings\[[164](https://arxiv.org/html/2605.07110#bib.bib53),[13](https://arxiv.org/html/2605.07110#bib.bib60),[176](https://arxiv.org/html/2605.07110#bib.bib43),[134](https://arxiv.org/html/2605.07110#bib.bib208),[3](https://arxiv.org/html/2605.07110#bib.bib210),[17](https://arxiv.org/html/2605.07110#bib.bib213)\]\. WAInjectBench and WASP add complementary evaluation evidence that these failures remain measurable in web\-agent settings\[[87](https://arxiv.org/html/2605.07110#bib.bib77),[30](https://arxiv.org/html/2605.07110#bib.bib49)\]\. EIA sharpens the privacy angle by suggesting that environmental injection can steer web agents toward leakage of user information rather than only generic task derailment\[[79](https://arxiv.org/html/2605.07110#bib.bib227)\]\. Benchmarks centered on semantic\-level UI injection, task\-redirection, and dark\-pattern manipulation further suggest that attacker\-controlled input can be persuasive or attention\-capturing rather than explicitly imperative\[[154](https://arxiv.org/html/2605.07110#bib.bib214),[64](https://arxiv.org/html/2605.07110#bib.bib218),[29](https://arxiv.org/html/2605.07110#bib.bib222)\]\. Depending on how the attack succeeds, the dominant attribution may be objective corruption or environmental misbinding\.

Defensive monitoring and mitigation\.The corresponding defense literature also makes clear that prompt\-injection resilience is not one thing\. In\-context defenses, localized attack detection, adversarial safety training, real\-time audit frameworks, diagnostic guardrails, and policy\-reasoning layers all aim at different points in the pipeline\[[151](https://arxiv.org/html/2605.07110#bib.bib211),[135](https://arxiv.org/html/2605.07110#bib.bib204),[85](https://arxiv.org/html/2605.07110#bib.bib203),[48](https://arxiv.org/html/2605.07110#bib.bib206),[81](https://arxiv.org/html/2605.07110#bib.bib66),[19](https://arxiv.org/html/2605.07110#bib.bib65)\]\. That diversity is itself informative for the framework developed here: it suggests that input\-side corruption is unlikely to be fully managed at one layer or one stage\.

Decision\-side corruption: memory poisoning and long\-horizon steering\.Once the agent maintains persistent state, poisoning memory or long\-horizon planning can become as valuable as poisoning a single prompt\. AgentPoison offers direct evidence for memory or knowledge\-base poisoning\[[20](https://arxiv.org/html/2605.07110#bib.bib54)\]\. LPS\-Bench indicates that planning\-time safety awareness can erode over long trajectories even when the attack surface is framed at the planning layer rather than as explicit memory poisoning\[[15](https://arxiv.org/html/2605.07110#bib.bib67)\]\. PASB adds complementary evidence that personalized long\-horizon deployments create broader attack surfaces in which attacker\-favored behavior can persist across realistic toolchains and contexts\[[138](https://arxiv.org/html/2605.07110#bib.bib103)\]\. Preference\-redirection and benign\-input elicitation studies add a related warning: the operative objective may be steered without always looking like a classic memory\-poisoning event\[[113](https://arxiv.org/html/2605.07110#bib.bib199),[58](https://arxiv.org/html/2605.07110#bib.bib215)\]\. Some cases more clearly replace the operative goal; others expand beyond the intended task boundary while still appearing superficially helpful\.

Execution\-side abuse: TOCTOU and action rebinding\.Zero\-Permission Manipulation and AgentHazard illustrate what can happen when the environment changes after the decision is formed but before execution lands\[[108](https://arxiv.org/html/2605.07110#bib.bib58),[84](https://arxiv.org/html/2605.07110#bib.bib59)\]\. These are representative cases of environmental misbinding\. The goal may remain the same, yet the realized outcome is redirected by timing and interface instability\.

Sharing and coordination surfaces\.Open agent ecosystems introduce a final operational surface: risk\-bearing content or behavior can move through interaction and sharing\. Prompt Infection and broader multi\-agent security work suggest that prompts, coordination patterns, or other risk\-bearing behaviors can propagate beyond one\-shot exploitation in multi\-agent settings\[[67](https://arxiv.org/html/2605.07110#bib.bib72),[107](https://arxiv.org/html/2605.07110#bib.bib74)\]\. Agent communities such as the Moltbook setting studied in\[[14](https://arxiv.org/html/2605.07110#bib.bib71)\]make those sharing channels easier to discuss as a deployment concern, even though that paper is not itself a security evaluation\. The implication is not that every capability\-sharing surface is attacker\-controlled by default\. It is that coordination itself becomes part of the trust boundary\.

The operational pattern is cumulative\. Attacker\-controlled or misleading content can enter through observation, distort memory or planning, ride a high\-authority execution channel, and then spread through coordination or sharing surfaces if no runtime control interrupts the chain\.

### VI\-FPrivacy as a Parallel Objective

Privacy deserves separate treatment because it is not reducible to successful attack detection\. Many privacy failures can occur even when the system is functioning as designed, yet retains too much, reads too broadly, or exports state through tools and telemetry in ways that exceed user expectations or task scope\.

The major privacy surfaces recur across current CUA designs\.Credential exposuremay arise whenever passwords, session tokens, or secrets become visible in the observation loop\.Cross\-session leakagemay arise when state learned in one task influences another\.Memory retention riskmay arise when the system keeps or retrieves more data than the task requires, even without adversarial prompting\[[182](https://arxiv.org/html/2605.07110#bib.bib50),[132](https://arxiv.org/html/2605.07110#bib.bib61)\]\.Screenshot\-, OCR\-, and retrieval\-induced disclosuremay arise because incidental interface content can still be operationalized once it is observed\.Third\-party tool telemetrymay arise when the tool ecosystem transmits or logs sensitive state beyond the base model’s local context\[[77](https://arxiv.org/html/2605.07110#bib.bib57),[139](https://arxiv.org/html/2605.07110#bib.bib101)\]\. Privacy\-preserving deployment frameworks such as CORE further suggest that even ordinary inference routing can change how much UI state is exposed upstream or retained externally\[[31](https://arxiv.org/html/2605.07110#bib.bib121)\]\. GUIGuard frames this granularity more explicitly by separating privacy recognition, privacy protection, and protected task execution in cross\-platform GUI settings\[[137](https://arxiv.org/html/2605.07110#bib.bib231)\]\. EIA complements that picture by suggesting that hidden environmental content can induce targeted privacy leakage even when the user\-facing task appears ordinary\[[79](https://arxiv.org/html/2605.07110#bib.bib227)\]\. Finally,data minimization and retention policyhelp determine whether any of these exposures become persistent\.

Privacy failures map differently onto the attribution lens than many attack cases do\. Some are straightforward cases of scope overreach, where the system over\-collects or over\-remembers while still “trying” to help\. Others may be better read as objective corruption when attacker\-controlled content induces disclosure or exfiltration\. The important point is that many privacy controls sit earlier than generic attack detection: in memory scope, screenshot discipline, retention policy, secret handling, and tool\-telemetry governance\.

### VI\-GMaintenance\-Stage Threats

Maintenance\-stage threats are what remain after the first release\. They matter because a deployed CUA is not static\. Interfaces change, registries evolve, extensions are updated, and controls that were once adequate can silently become stale\.

Regression and stale assurance\.TimeWarp and Risky\-Bench suggest that deployment\-grounded evaluation can reveal failure modes that narrower or more static settings miss, especially once interfaces and workflows evolve across versions\[[53](https://arxiv.org/html/2605.07110#bib.bib230),[183](https://arxiv.org/html/2605.07110#bib.bib100)\]\. The lesson is broader than one benchmark: without re\-evaluation and release gating after change, old controls can decay faster than deployment teams notice\.

Persistent\-state abuse and ecosystem drift\.Your Agent, Their Asset and ClawKeeper suggest more directly that post\-release security can be governed as much by persistent memory, extension hygiene, identity, and watcher or registry discipline as by base policy alone\[[139](https://arxiv.org/html/2605.07110#bib.bib101),[86](https://arxiv.org/html/2605.07110#bib.bib102)\]\. PASB complements that picture by indicating how personalized long\-lived deployments can reopen these risks in evaluation settings that are closer to real use\[[138](https://arxiv.org/html/2605.07110#bib.bib103)\]\. The same maintenance problem applies to the assurance stack itself: policy reasoners and diagnostic defenses such as ShieldAgent, AgentSentinel, AgentDoG, WebSentinel, and DMAST only help if their detectors remain synchronized with evolving interfaces and attack patterns\[[19](https://arxiv.org/html/2605.07110#bib.bib65),[48](https://arxiv.org/html/2605.07110#bib.bib206),[81](https://arxiv.org/html/2605.07110#bib.bib66),[135](https://arxiv.org/html/2605.07110#bib.bib204),[85](https://arxiv.org/html/2605.07110#bib.bib203)\]\. In open deployments, maintenance becomes the stage at which governance either keeps pace with capability diffusion or falls behind it\.

Table[V](https://arxiv.org/html/2605.07110#S6.T5)compresses the chapter into one lifecycle\-grounded reference map\. It is intentionally selective rather than exhaustive: the evidence column names anchor papers, while the analytical value of the table comes from locating each risk by entry surface, operative mechanism, and lifecycle stage\.

TABLE V:Working lifecycle\-grounded security and privacy reference map for CUAs\. Each row identifies where a risk condition enters, how it becomes operational, and one or two anchor papers\. “SO” = scope overreach, “OC” = objective corruption, and “EM” = environmental misbinding\.Taken together, the security and privacy analysis reinforces the organizing view developed here\. CUA failures become more informative when read by where corruption entered, how it became operational, and at which lifecycle stage earlier controls might still have changed the outcome\. Security is therefore not peripheral to the architecture–lifecycle framework; it is one of the clearer domains in which that framework helps organize evidence and control placement\.

## VIIFrom Threats to Controls: Practical Security Implications

The analysis so far suggests a simple rule: the best control is usually the earliest one that still acts on the relevant failure mechanism\. In deployed CUAs, that means control placement must follow both layer and stage\. Creation shapes priors\. Deployment binds authority\. Operation determines whether higher\-risk trajectories complete\. Maintenance determines whether earlier gains survive environmental and ecosystem drift\. A flat checklist cannot capture those differences\.

This section converts the preceding analysis into governance implications\. Its aim is not to catalogue every possible defense\. Instead, it identifies which control surfaces remain meaningful once capability, authority, and mixed\-trust content are already interacting in a live system\. The broader implication is that open deployment is better understood through a deployment\-oriented control framework than through model scaling alone\.

### VII\-ALifecycle\-Aligned Control Surfaces

The most defensible control surfaces line up with the four lifecycle stages because each stage changes a different object\. Creation changes priors\. Deployment changes authority binding\. Operation changes the active trajectory\. Maintenance changes whether the whole stack remains valid after release\. Organizing controls by these stages is therefore more informative than organizing them only by tool category or attack name\.

Design\-time controls\.At creation time, the relevant controls are those that shape priors before the system acquires live authority: grounding supervision, safe action abstractions, release gating on harmful\-task suites, and policy design that values confirmation and recoverability rather than completion alone\[[65](https://arxiv.org/html/2605.07110#bib.bib41),[126](https://arxiv.org/html/2605.07110#bib.bib42),[97](https://arxiv.org/html/2605.07110#bib.bib63),[80](https://arxiv.org/html/2605.07110#bib.bib193),[169](https://arxiv.org/html/2605.07110#bib.bib195),[11](https://arxiv.org/html/2605.07110#bib.bib196)\]\. The decisive question is whether the system is being optimized merely to finish tasks, or to finish them under explicit operational constraints\.

Deploy\-time controls\.At deployment time, the most important controls govern authority binding: least privilege, sandboxing, provenance\-aware middleware, tool allowlists, session scoping, channel separation, and explicit trust labeling for inputs and outputs\[[38](https://arxiv.org/html/2605.07110#bib.bib44),[96](https://arxiv.org/html/2605.07110#bib.bib64),[35](https://arxiv.org/html/2605.07110#bib.bib46),[31](https://arxiv.org/html/2605.07110#bib.bib121)\]\. These are not wrappers around the model\. They are part of what determines whether the resulting system is appropriate to operate in a given setting at all\.

Runtime controls\.During operation, the goal is to reduce the chance that uncertainty silently turns into irreversible action\. Policy mediation, post\-action verification, plan preview, confirmation at sensitive steps, interrupt or takeover paths, and explanation of why an action is being proposed all belong here\[[60](https://arxiv.org/html/2605.07110#bib.bib62),[103](https://arxiv.org/html/2605.07110#bib.bib51),[174](https://arxiv.org/html/2605.07110#bib.bib45),[178](https://arxiv.org/html/2605.07110#bib.bib153)\]\. Recent safeguarded\-execution systems make that design space more concrete: CORA calibrates execute\-versus\-abstain decisions under an explicit risk budget, while VeriSafe Agent verifies proposed mobile actions against formalized task constraints before they fire\[[34](https://arxiv.org/html/2605.07110#bib.bib149),[68](https://arxiv.org/html/2605.07110#bib.bib228)\]\. Runtime security stacks such as AgentSentinel, AgentDoG, WebSentinel, and GEM add a complementary layer of monitoring, diagnosis, and uncertainty\-aware escalation around the agent loop itself\[[48](https://arxiv.org/html/2605.07110#bib.bib206),[81](https://arxiv.org/html/2605.07110#bib.bib66),[135](https://arxiv.org/html/2605.07110#bib.bib204),[143](https://arxiv.org/html/2605.07110#bib.bib223)\]\. Their purpose is to help keep user intent attached to the loop while the environment is changing underneath the system\.

Maintenance controls\.After release, the important controls are those that preserve validity over time: regression testing, red teaming, extension review, version pinning, coordinated patching, registry governance, and continuous re\-evaluation against changing interfaces and threats\[[78](https://arxiv.org/html/2605.07110#bib.bib76),[53](https://arxiv.org/html/2605.07110#bib.bib230),[183](https://arxiv.org/html/2605.07110#bib.bib100),[139](https://arxiv.org/html/2605.07110#bib.bib101),[86](https://arxiv.org/html/2605.07110#bib.bib102),[85](https://arxiv.org/html/2605.07110#bib.bib203)\]\. A deployed CUA is more likely to remain governable if its assurance stack evolves alongside its capability stack\.

Under the attribution lens, the same logic becomes even more specific\. Scope overreach is most directly addressed by task scoping, confirmation thresholds, memory boundaries, and data minimization\. Objective corruption is most directly addressed by provenance separation, tool vetting, output mediation, and memory hygiene\. Environmental misbinding is most directly addressed by execution mediation, TOCTOU\-aware verification, least privilege, and auditable action binding\. Control placement therefore follows both*what failed*and*when the enabling condition entered*\.

### VII\-BHuman Oversight Should Be Designed, Not Assumed

Human oversight in CUAs should not be treated as a vague safety slogan\. If the system is expected to handle long\-horizon tasks under mixed\-trust inputs, then the user or operator needs explicit product and policy surfaces through which they can inspect, stop, or redirect execution\. Oversight is not a fallback for when the model fails\. It is one of the primary mechanisms by which user intent remains attached to machine autonomy\.

Five mechanisms recur as the most practical runtime oversight surfaces\.Plan previewexposes the intended workflow before consequential execution begins\.Step confirmationintroduces a checkpoint when the next action is irreversible, ambiguous, or outside ordinary scope\.Interrupt, pause, or takeoverkeeps changing environments from outrunning human correction\.Action explanationpreserves provenance about why the system believes a step is appropriate\.Reversible\-action preferencebiases the system toward drafts, dry runs, recycle\-bin semantics, or staged execution whenever the task permits it\. Benchmarks for interruptibility, collaborative assistance, and human interaction styles suggest that these are not merely interface niceties but measurable dimensions of deployed\-agent quality\[[191](https://arxiv.org/html/2605.07110#bib.bib157),[153](https://arxiv.org/html/2605.07110#bib.bib161),[52](https://arxiv.org/html/2605.07110#bib.bib143)\]\.

These mechanisms are not just usability features\. They are the points at which human intent remains operational inside the loop\. The same logic should scale with action impact\. Financial transfer, deletion, credential handling, permission changes, and external messaging do not need the same oversight regime, but they should all be mapped to one explicitly\. Low\-risk navigation may tolerate logging plus post\-action verification\. High\-impact actions generally require preview, confirmation, or explicit approval\.

### VII\-CControl Ownership Is Distributed Across Actors

No single team can absorb the whole governance burden of a deployed CUA\. Control ownership is distributed because the control surfaces themselves are distributed across the lifecycle\. Model and system developers shape priors\. Deployment teams bind the model to channels, permissions, tools, and sessions\. Operators and end users govern escalation and approval during runtime\. Ecosystem stewards govern registries, plugins, skills, and identity surfaces after release\.

This separation matters because different failures trace back to different control owners\. Insufficiently constrained development can introduce risky priors before anyone else has a chance to intervene\. Loose deployment practice can expose too much authority even when the model itself is reasonable\. Weak runtime oversight can let user intent dissolve during long trajectories\. Loose ecosystem governance can reopen a previously bounded system through extensions, registries, or agent\-to\-agent exchange\.

These roles are analytically distinct but operationally coupled\. Careful training may not rescue a deployment that binds the model to broad permissions\. Strong runtime approval may not fully rescue a system whose extension ecosystem is weakly governed\. Public materials describing OpenClaw\-like gateway assistants help illustrate this coupling in settings where persistent channels, tool bindings, sessions, and community interaction are presented as converging in one assistant surface\[[105](https://arxiv.org/html/2605.07110#bib.bib96),[95](https://arxiv.org/html/2605.07110#bib.bib97)\]\. Such public descriptions are used only as illustrative deployment patterns rather than as verified system evidence\. The broader point is that any open\-deployment CUA that combines ingress, tooling, memory, and execution authority needs control ownership to remain traceable across those surfaces\.

### VII\-DA Conservative Baseline Stack for Open Deployment

The literature does not support any single defense as sufficient\. It does, however, suggest a conservative baseline stack that repeatedly appears across safer deployment patterns\. The list below is not an exhaustive defense catalogue\. It is a compact control stack for open deployment: constrain what the system can do, preserve where instructions come from, bind authority safely, verify live execution, and keep assurance current after change\.

- •Constrained action interfaces\.High\-impact operations should prefer semantically narrow, auditable interfaces over unrestricted low\-level authority whenever possible\[[57](https://arxiv.org/html/2605.07110#bib.bib40),[38](https://arxiv.org/html/2605.07110#bib.bib44),[97](https://arxiv.org/html/2605.07110#bib.bib63)\]\.
- •Provenance\-aware mediation\.User instructions, retrieved content, tool outputs, and environmental text should remain distinguishable inside the control loop so that both the model and the operator can inspect source\[[60](https://arxiv.org/html/2605.07110#bib.bib62),[172](https://arxiv.org/html/2605.07110#bib.bib75)\]\.
- •Sandboxed and scoped authority\.Filesystem, network, tool, and account permissions should be deliberately bounded before deployment rather than tightened only after an incident\[[96](https://arxiv.org/html/2605.07110#bib.bib64),[38](https://arxiv.org/html/2605.07110#bib.bib44),[35](https://arxiv.org/html/2605.07110#bib.bib46)\]\.
- •Runtime verification plus escalation\.Post\-action checking, plan preview, impact\-aware approval, takeover paths, explicit safeguarded\-execution layers, and runtime diagnosis are necessary to keep autonomy aligned once live use begins\[[103](https://arxiv.org/html/2605.07110#bib.bib51),[174](https://arxiv.org/html/2605.07110#bib.bib45),[34](https://arxiv.org/html/2605.07110#bib.bib149),[68](https://arxiv.org/html/2605.07110#bib.bib228),[178](https://arxiv.org/html/2605.07110#bib.bib153),[81](https://arxiv.org/html/2605.07110#bib.bib66)\]\.
- •Continuous assurance\.Regression suites, red teaming, extension review, replay against changed interfaces, detector updates, and signed or reviewed registries are needed so that maintenance\-stage drift does not quietly reopen known weaknesses\[[78](https://arxiv.org/html/2605.07110#bib.bib76),[53](https://arxiv.org/html/2605.07110#bib.bib230),[183](https://arxiv.org/html/2605.07110#bib.bib100),[135](https://arxiv.org/html/2605.07110#bib.bib204),[85](https://arxiv.org/html/2605.07110#bib.bib203),[86](https://arxiv.org/html/2605.07110#bib.bib102)\]\.

These controls are cumulative\. Constraining actions without provenance still leaves instruction confusion unresolved\. Provenance without sandboxing still leaves broad authority intact\. Runtime verification without maintenance discipline still decays as tools, interfaces, and ecosystems change\. The practical implication is that dependable CUA deployment is unlikely to be achieved by any single alignment technique alone\. It is more likely to depend on whether capability, authority, and oversight are connected through one coherent control stack\.

## VIIIOpen Problems and Future Directions

The framework clarifies deployed CUA reliability without claiming that the problem has been closed\. Even if models continue to improve, several important questions remain unresolved\. The hard part is no longer only capability\. It is also attribution, evaluation, control placement, and post\-release governance\.

### VIII\-ALimits of the Analytical Lens

The lens emphasizes deployment\-grounded diagnosis: where reliability issues are introduced, how they become operational, and where meaningful controls can still be attached\. That emphasis leaves some questions only partially covered\. The article does not aim to provide a formal causal model of agent failure, an exhaustive benchmark comparison, or a substitute for learning\-theoretic, formal\-methods, or proof\-oriented security analysis\. Its value is organizational and diagnostic rather than complete in every theoretical dimension\.

### VIII\-BAttribution Still Lags Observation

The distinction among scope overreach, objective corruption, and environmental misbinding is analytically useful, but operationalizing it remains difficult\. In realistic deployments, off\-task helpfulness, silent goal drift, and attacker\-induced exfiltration may produce similar surface behavior\. Better provenance, telemetry, and causal tracing are still needed if attribution is to guide control placement reliably rather than merely explain incidents after the fact\[[57](https://arxiv.org/html/2605.07110#bib.bib40),[35](https://arxiv.org/html/2605.07110#bib.bib46)\]\.

### VIII\-CBenchmark Success Is Not Yet Release Readiness

Current evaluation suites capture important parts of the CUA problem, but they still underrepresent long\-lived sessions, changing permissions, maintenance\-stage drift, and open ecosystem interaction\. As a result, benchmark scores should not yet be read as translating directly into deployment readiness\. A major open direction is to connect benchmark design more tightly to release gating, regression testing, and post\-release monitoring\[[65](https://arxiv.org/html/2605.07110#bib.bib41),[126](https://arxiv.org/html/2605.07110#bib.bib42),[15](https://arxiv.org/html/2605.07110#bib.bib67),[183](https://arxiv.org/html/2605.07110#bib.bib100),[160](https://arxiv.org/html/2605.07110#bib.bib135),[2](https://arxiv.org/html/2605.07110#bib.bib136),[177](https://arxiv.org/html/2605.07110#bib.bib150),[16](https://arxiv.org/html/2605.07110#bib.bib151),[150](https://arxiv.org/html/2605.07110#bib.bib48)\]\.

### VIII\-DCross\-Modal Robustness Remains Immature

Injection in CUAs is not only text injection\. It may travel through screenshots, layout cues, parser artifacts, OCR\-visible instructions, time\-varying UI state, adversarial image patches, and attention\-steering interface elements\[[87](https://arxiv.org/html/2605.07110#bib.bib77),[176](https://arxiv.org/html/2605.07110#bib.bib43),[108](https://arxiv.org/html/2605.07110#bib.bib58),[134](https://arxiv.org/html/2605.07110#bib.bib208),[3](https://arxiv.org/html/2605.07110#bib.bib210),[17](https://arxiv.org/html/2605.07110#bib.bib213),[113](https://arxiv.org/html/2605.07110#bib.bib199),[154](https://arxiv.org/html/2605.07110#bib.bib214)\]\. Defenses that work on prompt text alone are therefore insufficient\. A robust CUA will likely need perception and execution defenses that explicitly account for how multimodal grounding and authority binding interact\.

### VIII\-ECapability Gains Still Create Policy Tension

Stronger models, richer tools, and better post\-training do not automatically reduce policy tension; in some settings they can increase it\. The more capable the agent becomes, the less obvious it is how that capability should translate into authority, autonomy, and approval requirements\. Future work therefore needs to evaluate safety as a function of tool access, impact level, and runtime pressure rather than as a static property of the base model\[[126](https://arxiv.org/html/2605.07110#bib.bib42)\]\. Broader agent\-safety evidence such as PropensityBench points in the same direction, even though it is not CUA\-specific: risk depends on both what a model can do in principle and what it is likely to attempt once consequential tools are available\[[112](https://arxiv.org/html/2605.07110#bib.bib69)\]\.

### VIII\-FOpen Ecosystems Need Stronger Governance Primitives

A growing subset of open\-deployment CUAs is embedded in tool registries, extension marketplaces, MCP\-style ecosystems, and agent\-to\-agent environments\. These settings appear to require stronger primitives for identity, attestation, registry hygiene, and capability containment than the field currently has\[[67](https://arxiv.org/html/2605.07110#bib.bib72),[170](https://arxiv.org/html/2605.07110#bib.bib70),[139](https://arxiv.org/html/2605.07110#bib.bib101),[86](https://arxiv.org/html/2605.07110#bib.bib102)\]\. At the same time, stronger isolation can reduce collaboration quality, which means governance cannot be framed as pure restriction without considering utility trade\-offs\[[107](https://arxiv.org/html/2605.07110#bib.bib74)\]\. The field still lacks a stable answer for how open capability ecosystems should remain both useful and governable\.

These open problems reinforce the central claim\. Progress in CUAs is unlikely to come from model quality alone\. It will more likely depend on better ways of linking architecture, lifecycle, authority, and control so that deployment remains intelligible after release rather than only impressive before it\.

## IXConclusion

Computer\-use agents can be productively analyzed as deployed interactive systems rather than as benchmark policies with stronger prompting alone\. Their reliability depends jointly on how they reconstruct interface state, preserve task intent, execute through real authority surfaces, and continue to evolve after release\.

The preceding sections developed that argument through a joint architecture–lifecycle framework\. The tri\-layer architecture traced where actionable state is reconstructed, where intent is stabilized, and where authority is exercised\. The lifecycle account traced when capability is formed, when authority is bound, when failures first become visible, and where controls may still intervene earliest\. The value of combining the two is organizational and diagnostic: it helps separate visible failure from upstream cause and ties practical control to the right stage of system evolution\.

Three tensions remain central\. The first is*portability versus controllability*: the observation and execution channels that generalize most broadly are often the hardest to mediate and verify precisely\. The second is*autonomy versus oversight*: long\-horizon execution becomes valuable only when user intent, provenance, and approval remain attached to the loop\. The third is*adaptation versus regression*: deployed CUAs need to evolve with changing interfaces and ecosystems, yet every update can reopen earlier weaknesses or create new ones\.

Those tensions suggest three priorities\. First, the field needs controllable visual grounding that remains portable without sacrificing auditable execution\. Second, it needs deployment\-aware evaluation that measures authority binding, misuse resistance, and oversight behavior alongside task capability\. Third, it needs maintenance\-aware governance so that models, tools, memory, identities, and extensions remain governable after release instead of drifting into insecurity\.

Taken together, the surveyed literature and public deployment patterns suggest that dependable computer use is unlikely to be achieved by stronger models alone\. It also depends on how capability is bound to authority, how runtime uncertainty is mediated, and how assurance is maintained after deployment\. The main value of the architecture–lifecycle view is therefore to help locate where problems are introduced, where they become visible, and where meaningful controls can still be applied\.

## References

- \[1\]S\. T\. R\. Adapala and Y\. R\. Alugubelly\(2025\)The aegis protocol: a foundational security framework for autonomous ai agents\.arXiv preprint arXiv:2508\.19267\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.7.6.2.1.1)\.
- \[2\]P\. Aggarwal, G\. Neubig, and S\. Welleck\(2026\)Gym\-anything: turn any software into an agent environment\.arXiv preprint arXiv:2604\.06126\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.5.4.2.1.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px4.p1.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[3\]L\. Aichberger, A\. Paren, G\. Li, P\. Torr, Y\. Gal, and A\. Bibi\(2025\)MIP against agent: malicious image patches hijacking multimodal os agents\.arXiv preprint arXiv:2503\.10809\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[4\]N\. Anand, R\. Jain, S\. Patnaik, B\. Krishnamurthy, and M\. Sarkar\(2026\)AFRAgent: an adaptive feature renormalization based high resolution aware gui agent\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 1147–1158\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1)\.
- \[5\]M\. Andreux, M\. Bakler, Y\. Barbier, H\. Benchekroun, E\. Biré, A\. Bonnet, R\. Bordie, N\. Bout, M\. Brunel, A\. Cambray,et al\.\(2025\)Surfer 2: the next generation of cross\-platform computer use agents\.arXiv preprint arXiv:2510\.19949\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p1.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p5.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px1.p1.1)\.
- \[6\]G\. Baechler, S\. Sunkara, M\. Wang, F\. Zubach, H\. Mansoor, V\. Etter, V\. Cărbune, J\. Lin, J\. Chen, and A\. Sharma\(2024\)Screenai: a vision\-language model for ui and infographics understanding\.arXiv preprint arXiv:2402\.04615\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p3.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[7\]J\. Bai, S\. Bai, S\. Yang, S\. Wang, S\. Tan, P\. Wang, J\. Lin, C\. Zhou, and J\. Q\. Zhou\(2023\)A versatile vision\-language model for understanding, localization, text reading, and beyond\.arXiv preprint arXiv:2308\.129666,pp\. 3\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1)\.
- \[8\]R\. Bavishi, E\. Elsen, C\. Hawthorne, M\. Nye, A\. Odena, A\. Somani, and S\. Taşırlar\(2023\)Introducing our multimodal models\.External Links:[Link](https://www.adept.ai/blog/fuyu-8b)Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1)\.
- \[9\]L\. Boisvert, M\. Thakkar, M\. Gasse, M\. Caccia, T\. L\. De Chezelles, Q\. Cappart, N\. Chapados, A\. Lacoste, and A\. Drouin\(2024\)Workarena\+\+: towards compositional planning and reasoning\-based common knowledge work tasks\.Advances in Neural Information Processing Systems37,pp\. 5996–6051\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px4.p1.1)\.
- \[10\]T\. Cao, B\. Lim, Y\. Liu, Y\. Sui, Y\. Li, S\. Deng, L\. Lu, N\. Oo, S\. Yan, and B\. Hooi\(2025\)Vpi\-bench: visual prompt injection attacks for computer\-use agents\.arXiv preprint arXiv:2506\.02456\.Cited by:[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[11\]H\. Chae, S\. Kim, J\. Cho, S\. Kim, S\. Moon, G\. Hwangbo, D\. Lim, M\. Kim, Y\. Hwang, M\. Gwak,et al\.\(2025\)Web\-shepherd: advancing prms for reinforcing web agents\.arXiv preprint arXiv:2505\.15277\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p2.1)\.
- \[12\]A\. Chen, Y\. Wu, J\. Zhang, J\. Xiao, S\. Yang, J\. Huang, K\. Wang, W\. Wang, and S\. Wang\(2025\)A survey on the safety and security threats of computer\-using agents: jarvis or ultron?\.arXiv preprint arXiv:2505\.10924\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.3.2.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1)\.
- \[13\]C\. Chen, X\. Song, Y\. Chai, Y\. Yao, H\. Zhao, L\. Li, J\. Li, Y\. Teng, G\. Liu, and Y\. Wang\(2025\)GhostEI\-bench: do mobile agents resilience to environmental injection in dynamic on\-device environments?\.arXiv preprint arXiv:2510\.20333\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1)\.
- \[14\]E\. Chen, C\. Guan, A\. Elshafiey, Z\. Zhao, J\. Zekeri, A\. E\. Shaibu, and E\. O\. Prince\(2026\)When openclaw ai agents teach each other: peer learning patterns in the moltbook community\.arXiv preprint arXiv:2602\.14477\.Cited by:[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p5.1),[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p6.1)\.
- \[15\]T\. Chen, C\. Hu, G\. Gao, D\. Liu, X\. Hu, and W\. Wang\(2026\)LPS\-bench: benchmarking safety awareness of computer\-use agents in long\-horizon planning under benign and adversarial scenarios\.arXiv preprint arXiv:2602\.03255\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p4.1),[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p4.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.8.6.5.1.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[16\]T\. Chen, Z\. Lu, Z\. Xu, G\. Shao, S\. Zhao, F\. Tang, Y\. Du, K\. Song, Y\. Liu, Y\. Yan,et al\.\(2026\)KnowU\-bench: towards interactive, proactive, and personalized mobile agent evaluation\.arXiv preprint arXiv:2604\.08455\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px4.p1.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[17\]Y\. Chen, X\. Hu, K\. Yin, J\. Li, and S\. Zhang\(2025\)Evaluating the robustness of multimodal agents against active environmental injection attacks\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 11648–11656\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px2.p1.1),[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[18\]Y\. Chen, Z\. Liao, P\. Yin, T\. Xie, K\. Yin, and S\. Zhang\(2026\)SafePred: a predictive guardrail for computer\-using agents via world models\.arXiv preprint arXiv:2602\.01725\.Cited by:[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p2.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[19\]Z\. Chen, M\. Kang, and B\. Li\(2025\)Shieldagent: shielding agents via verifiable safety policy reasoning\.arXiv preprint arXiv:2503\.22738\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p3.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1)\.
- \[20\]Z\. Chen, Z\. Xiang, C\. Xiao, D\. Song, and B\. Li\(2024\)Agentpoison: red\-teaming llm agents via poisoning memory or knowledge bases\.Advances in Neural Information Processing Systems37,pp\. 130185–130213\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p4.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.8.6.5.1.1)\.
- \[21\]J\. Cheng, A\. Kumar, R\. Lal, R\. Rajasekaran, H\. Ramezani, O\. Z\. Khan, O\. Rokhlenko, S\. Chiu\-Webster, G\. Hua, and H\. Amiri\(2025\)WebATLAS: an llm agent with experience\-driven memory and action simulation\.arXiv preprint arXiv:2510\.22732\.Cited by:[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p4.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.5.4.2.1.1)\.
- \[22\]K\. Cheng, Q\. Sun, Y\. Chu, F\. Xu, L\. YanTao, J\. Zhang, and Z\. Wu\(2024\)Seeclick: harnessing gui grounding for advanced visual gui agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9313–9332\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.3.2.2.1.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[23\]P\. Cheng, H\. Hu, Z\. Wu, Z\. Wu, T\. Ju, Z\. Zhang, and G\. Liu\(2025\)Hidden ghost hand: unveiling backdoor vulnerabilities in mllm\-powered mobile gui agents\.arXiv preprint arXiv:2505\.14418\.Cited by:[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.3.1.5.1.1)\.
- \[24\]P\. Cuvin, H\. Zhu, and D\. Yang\(2025\)DECEPTICON: how dark patterns manipulate web agents\.arXiv preprint arXiv:2512\.22894\.Cited by:[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[25\]X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su\(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.5.4.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p2.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p2.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.2.1.2.1.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[26\]X\. Ding, S\. Zhai, L\. Song, J\. Li, T\. Shi, N\. Meade, S\. Reddy, J\. Kang, and J\. Zhao\(2026\)The blind spot of agent safety: how benign user instructions expose critical vulnerabilities in computer\-use agents\.arXiv preprint arXiv:2604\.10577\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p1.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[27\]J\. Dong, S\. Guo, H\. Wang, X\. Chen, Z\. Liu, T\. Zhang, K\. Xu, M\. Huang, and H\. Qiu\(2025\)SafeSearch: automated red\-teaming for the safety of llm\-based search agents\.arXiv preprint arXiv:2509\.23694\.Cited by:[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p1.1)\.
- \[28\]A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez,et al\.\(2024\)Workarena: how capable are web agents at solving common knowledge work tasks?\.arXiv preprint arXiv:2403\.07718\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1)\.
- \[29\]D\. Ersoy, B\. Lee, A\. Shreekumar, A\. Arunasalam, M\. Ibrahim, A\. Bianchi, and Z\. B\. Celik\(2025\)Investigating the impact of dark patterns on llm\-based web agents\.arXiv preprint arXiv:2510\.18113\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1)\.
- \[30\]I\. Evtimov, A\. Zharmagambetov, A\. Grattafiori, C\. Guo, and K\. Chaudhuri\(2025\)Wasp: benchmarking web agent security against prompt injection attacks\.arXiv preprint arXiv:2504\.18575\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1)\.
- \[31\]G\. Fan, C\. Niu, C\. Lyu, F\. Wu, and G\. Chen\(2025\)CORE: reducing ui exposure in mobile agents via collaboration between cloud and local llms\.arXiv preprint arXiv:2510\.15455\.Cited by:[§VI\-F](https://arxiv.org/html/2605.07110#S6.SS6.p2.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p3.1)\.
- \[32\]S\. Fan, R\. Wan, Y\. Leng, G\. Liang, L\. Ling, Y\. Shang, and D\. Kong\(2026\)WebChain: a large\-scale human\-annotated dataset of real\-world web interaction traces\.arXiv preprint arXiv:2603\.05295\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[33\]Y\. Fan, H\. Zhao, R\. Zhang, Y\. Shen, X\. E\. Wang, and G\. Wu\(2025\)Gui\-bee: align gui action grounding to novel environments via autonomous exploration\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 33249–33266\.Cited by:[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p1.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p5.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[34\]Y\. Feng, J\. Du, Q\. Wang, Z\. Ma, Q\. Niu, Y\. Matsuo, L\. Feng, and L\. Yu\(2026\)CORA: conformal risk\-controlled agents for safeguarded mobile gui automation\.arXiv preprint arXiv:2604\.09155\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px5.p1.1),[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p2.1),[4th item](https://arxiv.org/html/2605.07110#S7.I1.i4.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[35\]H\. Foerster, R\. Mullins, T\. Blanchard, N\. Papernot, K\. Nikolić, F\. Tramèr, I\. Shumailov, C\. Zhang, and Y\. Zhao\(2026\)CaMeLs can use computers too: system\-level security for computer use agents\.arXiv preprint arXiv:2601\.09923\.Cited by:[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p3.1),[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p1.1),[§VI\-D](https://arxiv.org/html/2605.07110#S6.SS4.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.5.3.5.1.1),[3rd item](https://arxiv.org/html/2605.07110#S7.I1.i3.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p3.1),[§VIII\-B](https://arxiv.org/html/2605.07110#S8.SS2.p1.1)\.
- \[36\]M\. Gao, W\. Bu, B\. Miao, Y\. Wu, Y\. Li, J\. Li, S\. Tang, Q\. Wu, Y\. Zhuang, and M\. Wang\(2024\)Generalist virtual agents: a survey on autonomous agents across digital platforms\.arXiv preprint arXiv:2411\.10943\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p3.1),[§III](https://arxiv.org/html/2605.07110#S3.p1.1)\.
- \[37\]Y\. Gao, J\. Wu, X\. Chen, Y\. Yang, Z\. Cui, T\. Ma, J\. Zhang, and J\. Sang\(2026\)GUITester: enabling gui agents for exploratory defect discovery\.arXiv preprint arXiv:2601\.04500\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1)\.
- \[38\]H\. Gong, C\. Li, R\. Chang, and W\. Shen\(2025\)Secure and efficient access control for computer\-use agents via context space\.arXiv preprint arXiv:2509\.22256\.Cited by:[§VI\-D](https://arxiv.org/html/2605.07110#S6.SS4.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.5.3.5.1.1),[1st item](https://arxiv.org/html/2605.07110#S7.I1.i1.p1.1),[3rd item](https://arxiv.org/html/2605.07110#S7.I1.i3.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p3.1)\.
- \[39\]G\. Gonzalez\-Pumariega, V\. Tu, C\. Lee, J\. Yang, A\. Li, and X\. E\. Wang\(2025\)The unreasonable effectiveness of scaling agents for computer use\.arXiv preprint arXiv:2510\.02250\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.4.3.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p1.1),[§I](https://arxiv.org/html/2605.07110#S1.p2.1)\.
- \[40\]L\. Guo, W\. Liu, Y\. W\. Heng, T\. P\. Chen, and Y\. Wang\(2026\)Agent\-sama: state\-aware mobile assistant\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 29459–29467\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1)\.
- \[41\]T\. Gupta, P\. Wolters, Z\. Ma, P\. Sushko, R\. Y\. Pang, D\. Llanes, Y\. Yang, T\. Anderson, B\. Zheng, Z\. Ren,et al\.\(2026\)MolmoWeb: open visual web agent and open data for the open web\.arXiv preprint arXiv:2604\.08516\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p1.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[42\]I\. Gur, H\. Furuta, A\. Huang, M\. Safdari, Y\. Matsuo, D\. Eck, and A\. Faust\(2023\)A real\-world webagent with planning, long context understanding, and program synthesis\.arXiv preprint arXiv:2307\.12856\.Cited by:[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p4.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1)\.
- \[43\]A\. S\. Gurbuz, S\. Hong, A\. Nassar, M\. Pollefeys, and P\. Staar\(2026\)Moving beyond sparse grounding with complete screen parsing supervision\.arXiv preprint arXiv:2602\.14276\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p3.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[44\]H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu\(2024\)Webvoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6864–6890\.Cited by:[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p2.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.3.2.2.1.1)\.
- \[45\]T\. He, Y\. Chen, K\. Jiang, K\. Y\. Lee, K\. Zhou, K\. Shao, and S\. Wang\(2026\)EE\-mcp: self\-evolving mcp\-gui agents via automated environment generation and experience learning\.arXiv preprint arXiv:2604\.09815\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[46\]Z\. He, W\. Hong, Z\. Yang, Z\. Pan, M\. Liu, X\. Gu, and J\. Tang\(2026\)Vision2Web: a hierarchical benchmark for visual website development with agent verification\.arXiv preprint arXiv:2603\.26648\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1)\.
- \[47\]W\. Hong, W\. Wang, Q\. Lv, J\. Xu, W\. Yu, J\. Ji, Y\. Wang, Z\. Wang, Y\. Dong, M\. Ding,et al\.\(2024\)Cogagent: a visual language model for gui agents\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 14281–14290\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[48\]H\. Hu, P\. Chen, Y\. Zhao, and Y\. Chen\(2025\)Agentsentinel: an end\-to\-end and real\-time security defense framework for computer\-use agents\.InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security,pp\. 3535–3549\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p3.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[49\]S\. Hu, K\. Q\. Lin, and M\. Z\. Shou\(2025\)ShowUI\-π\\pi: flow\-based generative models as gui dexterous hands\.arXiv preprint arXiv:2512\.24965\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p2.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[50\]X\. Hu, T\. Xiong, B\. Yi, Z\. Wei, R\. Xiao, Y\. Chen, J\. Ye, M\. Tao, X\. Zhou, Z\. Zhao,et al\.\(2025\)Os agents: a survey on mllm\-based agents for computer, phone and browser use\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7436–7465\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.2.1.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1)\.
- \[51\]Z\. Hui, Y\. Li, T\. Chen, C\. Banbury, K\. Koishida,et al\.\(2025\)Winclick: gui grounding with multimodal large language models\.arXiv preprint arXiv:2503\.04730\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.4.3.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[52\]F\. Huq, Z\. Z\. Wang, Z\. Guo, V\. A\. Arangarajan, T\. Ou, F\. Xu, S\. Zhou, G\. Neubig, and J\. P\. Bigham\(2026\)Modeling distinct human interaction in web agents\.arXiv preprint arXiv:2602\.17588\.Cited by:[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px4.p1.1),[§V](https://arxiv.org/html/2605.07110#S5.p2.1),[§VII\-B](https://arxiv.org/html/2605.07110#S7.SS2.p2.1)\.
- \[53\]M\. F\. Ishmam and K\. Marino\(2026\-03\)TimeWarp: Evaluating Web Agents by Revisiting the Past\.Note:arXiv preprint arXiv:2603\.04949External Links:2603\.04949,[Link](https://arxiv.org/abs/2603.04949)Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p2.1),[5th item](https://arxiv.org/html/2605.07110#S7.I1.i5.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p5.1)\.
- \[54\]H\. Jia, M\. He, Z\. Yin, L\. Wu, J\. Fan, and J\. Sang\(2025\)ReInAgent: a context\-aware gui agent enabling human\-in\-the\-loop mobile task navigation\.arXiv preprint arXiv:2510\.07988\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1)\.
- \[55\]D\. Jiang, J\. Huang, X\. Zhao, L\. Chen, L\. Zheng, F\. Liu, H\. Qiu, P\. Shi, and Z\. Zeng\(2026\)TreeCUA: efficiently scaling gui automation with tree\-structured verifiable evolution\.arXiv preprint arXiv:2602\.09662\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[56\]W\. Jiang, Y\. Zhuang, C\. Song, X\. Yang, J\. T\. Zhou, and C\. Zhang\(2025\)Appagentx: evolving gui agents as proficient smartphone users\.arXiv preprint arXiv:2503\.02268\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p4.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.7.6.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[57\]D\. Jones, G\. Severi, M\. Pouliot, G\. Lopez, J\. de Gruyter, S\. Zanella\-Beguelin, J\. Song, B\. Bullwinkel, P\. Cortez, and A\. Minnich\(2025\)A systematization of security vulnerabilities in computer use agents\.arXiv preprint arXiv:2507\.05445\.Cited by:[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p1.1),[1st item](https://arxiv.org/html/2605.07110#S7.I1.i1.p1.1),[§VIII\-B](https://arxiv.org/html/2605.07110#S8.SS2.p1.1)\.
- \[58\]J\. Jones, Z\. Zhang, Y\. Ning, E\. Fosler\-Lussier, P\. St\-Charles, Y\. Bengio, D\. Song, Y\. Su, and H\. Sun\(2026\)When benign inputs lead to severe harms: eliciting unsafe unintended behaviors of computer\-use agents\.arXiv preprint arXiv:2602\.08235\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p4.1)\.
- \[59\]B\. Kang, S\. Wen, Y\. Bi, S\. Wu, X\. Yuan, R\. Shao, J\. Wang, and Z\. Tian\(2026\)LongHorizonUI: A Unified Framework for Robust long\-horizon Task Automation of GUI Agent\.InInternational Conference on Learning Representations,Note:PosterExternal Links:[Link](https://openreview.net/forum?id=BK7Mk5d4WE)Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1),[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.4.3.2.1.1)\.
- \[60\]M\. Kang, C\. Xiang, S\. Kariyappa, C\. Xiao, B\. Li, and E\. Suh\(2025\)Mitigating indirect prompt injection via instruction\-following intent analysis\.arXiv preprint arXiv:2512\.00966\.Cited by:[2nd item](https://arxiv.org/html/2605.07110#S7.I1.i2.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[61\]Z\. M\. Kim, D\. Lee, J\. Kim, V\. Raheja, and D\. Kang\(2026\)The amazing agent race: strong tool users, weak navigators\.arXiv preprint arXiv:2604\.10261\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px1.p1.1)\.
- \[62\]J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried\(2024\)Visualwebarena: evaluating multimodal agents on realistic visual web tasks\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 881–905\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1)\.
- \[63\]F\. Kong, J\. Zhang, Y\. Yue, C\. Sun, Y\. Tian, S\. Feng, X\. Yang, D\. Wang, Y\. Tian, J\. Du,et al\.\(2026\)WebTestBench: evaluating computer\-use agents towards end\-to\-end automated web testing\.arXiv preprint arXiv:2603\.25226\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1)\.
- \[64\]K\. Korgul, Y\. Yang, A\. Drohomirecki, W\. Howard, L\. Aichberger, C\. Russell, P\. H\. Torr, A\. Mahdi, A\. Bibi,et al\.\(2025\)It’s a trap\! task\-redirecting agent persuasion benchmark for web agents\.arXiv preprint arXiv:2512\.23128\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1)\.
- \[65\]T\. Kuntz, A\. Duzan, H\. Zhao, F\. Croce, Z\. Kolter, N\. Flammarion, and M\. Andriushchenko\(2025\)Os\-harm: a benchmark for measuring safety of computer use agents\.arXiv preprint arXiv:2506\.14866\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.6.5.2.1.1),[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p3.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.4.2.5.1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p2.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[66\]S\. J\. Lazer, K\. Aryal, M\. Gupta, and E\. Bertino\(2026\)A survey of agentic ai and cybersecurity: challenges, opportunities and use\-case prototypes\.arXiv preprint arXiv:2601\.05293\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.7.6.2.1.1)\.
- \[67\]D\. Lee and M\. Tiwari\(2024\)Prompt infection: llm\-to\-llm prompt injection within multi\-agent systems\.arXiv preprint arXiv:2410\.07283\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p6.1),[§VIII\-F](https://arxiv.org/html/2605.07110#S8.SS6.p1.1)\.
- \[68\]J\. Lee, D\. Lee, C\. Choi, Y\. Im, J\. Wi, K\. Heo, S\. Oh, S\. Lee, and I\. Shin\(2025\)VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic\-based Action Verification\.InProceedings of the 31st Annual International Conference on Mobile Computing and Networking,pp\. 817–831\.External Links:[Document](https://dx.doi.org/10.1145/3680207.3765248),[Link](https://doi.org/10.1145/3680207.3765248)Cited by:[4th item](https://arxiv.org/html/2605.07110#S7.I1.i4.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[69\]J\. Lee, D\. Hahm, J\. S\. Choi, W\. B\. Knox, and K\. Lee\(2026\)MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control\.InFortieth AAAI Conference on Artificial Intelligence,pp\. 37565–37573\.External Links:[Document](https://dx.doi.org/10.1609/AAAI.V40I44.41090),[Link](https://doi.org/10.1609/AAAI.V40I44.41090)Cited by:[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p3.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[70\]I\. Levy, B\. Wiesel, S\. Marreed, A\. Oved, A\. Yaeli, N\. Mashkif, and S\. Shlomov\(2026\)ST\-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents\.InInternational Conference on Learning Representations,Note:PosterExternal Links:[Link](https://openreview.net/forum?id=MuCDzH0ctf)Cited by:[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p3.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[71\]J\. Li and K\. Huang\(2025\)A survey on gui agents with foundation models enhanced by reinforcement learning\.arXiv preprint arXiv:2504\.20464\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.3.2.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1),[§III](https://arxiv.org/html/2605.07110#S3.p1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p1.1)\.
- \[72\]J\. Li, T\. Lan, H\. Tan, Y\. Meng, and H\. Zhu\(2026\)SlowBA: an efficiency backdoor attack towards vlm\-based gui agents\.arXiv preprint arXiv:2603\.08316\.Cited by:[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p2.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[73\]K\. Li, Z\. Meng, H\. Lin, Z\. Luo, Y\. Tian, J\. Ma, Z\. Huang, and T\. Chua\(2025\)ScreenSpot\-Pro: GUI Grounding for Professional High\-Resolution Computer Use\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 8778–8786\.External Links:[Document](https://dx.doi.org/10.1145/3746027.3755688),[Link](https://doi.org/10.1145/3746027.3755688)Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p6.1)\.
- \[74\]N\. Li, X\. Qu, J\. Zhou, J\. Wang, M\. Wen, K\. Du, X\. Lou, Q\. Peng, J\. Wang, and W\. Zhang\(2025\)MobileUse: a gui agent with hierarchical reflection for autonomous mobile operation\.arXiv preprint arXiv:2507\.16853\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p4.1),[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px1.p1.1)\.
- \[75\]S\. Li, K\. Kallidromitis, A\. Gokul, Y\. Kato, K\. Kozuka, and A\. Grover\(2025\)MobileWorldBench: towards semantic world modeling for mobile agents\.arXiv preprint arXiv:2512\.14014\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px1.p1.1)\.
- \[76\]Z\. Li, Z\. Cao, W\. Huang, Y\. Zhang, K\. Qi, R\. Wang, Z\. Zheng, J\. Zhao, H\. Zhu, H\. Wu,et al\.\(2026\)MagicGUI\-rms: a multi\-agent reward model system for self\-evolving gui agents via automated feedback reflux\.arXiv preprint arXiv:2601\.13060\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1)\.
- \[77\]Z\. Li, J\. Cui, X\. Liao, and L\. Xing\(2025\)Les dissonances: cross\-tool harvesting and polluting in pool\-of\-tools empowered llm agents\.arXiv preprint arXiv:2504\.03111\.Cited by:[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px5.p1.1),[§VI\-D](https://arxiv.org/html/2605.07110#S6.SS4.p3.1),[§VI\-F](https://arxiv.org/html/2605.07110#S6.SS6.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.6.4.5.1.1)\.
- \[78\]Z\. Liao, J\. Jones, L\. Jiang, Y\. Ning, E\. Fosler\-Lussier, Y\. Su, Z\. Lin, and H\. Sun\(2025\)Redteamcua: realistic adversarial testing of computer\-use agents in hybrid web\-os environments\.arXiv preprint arXiv:2505\.21936\.Cited by:[§VI\-D](https://arxiv.org/html/2605.07110#S6.SS4.p4.1),[5th item](https://arxiv.org/html/2605.07110#S7.I1.i5.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p5.1)\.
- \[79\]Z\. Liao, L\. Mo, C\. Xu, M\. Kang, J\. Zhang, C\. Xiao, Y\. Tian, B\. Li, and H\. Sun\(2025\)EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage\.InInternational Conference on Learning Representations,Note:PosterExternal Links:[Link](https://openreview.net/forum?id=xMOLUzo2Lk)Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1),[§VI\-F](https://arxiv.org/html/2605.07110#S6.SS6.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.10.8.5.1.1)\.
- \[80\]H\. Lin, X\. Tan, Y\. Qin, Z\. Xu, Y\. Shi, Z\. Li, G\. Li, S\. Cai, S\. Cai, C\. Fu,et al\.\(2025\)Cuarewardbench: a benchmark for evaluating reward models on computer\-using agent\.arXiv preprint arXiv:2510\.18596\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p2.1)\.
- \[81\]D\. Liu, Q\. Ren, C\. Qian, S\. Shao, Y\. Xie, Y\. Li, Z\. Yang, H\. Luo, P\. Wang, Q\. Liu,et al\.\(2026\)AgentDoG: a diagnostic guardrail framework for ai agent safety and security\.arXiv preprint arXiv:2601\.18491\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p3.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1),[4th item](https://arxiv.org/html/2605.07110#S7.I1.i4.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[82\]G\. Liu, P\. Zhao, Y\. Liang, L\. Liu, Y\. Guo, H\. Xiao, W\. Lin, Y\. Chai, Y\. Han, S\. Ren,et al\.\(2025\)Llm\-powered gui agents in phone automation: surveying progress and prospects\.arXiv preprint arXiv:2504\.19838\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.3.2.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1)\.
- \[83\]G\. Liu, P\. Zhao, Y\. Liang, Q\. Luo, S\. Tang, Y\. Chai, W\. Lin, H\. Xiao, W\. Wang, S\. Chen,et al\.\(2026\)MemGUI\-bench: benchmarking memory of mobile gui agents in dynamic environments\.arXiv preprint arXiv:2602\.06075\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1)\.
- \[84\]G\. Liu, J\. Ye, J\. Liu, Y\. Li, W\. Liu, P\. Gao, J\. Luan, and Y\. Liu\(2025\)Hijacking jarvis: benchmarking mobile gui agents against unprivileged third parties\.InProceedings of the 2nd International Workshop on Edge and Mobile Foundation Models,pp\. 12–18\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p5.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.9.7.5.1.1)\.
- \[85\]H\. Liu, D\. Li, L\. Rutishauser, and Z\. Zheng\(2026\)Dual\-modality multi\-stage adversarial safety training: robustifying multimodal web agents against cross\-modal attacks\.arXiv preprint arXiv:2603\.04364\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p3.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1),[5th item](https://arxiv.org/html/2605.07110#S7.I1.i5.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p5.1)\.
- \[86\]S\. Liu, C\. Li, C\. Wang, J\. Hou, Z\. Chen, L\. Zhang, Z\. Liu, Q\. Ye, Y\. Hei, X\. Zhang,et al\.\(2026\)ClawKeeper: comprehensive safety protection for openclaw agents through skills, plugins, and watchers\.arXiv preprint arXiv:2603\.24414\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px3.p1.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.11.9.5.1.1),[5th item](https://arxiv.org/html/2605.07110#S7.I1.i5.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p5.1),[§VIII\-F](https://arxiv.org/html/2605.07110#S8.SS6.p1.1)\.
- \[87\]Y\. Liu, R\. Xu, X\. Wang, Y\. Jia, and N\. Z\. Gong\(2025\)WAInjectBench: benchmarking prompt injection detections for web agents\.arXiv preprint arXiv:2510\.01354\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.6.5.2.1.1),[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p3.1),[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.7.5.5.1.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[88\]Y\. Liu, P\. Li, Z\. Wei, C\. Xie, X\. Hu, X\. Xu, S\. Zhang, X\. Han, H\. Yang, and F\. Wu\(2025\)Infiguiagent: a multimodal generalist gui agent with native reasoning and reflection\.arXiv preprint arXiv:2501\.04575\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1)\.
- \[89\]Y\. Lu, J\. Yang, Y\. Shen, and A\. Awadallah\(2024\)Omniparser for pure vision based gui agent\.arXiv preprint arXiv:2408\.00203\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.3.2.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[90\]Z\. Lu, Y\. Chai, Y\. Guo, X\. Yin, L\. Liu, H\. Wang, H\. Xiao, S\. Ren, G\. Xiong, and H\. Li\(2025\)Ui\-r1: enhancing efficient action prediction of gui agents by reinforcement learning\.arXiv preprint arXiv:2503\.21620\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1)\.
- \[91\]D\. Luo, B\. Tang, K\. Li, G\. Papoudakis, J\. Song, S\. Gong, J\. Hao, J\. Wang, and K\. Shao\(2025\)ViMo: a generative visual gui world model for app agent\.arXiv preprint arXiv:2504\.13936\.Cited by:[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p2.1)\.
- \[92\]R\. Luo, L\. Wang, W\. He, L\. Chen, J\. Li, and X\. Xia\(2025\)Gui\-r1: a generalist r1\-style vision\-language action model for gui agents\.arXiv preprint arXiv:2504\.10458\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1)\.
- \[93\]Y\. Luo, H\. Zhu, S\. Pang, Z\. Lu, T\. Dong, Y\. Zhou, and M\. Xue\(2026\)AgentRAE: remote action execution through notification\-based visual backdoors against screenshots\-based mobile gui agents\.arXiv preprint arXiv:2603\.23007\.Cited by:[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[94\]S\. Marro, A\. Chan, X\. Ren, L\. Hammond, J\. Wright, G\. Wanga, T\. Piccardi, N\. Campos, T\. South, J\. Yu,et al\.\(2025\)Permission manifests for web agents\.arXiv preprint arXiv:2601\.02371\.Cited by:[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[95\]A\. McConnon\(2026\)OpenClaw, moltbook and the future of ai agents\.Note:https://www\.ibm\.com/think/news/clawdbot\-ai\-agent\-testing\-limits\-vertical\-integrationIBM Think article, accessed: 2026\-04\-11Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p5.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p5.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px5.p1.1),[§VII\-C](https://arxiv.org/html/2605.07110#S7.SS3.p3.1)\.
- \[96\]L\. Meng, H\. Feng, I\. Shumailov, and E\. Fernandes\(2025\)Cellmate: sandboxing browser ai agents\.arXiv preprint arXiv:2512\.12594\.Cited by:[§VI\-D](https://arxiv.org/html/2605.07110#S6.SS4.p2.1),[3rd item](https://arxiv.org/html/2605.07110#S7.I1.i3.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p3.1)\.
- \[97\]L\. Miculicich, M\. Parmar, H\. Palangi, K\. D\. Dvijotham, M\. Montanari, T\. Pfister, and L\. T\. Le\(2025\)Veriguard: enhancing llm agent safety via verified code generation\.arXiv preprint arXiv:2510\.05156\.Cited by:[1st item](https://arxiv.org/html/2605.07110#S7.I1.i1.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p2.1)\.
- \[98\]F\. Mo, J\. Chen, H\. Zhu, and X\. Hu\(2025\)Building a stable planner: an extended finite state machine based planning module for mobile gui agent\.arXiv preprint arXiv:2505\.14141\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1)\.
- \[99\]J\. Mu, C\. Zhang, C\. Ni, L\. Wang, B\. Qiao, K\. Mathur, Q\. Wu, Y\. Xie, X\. Ma, M\. Zhou,et al\.\(2025\)GUI\-360∘: a comprehensive dataset and benchmark for computer\-using agents\.arXiv preprint arXiv:2511\.04307\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p3.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[100\]D\. Nguyen, J\. Chen, Y\. Wang, G\. Wu, N\. Park, Z\. Hu, H\. Lyu, J\. Wu, R\. Aponte, Y\. Xia,et al\.\(2025\)Gui agents: a survey\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 22522–22538\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.2.1.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1)\.
- \[101\]H\. Nie, X\. Liu, Y\. Bai, Y\. Wang, Y\. Liu, Q\. Yao, and Z\. Wang\(2026\)PSPA\-bench: a personalized benchmark for smartphone gui agent\.arXiv preprint arXiv:2603\.29318\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px4.p1.1),[§V](https://arxiv.org/html/2605.07110#S5.p2.1)\.
- \[102\]L\. Ning, Z\. Liang, Z\. Jiang, H\. Qu, Y\. Ding, W\. Fan, X\. Wei, S\. Lin, H\. Liu, P\. S\. Yu,et al\.\(2025\)A survey of webagents: towards next\-generation ai agents for web automation with large foundation models\.In31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD 2025\),pp\. 6140–6150\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.3.2.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1)\.
- \[103\]Y\. Ning, J\. Jones, Z\. Zhang, C\. Ye, W\. Ruan, J\. Li, R\. Gupta, and H\. Sun\(2026\)When actions go off\-task: detecting and correcting misaligned actions in computer\-use agents\.arXiv preprint arXiv:2602\.08995\.Cited by:[4th item](https://arxiv.org/html/2605.07110#S7.I1.i4.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[104\]R\. Niu, J\. Li, S\. Wang, Y\. Fu, X\. Hu, X\. Leng, H\. Kong, Y\. Chang, and Q\. Wang\(2024\)Screenagent: a vision language model\-driven computer control agent\.arXiv preprint arXiv:2402\.07945\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p4.1)\.
- \[105\]OpenClaw\(2026\)OpenClaw — personal ai assistant\.Note:https://openclaw\.ai/Website, accessed: 2026\-04\-11Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p5.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p5.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px5.p1.1),[§VII\-C](https://arxiv.org/html/2605.07110#S7.SS3.p3.1)\.
- \[106\]S\. Pei, L\. Tang, T\. Duan, L\. Chen, S\. Li, K\. Huang, Y\. Jing, Y\. Yan, B\. Zhang, C\. Jiang,et al\.\(2026\)AdaZoom\-gui: adaptive zoom\-based gui grounding with instruction refinement\.arXiv preprint arXiv:2603\.17441\.Cited by:[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[107\]P\. Peigné, M\. Kniejski, F\. Sondej, M\. David, J\. Hoelscher\-Obermaier, C\. S\. de Witt, and E\. Kran\(2025\)Multi\-agent security tax: trading off security and collaboration capabilities in multi\-agent systems\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 27573–27581\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p6.1),[§VIII\-F](https://arxiv.org/html/2605.07110#S8.SS6.p1.1)\.
- \[108\]Y\. Qian, K\. Qian, X\. He, L\. Chen, J\. Zhang, T\. Zhang, H\. Wei, L\. Wang, H\. Wu, and B\. Mao\(2026\)Zero\-permission manipulation: can we trust large multimodal model powered gui agents?\.arXiv preprint arXiv:2601\.12349\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p5.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.9.7.5.1.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[109\]G\. V\. Ramesh, A\. Nayak, B\. Siddique, and K\. Fawaz\(2026\)WebSP\-eval: evaluating web agents on website security and privacy tasks\.arXiv preprint arXiv:2604\.06367\.Cited by:[§V](https://arxiv.org/html/2605.07110#S5.p2.1)\.
- \[110\]C\. Rawles, A\. Li, D\. Rodriguez, O\. Riva, and T\. Lillicrap\(2023\)Androidinthewild: a large\-scale dataset for android device control\.Advances in Neural Information Processing Systems36,pp\. 59708–59728\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[111\]X\. Ren, P\. Jiang, K\. Li, Z\. Huang, X\. Du, J\. Jiang, Z\. Xing, J\. Sun, and T\. Y\. Zhuo\(2025\)HackWorld: evaluating computer\-use agents on exploiting web application vulnerabilities\.arXiv preprint arXiv:2510\.12200\.Cited by:[§VI\-D](https://arxiv.org/html/2605.07110#S6.SS4.p4.1)\.
- \[112\]U\. M\. Sehwag, S\. Shabihi, A\. McAvoy, V\. Sehwag, Y\. Xu, D\. Towers, and F\. Huang\(2025\)PropensityBench: evaluating latent safety risks in large language models via an agentic approach\.arXiv preprint arXiv:2511\.20703\.Cited by:[§VIII\-E](https://arxiv.org/html/2605.07110#S8.SS5.p1.1)\.
- \[113\]D\. Seip and M\. Hein\(2026\)Preference redirection via attention concentration: an attack on computer use agents\.arXiv preprint arXiv:2604\.08005\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p4.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[114\]Y\. Shi, J\. Li, L\. Zhang, Z\. Dongfang, B\. Wu, S\. Tao, Y\. Yan, C\. Qin, W\. Liu, Z\. Lin,et al\.\(2026\)AndroTMem: from interaction trajectories to anchored memory in long\-horizon gui agents\.arXiv preprint arXiv:2603\.18429\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1),[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px1.p1.1)\.
- \[115\]Y\. Shi, W\. Yu, W\. Yao, W\. Chen, and N\. Liu\(2025\)Towards trustworthy gui agents: a survey\.arXiv preprint arXiv:2503\.23434\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.2.1.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1),[§IV\-D](https://arxiv.org/html/2605.07110#S4.SS4.p2.1),[§IV](https://arxiv.org/html/2605.07110#S4.p1.1)\.
- \[116\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1)\.
- \[117\]K\. Singh, S\. Singh, and M\. Khanna\(2025\)Trishul: towards region identification and screen hierarchy understanding for large vlm based gui agents\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 170–179\.Cited by:[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p1.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p3.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[118\]L\. Song, Y\. Dai, V\. Prabhu, J\. Zhang, T\. Shi, L\. Li, J\. Li, S\. Savarese, Z\. Chen, J\. Zhao,et al\.\(2025\)Coact\-1: computer\-using agents with coding as actions\.arXiv preprint arXiv:2508\.03923\.Cited by:[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p2.1),[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p2.1),[§II\-C](https://arxiv.org/html/2605.07110#S2.SS3.p2.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1),[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.7.6.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px2.p1.1)\.
- \[119\]L\. Song, J\. Zhang, H\. Sheng, T\. Shi, G\. Rahul, Y\. Liu, R\. Krishna, J\. Kang, and J\. Zhao\(2026\)Video\-based reward modeling for computer\-use agents\.arXiv preprint arXiv:2603\.10178\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1)\.
- \[120\]L\. Sun, J\. Zhang, S\. Wang, and Z\. Wei\(2026\)MAGNET: towards adaptive gui agents with memory\-driven knowledge evolution\.arXiv preprint arXiv:2601\.19199\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px1.p1.1)\.
- \[121\]Q\. Sun, K\. Cheng, Z\. Ding, C\. Jin, Y\. Wang, F\. Xu, Z\. Wu, C\. Jia, L\. Chen, Z\. Liu,et al\.\(2025\)Os\-genesis: automating gui agent trajectory construction via reverse task synthesis\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5555–5579\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[122\]W\. J\. Tan, Z\. R\. L\. Lim, S\. Durgad, K\. Obegi, and A\. Y\. Li\(2026\)OpeFlo: automated ux evaluation via simulated human web interaction with gui grounding\.arXiv preprint arXiv:2604\.09581\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1)\.
- \[123\]W\. Tan, W\. Zhang, X\. Xu, H\. Xia, Z\. Ding, B\. Li, B\. Zhou, J\. Yue, J\. Jiang, Y\. Li,et al\.\(2024\)Cradle: empowering foundation agents towards general computer control\.arXiv preprint arXiv:2403\.03186\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p2.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px1.p1.1)\.
- \[124\]F\. Tang, Z\. Lu, B\. Zhang, W\. Lu, J\. Xiao, Y\. Zhuang, and Y\. Shen\(2026\)ClawGUI: a unified framework for training, evaluating, and deploying gui agents\.arXiv preprint arXiv:2604\.11784\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1),[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.4.3.2.1.1)\.
- \[125\]F\. Tang, H\. Xu, H\. Zhang, S\. Chen, X\. Wu, Y\. Shen, W\. Zhang, G\. Hou, Z\. Tan, Y\. Yan,et al\.\(2025\)A survey on \(m\) llm\-based gui agents\.arXiv preprint arXiv:2504\.13865\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.2.1.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1),[§III](https://arxiv.org/html/2605.07110#S3.p1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p1.1)\.
- \[126\]A\. X\. Tian, R\. Zhang, J\. Tang, J\. Wang, T\. Shi, and J\. Wen\(2025\)Measuring harmfulness of computer\-using agents\.arXiv preprint arXiv:2508\.00935\.Cited by:[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p3.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.4.2.5.1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p2.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1),[§VIII\-E](https://arxiv.org/html/2605.07110#S8.SS5.p1.1)\.
- \[127\]Y\. Tian, X\. Yang, J\. Zhang, Y\. Dong, and H\. Su\(2023\)Evil geniuses: delving into the safety of llm\-based agents\.arXiv preprint arXiv:2311\.11855\.Cited by:[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p1.1)\.
- \[128\]D\. Toyama, P\. Hamel, A\. Gergely, G\. Comanici, A\. Glaese, Z\. Ahmed, T\. Jackson, S\. Mourad, and D\. Precup\(2021\)Androidenv: a reinforcement learning platform for android\.arXiv preprint arXiv:2105\.13231\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p2.1)\.
- \[129\]H\. Wang, H\. Zou, H\. Song, J\. Feng, J\. Fang, J\. Lu, L\. Liu, Q\. Luo, S\. Liang, S\. Huang,et al\.\(2025\)Ui\-tars\-2 technical report: advancing gui agent with multi\-turn reinforcement learning\.arXiv preprint arXiv:2509\.02544\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p2.1)\.
- \[130\]J\. Wang, J\. Zhou, W\. Zhang, T\. Wang, W\. Liu, Z\. Zhang, X\. Lou, W\. Zhang, H\. DENG, and J\. Wang\(2026\)ColorBrowserAgent: complex long\-horizon browser agent with adaptive knowledge evolution\.InThe 64th Annual Meeting of the Association for Computational Linguistics–Industry Track,Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.5.4.2.1.1)\.
- \[131\]J\. Wang, H\. Xu, J\. Ye, M\. Yan, W\. Shen, J\. Zhang, F\. Huang, and J\. Sang\(2024\)Mobile\-agent: autonomous multi\-modal mobile device agent with visual perception\.arXiv preprint arXiv:2401\.16158\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.4.3.2.1.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px1.p1.1)\.
- \[132\]S\. Wang, F\. Yu, X\. Liu, X\. Qin, J\. Zhang, Q\. Lin, D\. Zhang, and S\. Rajmohan\(2025\)Privacy in action: towards realistic privacy mitigation and evaluation for llm\-powered agents\.arXiv preprint arXiv:2509\.17488\.Cited by:[§VI\-F](https://arxiv.org/html/2605.07110#S6.SS6.p2.1)\.
- \[133\]S\. Wang, W\. Liu, J\. Chen, Y\. Zhou, W\. Gan, X\. Zeng, Y\. Che, S\. Yu, X\. Hao, K\. Shao,et al\.\(2024\)Gui agents with foundation models: a comprehensive survey\.arXiv preprint arXiv:2411\.04890\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1),[§I](https://arxiv.org/html/2605.07110#S1.p3.1)\.
- \[134\]X\. Wang, J\. Bloch, Z\. Shao, Y\. Hu, S\. Zhou, and N\. Z\. Gong\(2025\)Webinject: prompt injection attack to web agents\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 2010–2030\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px2.p1.1),[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[135\]X\. Wang, Y\. Liu, Z\. Wang, D\. Song, and N\. Gong\(2026\)WebSentinel: detecting and localizing prompt injection attacks for web agents\.arXiv preprint arXiv:2602\.03792\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p3.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1),[5th item](https://arxiv.org/html/2605.07110#S7.I1.i5.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[136\]X\. Wang, B\. Wang, D\. Lu, J\. Yang, T\. Xie, J\. Wang, J\. Deng, X\. Guo, Y\. Xu, C\. H\. Wu,et al\.\(2025\)Opencua: open foundations for computer\-use agents\.arXiv preprint arXiv:2508\.09123\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.4.3.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p1.1),[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[137\]Y\. Wang, Z\. Zhang, W\. Zhou, W\. Zhang, J\. Zhang, Q\. Zhu, Y\. Shi, S\. Zheng, and J\. He\(2026\-01\)GUIGuard: Toward a General Framework for Privacy\-Preserving GUI Agents\.Note:arXiv preprint arXiv:2601\.18842External Links:2601\.18842,[Link](https://arxiv.org/abs/2601.18842)Cited by:[§VI\-F](https://arxiv.org/html/2605.07110#S6.SS6.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.10.8.5.1.1)\.
- \[138\]Y\. Wang, F\. Xu, Z\. Lin, G\. He, Y\. Huang, H\. Gao, Z\. Niu, S\. Lian, and Z\. Liu\(2026\)From assistant to double agent: formalizing and benchmarking attacks on openclaw for personalized local ai agent\.arXiv preprint arXiv:2602\.08412\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px4.p1.1),[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px3.p1.1),[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p4.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.8.6.5.1.1)\.
- \[139\]Z\. Wang, H\. Tu, L\. Zhang, H\. Chen, J\. Wu, X\. Liu, Z\. Yuan, T\. Pang, M\. Q\. Shieh, F\. Liu,et al\.\(2026\)Your agent, their asset: a real\-world safety analysis of openclaw\.arXiv preprint arXiv:2604\.04759\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px4.p1.1),[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px3.p1.1),[§VI\-F](https://arxiv.org/html/2605.07110#S6.SS6.p2.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p3.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p5.1),[§VIII\-F](https://arxiv.org/html/2605.07110#S8.SS6.p1.1)\.
- \[140\]M\. Wu, Y\. Guo, Y\. Cao, H\. Lu, S\. Zhu, P\. Qu, X\. Chen, K\. Qin, Z\. Wang, X\. Zhang,et al\.\(2026\)UI\-oceanus: scaling gui agents with synthetic environmental dynamics\.arXiv preprint arXiv:2604\.02345\.Cited by:[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px1.p1.1)\.
- \[141\]P\. Wu, S\. Ma, B\. Wang, J\. Yu, L\. Lu, and Z\. Liu\(2025\)GUI\-reflection: empowering multimodal gui models with self\-reflection behavior\.arXiv preprint arXiv:2506\.08012\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p4.1)\.
- \[142\]Q\. Wu, P\. Gao, W\. Liu, and J\. Luan\(2025\)Backtrackagent: enhancing gui agent with error detection and backtracking mechanism\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 4250–4272\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1)\.
- \[143\]Z\. Wu, P\. Cheng, Z\. Wu, L\. Dong, and Z\. Zhang\(2026\)Gem: gaussian embedding modeling for out\-of\-distribution detection in gui agents\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 33989–33997\.Cited by:[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[144\]Z\. Wu, C\. Han, Z\. Ding, Z\. Weng, Z\. Liu, S\. Yao, T\. Yu, and L\. Kong\(2024\)Os\-copilot: towards generalist computer agents with self\-improvement\.arXiv preprint arXiv:2402\.07456\.Cited by:[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p2.1),[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p2.1),[§II\-C](https://arxiv.org/html/2605.07110#S2.SS3.p2.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p5.1),[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.7.6.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px2.p1.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px5.p1.1)\.
- \[145\]H\. Xiao, G\. Wang, Y\. Chai, Z\. Lu, W\. Lin, H\. He, L\. Fan, L\. Bian, R\. Hu, L\. Liu,et al\.\(2025\)Ui\-genie: a self\-improving approach for iteratively boosting mllm\-based mobile gui agents\.arXiv preprint arXiv:2505\.21496\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1)\.
- \[146\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.5.4.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p2.1),[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p2.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px4.p1.1)\.
- \[147\]Y\. Xie, S\. Chen, J\. Xing, W\. Jiang, Z\. Zhu, Y\. Wang, P\. Bu, J\. Song, Y\. Jiang, and B\. Zheng\(2026\)SecAgent: efficient mobile gui agent with semantic context\.arXiv preprint arXiv:2603\.08533\.Cited by:[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p2.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1)\.
- \[148\]H\. Xu, X\. Zhang, H\. Liu, J\. Wang, Z\. Zhu, S\. Zhou, X\. Hu, F\. Gao, J\. Cao, Z\. Wang,et al\.\(2026\)Mobile\-agent\-v3\. 5: multi\-platform fundamental gui agents\.arXiv preprint arXiv:2602\.16855\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p2.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p3.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px1.p1.1)\.
- \[149\]H\. Yan, J\. Wang, X\. Huang, Y\. Shen, Z\. Meng, Z\. Fan, K\. Tan, J\. Gao, L\. Shi, M\. Yang,et al\.\(2025\)Step\-gui technical report\.arXiv preprint arXiv:2512\.15431\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.07110#S2.SS3.p2.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p3.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1),[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p3.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px2.p1.1)\.
- \[150\]J\. Yang, S\. Shao, D\. Liu, and J\. Shao\(2025\)Riosworld: benchmarking the risk of multimodal computer\-use agents\.arXiv preprint arXiv:2506\.00618\.Cited by:[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p3.1),[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p1.1),[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p3.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[151\]P\. Yang, H\. Ci, and M\. Z\. Shou\(2025\)In\-context defense in computer agents: an empirical study\.arXiv preprint arXiv:2503\.09241\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px2.p1.1),[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p3.1)\.
- \[152\]P\. Yang, H\. Ci, and M\. Z\. Shou\(2025\)MacOSWorld: a multilingual interactive benchmark for gui agents\.arXiv preprint arXiv:2506\.04135\.Cited by:[§IV\-D](https://arxiv.org/html/2605.07110#S4.SS4.p2.1)\.
- \[153\]S\. Yang, J\. Yu, Y\. Peng, K\. Q\. Lin, J\. W\. Cho, Y\. Song, and J\. Kim\(2026\)GUIDE: a benchmark for understanding and assisting users in open\-ended gui tasks\.arXiv preprint arXiv:2603\.25864\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p4.1),[§V](https://arxiv.org/html/2605.07110#S5.p2.1),[§VII\-B](https://arxiv.org/html/2605.07110#S7.SS2.p2.1)\.
- \[154\]W\. Yang, C\. Jin, H\. Zhu, W\. Luo, D\. Yuen, K\. Shao, H\. Huang, J\. Duan, J\. Cao, and R\. He\(2026\)Are gui agents focused enough? automated distraction via semantic\-level ui element injection\.arXiv preprint arXiv:2604\.07831\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[155\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan\(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1)\.
- \[156\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1)\.
- \[157\]K\. You, H\. Zhang, E\. Schoop, F\. Weers, A\. Swearngin, J\. Nichols, Y\. Yang, and Z\. Gan\(2024\)Ferret\-ui: grounded mobile ui understanding with multimodal llms\.InEuropean Conference on Computer Vision,pp\. 240–255\.Cited by:[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p1.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p3.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[158\]M\. Yu, S\. Luo, and X\. Chen\(2026\-01\)GraphPilot: GUI Task Automation with One\-Step LLM Reasoning Powered by Knowledge Graph\.Note:Journal of Intelligent Computing and NetworkingExternal Links:2601\.17418,[Link](https://arxiv.org/abs/2601.17418)Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.6.5.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px2.p1.1)\.
- \[159\]S\. Yu, G\. Li, W\. Shi, and P\. Qi\(2025\)Polyskill: learning generalizable skills through polymorphic abstraction\.arXiv preprint arXiv:2510\.15863\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1),[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p4.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px1.p1.1)\.
- \[160\]P\. Yuan, Y\. Yin, Y\. Cai, and Z\. Wei\(2026\)WebForge: breaking the realism\-reproducibility\-scalability trilemma in browser agent benchmark\.arXiv preprint arXiv:2604\.10988\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1),[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.5.4.2.1.1),[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px4.p1.1),[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1),[§V](https://arxiv.org/html/2605.07110#S5.p2.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[161\]X\. Yuan, J\. Zhang, K\. Li, Z\. Cai, L\. Yao, J\. Chen, E\. Wang, Q\. Hou, J\. Chen, P\. Jiang, and B\. Li\(2025\)Enhancing visual grounding for gui agents via self\-evolutionary reinforcement learning\.arXiv preprint arXiv:2505\.12370\.Cited by:[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px2.p1.1)\.
- \[162\]A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang\(2024\)Agenttuning: enabling generalized agent abilities for llms\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 3053–3077\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[163\]Y\. Zhai, R\. Li, L\. Wang, N\. Shi, L\. Xu, W\. Zhang, R\. Lin, B\. Xu, and B\. Cui\(2026\)GUIDE: interpretable gui agent evaluation via hierarchical diagnosis\.arXiv preprint arXiv:2604\.04399\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1)\.
- \[164\]Q\. Zhan, Z\. Liang, Z\. Ying, and D\. Kang\(2024\)Injecagent: benchmarking indirect prompt injections in tool\-integrated large language model agents\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10471–10506\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1)\.
- \[165\]B\. Zhang, Z\. Shang, Z\. Gao, W\. Zhang, R\. Xie, X\. Ma, T\. Yuan, X\. Wu, S\. Zhu, and Q\. Li\(2025\)TongUI: internet\-scale trajectories from multimodal web tutorials for generalized gui agents\.arXiv preprint arXiv:2504\.12679\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px1.p1.1)\.
- \[166\]C\. Zhang, L\. Li, S\. He, X\. Zhang, B\. Qiao, S\. Qin, M\. Ma, Y\. Kang, Q\. Lin, S\. Rajmohan,et al\.\(2025\)Ufo: a ui\-focused agent for windows os interaction\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 597–622\.Cited by:[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p5.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.6.5.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px2.p1.1)\.
- \[167\]C\. Zhang, Z\. Yang, J\. Liu, Y\. Li, Y\. Han, X\. Chen, Z\. Huang, B\. Fu, and G\. Yu\(2025\)Appagent: multimodal agents as smartphone users\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–20\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1)\.
- \[168\]D\. Zhang, B\. Rama, J\. Ni, S\. He, F\. Zhao, K\. Chen, A\. Chen, and J\. Cao\(2025\)Litewebagent: the open\-source suite for vlm\-based web\-agent applications\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(System Demonstrations\),pp\. 449–455\.Cited by:[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p4.1),[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p6.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.6.5.2.1.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px2.p1.1)\.
- \[169\]D\. Zhang, S\. Zhang, Z\. Yang, Z\. Zhu, Z\. Zhao, R\. Cao, L\. Chen, and K\. Yu\(2025\)Progrm: build better gui agents with progress rewards\.arXiv preprint arXiv:2505\.18121\.Cited by:[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px3.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p2.1)\.
- \[170\]D\. Zhang, Z\. Li, X\. Luo, X\. Liu, P\. Li, and W\. Xu\(2025\)MCP security bench \(msb\): benchmarking attacks against model context protocol in llm agents\.arXiv preprint arXiv:2510\.15994\.Cited by:[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px5.p1.1),[§VI\-D](https://arxiv.org/html/2605.07110#S6.SS4.p3.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.6.4.5.1.1),[§VIII\-F](https://arxiv.org/html/2605.07110#S8.SS6.p1.1)\.
- \[171\]H\. Zhang, J\. Huang, K\. Mei, Y\. Yao, Z\. Wang, C\. Zhan, H\. Wang, and Y\. Zhang\(2024\)Agent security bench \(asb\): formalizing and benchmarking attacks and defenses in llm\-based agents\.arXiv preprint arXiv:2410\.02644\.Cited by:[§VI\-A](https://arxiv.org/html/2605.07110#S6.SS1.p1.1)\.
- \[172\]K\. Zhang, M\. Tenenholtz, K\. Polley, J\. Ma, D\. Yarats, and N\. Li\(2025\)Browsesafe: understanding and preventing prompt injection within ai browser agents\.arXiv preprint arXiv:2511\.20597\.Cited by:[2nd item](https://arxiv.org/html/2605.07110#S7.I1.i2.p1.1)\.
- \[173\]L\. Zhang, Y\. Xiao, X\. Lu, J\. Cao, Y\. Zhao, J\. Zhou, L\. An, Z\. Feng, W\. Sha, Y\. Shi,et al\.\(2026\)OmegaUse: building a general\-purpose gui agent for autonomous task execution\.arXiv preprint arXiv:2601\.20380\.Cited by:[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p4.1)\.
- \[174\]W\. Zhang, Y\. Shen, C\. Jiang, J\. Dai, G\. Hong, and X\. Pan\(2026\)MirrorGuard: toward secure computer\-use agents via simulation\-to\-real reasoning correction\.arXiv preprint arXiv:2601\.12822\.Cited by:[4th item](https://arxiv.org/html/2605.07110#S7.I1.i4.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[175\]Y\. Zhang, X\. Guo, Y\. Goh, J\. Hu, Z\. Chen, X\. Wang, D\. Gao, and M\. Z\. Shou\(2026\)ShowUI\-aloha: human\-taught gui agent\.arXiv preprint arXiv:2601\.07181\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1)\.
- \[176\]Y\. Zhang, X\. Li, L\. Cai, and J\. Li\(2026\)Environmental injection attacks against gui agents in realistic dynamic environments\.arXiv preprint arXiv:2509\.11250\.Cited by:[§VI\-E](https://arxiv.org/html/2605.07110#S6.SS5.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.7.5.5.1.1),[§VIII\-D](https://arxiv.org/html/2605.07110#S8.SS4.p1.1)\.
- \[177\]Y\. Zhang, Y\. Wang, Y\. Zhu, P\. Du, J\. Miao, X\. Lu, W\. Xu, Y\. Hao, S\. Cai, X\. Wang,et al\.\(2026\)ClawBench: can ai agents complete everyday online tasks?\.arXiv preprint arXiv:2604\.08523\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p2.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px4.p1.1),[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px1.p1.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[178\]Y\. Zhang, X\. Xue, X\. Wu, M\. Chen, C\. Liu, X\. He, R\. Shao, F\. Liu, H\. Xu, Q\. Pan,et al\.\(2026\)Don’t act blindly: robust gui automation via action\-effect verification and self\-correction\.arXiv preprint arXiv:2604\.05477\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1),[4th item](https://arxiv.org/html/2605.07110#S7.I1.i4.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p4.1)\.
- \[179\]Z\. Zhang and A\. Zhang\(2024\)You only look at screens: multimodal chain\-of\-action agents\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 3132–3149\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p5.1)\.
- \[180\]H\. H\. Zhao, K\. Yang, W\. Yu, D\. Gao, and M\. Z\. Shou\(2025\)Worldgui: an interactive benchmark for desktop gui automation from any starting point\.arXiv preprint arXiv:2502\.08047\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px1.p1.1)\.
- \[181\]N\. Zhao\(2026\)WebPII: benchmarking visual pii detection for computer\-use agents\.arXiv preprint arXiv:2603\.17357\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1),[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.6.5.2.1.1),[§VI](https://arxiv.org/html/2605.07110#S6.p2.1)\.
- \[182\]A\. Zharmagambetov, C\. Guo, I\. Evtimov, M\. Pavlova, R\. Salakhutdinov, and K\. Chaudhuri\(2025\)Agentdam: privacy leakage evaluation for autonomous web agents\.arXiv preprint arXiv:2503\.09780\.Cited by:[§VI\-F](https://arxiv.org/html/2605.07110#S6.SS6.p2.1)\.
- \[183\]J\. Zheng, Y\. Luo, J\. Xu, B\. Liu, Y\. Chen, C\. Cui, G\. Deng, C\. Lu, X\. Wang, A\. Zhang,et al\.\(2026\)Risky\-bench: probing agentic safety risks under real\-world deployment\.arXiv preprint arXiv:2602\.03100\.Cited by:[TABLE I](https://arxiv.org/html/2605.07110#S1.T1.3.6.5.2.1.1),[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p3.1),[§V\-E](https://arxiv.org/html/2605.07110#S5.SS5.SSS0.Px2.p1.1),[§VI\-G](https://arxiv.org/html/2605.07110#S6.SS7.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.11.9.5.1.1),[5th item](https://arxiv.org/html/2605.07110#S7.I1.i5.p1.1),[§VII\-A](https://arxiv.org/html/2605.07110#S7.SS1.p5.1),[§VIII\-C](https://arxiv.org/html/2605.07110#S8.SS3.p1.1)\.
- \[184\]L\. Zheng, R\. Wang, X\. Wang, and B\. An\(2023\)Synapse: trajectory\-as\-exemplar prompting with memory for computer control\.arXiv preprint arXiv:2306\.07863\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.5.4.2.1.1),[§IV](https://arxiv.org/html/2605.07110#S4.p2.1)\.
- \[185\]H\. Zhong, F\. Faisal, L\. França, T\. Leesatapornwongsa, A\. Szekeres, K\. Rong, and S\. Nath\(2026\)ActionEngine: from reactive to programmatic gui agents via state machine memory\.arXiv preprint arXiv:2602\.20502\.Cited by:[§V\-D](https://arxiv.org/html/2605.07110#S5.SS4.SSS0.Px1.p1.1)\.
- \[186\]H\. Zhou, X\. Zhang, P\. Tong, J\. Zhang, L\. Chen, Q\. Kong, C\. Cai, C\. Liu, Y\. Wang, J\. Zhou,et al\.\(2025\)MAI\-ui technical report: real\-world centric foundation gui agents\.arXiv preprint arXiv:2512\.22047\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.07110#S2.SS3.p2.1),[§III\-C](https://arxiv.org/html/2605.07110#S3.SS3.p3.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p4.1),[§IV\-C](https://arxiv.org/html/2605.07110#S4.SS3.p3.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.4.3.2.1.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px2.p1.1)\.
- \[187\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2023\)Webarena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.Cited by:[§I](https://arxiv.org/html/2605.07110#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.07110#S2.SS1.p2.1),[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p1.1),[§II\-B](https://arxiv.org/html/2605.07110#S2.SS2.p2.1),[§IV\-A](https://arxiv.org/html/2605.07110#S4.SS1.p2.1),[TABLE III](https://arxiv.org/html/2605.07110#S4.T3.3.2.1.2.1.1),[§V\-B](https://arxiv.org/html/2605.07110#S5.SS2.SSS0.Px4.p1.1),[§V\-C](https://arxiv.org/html/2605.07110#S5.SS3.SSS0.Px1.p1.1)\.
- \[188\]S\. Zhou\(2026\-03\)WebArena\-infinity: generating browser environments with verifiable tasks at scale\.shuyanzhou\.com\.External Links:[Link](https://webarena.dev/webarena-infinity/)Cited by:[§I](https://arxiv.org/html/2605.07110#S1.SS0.SSS0.Px1.p1.1),[§V](https://arxiv.org/html/2605.07110#S5.p2.1)\.
- \[189\]X\. Zhou, G\. Tie, G\. Zhang, H\. Wang, P\. Zhou, and L\. Sun\(2025\)Badvla: towards backdoor attacks on vision\-language\-action models via objective\-decoupled optimization\.arXiv preprint arXiv:2505\.16640\.Cited by:[§VI\-C](https://arxiv.org/html/2605.07110#S6.SS3.p2.1),[TABLE V](https://arxiv.org/html/2605.07110#S6.T5.3.3.1.5.1.1)\.
- \[190\]J\. Zhu, L\. Yang, R\. Shan, C\. Zheng, Z\. Zheng, W\. Liu, Y\. Yu, W\. Zhang, and J\. Lin\(2026\)Turing test on screen: a benchmark for mobile gui agent humanization\.arXiv preprint arXiv:2604\.09574\.Cited by:[§V](https://arxiv.org/html/2605.07110#S5.p2.1)\.
- \[191\]H\. P\. Zou, C\. Miao, W\. Huang, Y\. Chen, Y\. Zhou, H\. Zhang, Y\. Wu, L\. Fang, Z\. Gu, Z\. Zhang,et al\.\(2026\)When users change their mind: evaluating interruptible agents in long\-horizon web navigation\.arXiv preprint arXiv:2604\.00892\.Cited by:[§IV\-B](https://arxiv.org/html/2605.07110#S4.SS2.p4.1),[§VII\-B](https://arxiv.org/html/2605.07110#S7.SS2.p2.1)\.

Similar Articles

On the Reliability of Computer Use Agents

Hugging Face Daily Papers

A preprint analyzing why computer-use agents succeed once but fail on repeated executions, attributing unreliability to execution stochasticity, task ambiguity, and behavioral variability, and advocating repeated evaluation and stable strategies.

PRO-CUA: Process-Reward Optimization for Computer Use Agents

arXiv cs.AI

This paper introduces PRO-CUA, a process-reward optimization framework for training Computer Use Agents (CUAs) using iterative step-level reinforcement learning. The method decouples on-policy environment interaction from policy optimization, enabling dense credit assignment without relying on expert trajectories, and demonstrates effectiveness on live web benchmarks.

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

arXiv cs.AI

Researchers present an ontology-grounded framework for pre-deployment verification of enterprise AI agents, combining an Agent Operational Envelope, automated scenario generation, and machine-verifiable Trust Certificates with graduated deployment verdicts. A pilot across four regulated industries generated 1,800 scenarios and showed ontology-grounded generation significantly outperformed persona-based baselines on regulatory coverage.