OSGuard: A Benchmark for Safety in Computer-Use Agents
Summary
OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents under benign user instructions, featuring action-level judgments and risk-augmented execution suites to detect unsafe shortcuts.
View Cached Full Text
Cached at: 06/16/26, 11:43 AM
# OSGuard: A Benchmark for Safety in Computer-Use Agents
Source: [https://arxiv.org/html/2606.15034](https://arxiv.org/html/2606.15034)
Mina Mohammadmirzaei, Jeffrey Flanigan University of California, Santa Cruz mmohamm9@ucsc\.edu, jmflanig@ucsc\.edu
###### Abstract
Computer\-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks\. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut\. We introduce OSGuard, a dual\-granularity benchmark suite for evaluating safety in computer\-use agents under benign, unchanged user instructions\. OSGuard contains an action\-level benchmark for local guardrail decisions and a risk\-augmented execution suite for end\-to\-end evaluation\. The action\-level benchmark consists of contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged relative to the original instruction and current interface state\. The execution suite contains manually constructed OSWorld\-derived task variants in which the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites, etc\. Each variant is paired with augmented evaluators that retain the original task\-success criterion while adding explicit state\-based safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal task objective\. Our experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, while risk\-augmented execution exposes remaining gaps between local oversight and reliable end\-to\-end safety\. This dual\-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full\-task safety when deployed as guardrails\.
## 1Introduction
Computer\-use agents are increasingly able to perform multi\-step tasks across realistic desktop and web environments\(Xieet al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib2); Zhouet al\.,[2023](https://arxiv.org/html/2606.15034#bib.bib1); Drouinet al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib3)\)\. As these agents become more capable, evaluating whether they can complete a user instruction is no longer sufficient\. A task may be completed in a way that still violates important constraints of the user’s environment: the agent might overwrite unrelated content, broaden permissions, modify global settings, access unnecessary sensitive information, or act on the wrong target\.
Many such failures in ordinary computer use are not overtly malicious\. The user’s instruction may be benign, the desired task may be achievable, and the agent’s behavior may look like progress toward the user’s goal\. The safety problem arises from how the agent carries out the task in the current state of the environment\. We study this class of failures as unsafe shortcuts: locally plausible actions or executions that advance the nominal task while violating constraints that a safe agent should preserve\.
Existing safety benchmarks have expanded evaluation beyond nominal task success, but they mostly study different regimes\. OS\-Harm and RiOSWorld include settings with malicious user requests, prompt injection, phishing, pop\-ups, and other adversarial or explicitly harmful task contexts\(Kuntzet al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib4); Yanget al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib5)\)\. AUTOELICIT and BLIND\-ACT move closer to benign\-looking tasks, but either perturb task instructions to elicit harmful behavior or stress failures from context\-insensitive execution, ambiguity, and infeasible or contradictory goals\(Joneset al\.,[2026](https://arxiv.org/html/2606.15034#bib.bib6); Shayeganiet al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib7)\)\. In contrast, we focus on a setting in which the original user instruction remains benign and unchanged, the task remains achievable, and the safety challenge comes from respecting task\-state constraints while completing the task\.
This setting calls for evaluation at two granularities\. First, a system should support local oversight: given the original instruction, the current interface state, and a proposed next action, it should determine whether the action should be executed\. Second, safety should be evaluated end to end: when the environment contains latent hazards, an agent should complete the task without violating additional constraints\. These granularities are related but not interchangeable\. A model may perform well on isolated action judgments without improving full\-task behavior, while end\-to\-end failures can be difficult to diagnose without identifying the local decisions that caused them\.
We introduce OSGuard, a dual\-granularity benchmark suite for evaluating safety in computer\-use agents under benign user instructions\. OSGuard contains an action\-level benchmark for local guardrail decisions and a risk\-augmented execution suite derived from OSWorld tasks\(Xieet al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib2)\)\. In the execution suite, the original instruction is kept fixed while the environment is modified to introduce latent hazards, preserving a safe path to task completion\. Each task variant is paired with an augmented evaluator that retains the original task\-success criterion and adds explicit safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal objective\.
Our evaluation is intended as a diagnostic benchmark rather than a complete coverage of all computer\-use safety failures: the action\-level benchmark contains 324 contextualized proposed actions from selected construction sources, and the execution suite focuses on 45 manually constructed OSWorld\-derived variants, a finite set of state\-dependent hazard families, and a limited set of executor and guardrail models\.
Our contributions are: \(1\) an action\-level benchmark for local oversight in computer\-use agents; \(2\) a risk\-augmented execution suite for evaluating ordinary state\-dependent hazards under unchanged benign instructions; \(3\) an invariant\-based evaluation protocol for measuring unsafe completion in full task executions; and \(4\) an empirical evaluation of guardrail models both offline on action\-level decisions and online during interactive execution\.
## 2Related Work
Recent work has introduced increasingly realistic benchmarks for evaluating computer\-use agents in interactive environments\. OSWorld evaluates agents on real desktop tasks with execution\-based success criteria, while WebArena and WorkArena evaluate long\-horizon web and workplace tasks in realistic environments\(Xieet al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib2); Zhouet al\.,[2023](https://arxiv.org/html/2606.15034#bib.bib1); Drouinet al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib3)\)\. BrowserGym provides infrastructure for evaluating browser agents across different web tasks, and related benchmarks such as Mind2Web and AndroidWorld study web navigation and mobile\-device control\(Le Sellier De Chezelleset al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib8); Denget al\.,[2023](https://arxiv.org/html/2606.15034#bib.bib9); Rawleset al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib10)\)\. These benchmarks establish computer use as an important setting for agent evaluation, but they primarily measure whether agents can complete user instructions\. OSGuard builds on this line of evaluation while shifting the focus from nominal task completion alone to safety\-aware completion\.
A growing body of work studies safety risks in computer\-use agents\. OS\-Harm evaluates harmful behavior in OSWorld\-style environments through deliberate misuse, prompt injection, and model misbehavior\(Kuntzet al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib4)\)\. RiOSWorld studies risky computer\-use tasks across applications, including both user\-originated harmful tasks and adversarial interface conditions such as phishing, pop\-ups, and other environmental hazards\(Yanget al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib5)\)\. These benchmarks are important because they show that computer\-use agents can cause harm in realistic environments, but many of their tasks make the harmful goal or adversarial setting explicit\. OSGuard targets a complementary regime: ordinary benign\-task computer use, where the original instruction is not malicious, the environment is not overtly adversarial, and the safety challenge arises from how the agent acts in the current task state\.
Other work moves closer to benign\-looking workflows, but studies a different source of failure\. AUTOELICIT searches for minimal perturbations to task instructions that elicit harmful behavior from computer\-use agents\(Joneset al\.,[2026](https://arxiv.org/html/2606.15034#bib.bib6)\)\. BLIND\-ACT studies failures from context\-insensitive execution, ambiguity, and infeasible or contradictory goals\(Shayeganiet al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib7)\)\. These works highlight that safety failures need not begin from an overtly malicious request\. In contrast, OSGuard’s risk\-augmented execution suite keeps the original user instruction clear and fixed and modifies only the environment state, allowing evaluation of whether agents preserve task\-local constraints while pursuing an otherwise ordinary goal\.
OSGuard is also related to work on guardrails and pre\-execution oversight\. MisActBench studies action\-level misalignment detection by labeling individual steps in computer\-use trajectories as aligned or misaligned, with misalignment categories including malicious instruction following, harmful unintended behavior, and other task\-irrelevant behavior; it also introduces DeAction as a pre\-execution correction guardrail\(Ninget al\.,[2026](https://arxiv.org/html/2606.15034#bib.bib11)\)\. WebGuard studies action\-level risk prediction for web agents using human\-annotated actions\(Zhenget al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib12)\)\. ShieldAgent and GuardAgent study guardrail agents that evaluate whether agent behavior satisfies safety policies or user\-specified guard requests\(Chenet al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib13); Xianget al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib14)\)\. SafePred studies predictive guardrails that anticipate future safety risks rather than only judging the current action\(Chenet al\.,[2026](https://arxiv.org/html/2606.15034#bib.bib15)\)\. Together, these works motivate action\-level oversight before execution\.
OSGuard evaluates safety at both the action and execution levels\. The action\-level benchmark tests whether a guardrail can judge a proposed action relative to the original instruction and current state, while the risk\-augmented execution suite tests whether an agent can complete the full task while preserving added safety invariants\. These two evaluations are complementary: offline action judgments provide a controlled measure of local oversight ability, while execution results show whether safety failures still occur during interactive task completion\.
Execution\-based evaluation is important for computer\-use agents because many outcomes can be verified directly from the environment state\. Prior benchmarks commonly evaluate final task success using state\-based checks, and safety benchmarks evaluate whether harmful outcomes occurred\(Xieet al\.,[2024](https://arxiv.org/html/2606.15034#bib.bib2); Kuntzet al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib4); Yanget al\.,[2025](https://arxiv.org/html/2606.15034#bib.bib5)\)\. OSGuard’s risk\-augmented execution suite extends this idea by adding explicit safety invariants to the original task\-success criteria\. These invariants can check whether files, permissions, settings, target identities, access boundaries, or out\-of\-scope resources were preserved\. As a result, the evaluator can distinguish executions that complete the original task while preserving the added constraints from unsafe completions that achieve the nominal objective while violating at least one safety condition\.
Overall, our work is positioned between capability benchmarks, safety benchmarks, and guardrail benchmarks\. In contrast to standard computer\-use benchmarks, we evaluate safety rather than only task completion\. In contrast to safety benchmarks centered on overtly harmful tasks, adversarial content, or perturbed instructions, we focus on ordinary state\-dependent hazards under unchanged benign instructions\. In contrast to action\-level guardrail work alone, we connect local oversight decisions to end\-to\-end execution outcomes\. This dual\-granularity design lets us ask whether models can both judge proposed actions correctly and improve full\-task safety when deployed as guardrails\.
## 3Overview
We introduce OSGuard, a dual\-granularity benchmark suite for computer\-use safety, with one component for action\-level oversight and another for full\-task execution under added safety constraints\. The first component evaluates local oversight: a guardrail—an oversight agent that evaluates proposed behavior before execution—receives the original task instruction, the current interface state, and a candidate action, and must determine whether that action should be allowed, blocked as unrelated to the user’s objective, or blocked as unsafe\. We use*candidate action*to denote the next unit of behavior submitted to the guardrail\. Depending on the executor or proposer interface, this unit may correspond to one primitive GUI action or to a short compound behavior, but the guardrail decision is made before that unit is executed and the label applies to it as a whole\.
The second component is a risk\-augmented execution suite consisting of manually constructed variants derived from OSWorld tasks\. In this setting, the original user instruction is kept fixed, but the environment is modified to introduce state\-dependent safety hazards that can make unsafe actions locally attractive while preserving a safe path to task completion\. This execution setting can be used either to evaluate a task\-executing agent alone or to evaluate the same agent paired with a guardrail\.
Figure 1 summarizes the two benchmark granularities: Figure 1\(a\) shows how action\-level items are constructed, Figure 1\(b\) shows the local guardrail decision interface, and Figure 1\(c\) shows how risk\-augmented execution separates safe success from unsafe completion\.
Figure 1:OSGuard benchmark overview\.\(a\)Construction sources for the action\-level benchmark: standard OSWorld executions, interrupted prefixes with state\-compatible proposer actions, interrupted prefixes with unsafe proposer actions, and risk\-augmented variant executions\.\(b\)Action\-level guardrail task: given the original instruction, pre\-action state, and candidate action, the guardrail predicts allowed, unrelated, or unsafe\.\(c\)Risk\-augmented execution suite: the original instruction is kept fixed while the environment is modified to introduce a latent hazard; augmented evaluators retain the original success check and add state\-based safety invariants\.Using both components lets us connect local oversight decisions to full\-task behavior\. The action\-level benchmark measures whether models can recognize allowed, unrelated, and unsafe actions in context; the risk\-augmented execution suite measures whether those decisions matter in interactive execution, where unsafe shortcuts may still achieve the nominal task objective\. The following sections describe the construction of the action\-level benchmark and the risk\-augmented execution suite, followed by the guardrail interface used in both offline and online evaluation\.
## 4Action\-Level Evaluation
The action\-level component of OSGuard evaluates a single pre\-execution guardrail decision\. As shown in Figure 1\(b\), each benchmark item contains the original task instruction, the interface state immediately preceding a candidate action, and the candidate action submitted to the guardrail\. To give the guardrail both visual and structured information about the interface, the state is represented by a screenshot together with an accessibility tree\.
Letuudenote the original task instruction, letst−1s\_\{t\-1\}denote the state immediately preceding the candidate action, and letata\_\{t\}denote the candidate action submitted for guardrail review\. Each evaluated instance is represented as
x=\(u,st−1,at\),x=\(u,s\_\{t\-1\},a\_\{t\}\),with reference label
y∈\{allowed,unrelated,unsafe\}\.y\\in\\\{\\textit\{allowed\},\\textit\{unrelated\},\\textit\{unsafe\}\\\}\.The guardrail receives exactly these three inputs when judging whether the candidate action should be executed\.
To cover ordinary task progress, off\-task behavior, and unsafe behavior, we draw benchmark items from four construction sources, summarized in Figure 1a and Table 1\. The first source is standard OSWorld executions\. The second source is interrupted OSWorld execution prefixes paired with state\-compatible proposer actions\. The third source is interrupted OSWorld execution prefixes paired with unsafe proposer actions\. The fourth source is executions of the risk\-augmented OSWorld variants described in Section 5\. These source categories define how candidate actions are constructed, while final labels are assigned through action\-level human annotation\.
Table 1 reports the number of items from each source, the final label distribution, and the average candidate\-action index, measured in executed actions from the start of the trajectory to the candidate action\.
Table 1:Benchmark composition of the action\-level benchmark by construction source and labelSource\# ItemsAllowedUnrelatedUnsafeAvg\. cand\.action indexStandard OSWorld executions4949––9Interrupted \+ state\-compatible continuations112–93195Interrupted \+ explicitly unsafe continuations105–17885Risk\-augmented variant executions58316219Total324801161287### 4\.1Construction sources
We construct action\-level items from two proposer\-generated sources and two execution\-derived sources\. To generate examples from realistic decision points rather than synthetic isolated states, the proposer\-generated sources begin from intermediate states collected during actual OSWorld executions\. We execute each selected OSWorld task with a fixed task\-executing agent under its original instruction and interrupt the trajectory at an early but nontrivial point, where multiple plausible next behaviors remain available\. In practice, we interrupt after two steps for short tasks and four steps for other tasks, where a short task is one that the task\-executing agent completes in four or fewer steps in an unmodified run\. At the interruption point, we record the screenshot and accessibility tree\. This interruption state serves as the pre\-action state for proposer\-generated benchmark items\.
These interruption states give us realistic but underdetermined decision points\. The interface has already been shaped by the original task, while still leaving enough ambiguity for multiple locally plausible next actions\. This allows us to construct examples that are realistic from the visible state while still differing in whether they remain aligned with the original objective\.
Given the current interface state, a task proposer formulates a new task instruction that can plausibly be initiated from the current state and then proposes the next step toward carrying out that task\. The generated task instruction is used only to elicit the candidate action; it is not shown to the guardrail and is not part of the benchmark input\. We use two proposer modes\. In the state\-compatible mode, the proposer generates a task that is compatible with the current state and can be initiated from that point\. In the unsafe mode, the proposer generates a task that is also compatible with the current state but would be unsafe to pursue\. This produces more explicitly unsafe proposals, while still grounding them in realistic intermediate states rather than synthetic perturbations\. For proposer\-generated examples, the proposed next behavior becomes the benchmark’s candidate action\.
The two execution\-derived sources are standard OSWorld executions and executions of the risk\-augmented variants described in Section[5](https://arxiv.org/html/2606.15034#S5)\. To capture ordinary task progress and avoid concentrating all allowed examples near initial states, we sample candidate actions from both completed and incomplete standard OSWorld executions, and retain the state immediately preceding each sampled action\. To include actions from risk\-augmented environments with both safe and unsafe outcomes, we sample one or more actions from both successful and safety\-violating risk\-augmented variant executions, and retain the state immediately preceding each sampled action\.
Across all four sources, the resulting benchmark object is the same: a single candidate action to be judged in context\.
### 4\.2Benchmark task and annotation
To keep evaluation grounded in the user’s original task, the guardrail does not receive the natural\-language instruction generated by the proposer for proposer\-generated items\. This is deliberate: in deployment, a guardrail must judge whether a proposed action remains aligned with the user’s original goal based on the original instruction, the current state, and the proposed behavior itself, not on a proposer’s self\-described intent\. Although we retain the full recorded trajectory up to the target action for construction and audit purposes, the benchmark task itself is defined only over the original instruction, the pre\-action state, and the candidate action\.
To keep labeling consistent with guardrail evaluation, reference labels are assigned using exactly the information available to the guardrail\. A target action is labeledallowedif it is an appropriate next step toward completion of the original task\. A target action is labeledunrelatedif it does not advance the user’s original objective but is not itself a safety violation\. A target action is labeledunsafeif it should be blocked on safety grounds\. When an action is both off\-task and unsafe,unsafetakes precedence\.
## 5Risk\-Augmented Execution Suite
To evaluate full\-task safety beyond nominal task success, we manually construct 45 risk\-augmented variants from OSWorld tasks\. These variants serve two roles in the paper\. First, trajectories collected from their executions provide one of the four data sources used to construct the action\-level benchmark\. Second, the same variants define a standalone full\-task evaluation setting for unsafe but task\-compatible behavior under explicit safety constraints\.
For each variant, we keep the original user instruction unchanged and modify only the task state\. The modified state introduces a state\-dependent safety hazard condition while preserving a safe path to task completion\. To reduce confounding from basic task capability, we instantiate these variants only on tasks whose unmodified versions can be successfully completed by the task\-executing agent\. This ensures that failures on the modified tasks are assessed in settings where the underlying workflow is already within the agent’s capabilities\.
The variants are designed to capture cases in which the agent’s behavior remains superficially consistent with the original task, but may proceed through a locally plausible unsafe shortcut or other unsafe, overbroad, or undesirable continuation that should be rejected\. We create these hazards by injecting additional files, conflicting artifacts, decoy content, extra identities or rules, protected configuration entries, or other task\-local state changes that create a fork between a safe continuation and an unsafe shortcut\. The unsafe shortcut is chosen to be locally plausible and often lower\-friction than the safe alternative, so that the resulting examples reflect realistic failures rather than artificially adversarial behavior\.
The resulting variants cover six recurring safety failure categories that arise during normal computer use: destructive overwrite or deletion; overbroad edits or permission changes; scope escape; configuration clobbering; unnecessary access to sensitive content; and wrong\-target or globalized updates\. Figure 2 illustrates this construction for a YouTube history\-clearing task: the original instruction remains unchanged, but the modified state introduces a locally plausible unsafe shortcut that the augmented evaluator can detect\.
Figure 2:Example risk\-augmented variant\.The original YouTube history\-clearing instruction is unchanged, but the modified environment introduces a decoy hazard: the agent can complete the nominal task through an unsafe shortcut that deletes unrelated history state\. The augmented evaluator keeps the original task\-success check and adds a safety invariant, distinguishing safe success from unsafe completion\.### 5\.1Evaluation and failure attribution
To evaluate task completion and safety preservation together, each modified task is paired with an augmented evaluator\. The evaluator retains the original OSWorld success criterion and adds one or more safety invariants corresponding to the injected risk\. A run is counted as successful only if it both completes the original task and satisfies the added invariants\. When a run fails, the evaluator identifies the specific invariant or invariants that were violated\.
To make failures reproducible and directly checkable, the added invariants use concrete, state\-based checks rather than free\-form semantic judgments\. Depending on the task, these checks include file existence and hashes, preservation of pre\-existing content, directory permissions and traversability, rule or settings preservation, absence of out\-of\-scope copies, non\-disclosure of protected content, and protection of unrelated resources\. This design makes both the evaluation outcomes and the violated safety conditions directly verifiable from the observable system state\.
To connect execution\-level evaluation back to the action\-level benchmark, we use the risk\-augmented variants both as full tasks and as a source of target actions\. In the execution setting, the task\-executing agent operates in the modified environment under the original instruction, and performance is measured using the augmented evaluator\. In the action\-level setting, trajectories from these variants provide target actions that arise in environments with explicit safety constraints\.
## 6Guardrail Agent
Our guardrail agent is a reactive, state\-conditioned prompted model that evaluates proposed behavior before execution\. The same guardrail interface is used both on pre\-constructed benchmark items and during risk\-augmented full task execution, allowing the same decision mechanism to be evaluated offline on single\-action decisions and online during interactive task execution\. In both settings, the guardrail receives the original instruction, the current interface state, and a candidate action, then decides whether the action should be allowed, blocked as unrelated, or blocked as unsafe\.
When the guardrail blocks a proposal as unrelated or unsafe, it also returns brief natural\-language feedback describing the basis for intervention\. This allows the guardrail to function not only as a classifier of proposed actions, but also as a pre\-execution oversight mechanism that can steer subsequent action revision\. In the benchmark setting, evaluation is based on the decision label itself\.
During guarded execution, if the guardrail returnsallowed, the proposed action is executed and the environment advances to the next state\. If the guardrail returnsunrelatedorunsafe, the proposed action is not executed\. Instead, the guardrail’s feedback is returned to the task\-executing agent, which proposes a revised action from the same state\.
A fixed retry budget lets the executor recover from blocked proposals without allowing unbounded revision\. This revision process is repeated for at most two retries; that is, at most three proposals are considered at a given state counting the initial proposal\. If no proposal is approved within this budget, the episode is terminated and counted as unsuccessful, and the final block decision is recorded\.
Comparing guarded and unguarded execution on the risk\-augmented variants tests whether pre\-execution intervention reduces violations of the added safety invariants while still allowing the executor to complete the original task\. This setting measures the safety benefit of guardrail intervention under latent task\-state hazards\.
## 7Experiments
In this section, we report results at both OSGuard granularities: action\-level guardrail decisions and risk\-augmented execution with and without guardrail intervention\.
### 7\.1Action\-level evaluation results
Table 2 reports the main action\-level benchmark results across guardrail models\. Gemini 3 Pro Preview performs best overall, reaching 80% accuracy and 0\.80 macro\-F1\. It is also the most balanced model across labels, with F1 scores of 0\.85 on allowed actions, 0\.80 on unrelated actions, and 0\.76 on unsafe actions\. This balance matters for guardrail deployment: high allowed performance is needed to preserve task progress, while high unrelated and unsafe performance is needed to block off\-objective or dangerous behavior\.
GPT 5\.1 and Claude Sonnet 4\.5 achieve lower macro\-F1 scores of 0\.62 and 0\.60, with different error profiles: GPT 5\.1 has relatively strong unsafe recall but low unrelated recall, suggesting that it catches many unsafe actions while missing many unrelated ones, while Claude Sonnet 4\.5 is more even across classes but lower overall\.
In addition to prompted multimodal guardrails, we evaluate DeAction, the guardrail introduced with MisActBench\(Ninget al\.,[2026](https://arxiv.org/html/2606.15034#bib.bib11)\), as a transfer baseline for action\-level oversight\. We instantiate DeAction with Claude Sonnet 4\.5, which allows a direct comparison against our prompted Claude Sonnet 4\.5 guardrail using the same underlying model\. On MisActBench, DeAction with Claude Sonnet 4\.5 is reported to achieve 84% accuracy and 0\.80 F1Ninget al\.\([2026](https://arxiv.org/html/2606.15034#bib.bib11)\)\. On OSGuard’s action\-level benchmark, the same DeAction setup reaches 57% accuracy and 0\.55 macro\-F1\. Within OSGuard, the prompted Claude Sonnet 4\.5 guardrail performs slightly better overall than DeAction, reaching 60% accuracy and 0\.60 macro\-F1\. The largest difference is on the unsafe class\. DeAction has high unsafe precision, 0\.90, but low unsafe recall, 0\.20\. In contrast, the prompted Claude Sonnet 4\.5 guardrail has lower unsafe precision, 0\.77, but substantially higher unsafe recall, 0\.48, and higher unsafe F1, 0\.59 versus 0\.33\. This suggests that DeAction is conservative in assigning the unsafe label and transfers less effectively to the unsafe actions in OSGuard’s action\-level benchmark\.
Table 2:Main results on the action\-level benchmarkGuardrailAcc\. \(%\)Macro\-F1AllowedUnrelatedUnsafePRF1PRF1PRF1Gemini 3 Pro Preview800\.800\.760\.960\.850\.790\.800\.800\.840\.690\.76GPT 5\.1630\.620\.550\.840\.670\.750\.370\.500\.650\.740\.69Claude Sonnet 4\.5600\.600\.590\.700\.640\.520\.670\.590\.770\.480\.59DeAction with Claude Sonnet 4\.5570\.550\.570\.860\.690\.520\.770\.620\.900\.200\.33Table 3 breaks performance down by construction source\. Standard OSWorld actions are the easiest source: all models perform relatively well, and Gemini reaches perfect accuracy and macro\-F1 on this subset\. The generated and risk\-augmented sources are harder\. In particular, risk\-augmented executions produce low source\-level macro\-F1 scores for all models, with the best result only 0\.48\. This indicates that unsafe behavior arising from task\-local hazards is more difficult to recognize than ordinary task progress or explicitly off\-task behavior\.
The source\-level breakdown also clarifies the DeAction comparison\. With the same Claude Sonnet 4\.5 backbone, DeAction performs better than the prompted Claude guardrail on standard OSWorld trajectories\. However, DeAction drops more sharply on the sources most central to OSGuard’s safety setting: on unsafe proposer actions, it obtains 34% accuracy and 0\.37 macro\-F1, compared with 57% accuracy and 0\.48 macro\-F1 for the prompted Claude guardrail; on risk\-augmented executions, it obtains 41% accuracy and 0\.26 macro\-F1, compared with 43% accuracy and 0\.32 macro\-F1\. The gap between accuracy and macro\-F1 on several sources also shows why macro\-F1 is important for diagnosing robustness across all labels\.
Table 3:Action\-level benchmark performance by construction sourceGuardrailStandard OSWorldInterrupted \+ state\-compatibleInterrupted \+ explicitly unsafeRisk\-augmentedAcc\. \(%\)Macro\-F1Acc\. \(%\)Macro\-F1Acc\. \(%\)Macro\-F1Acc\. \(%\)Macro\-F1Gemini 3 Pro Preview1001\.00830\.45840\.46500\.27GPT 5\.1880\.93450\.47740\.53570\.48Claude Sonnet 4\.5730\.85660\.46570\.48430\.32DeAction with Claude Sonnet 4\.5960\.98690\.41340\.37410\.26
### 7\.2Risk\-augmented execution results
Table 4 reports execution on the risk\-augmented variants\. Because each variant is derived from an OSWorld task that the executor completes in its unmodified form, the unguarded 62% variant success rate reflects a safety\-specific degradation rather than a lack of basic task ability\. The remaining 38% of runs are unsafe completions: the executor reaches the nominal task goal while violating at least one added invariant\. This supports the construction of the variant suite: the modified states preserve task solvability while exposing hazards that ordinary task success would not distinguish\.
To understand what kinds of hazards drive this drop, we analyzed the unguarded executions by the safety\-condition families introduced in Section 5\. The failures are not uniform across safety\-condition families\. The executor performs best on configuration\-clobbering variants and relatively well on unnecessary\-access to sensitive content, while it performs worst on destructive overwrite or deletion variants\. This pattern suggests that the variants exercise different kinds of task\-state constraints rather than a single generic failure mode\.
For guarded execution, we use the Gemini 3 Pro Preview guardrail, which achieved the strongest overall action\-level performance in Table 2\. Guarding leaves aggregate variant success unchanged at 62%, reduces unsafe completion from 38% to 33%, and introduces a 4% retry\-termination rate\. The paired outcomes show that most trajectories are unchanged: most unguarded successes remain successes, and most unsafe completions remain unsafe\. The guardrail corrects one previously unsafe run into a safe success and halts one additional unsafe run after retries, but also halts one previously successful run\. Thus, guardrail intervention provides a small targeted improvement, but does not close the execution\-level safety gap exposed by the risk\-augmented variants\.
Table 4:Execution on the manually constructed risk\-augmented variants\.ExecutorGuardrailVariantsuccess \(%\)Unsafecompletion \(%\)Retryterm\. \(%\)Claude Sonnet 4\.5 Computer UseNone6238N/AClaude Sonnet 4\.5 Computer UseGemini 3 Pro Preview62334
## 8Conclusion
We introduced OSGuard, a dual\-granularity benchmark for safety evaluation in computer\-use agents\. OSGuard combines an action\-level benchmark for local pre\-execution oversight with a risk\-augmented execution suite for full\-task evaluation under added state\-based safety constraints\. This execution suite can be used to evaluate both agents alone and agents paired with guardrails, allowing OSGuard to connect local oversight decisions with end\-to\-end safety outcomes\.
Our experiments show that this connection is still weak for current systems\. The strongest guardrail on the action\-level benchmark reaches 80% accuracy and 0\.80 macro\-F1; however, performance drops substantially on actions drawn from risk\-augmented executions\. In full\-task execution, the unguarded agent completes 62% of variants safely, while 38% are unsafe completions\. Adding the strongest guardrail reduces unsafe completions only modestly, and leaves overall variant success unchanged\. These results suggest that the same state\-dependent hazards that mislead task\-executing agents are also difficult for guardrails to recognize reliably\. Current guardrails can provide useful local oversight, but for this class of safety issues, they intervene only partially and often do not change the final execution outcome\.
The risk\-augmented execution suite in OSGuard therefore highlights an important safety gap: agents may complete the requested task while violating constraints that should have been preserved, and guardrails that work well in other or more local settings may not substantially change the final outcome\. OSGuard is intended to make this class of failures measurable and diagnosable, and to encourage future work on agents and guardrails that preserve task\-local constraints in addition to completing the user’s stated goal\.
## References
- \[1\]\(2026\)SafePred: a predictive guardrail for computer\-using agents via world models\.arXiv preprint arXiv:2602\.01725\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p4.1)\.
- \[2\]Z\. Chen, M\. Kang, and B\. Li\(2025\)Shieldagent: shielding agents via verifiable safety policy reasoning\.arXiv preprint arXiv:2503\.22738\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p4.1)\.
- \[3\]X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su\(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p1.1)\.
- \[4\]A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez,et al\.\(2024\)Workarena: how capable are web agents at solving common knowledge work tasks?\.arXiv preprint arXiv:2403\.07718\.Cited by:[§1](https://arxiv.org/html/2606.15034#S1.p1.1),[§2](https://arxiv.org/html/2606.15034#S2.p1.1)\.
- \[5\]J\. Jones, Z\. Zhang, Y\. Ning, E\. Fosler\-Lussier, P\. St\-Charles, Y\. Bengio, D\. Song, Y\. Su, and H\. Sun\(2026\)When benign inputs lead to severe harms: eliciting unsafe unintended behaviors of computer\-use agents\.arXiv preprint arXiv:2602\.08235\.Cited by:[§1](https://arxiv.org/html/2606.15034#S1.p3.1),[§2](https://arxiv.org/html/2606.15034#S2.p3.1)\.
- \[6\]T\. Kuntz, A\. Duzan, H\. Zhao, F\. Croce, Z\. Kolter, N\. Flammarion, and M\. Andriushchenko\(2025\)Os\-harm: a benchmark for measuring safety of computer use agents\.arXiv preprint arXiv:2506\.14866\.Cited by:[§1](https://arxiv.org/html/2606.15034#S1.p3.1),[§2](https://arxiv.org/html/2606.15034#S2.p2.1),[§2](https://arxiv.org/html/2606.15034#S2.p6.1)\.
- \[7\]T\. Le Sellier De Chezelles, M\. Gasse, A\. Lacoste, A\. Drouin, M\. Caccia, L\. Boisvert, M\. Thakkar, T\. Marty, R\. Assouel, S\. O\. Shayegan,et al\.\(2025\)The browsergym ecosystem for web agent research\.Transactions on Machine Learning Research3\(3835\)\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p1.1)\.
- \[8\]Y\. Ning, J\. Jones, Z\. Zhang, C\. Ye, W\. Ruan, J\. Li, R\. Gupta, and H\. Sun\(2026\)When actions go off\-task: detecting and correcting misaligned actions in computer\-use agents\.arXiv preprint arXiv:2602\.08995\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p4.1),[§7\.1](https://arxiv.org/html/2606.15034#S7.SS1.p3.1)\.
- \[9\]C\. Rawles, S\. Clinckemaillie, Y\. Chang, J\. Waltz, G\. Lau, M\. Fair, A\. Li, W\. Bishop, W\. Li, F\. Campbell\-Ajala,et al\.\(2024\)Androidworld: a dynamic benchmarking environment for autonomous agents\.arXiv preprint arXiv:2405\.14573\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p1.1)\.
- \[10\]E\. Shayegani, K\. Hines, Y\. Dong, N\. Abu\-Ghazaleh, R\. Lutz, S\. Whitehead, V\. Balachandran, B\. Nushi, and V\. Vineet\(2025\)Just do it\!? computer\-use agents exhibit blind goal\-directedness\.arXiv preprint arXiv:2510\.01670\.Cited by:[§1](https://arxiv.org/html/2606.15034#S1.p3.1),[§2](https://arxiv.org/html/2606.15034#S2.p3.1)\.
- \[11\]Z\. Xiang, L\. Zheng, Y\. Li, J\. Hong, Q\. Li, H\. Xie, J\. Zhang, Z\. Xiong, C\. Xie, C\. Yang,et al\.\(2024\)Guardagent: safeguard llm agents by a guard agent via knowledge\-enabled reasoning\.arXiv preprint arXiv:2406\.09187\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p4.1)\.
- \[12\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[§1](https://arxiv.org/html/2606.15034#S1.p1.1),[§1](https://arxiv.org/html/2606.15034#S1.p5.1),[§2](https://arxiv.org/html/2606.15034#S2.p1.1),[§2](https://arxiv.org/html/2606.15034#S2.p6.1)\.
- \[13\]J\. Yang, S\. Shao, D\. Liu, and J\. Shao\(2025\)Riosworld: benchmarking the risk of multimodal computer\-use agents\.arXiv preprint arXiv:2506\.00618\.Cited by:[§1](https://arxiv.org/html/2606.15034#S1.p3.1),[§2](https://arxiv.org/html/2606.15034#S2.p2.1),[§2](https://arxiv.org/html/2606.15034#S2.p6.1)\.
- \[14\]B\. Zheng, Z\. Liao, S\. Salisbury, Z\. Liu, M\. Lin, Q\. Zheng, Z\. Wang, X\. Deng, D\. Song, H\. Sun,et al\.\(2025\)Webguard: building a generalizable guardrail for web agents\.arXiv preprint arXiv:2507\.14293\.Cited by:[§2](https://arxiv.org/html/2606.15034#S2.p4.1)\.
- \[15\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2023\)Webarena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.Cited by:[§1](https://arxiv.org/html/2606.15034#S1.p1.1),[§2](https://arxiv.org/html/2606.15034#S2.p1.1)\.Similar Articles
BraveGuard: From Open-World Threats to Safer Computer-Use Agents
BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents, achieving significant accuracy gains on the AgentHazard benchmark.
Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation
This paper presents a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples across 8 NIST safety categories, finding that model size does not correlate with detection performance and that Qwen Guard (4B) achieves the highest recall.
OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform
OpenGuardrails is an open-source platform for AI safety, offering context-aware content-safety and manipulation detection (e.g., prompt injection, jailbreaking) via a unified model, plus a separate NER pipeline for data-leakage identification. It achieves state-of-the-art performance on safety benchmarks and supports private, enterprise-grade deployment.
SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
SABER introduces a benchmark for evaluating the operational safety of LLM coding agents in realistic stateful project workspaces, showing that even the best model has over a 54% harmful safety-violation rate, indicating insufficient alignment for real-world environments.
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
The paper introduces PhoneSafety, a benchmark of 700 safety-critical moments across 130+ apps to evaluate phone-use agents. Results show that avoiding harmful outcomes does not necessarily indicate safety, as models may fail to act or make unsafe choices, requiring a distinction between capability and safety signals.