BraveGuard: From Open-World Threats to Safer Computer-Use Agents
Summary
BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents, achieving significant accuracy gains on the AgentHazard benchmark.
View Cached Full Text
Cached at: 06/04/26, 03:41 AM
Paper page - BraveGuard: From Open-World Threats to Safer Computer-Use Agents
Source: https://huggingface.co/papers/2606.01166 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents.
Computer-use agentsextend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift createssafety risksthat are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for trainingguard modelsfromopen-world threat signalsand realisticagent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them asexecutable computer-use tasks, collects agent rollouts, and derivestrajectory-level supervisionfor guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding anadaptive defense looprather than a static, benchmark-driven training process. We instantiate BraveGuard by training multipleguard backbones, includingQwen3-GuardandLlama-Guardvariants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improvessafety detectionacross computer-use trajectories. OnAgentHazard, it substantially improves detection accuracy over off-the-shelfguard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses forcomputer-use agentsfacing evolving real-world risks.
View arXiv pageView PDFGitHub27Add to collection
Get this paper in your agent:
hf papers read 2606\.01166
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### Yunhao-Feng/BraveGuard Text Generation• Updatedabout 21 hours ago • 5
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.01166 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.01166 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OSGuard: A Benchmark for Safety in Computer-Use Agents
OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents under benign user instructions, featuring action-level judgments and risk-augmented execution suites to detect unsafe shortcuts.
OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform
OpenGuardrails is an open-source platform for AI safety, offering context-aware content-safety and manipulation detection (e.g., prompt injection, jailbreaking) via a unified model, plus a separate NER pipeline for data-leakage identification. It achieves state-of-the-art performance on safety benchmarks and supports private, enterprise-grade deployment.
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor is a novel framework for LLM agent safety that uses hierarchical memory and self-evolution to balance safety and utility, achieving state-of-the-art performance on benign and malicious tasks.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World introduces a self-evolving training framework for general agent intelligence that autonomously discovers real-world environments and tasks via the Model Context Protocol, enabling continuous learning. Agent-World-8B and 14B models outperform strong proprietary models across 23 challenging agent benchmarks.
Armorer Guard Learning Loop: local live feedback for AI-agent security
Armorer Guard introduces a Rust-native learning overlay for AI-agent security that enables local live feedback without silent cloud upload or model weight mutation, featuring CLI modes for feedback recording and offline retraining.