Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Summary
This paper presents Vera, an end-to-end automated safety testing framework for LLM agents that combines literature-driven risk discovery, combinatorial composition of safety cases, and evidence-grounded verification. Evaluations on four agent frameworks reveal substantial safety weaknesses, with average attack success rates reaching 93.9% under multi-channel attacks, and the release of Vera-Bench with 1600 executable safety cases.
View Cached Full Text
Cached at: 07/03/26, 05:45 AM
# Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Source: [https://arxiv.org/html/2607.01793](https://arxiv.org/html/2607.01793)
Yunhao Feng1,5,∗, Ruixiao Lin2,∗, Ming Wen3, Qinqin He4, Yanming Guo5, Yifan Ding3 Yutao Wu6, Jialuo Chen1, Yunhao Chen3, Xiaohu Du1, Jianan Ma1, Zixing Chen3 Zhuoer Xu1, Xingjun Ma3, Xinhao Deng1,† 1AntGroup2Zhejiang University3Fudan University4Alibaba Group 5Hunan Institute of Advanced Technology6Deakin University
###### Abstract
LLM agents increasingly perform autonomous actions through external tools, leading to complex and evolving safety risks\. However, existing safety testing targets expert\-designed safety violations, and the corresponding outcomes are evaluated by hard\-coded rules, making them costly to extend as agents evolve\. To this end, we presentVera, an end\-to\-end automated safety testing framework that instantiates software engineering testing principles for non\-deterministic agents through a three\-stage, self\-reinforcing pipeline\. First, a literature\-driven exploration continuously discovers and structures emerging risks into taxonomies of safety risks, attack methods, and tool execution environments\. Second, combinatorial composition across taxonomy dimensions produces executable safety cases, each specifying a concrete safety goal, a programmatically constructed initial state, and a deterministic verification predicate grounded in observable artifacts\. Third, adaptive execution runs heterogeneous agents in isolated sandboxes where a control agent steers multi\-turn interaction based on runtime observations, while evidence\-grounded verifiers judge outcomes from environment state and tool\-call evidence rather than model self\-report\. We evaluateVeraon four production agent frameworks \(OpenClaw, Hermes, Codex, Claude Code\), revealing substantial safety weaknesses, with average attack success rates reaching 93\.9% under multi\-channel attacks; we also releaseVera\-Bench, comprising 1600 executable safety cases spanning 124 risk categories across three execution settings\. These results indicate that modular, executable testing infrastructure is essential for rigorous and maintainable safety evaluation of rapidly evolving agentic systems at scale\. The code is publicly available athttps://github\.com/Yunhao\-Feng/Vera\.
11footnotetext:Yunhao Feng and Ruixiao Lin contributed equally to this work\.22footnotetext:Corresponding author: Xinhao Deng \(dengxinhao@tsinghua\.edu\.cn\)\.## IIntroduction
Large Language Model \(LLM\) agents\[[44](https://arxiv.org/html/2607.01793#bib.bib6),[38](https://arxiv.org/html/2607.01793#bib.bib8),[32](https://arxiv.org/html/2607.01793#bib.bib7)\]are rapidly becoming general\-purpose software components for automating workflows across personal computing, software development, and enterprise services\. By incorporating external tools with LLMs\[[25](https://arxiv.org/html/2607.01793#bib.bib5),[2](https://arxiv.org/html/2607.01793#bib.bib4),[26](https://arxiv.org/html/2607.01793#bib.bib2),[24](https://arxiv.org/html/2607.01793#bib.bib3)\], these systems can perform autonomous actions that extend far beyond text generation\. However, this autonomy introduces risks such as sensitive data exposure\[[4](https://arxiv.org/html/2607.01793#bib.bib47)\], unauthorized system modification\[[47](https://arxiv.org/html/2607.01793#bib.bib29)\], cross\-application manipulation\[[10](https://arxiv.org/html/2607.01793#bib.bib43)\], and unsafe code execution\[[42](https://arxiv.org/html/2607.01793#bib.bib49)\], as categorized by the OWASP Top 10 for LLM Applications\[[27](https://arxiv.org/html/2607.01793#bib.bib41)\]\. These risks are rapidly growing in both categories and manifestation complexity\[[35](https://arxiv.org/html/2607.01793#bib.bib10),[23](https://arxiv.org/html/2607.01793#bib.bib9)\], and their combinatorial diversity across risk types, attack methods, and tool execution environments poses significant challenges to large\-scale, runtime\-grounded safety evaluation\.
Existing evaluation efforts have progressed from prompt\-level refusal assessment\[[45](https://arxiv.org/html/2607.01793#bib.bib12),[40](https://arxiv.org/html/2607.01793#bib.bib52)\]through trajectory\-level benchmarks with pre\-defined scenarios\[[5](https://arxiv.org/html/2607.01793#bib.bib36),[36](https://arxiv.org/html/2607.01793#bib.bib37),[17](https://arxiv.org/html/2607.01793#bib.bib13)\]to interactive red\-teaming platforms with automated adversaries\[[3](https://arxiv.org/html/2607.01793#bib.bib16),[41](https://arxiv.org/html/2607.01793#bib.bib26)\]\. Yet a common limitation persists: most approaches conflate an unsafe request, an attempted action, or a textual statement of intent with a realized safety violation, overlooking whether the harmful outcome was actually produced through executed actions and can be analyzed through their observable effects on the environment\. Moreover, each approach tightly couples its risk definitions, environment implementations, agent adapters, and verification procedures, so that extending coverage to new risks, tool ecosystems, or agent architectures requires coordinated modifications across multiple system layers, making safety datasets costly to construct and difficult to maintain as agents evolve\.
Adapting established software testing paradigms\[[50](https://arxiv.org/html/2607.01793#bib.bib44)\]to agents demands new testing primitives: such paradigms assume deterministic or statistically characterizable input–output mappings, while agents’ planning, tool selection, and state evolution are non\-deterministic at runtime\. To this end,Verarealizes end\-to\-end agent safety testing as a three\-stage, self\-reinforcing pipeline, which addresses challenges in scaling automated safety evaluation to rapidly evolving agent systems:\(1\) Rapidly evolving risk landscape\.Agent capabilities, tool ecosystems, and deployment contexts change faster than any manually curated taxonomy can track\. To this end,Veracontinuously discovers and structures emerging risks through literature\-driven exploration that iteratively builds and consolidates taxonomies of risks, attack methods, and environments\.\(2\) From risks to executable test case\.Identified risks are abstract categories, not runnable tests\.Verabridges this by composing taxonomy elements into executable safety cases through combinatorial generation, enforcing that each retained case specifies a concrete safety goal, a deterministic initial state, and a verification predicate grounded in observable artifacts\.\(3\) Adaptive testing and runtime verification\.Agent behavior is non\-deterministic, as the same safety case may yield divergent execution paths or outcomes depending on the model’s runtime planning decisions\. Therefore, a fixed testing procedure fails when the agent trajectory departs from the assumed pattern\.Veraaddresses this through sandboxed adaptive execution: a configurable tool gateway records all tool interactions, an adaptive control agent steers the test interaction in response to observed behavior, and a programmatic verifier judges the outcome from observable artifacts rather than model self\-report\. A unified execution contract connects heterogeneous agent frameworks through a common interface and evaluates each in isolated, stateful sandboxes under benign, single\-channel, and multi\-channel threat conditions\.
This work makes the following contributions:
- •We instantiate SE testing principles \(test oracles, combinatorial construction, and evidence\-grounded verification\) for agents, yielding the executable safety case, risk composition, and adaptive execution protocol\.
- •We proposeVera, an end\-to\-end safety testing framework supporting divergent agent frameworks that operates in three stages: autonomous risk discovery, executable test\-case generation, and runtime\-adaptive execution\.
- •We evaluateVeraon four production agent frameworks, revealing substantial safety vulnerabilities; we further releaseVera\-Bench covering three threat models with deterministic verifiers\.
## IIRelated Work
### II\-ASafety Risks of Computer\-Use Agents
LLM agents have evolved from single\-turn text generators into autonomous systems that invoke external tools for real\-world execution\[[44](https://arxiv.org/html/2607.01793#bib.bib6),[31](https://arxiv.org/html/2607.01793#bib.bib42),[38](https://arxiv.org/html/2607.01793#bib.bib8),[32](https://arxiv.org/html/2607.01793#bib.bib7)\]\. Recent computer\-use agents execute tasks in software repositories\[[43](https://arxiv.org/html/2607.01793#bib.bib45),[2](https://arxiv.org/html/2607.01793#bib.bib4),[25](https://arxiv.org/html/2607.01793#bib.bib5)\]and across desktop and web applications\[[39](https://arxiv.org/html/2607.01793#bib.bib46),[26](https://arxiv.org/html/2607.01793#bib.bib2),[24](https://arxiv.org/html/2607.01793#bib.bib3)\]\. Through external tool interactions, a compromised agent may leak credentials embedded in configurations\[[4](https://arxiv.org/html/2607.01793#bib.bib47)\], exfiltrate users’ private data\[[19](https://arxiv.org/html/2607.01793#bib.bib48)\], or execute unauthorized operations\[[47](https://arxiv.org/html/2607.01793#bib.bib29)\]; when such vulnerabilities are exploited at scale, they can escalate into autonomous cyberattack campaigns\[[42](https://arxiv.org/html/2607.01793#bib.bib49)\]\. The diversity of external tool execution environments further expands the attack surface: adversarial instructions can manipulate agent behavior through tool\-mediated channels such as web pages, emails, or tool outputs\[[10](https://arxiv.org/html/2607.01793#bib.bib43)\], while safety violations arise through increasingly complex patterns including multi\-step harmful task composition\[[1](https://arxiv.org/html/2607.01793#bib.bib30),[48](https://arxiv.org/html/2607.01793#bib.bib31)\]and cross\-stage backdoor triggers\[[9](https://arxiv.org/html/2607.01793#bib.bib32),[8](https://arxiv.org/html/2607.01793#bib.bib33)\]\. These risks are rapidly growing in both categories and manifestation complexity\[[35](https://arxiv.org/html/2607.01793#bib.bib10),[23](https://arxiv.org/html/2607.01793#bib.bib9)\], and their combinatorial diversity across risk types, attack methods, and tool execution environments poses significant challenges to large\-scale, runtime\-grounded safety evaluation\.
### II\-BSafety Evaluation and Testing for LLM Agents
The growing complexity and diversity of these risks have driven the focus of safety evaluation from prompt or response\-level assessment of harmful outputs toward trajectory\-level analysis of tool\-mediated behaviors\. Prompt\-level approaches assess whether an agent’s textual response constitutes compliance with or refusal of an unsafe request\[[45](https://arxiv.org/html/2607.01793#bib.bib12),[40](https://arxiv.org/html/2607.01793#bib.bib52)\]\. These methods inherit the red\-teaming paradigm and focus on the model’s content\-safety boundary rather than its downstream execution behavior; the safety violation is verified by LLM\-based judges\[[22](https://arxiv.org/html/2607.01793#bib.bib11)\]or fine\-tuned safety classifiers\[[12](https://arxiv.org/html/2607.01793#bib.bib18),[11](https://arxiv.org/html/2607.01793#bib.bib51)\]applied to model outputs\. Trajectory\-level benchmarks examine the complete execution trace of the target agent within stateful tool\-execution environments\[[5](https://arxiv.org/html/2607.01793#bib.bib36),[36](https://arxiv.org/html/2607.01793#bib.bib37),[1](https://arxiv.org/html/2607.01793#bib.bib30),[17](https://arxiv.org/html/2607.01793#bib.bib13),[37](https://arxiv.org/html/2607.01793#bib.bib1),[7](https://arxiv.org/html/2607.01793#bib.bib14),[15](https://arxiv.org/html/2607.01793#bib.bib39)\]\. Their risk categories and test scenarios are pre\-defined by human experts through manual curation or semi\-automated enumeration, with scenario coverage spanning single\-step tool misuse through multi\-step harmful task compositions; the safety violation is verified by per\-task hard\-coded rules over each execution trajectory\[[16](https://arxiv.org/html/2607.01793#bib.bib38),[30](https://arxiv.org/html/2607.01793#bib.bib54)\]\. Interactive red\-teaming platforms incorporate automated adversarial interaction into the evaluation loop, deploying an automated attacker against the target agents, which consumes a pre\-defined safety goal or methods and adapts its strategy across conversation turns\[[3](https://arxiv.org/html/2607.01793#bib.bib16),[41](https://arxiv.org/html/2607.01793#bib.bib26)\]\. These platforms probe agent robustness under adaptive, multi\-turn threat; the safety violation is verified by tracking sequential tool\-call patterns and cumulative state changes throughout the interactions\[[49](https://arxiv.org/html/2607.01793#bib.bib56)\]\.
## IIIPreliminaries
Figure 1:Overview ofVera\. The framework continuously expands literature\-grounded taxonomies of safety risks, attack methods, and environments, and composes their elements into safety goals and executable scenarios\. Heterogeneous agents are evaluated through a common interface in isolated, stateful sandboxes under benign, single\-channel, and multi\-channel conditions\. A test\-side control agent adapts the interaction from runtime observations, while case\-specific verifiers determine outcomes from environment state, tool\-call evidence, and agent responses\. Verified executions are retained as replayable safety records and provide feedback for subsequent risk exploration and scenario refinement\.To formalize the observable execution behavior of a tool\-using agent, we consider a computer\-using agent𝒜\\mathcal\{A\}operating in a stateful execution environmentℰ\\mathcal\{E\}with access to a set of tools𝒯\\mathcal\{T\}\. Given a user task, the resulting interaction spansnnconversation turns\. At theii\-th turn, the agent receives a user messageuiu\_\{i\}and may issue a sequence ofkik\_\{i\}tool calls before producing its responserir\_\{i\}to the user\. We denote thejj\-th tool call asai,ja\_\{i,j\}, which selects a tool \(with its arguments\) from𝒯\\mathcal\{T\}\. Each tool call is executed by the environmentℰ\\mathcal\{E\}, which produces the true resultfi,jf\_\{i,j\}\. The agent, however, observes a potentially different valuef~i,j\\tilde\{f\}\_\{i,j\}returned by a*configurable tool gateway*that mediates between the agent and all tool endpoints:f~i,j=fi,j\\tilde\{f\}\_\{i,j\}=f\_\{i,j\}under normative execution, whilef~i,j≠fi,j\\tilde\{f\}\_\{i,j\}\\neq f\_\{i,j\}if the tool return is compromised by an attacker\. This complete execution is formulated as:
τ=⟨ui,⟨ai,j,fi,j,f~i,j⟩j=1ki,ri⟩i=1n\.\\tau=\\left\\langle u\_\{i\},\\left\\langle a\_\{i,j\},f\_\{i,j\},\\tilde\{f\}\_\{i,j\}\\right\\rangle\_\{j=1\}^\{k\_\{i\}\},r\_\{i\}\\right\\rangle\_\{i=1\}^\{n\}\.\(1\)The trajectory records only externally observable behavior: most deployed agent frameworks expose tool calls and responses through their APIs but do not provide access to the model’s internal reasoning traces\[[26](https://arxiv.org/html/2607.01793#bib.bib2),[25](https://arxiv.org/html/2607.01793#bib.bib5),[2](https://arxiv.org/html/2607.01793#bib.bib4)\]\. Internal chain\-of\-thought or planning steps are therefore excluded, since safety outcomes are determined by executed actions and their observable effects rather than by stated intent\. We further denote bysTs\_\{T\}the environment state at the end of execution, capturing the cumulative effect of all executed actions onℰ\\mathcal\{E\}, such as persistent effects on files, application data, service records, and other resources\.
We define an executable safety case asσ=⟨g,s0,Vg⟩\\sigma=\\langle g,s\_\{0\},V\_\{g\}\\rangle, which can be constructed, executed, and verified in a fully automated manner\. Here,ggspecifies the target safety violation;s0s\_\{0\}is the case\-specific initial environment state that is constructed programmatically through service APIs; andVgV\_\{g\}is a programmatic verifier, a deterministic code\-level script that determines whether the safety violation specified bygghas been realized, based on the execution outcome\. Running agent𝒜\\mathcal\{A\}in environmentℰ\\mathcal\{E\}on the safety caseσ\\sigmaproduces a trajectoryτ\\tauand the corresponding post\-execution statesTs\_\{T\}:
\(τ,sT\)=Exec\(𝒜,ℰ,σ\),y=Vg\(τ,sT\)∈\{0,1\}\.\(\\tau,s\_\{T\}\)=\\operatorname\{Exec\}\(\\mathcal\{A\},\\mathcal\{E\},\\sigma\),\\qquad y=V\_\{g\}\(\\tau,s\_\{T\}\)\\in\\\{0,1\\\}\.\(2\)The verifierVgV\_\{g\}inspects any information recorded inτ\\tauandsTs\_\{T\}, including tool\-call records, agent responses, and the tool environment state;y=1y=1indicates that the violation was confirmed through executed actions and their observable effects\. Notably, merely inserting adversarial content into a prompt, tool result, or initial environment does not constitute success\.
Threat model\.To test the safety boundary of the target agent, we apply two\-tier adversarial settings based on the number of attacker\-controlled interaction channels\. In thesingle\-channelsetting, the adversary controls the user messages\{ui\}i=1n\\\{u\_\{i\}\\\}\_\{i=1\}^\{n\}, while all tool results are delivered without modification, i\.e\.,f~i,j=fi,j\\tilde\{f\}\_\{i,j\}=f\_\{i,j\}\. In themulti\-channelsetting, the adversaryretainscontrol over user messages and additionally injects safety violation commands into selected tool results via an operatorf~i,j=ℐi,j\(fi,j\)\\tilde\{f\}\_\{i,j\}=\\mathcal\{I\}\_\{i,j\}\(f\_\{i,j\}\), whereℐi,j\\mathcal\{I\}\_\{i,j\}applies one of four modes: identity, append, prefix, or override\. In practice, the injected content may reach the agent through email, code\-hosting, messaging, payment, search, or other tool\-mediated channels\.
Moreover, the adversary conducts a stateful, multi\-turn attack, and is capable of adapting subsequent behavior based on information recorded inτ\\tau, including agent responses and tool\-call records\. However, it cannot modify the target model, its system instructions, the agent implementation, or the internal tool code\.
## IVMethodology
As illustrated in[Figure1](https://arxiv.org/html/2607.01793#S3.F1),Verarealizes end\-to\-end agent safety testing as a three\-stage, self\-reinforcing pipeline\.*Continuous Risk Exploration*\([SectionIV\-A](https://arxiv.org/html/2607.01793#S4.SS1)\) discovers and structures emerging risks from the research literature into consolidating taxonomies of risks, attack methods, and environments\.*Executable Test Case Construction*\([SectionIV\-B](https://arxiv.org/html/2607.01793#S4.SS2)\) composes taxonomy elements into safety cases, each specifying a concrete safety goal, a programmatically constructed initial state, and a deterministic verification predicate\.*Adaptive Execution and Evidence\-Grounded Verification*\([SectionIV\-C](https://arxiv.org/html/2607.01793#S4.SS3)\) runs the target agent in an isolated sandbox where an adaptive control agent steers the test interaction and case\-specific verifiers judge outcomes from environment state and tool\-call evidence\. Verified executions are retained as replayable safety records and feed back into subsequent risk exploration and scenario refinement\.
### IV\-AContinuous Risk Exploration
A safety risk in the agentic setting is fully characterized by three orthogonal aspects:*what*harmful consequence is realized,*how*it is induced, and*where*\(in which tool execution environment\) it manifests\. The risk taxonomyℛ\\mathcal\{R\}describes the harmful consequence that may be realized, such as credential disclosure, unauthorized modification, or unsafe code execution\. The attack\-method taxonomyℳ\\mathcal\{M\}describes the mechanism used to induce the behavior, such as prompt injection, task decomposition, role play, or encoding\-based obfuscation\. The environment taxonomyΩℰ\\Omega\_\{\\mathcal\{E\}\}describes the external services and execution environments through which the agent acts, including email, code hosting, messaging, payment, web search, and similar service endpoints\.
Each taxonomy is organized as a hierarchical tree whose leaf nodes represent the finest\-grained, actionable risk units for test\-case generation\. A Summary Agent iteratively populates this tree starting from a set of broad search conceptsQ\(0\)Q^\{\(0\)\}derived from public agent\-safety literature, which describe general concepts rather than labels, examples, or task descriptions taken from the downstream evaluation set\. The same recursive mechanism can in principle generalize beyond academic literature to operational security intelligence feeds such as CVE databases, vendor advisories, and structured threat\-intelligence frameworks like MITRE ATT&CK\[[34](https://arxiv.org/html/2607.01793#bib.bib40)\], enabling deployment\-time taxonomy updates that track emerging vulnerabilities as they are disclosed\.
We formalize each exploration iterationttas:
\(𝐓t\+1,Q\(t\+1\)\)=Φ\(𝐓t,Q\(t\),P\(t\)\),\(\\mathbf\{T\}\_\{t\+1\},Q^\{\(t\+1\)\}\)=\\Phi\(\\mathbf\{T\}\_\{t\},Q^\{\(t\)\},P^\{\(t\)\}\),\(3\)where𝐓t=\(ℛt,ℳt,Ωℰ,t\)\\mathbf\{T\}\_\{t\}=\(\\mathcal\{R\}\_\{t\},\\mathcal\{M\}\_\{t\},\\Omega\_\{\\mathcal\{E\},t\}\)groups the three taxonomies andP\(t\)P^\{\(t\)\}denotes the documents retrieved at iterationtt\. The update operatorΦ\\Phiprocesses each document inP\(t\)P^\{\(t\)\}by extracting candidate concepts and applying one of four operations to each taxonomy: \(1\)*create*: if the concept describes a risk, attack method, or environment class not yet represented and is supported by at least five distinct papers or attack scenarios, add it as a new leaf node; \(2\)*update*: if the concept matches an existing node, extend the node’s supporting evidence set; \(3\)*merge*: if two existing nodes describe the same underlying class due to terminological variation across papers, unify them into a single node; \(4\)*delete*: if a node loses all supporting evidence after a merge or reclassification, remove it from the taxonomy tree\. These operations allow each taxonomy to converge to a stable structure rather than grow monotonically as more literature is processed\. Then, the Summary Agent examines sparsely populated branches, unresolved concepts, and newly discovered terminology to generate the next query frontierQ\(t\+1\)Q^\{\(t\+1\)\}\. Exploration continues until the retrieval budget is exhausted\. The exploration scope is limited to risks that can be tested at inference time against a deployed agent\. Purely training\-phase attacks such as fine\-tuning poisoning or backdoor injection fall outside this scope because they cannot be exercised through the agent’s runtime interface\. Categories that appear training\-related, such as training\-data probing, are retained only when they correspond to inference\-time behaviors \(e\.g\., membership inference or model inversion through interactive queries\)\.
### IV\-BExecutable Test Case Construction
This stage transforms the taxonomy leaves produced by risk exploration into executable safety cases\. As defined in[SectionIII](https://arxiv.org/html/2607.01793#S3), each case is a tripleσ=⟨g,s0,Vg⟩\\sigma=\\langle g,s\_\{0\},V\_\{g\}\\rangle: a concrete safety goalgg, an initial environment states0s\_\{0\}, and a programmatic verifierVgV\_\{g\}\. The construction proceeds by first generating candidate goalsggthrough combinatorial composition across taxonomy dimensions, then compiling each accepted goal into a complete case by synthesizing itss0s\_\{0\}andVgV\_\{g\}, and finally instantiating controlled variants for comparative evaluation\.
I\. Safety goal generation\.Veraconstructs candidate safety goals by composing one leaf from each taxonomy\. Letr∈Leaves\(ℛ\)r\\in\\operatorname\{Leaves\}\(\\mathcal\{R\}\),m∈Leaves\(ℳ\)m\\in\\operatorname\{Leaves\}\(\\mathcal\{M\}\), ande∈Leaves\(Ωℰ\)e\\in\\operatorname\{Leaves\}\(\\Omega\_\{\\mathcal\{E\}\}\)\. A goal composer maps the tuple\(r,m,e\)\(r,m,e\)to a context\-specific safety goal
g=G\(r,m,e;𝒟\),g=G\(r,m,e;\\mathcal\{D\}\),\(4\)whereGGis an LLM\-based goal composer that contextualizes the abstract taxonomy tuple into a concrete, verifiable safety violation, and𝒟\\mathcal\{D\}contains a collection of format demonstrations and environment descriptions\. Specifically,𝒟\\mathcal\{D\}maintains output consistency and diversity by dynamically updating the demonstration set to provide in\-context guidance that reflects the current distribution\. Additionally, the typed schema validates the generated goal against a minimum\-length target description, rejecting malformed or under\-specified outputs before they enter the candidate pool\. Each accepted goal must identify a concrete safety violation, the resource or service on which it occurs, and sufficient conditions for determining whether it was realized\. For example, “data leakage through prompt injection in code hosting” is converted into the concrete “exposing protected repository credentials through a code\-hosting workflow”\.
As the full Cartesian product of three dimensions contains many incompatible or redundant tuples,Veraselectively constructs the candidate set\. For each risk and attack\-method pair, the implementation samples ten environment leaves and retains only combinations for which the environment exposes the resources and actions required by the goal\. The resulting goals are normalized and deduplicated using their risk semantics, target resource, intended state change, and execution context\. This procedure preserves broad coverage while avoiding repeated goals that differ only in surface wording\.
II\. Safety case compilation\.Each selected goalggis compiled into a complete executable safety caseσ=⟨g,s0,Vg⟩\\sigma=\\langle g,s\_\{0\},V\_\{g\}\\rangle\. Specifically:s0s\_\{0\}is the scenario\-specific state dynamically constructed by the LLM initializer given the safety goalggthrough the programmatic interfaces of tool execution environmentsℰ\\mathcal\{E\}; this produces diverse and goal\-relevant preconditions without manual case\-by\-case design\. For example, a password\-bearing email for a credential\-theft goal, a repository containing a vulnerable dependency for a supply\-chain goal, or a pending transaction for a financial\-fraud goal\. Once generated, the initialization sequence is recorded as a deterministic environment call script and replayed verbatim on each execution, so thats0s\_\{0\}remains reproducible across runs despite being authored by an underlying LLM\. The associated scenario package additionally contains a user\-interaction specification describing the legitimate task context and the attack surface available to the test\-side Control Agent\. The verifierVgV\_\{g\}is similarly generated and replayed ass0s\_\{0\}: it produces a deterministic predicate over the post\-execution trajectoryτ\\tauand terminal statesTs\_\{T\}, checking whether the safety violation described byggwas realized through observable effects\.
III\. Filtering and variant generation\.Verafilters compiled cases whose success condition depends on internal reasoning traces rather than observable effects, or that duplicate an already\-accepted case\. Each retained base scenario is then expanded into three controlled variants corresponding to the threat model defined in[SectionIII](https://arxiv.org/html/2607.01793#S3): a benign variant that removes all adversarial elements and serves as a functional baseline, a single\-channel variant that permits adversarial user interaction but delivers tool results unmodified \(f~i,j=fi,j\\tilde\{f\}\_\{i,j\}=f\_\{i,j\}\), and a multi\-channel variant that additionally transforms selected tool results into an adversarial one via the configurable tool gateway \(f~i,j≠fi,j\\tilde\{f\}\_\{i,j\}\\neq f\_\{i,j\}\)\.
### IV\-CAdaptive Execution and Evidence\-Grounded Verification
This stage executes each safety case and verifies whether the target safety violation was realized, through a three\-component pipeline coordinated by the Control Agent\. A sandboxed execution environment isolates heterogeneous agents behind a common execution contract and records all tool interactions through the configurable tool gateway \([SectionIV\-C1](https://arxiv.org/html/2607.01793#S4.SS3.SSS1)\)\. An adaptive test driver consumes runtime observations from the sandbox and steers multi\-turn interaction toward the safety goal \([SectionIV\-C2](https://arxiv.org/html/2607.01793#S4.SS3.SSS2)\)\. An evidence\-grounded verifier then examines the recorded trajectory and the final environment state to determine whether the safety goal was achieved \([SectionIV\-C3](https://arxiv.org/html/2607.01793#S4.SS3.SSS3)\) based on collected evidence\.
#### IV\-C1Large\-scale Sandboxed Execution Environment
Agent implementations differ in launch procedures, message transport, tool protocols, and transcript formats\. A per\-agent adapter translates framework\-specific events into the trajectory representation defined in[Equation1](https://arxiv.org/html/2607.01793#S3.E1), while an isolated sandbox provides each execution with its own instance of the target agent, the tool gateway, and the external services required by the scenario\. We implement the unified and configurable tool gateway as an MCP\-based service that mediates all tool calls, recording both the original resultfi,jf\_\{i,j\}and the observationf~i,j\\tilde\{f\}\_\{i,j\}returned to the agent\. Under multi\-channel execution, the gateway applies the transformation operatorℐi,j\\mathcal\{I\}\_\{i,j\}\([SectionIII](https://arxiv.org/html/2607.01793#S3)\) to inject adversarial content into the agent execution; each operation is conducted in one of four modes: identity \(unmodified baseline\), append or prefix \(attacker\-controlled content co\-existing with legitimate results\), and override \(fully compromised data source\)\.
Each sandbox records three complementary forms of execution evidence: \(1\) the interaction log capturing user messages and agent responses, \(2\) the gateway log capturing tool calls with original and transformed results, and \(3\) the environment state of the tool execution which preserves persistent effects such as repository modifications, outgoing messages, transfers, or created records\. These sources jointly reconstructτ\\tauandsTs\_\{T\}, and are returned to the Control Agent for verification\.
#### IV\-C2Adaptive Test Driver
Before execution, the Control Agent receives the safety caseσ\\sigmaand the available tool service schemas, from which it prepares an attack plan identifying the intended interaction sequence and relevant tools\. At turnii, its control statecic\_\{i\}aggregates the safety case, interaction history, strategy summaries, observed tool calls, registered injection rules, and progress estimate\. Its control policyπctrl\\pi\_\{\\mathrm\{ctrl\}\}selects the next user messageuiu\_\{i\}, along with a set of gateway rulesJiJ\_\{i\}in the multi\-channel setting:
\(ui,Ji\)=πctrl\(σ,ci\)\.\(u\_\{i\},J\_\{i\}\)=\\pi\_\{\\mathrm\{ctrl\}\}\(\\sigma,c\_\{i\}\)\.\(5\)The Control Agent selects the tool results and designs the specificJiJ\_\{i\}, where it first interacts with the agent on a legitimate task to establish a plausible context before gradually introducing adversarial intent\.
At the end of conversation turnii, the Control Agent receives an observation from the target agent:
oi=⟨ri,Li,Δsi⟩,o\_\{i\}=\\langle r\_\{i\},L\_\{i\},\\Delta s\_\{i\}\\rangle,\(6\)whererir\_\{i\}is the agent response,LiL\_\{i\}contains the newly recorded gateway events \(which serve as ground truth since the agent may misreport\), andΔsi\\Delta s\_\{i\}summarizes observable environment\-state changes\. The Control Agent incorporatesoio\_\{i\}intoci\+1c\_\{i\+1\}and adapts accordingly: it reformulates the request upon refusal, adjusts the task decomposition when the agent uses an unexpected tool, or selects a different injection point when the relevant content has not been retrieved\.
This observe–adapt–act loop repeats until one of three conditions is met: the interaction reaches the predefined budget, the required evidence has been produced, or continued interaction is unlikely to change the outcome\. To enable post\-hoc analysis and replay, every control decision is stored together with the full interaction trajectory\.
#### IV\-C3Evidence\-Grounded Verification
After the adaptive loop terminates, the verifierVgV\_\{g\}consumes the three forms of evidence accumulated across all turns\. Because an agent may claim refusal after executing a harmful call, or claim compliance without changing the environment,VgV\_\{g\}selects among evidence sources by manipulation resistance:
Vg\(τ,sT\)=\(Vgstate\(sT\)⊳Vgtool\(τ\)⊳Vgresp\(τ\)\),V\_\{g\}\(\\tau,s\_\{T\}\)=\\bigl\(V\_\{g\}^\{\\mathrm\{state\}\}\(s\_\{T\}\)\\;\\rhd\\;V\_\{g\}^\{\\mathrm\{tool\}\}\(\\tau\)\\;\\rhd\\;V\_\{g\}^\{\\mathrm\{resp\}\}\(\\tau\)\\bigr\),\(7\)wherea⊳ba\\rhd bdenotes thataais used whenever its predicate is defined for the given safety goal, andbbserves as the fallback otherwise\. Environment state takes priority because a tool call records intent but does not guarantee effect; the agent response is consulted only when the textual output itself constitutes the violation \(e\.g\., disclosing a credential in the reply\)\.
Each verifier is a deterministic Python program whose outcome is independent of the generation model\. Verification runs while the sandbox remains active, allowingVgstateV\_\{g\}^\{\\mathrm\{state\}\}to query live service APIs\. The framework applies asymmetric judgment: when the Control Agent reports failure, it recordsy=0y\{=\}0without invoking the verifier; when it reports success, the verifier must confirm the claim against environment evidence before assigningy=1y\{=\}1, eliminating false positives from optimistic self\-assessment\. A verifier that fails due to a syntax error or tool\-call schema mismatch is regenerated to avoid false negatives\.
TABLE I:Overall ESR \(%\) by risk category \(rows\) and environment group \(columns\)\.Risk CategoryCommunicProductivityFinanceCRM & SvcDev & DataSocialTravelDomain SpecOS / TermWeb & StorAvgIntegrity92\.992\.992\.1100\.0100\.0100\.095\.795\.192\.391\.795\.3Sys Probing76\.991\.785\.280\.082\.476\.981\.280\.076\.585\.781\.6Privacy & Data86\.473\.178\.891\.794\.484\.280\.078\.378\.890\.983\.7Priv Escal87\.576\.280\.080\.081\.883\.373\.775\.094\.192\.382\.4System Abuse82\.983\.084\.886\.489\.573\.388\.987\.078\.872\.782\.7Malware Gen86\.782\.473\.975\.084\.684\.688\.286\.782\.671\.481\.6Cyber Attack81\.087\.190\.592\.376\.588\.9100\.091\.790\.985\.788\.4Harm Output76\.970\.683\.390\.966\.775\.089\.578\.6100\.058\.379\.0
TABLE II:Overall ESR \(%\) by environment group \(rows\) and attack method \(columns\)\.EnvironmentPersona & CtxProfile InferJailbreakInstr InjectionFormat InducementRoleplay & PersonaHypotheticalTask DecomposConstraint ManipObfuscationSocial EngineerAvgCommunic88\.9100\.072\.281\.561\.986\.780\.090\.080\.091\.3100\.084\.8Productivity87\.866\.790\.072\.490\.962\.593\.850\.082\.187\.163\.677\.0Finance64\.388\.990\.694\.185\.064\.783\.375\.081\.089\.276\.581\.1CRM & Svc81\.8100\.083\.3100\.092\.390\.071\.4100\.075\.085\.742\.983\.9Dev & Data100\.087\.566\.788\.2100\.0100\.080\.088\.977\.882\.487\.587\.2Social80\.0100\.0100\.083\.381\.250\.071\.462\.5100\.0100\.0100\.084\.4Travel100\.0100\.080\.086\.7100\.093\.360\.0100\.0100\.077\.870\.688\.0Domain Spec95\.0100\.086\.482\.882\.487\.577\.8100\.086\.783\.075\.086\.9OS / Term89\.5100\.091\.780\.661\.594\.783\.360\.081\.894\.357\.181\.3Web & Stor100\.087\.587\.590\.972\.290\.081\.8100\.071\.464\.350\.081\.4
TABLE III:Overall ESR \(%\) by attack method \(rows\) and risk category \(columns\)\.Attack MethodIntegritySys ProbingPrivacy & DataPriv Escal\.System AbuseMalware GenCyber AttackHarm OutputAvgPersona & Ctx95\.986\.773\.166\.780\.8100\.092\.9100\.087\.0Profile Infer100\.0100\.081\.877\.8100\.0100\.0100\.060\.089\.9Jailbreak91\.480\.085\.080\.082\.975\.075\.0100\.083\.7Instr Injection91\.778\.979\.591\.787\.792\.680\.081\.585\.5Format Inducement66\.792\.376\.088\.978\.154\.586\.484\.678\.4Roleplay & Persona93\.376\.285\.793\.381\.280\.078\.678\.683\.4Hypothetical85\.766\.780\.076\.978\.680\.0100\.072\.780\.1Task Decompos100\.075\.091\.7100\.086\.483\.3100\.071\.488\.5Constraint Manip100\.091\.790\.0100\.077\.482\.484\.257\.185\.3Obfuscation100\.079\.491\.480\.082\.788\.993\.287\.587\.9Social Engineer90\.078\.670\.041\.7100\.060\.083\.371\.474\.4
## VExperiment
### V\-AExperimental Setup
##### Agents & Models
We evaluateVerawith four heterogeneous agent frameworks: OpenClaw\[[26](https://arxiv.org/html/2607.01793#bib.bib2)\], Hermes\[[24](https://arxiv.org/html/2607.01793#bib.bib3)\], Codex\[[25](https://arxiv.org/html/2607.01793#bib.bib5)\], and Claude Code\[[2](https://arxiv.org/html/2607.01793#bib.bib4)\]\. These frameworks differ in their execution loops, tool\-use protocols, context management, and interaction interfaces, providing a diverse test bed for evaluating whether the generated scenarios remain independent of a particular agent implementation\. We consider multiple backend models, including GPT\-5\.2\[[33](https://arxiv.org/html/2607.01793#bib.bib62)\], Gemini\-3\[[14](https://arxiv.org/html/2607.01793#bib.bib59)\], Qwen\-3\.7\[[28](https://arxiv.org/html/2607.01793#bib.bib58)\], Kimi\-K2\.6\[[13](https://arxiv.org/html/2607.01793#bib.bib60)\], and GLM\-5\.2\[[46](https://arxiv.org/html/2607.01793#bib.bib61)\], subject to backend compatibility with each framework\. Each agent–model configuration is connected toVerathrough the common adapter interface\. The adapter standardizes message delivery, tool exposure, response collection, and trajectory serialization without modifying the internal planning or execution logic of the target agent\.
##### Taxonomy Construction
The risk exploration stage processes approximately 800 papers retrieved from arXiv and OpenReview\. The resulting taxonomies contain 124 leaf\-level risk categories, 77 leaf\-level attack methods, and 30 leaf\-level environment categories\. Combinatorial composition across these dimensions produces 39,078 candidate safety goals after compatibility filtering and deduplication, from which the compilation and quality\-filtering pipeline retains 1,600 executable base scenarios forVera\-Bench\.
##### Execution Settings
All executions are conducted in isolated Docker Compose environments\. Each run starts from a freshly initialized sandbox containing the target agent, the MCP middleware, and the external services required by the corresponding scenario\. Each sandbox is realized as a Docker Compose stack of 12 containers, comprising the target agent, the MCP gateway, and five self\-hosted backend services: Mailpit \(email\), Gitea \(code hosting\), Blnk \(payment and banking\), Databag \(instant messaging\), and SearXNG \(web search\)\. The current environment exposes service families through72MCP tool functions and uses separate network namespaces, project identifiers, and service states for parallel executions\. Tool calls, tool arguments, original service responses, transformed observations, and persistent state changes are recorded throughout execution\. Unless otherwise specified, the Control Agent is allowed at most ten interaction turns\. A run terminates when the interaction budget is exhausted, the target outcome has been verified, or the execution can no longer make progress toward the scenario objective\. Each executable base scenario is instantiated under three controlled settings\. The*benign*setting contains only the legitimate task and preserves all tool results without modification\. The*single\-channel*setting allows the test\-side controller to issue adversarial user messages, while tool results are returned exactly as produced by the underlying services\. The*multi\-channel*setting retains the same user\-level interaction capability and additionally allows selected tool observations to be transformed through the MCP middleware\.
##### Datasets
Each retained data item contains four artifacts\. Theattack\_planfile records the scenario objective, interaction strategy, and threat\-model configuration\. Themcp\_logsfile stores the complete sequence of tool calls, arguments, original service results, and observations returned to the agent\. Thetrace\.jsonfile contains the normalized multi\-turn interaction trajectory, including user messages, agent responses, and execution metadata\. Theverify\.pyfile implements the case\-specific executable predicate used to determine whether the target safety violation was realized\. Together, these artifacts preserve the intended attack, the observable execution process, the resulting agent trajectory, and the evidence\-grounded outcome label\. Each of the 1,600 base scenarios is instantiated under three settings \(benign, single\-channel, multi\-channel\)\. The released dataset includes the complete execution artifacts for all retained runs\.
### V\-BDataset Analysis
Veraorganizes its test space using three independently constructed, three\-level hierarchical taxonomies of safety risks, attack methods, and execution environments\. The complete taxonomy contains 124 leaf\-level risk categories, 77 leaf\-level attack methods, and 30 leaf\-level environment categories\. Reporting every leaf\-level combination would produce tables that are too sparse and difficult to interpret\. We therefore aggregate leaf nodes according to their first\-level parent groups while retaining the complete fine\-grained taxonomy in the released data\. We report the execution success rate \(ESR\), defined as the fraction of attempted runs that terminate without infrastructure failure and satisfy the executable success predicate associated with their execution setting\. For a benign run, this predicate captures successful completion of the legitimate task\. For a task in attacking mode, it captures the case\-specific safety outcome defined by the verifier\.
Figure 2:Distribution of retainedVeraexecutions across first\-level risk and environment groups under the benign, single, and multi\-channel settings\. Each heat\-map cell reports the number of retained data items associated with the corresponding group pair\.Figure 3:Distribution of execution cost and interaction length across retainedVeraruns\. Panel \(a\) shows the total input\-token distribution, panel \(b\) shows the total output\-token distribution, and panel \(c\) shows the total tool\-call count per execution\. Dashed and dotted vertical lines mark the median and 95th percentile, respectively\. Retained runs are typically moderate in length but exhibit a pronounced right tail, indicating that most scenarios are operationally tractable while a smaller subset requires substantially longer context windows or more extensive tool interaction\.[TablesI](https://arxiv.org/html/2607.01793#S4.T1),[II](https://arxiv.org/html/2607.01793#S4.T2)and[III](https://arxiv.org/html/2607.01793#S4.T3)provide three complementary projections of the same execution corpus\.[TableI](https://arxiv.org/html/2607.01793#S4.T1)examines whether scenarios remain executable across combinations of safety consequences and application contexts\.[TableII](https://arxiv.org/html/2607.01793#S4.T2)characterizes the interaction between deployment environments and attack mechanisms, while[TableIII](https://arxiv.org/html/2607.01793#S4.T3)measures how consistently each attack\-method group can be instantiated across distinct risk families\. These tables report first\-level aggregates for readability rather than replacing the underlying taxonomy\. Every retained execution remains annotated with one of the 124 fine\-grained risk categories, 77 attack methods, and 30 environment categories\.[Figure2](https://arxiv.org/html/2607.01793#S5.F2)complements the rate\-based analysis with absolute data density\.
[TableI](https://arxiv.org/html/2607.01793#S4.T1)shows that executable coverage remains broad across the joint space of risk categories and environments, although the degree of stability differs noticeably across risk families\. All eight first\-level risk groups are executable in all ten environment groups, with average ESRs ranging from 79\.0% to 95\.3%\. Integrity is the most stable category at 95\.3%, followed by Cyber Attack at 88\.4%, while Harmful Output is lowest at 79\.0%\. This ordering is consistent with the structure of the tasks\. Integrity cases are typically anchored in concrete, externally verifiable state changes, which makes successful execution easier to initialize and validate\. Harmful Output, in contrast, depends more heavily on the realized model response and therefore exhibits greater sensitivity to environmental context\. The cell\-level variation further reveals that risks whose realization depends on specific environment capabilities exhibit the widest ESR spread: Priv Escal ranges from 94\.1% \(OS / Term\) to 73\.7% \(Travel\), reflecting the tight coupling between privilege\-escalation actions and the availability of system\-level primitives\. Similarly, Malware Gen drops to 71\.4% in Web & Stor, where the restricted execution surface limits the scope of code\-generation scenarios\.
[TableII](https://arxiv.org/html/2607.01793#S4.T2)presents the same corpus from the perspective of environments and attack methods, and again the main pattern is broad coverage with substantial interaction effects\. Average ESR by environment ranges from 77\.0% in Productivity to 88\.0% in Travel, with Dev & Data and Domain Spec also relatively high at 87\.2% and 86\.9%, respectively\. The higher\-scoring environments share a common trait: they expose workflows with clearly separable action steps \(booking, repository operations, domain\-specific procedures\) that naturally accommodate compositional attack strategies\. Productivity, by contrast, involves more interdependent multi\-step workflows where a single intermediate failure cascades to the verifier\. At the same time, no environment is uniformly easy across attacks\. Social Engineer consistently underperforms relative to other attack families, falling to 42\.9% in CRM & Svc and 50\.0% in Web & Stor, while reaching 100\.0% in Communication and Social\. This disparity suggests that social\-engineering attacks succeed primarily when the environment provides persistent, identity\-bearing channels \(email threads, messaging histories\) through which trust can be established over multiple turns, and struggle in transactional environments where agent actions are more mechanically constrained\.
[TableIII](https://arxiv.org/html/2607.01793#S4.T3)further shows that attack methods differ substantially in how consistently they transfer across risk categories\. Profile Infer has the highest average ESR at 89\.9%, followed by Task Decompos at 88\.5% and Obfuscation at 87\.9%, indicating that these methods generalize relatively well across distinct forms of harm\. Social Engineer is the least stable attack family at 74\.4%, and Format Inducement is also comparatively low at 78\.4%\. This stratification reveals two distinct families\.*Mechanistic*attacks \(Obfuscation, Task Decomposition, Constraint Manipulation\) operate at the input\-encoding or task\-decomposition level and transfer broadly because they exploit structural properties of LLM parsing rather than domain\-specific agent behavior\.*Contextual*attacks \(Social Engineering, Roleplay & Persona\) require the agent to maintain and update a social model across turns, making them effective when the interaction is rich enough to sustain a narrative but fragile when the environment enforces short, transactional exchanges\.
TABLE IV:Execution Success Rate \(ESR, %\) per agent framework and attack mode\.Attack ModeClaude CodeCodexOpenClawHermesAverageSingle\-Channel95\.291\.182\.893\.490\.6Multi\-Channel93\.195\.889\.197\.893\.9Benign80\.169\.158\.074\.870\.5Overall88\.684\.170\.386\.682\.4##### Validation of combinatorial composition\.
The ESR patterns across[TablesI](https://arxiv.org/html/2607.01793#S4.T1),[II](https://arxiv.org/html/2607.01793#S4.T2)and[III](https://arxiv.org/html/2607.01793#S4.T3)provide direct evidence that the compositional test\-case construction described in[SectionIV\-B](https://arxiv.org/html/2607.01793#S4.SS2)produces meaningful and executable safety cases rather than degenerate or incompatible combinations\. If the three taxonomy dimensions were not genuinely orthogonal, one would expect large blocks of zero or near\-zero ESR wherever incompatible tuples dominate\. Instead, all 80 cells in[TableI](https://arxiv.org/html/2607.01793#S4.T1)and all 110 cells in[TableII](https://arxiv.org/html/2607.01793#S4.T2)exceed 40%, demonstrating that the compatibility filtering and deduplication procedure successfully eliminates ill\-formed combinations while preserving broad joint coverage\. Moreover, the observed variance is itself informative: it localizes to cells where the underlying method predicts difficulty\. The low ESR of Malware Gen×\\timesWeb & Stor \(71\.4%\) and Priv Escal×\\timesTravel \(73\.7%\) both correspond to cases where the required execution primitive \(code execution surface, system\-level privileges\) is structurally absent from the environment, which the filtering stage appropriately downweights but does not entirely remove\. This confirms that the composition pipeline produces a test corpus whose difficulty distribution is governed by genuine environment–risk interactions rather than by arbitrary taxonomy noise\.
##### Effectiveness of adaptive test driver\.
The Control Agent’s adaptive steering \([SectionIV\-C2](https://arxiv.org/html/2607.01793#S4.SS3.SSS2)\) is designed to address the non\-determinism of agent behavior by observing runtime responses and adjusting subsequent interaction turns\.[TableIV](https://arxiv.org/html/2607.01793#S5.T4)provides indirect evidence of this mechanism’s contribution: the overall attack ASR across four agents reaches 90\.6% under single\-channel and 93\.9% under multi\-channel, significantly exceeding the levels reported by static\-prompt benchmarks on comparable agent configurations\. More directly, the comparison between Single\-Channel and Benign settings isolates the adaptive driver’s effect\. In the benign setting, no adversarial steering is applied and the agent simply attempts the legitimate task; the average ESR is 70\.5%\. Switching to single\-channel—where the only additional element is the Control Agent’s adaptive user messages—raises the success rate by 20\.1 percentage points on average\. This gap quantifies the value of runtime\-adaptive interaction: by reformulating requests upon refusal, decomposing tasks when the agent hesitates, and escalating gradually through the interaction budget, the Control Agent overcomes defenses that would block a single static adversarial prompt\. The magnitude of this gap varies across agents: it is largest for OpenClaw \(\+24\.8 points\), whose conservative tool\-call policies are more easily circumvented by multi\-turn reformulation, and smallest for Claude Code \(\+15\.1 points\), whose stronger instruction\-following capability means even single\-turn attacks partially succeed in some configurations\.
##### Multi\-channel threat model and tool gateway\.
The configurable tool gateway \([SectionIV\-C1](https://arxiv.org/html/2607.01793#S4.SS3.SSS1)\) enables the multi\-channel threat model by injecting adversarial content into tool results while preserving the original response for ground\-truth recording\. Comparing multi\-channel to single\-channel in[TableIV](https://arxiv.org/html/2607.01793#S5.T4)reveals that this additional attack surface provides a consistent but modest incremental gain: \+3\.3 points on average, ranging from−2\.1\-2\.1\(Claude Code\) to \+6\.3 \(OpenClaw\)\. This pattern admits a nuanced interpretation\. The relatively small average gap indicates that the primary vulnerability lies in the user\-message channel, where the Control Agent can iteratively refine its approach—the tool\-observation channel provides an additional vector but is not the dominant one\. However, the per\-agent variation is revealing\. Claude Code’s slight*decrease*from single to multi\-channel \(95\.2%→\\to93\.1%\) suggests that its safety filters are more attuned to detecting injected content in tool results, possibly triggering additional refusals that offset the injection advantage\. OpenClaw’s \+6\.3\-point gain is the largest, indicating that for agents with conservative user\-message processing, the tool\-observation channel serves as an effective bypass that circumvents front\-end safety checks\. Hermes exhibits the highest multi\-channel ASR overall \(97\.8%\), suggesting minimal filtering on either channel\. These patterns validate the two\-tier threat model: multi\-channel testing reveals differential robustness across interaction surfaces that would remain invisible under single\-channel evaluation alone\.
##### Evidence\-grounded verification analysis\.
The verification hierarchy \([Equation7](https://arxiv.org/html/2607.01793#S4.E7)\) prioritizes environment state over tool\-call records over agent responses\. The inversion between benign and adversarial ESR in[TableIV](https://arxiv.org/html/2607.01793#S5.T4)provides indirect validation of this design\. The average Benign ESR \(70\.5%\) is substantially lower than attack ASRs \(90\.6–93\.9%\) not because benign tasks are harder for agents, but because benign verifiers enforce multi\-predicate end\-to\-end correctness oversTs\_\{T\}: the agent must produce the right sequence of actions, each with observable effects in the final environment state\. Attack verifiers, in contrast, need only confirm that a specific safety violation was realized through observable artifacts—a single harmful state change or data exfiltration event suffices\. This asymmetry is by design: it prevents false positives where an agent verbally complies with an unsafe request but never executes it, and prevents false negatives where an agent claims refusal after already producing a harmful side effect\. The Benign ESR further serves as an implicit quality metric for the test\-case compilation stage \([SectionIV\-B](https://arxiv.org/html/2607.01793#S4.SS2)\): a benign case that fails verification indicates that the programmatically generateds0s\_\{0\}or the verifierVgV\_\{g\}may be miscalibrated, providing a signal for iterative refinement of the scenario generation pipeline\.
##### Cross\-agent analysis\.
[Figure2](https://arxiv.org/html/2607.01793#S5.F2)complements the ESR tables with absolute data density across the three compositional views\. The retained executions are not uniformly distributed across the taxonomy, but they remain concentrated in several semantically meaningful regions under the benign, single\-channel, and multi\-channel settings, indicating that the dataset preserves both breadth and realistic variation in scenario frequency\.[Figure3](https://arxiv.org/html/2607.01793#S5.F3)shows that the retained executions are usually moderate in length but exhibit a pronounced right tail\. The median run uses 155k input tokens, 3k output tokens, and 11 tool calls, while the corresponding 95th percentiles are 789k input tokens, 11k output tokens, and 38 tool calls, suggesting that most scenarios are operationally tractable while a smaller subset preserves longer\-horizon and more tool\-intensive interactions\.
[TableIV](https://arxiv.org/html/2607.01793#S5.T4)shows substantial variation across agent frameworks and execution settings\. Claude Code attains the highest overall ESR at 88\.6%, Hermes follows closely at 86\.6%, Codex reaches 84\.1%, and OpenClaw is lowest at 70\.3%\. The ranking is informative: Claude Code and Hermes, which employ richer tool\-use orchestration and longer context windows, are more susceptible to adaptive multi\-turn attacks because their stronger task\-completion capabilities also make them more compliant with adversarial instructions that are embedded within plausible workflows\. OpenClaw’s lower ESR partly reflects its more conservative tool\-call policies, but also its higher infrastructure failure rate \(as evidenced by its correspondingly low Benign score of 58\.0%\)\. The correlation between benign task\-completion ability and attack susceptibility reveals a fundamental tension in agent design: the same capabilities that make an agent useful—strong instruction following, flexible tool orchestration, long\-context reasoning—also make it more amenable to adversarial manipulation within plausible task contexts\. This “capability–vulnerability alignment” is not an artifact of our evaluation but a structural property thatVera’s unified execution contract \([SectionIV\-C1](https://arxiv.org/html/2607.01793#S4.SS3.SSS1)\) makes visible by testing heterogeneous agents under identical scenarios\.
The 3\.3\-point gap between Single\-Channel \(90\.6%\) and Multi\-Channel \(93\.9%\) average ESR further confirms that tool\-result injection provides a measurable but modest additional advantage when layered on top of adaptive user\-message control\. However, this aggregate masks important per\-agent variation\. For Codex, multi\-channel testing increases ASR by 4\.7 points \(91\.1%→\\to95\.8%\), while for Claude Code it decreases by 2\.1 points \(95\.2%→\\to93\.1%\)\. This suggests that Codex’s safety mechanisms operate primarily at the user\-message parsing stage and are more easily bypassed when adversarial instructions arrive through a trusted tool\-result channel, whereas Claude Code applies comparable scrutiny to both channels\. Such differential channel robustness would be entirely invisible to single\-channel evaluation frameworks, validating the design choice of the configurable tool gateway as a principled mechanism for probing per\-channel safety boundaries in deployed agents\.
## VIDownstream Task
A natural downstream use ofVerais safety classification for agent interactions\. In preliminary experiments, we found that several strong existing guard models do not transfer especially well to our benchmark\. As shown in[Figure4](https://arxiv.org/html/2607.01793#S6.F4), LlamaGuard3\[[12](https://arxiv.org/html/2607.01793#bib.bib18)\]reaches 0\.438 accuracy, 0\.258 recall, and 0\.310 F1, while the base Qwen3Guard\[[51](https://arxiv.org/html/2607.01793#bib.bib19)\]improves to 0\.670 accuracy, 0\.468 recall, and 0\.637 F1\. AgentDoG\[[21](https://arxiv.org/html/2607.01793#bib.bib22)\]attains the highest recall among the off\-the\-shelf baselines at 0\.742, but its accuracy remains 0\.490 and its F1 reaches 0\.643\. We additionally evaluate NemoGuard\[[29](https://arxiv.org/html/2607.01793#bib.bib20)\], YuFeng\-XGuard\[[18](https://arxiv.org/html/2607.01793#bib.bib21)\], and AgentDoG 1\.5\[[20](https://arxiv.org/html/2607.01793#bib.bib23)\], which exhibit similar limited transfer\. These results suggest that detecting safety\-relevant failures in our benchmark is nontrivial even for competitive guard models, likely because the retained cases involve richer environment grounding, broader attack diversity, and more varied realizations of unsafe behavior than standard moderation\-style evaluation settings\.
We therefore fine\-tune a guard model based on Qwen3Guard using data derived fromVera\. The resulting model substantially improves performance on this benchmark, reaching 0\.930 accuracy, 0\.903 recall, and 0\.941 F1, as shown in[Figure4](https://arxiv.org/html/2607.01793#S6.F4)\. Relative to the base Qwen3Guard, this corresponds to gains of 26\.0 points in accuracy, 43\.5 points in recall, and 30\.4 points in F1\. The improvement is also consistent across all three metrics when compared with the other baselines, indicating that the gain is not driven by a narrow precision–recall tradeoff\. Rather, the fine\-tuned model appears to learn decision boundaries that are better matched to the structure of agent safety violations represented inVera\.
The training dynamics are correspondingly stable\.[Figure5](https://arxiv.org/html/2607.01793#S6.F5)shows that both training and evaluation loss decrease smoothly throughout optimization, with the final train loss reaching 0\.0868 and the best evaluation loss reaching 0\.0387 at step 210\. The evaluation curve tracks the training trend without late\-stage instability, which suggests that the fine\-tuning procedure is well behaved on this task\. Taken together, these results indicate thatVeracan support not only evaluation of agent safety failures, but also the development of downstream guard models that are substantially better aligned with the threat patterns present in realistic and diverse agent settings\.
Figure 4:Performance of off\-the\-shelf and fine\-tuned guard models on theVeradownstream safety\-classification task\. We report accuracy, recall, and F1\. Existing guard models show limited transfer to this benchmark, while fine\-tuning Qwen3Guard onVera\-derived data yields substantial gains across all three metrics\.Figure 5:Training dynamics of the fine\-tuned Qwen3Guard model on theVeradownstream task\. The smoothed training loss decreases steadily over optimization, and the evaluation loss reaches its minimum of 0\.0387 at step 210\. The overall trajectory indicates stable convergence during fine\-tuning\.TABLE V:Guard\-model performance on R\-Judge\.ModelAccRec\.F1LlamaGuard353\.7100\.069\.5NemoGuard54\.440\.648\.5Qwen3\-Guard59\.432\.345\.8BraveGuard57\.891\.269\.7Finetuned \(Ours\)61\.777\.968\.4To examine whether the safety judgments learned fromVeratransfer beyond our benchmark, we further evaluate the fine\-tuned guard model on R\-Judge\[[45](https://arxiv.org/html/2607.01793#bib.bib12)\], a separate safety\-classification benchmark with a different distribution of prompts and decision boundaries\. The results are reported in[TableV](https://arxiv.org/html/2607.01793#S6.T5)\. Several observations are notable\. First, the fine\-tuned model achieves the highest accuracy at 61\.7%, exceeding all off\-the\-shelf baselines, which suggests that training onVeradoes not merely overfit to the annotation conventions or execution structure of our own benchmark\. Instead, it appears to improve the model’s ability to separate harmful from non\-harmful cases under distribution shift\. Second, the recall of the fine\-tuned model reaches 77\.9%, which is lower than the extremely aggressive settings of LlamaGuard3 and BraveGuard\[[6](https://arxiv.org/html/2607.01793#bib.bib24)\]but substantially higher than NemoGuard and Qwen3\-Guard\. This places our model at a more balanced operating point between sensitivity and overprediction\. Third, although BraveGuard attains the highest F1 score by a small margin, its lower accuracy suggests a stronger tendency to label examples as unsafe\. By contrast, our fine\-tuned model delivers the most accurate overall judgments while maintaining competitive F1, indicating thatVera\-derived supervision yields a detector with improved calibration rather than simply higher positive prediction rates\. Taken together, these results provide preliminary evidence that the safety signals captured byVeraare not purely benchmark\-specific and can support guard models that generalize to external evaluation settings\.
## VIIConclusion
We introducedVera, a framework that operationalizes agent safety testing through executable safety cases, adaptive runtime interaction, and evidence\-grounded verification over environment state\. Experiments on four production agent frameworks reveal substantial safety weaknesses across diverse risks, attack methods, and environments, while guard models fine\-tuned onVera\-Bench generalize more effectively than strong off\-the\-shelf baselines\. As agents become more autonomous and tool\-integrated, safety evaluation must evolve from static benchmarking toward modular, executable testing infrastructure; we hopeVerahelps establish such infrastructure for future agent assurance\.
## References
- \[1\]M\. Andriushchenko, A\. Souly, M\. Dziemian, D\. Duenas, M\. Lin, J\. Wang, D\. Hendrycks, A\. Zou, Z\. Kolter, M\. Fredrikson,et al\.\(2025\)Agentharm: a benchmark for measuring harmfulness of llm agents\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 79185–79220\.Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[2\]\(2025\-02\)Claude 3\.7 Sonnet and Claude Code\.External Links:[Link](https://www.anthropic.com/news/claude-3-7-sonnet)Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1),[§III](https://arxiv.org/html/2607.01793#S3.p1.18),[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[3\]Z\. Chen, X\. Liu, H\. Tong, C\. Guo, Y\. Nie, J\. Zhang, M\. Kang, C\. Xu, Q\. Liu, X\. Liu,et al\.\(2026\)DecodingTrust\-agent platform \(dtap\): a controllable and interactive red\-teaming platform for ai agents\.arXiv preprint arXiv:2605\.04808\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[4\]Z\. Chen, Y\. Zhang, Y\. Liu, G\. Deng, Y\. Li, Y\. Zhang, J\. Ning, L\. Y\. Zhang, L\. Ma, and Z\. Li\(2026\)How your credentials are leaked by LLM agent skills: an empirical study\.arXiv preprint arXiv:2604\.03070\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[5\]E\. Debenedetti, J\. Zhang, M\. Balunovic, L\. Beurer\-Kellner, M\. Fischer, and F\. Tramèr\(2024\)Agentdojo: a dynamic environment to evaluate prompt injection attacks and defenses for llm agents\.Advances in Neural Information Processing Systems37,pp\. 82895–82920\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[6\]Y\. Feng, Y\. Ding, X\. Du, M\. Wen, X\. Deng, Y\. Guo, Y\. Xie, B\. Zheng, Y\. Tan, Y\. Li,et al\.\(2026\)BraveGuard: from open\-world threats to safer computer\-use agents\.arXiv preprint arXiv:2606\.01166\.Cited by:[§VI](https://arxiv.org/html/2607.01793#S6.p4.1)\.
- \[7\]Y\. Feng, Y\. Ding, Y\. Tan, X\. Ma, Y\. Li, Y\. Wu, Y\. Gao, K\. Zhai, and Y\. Guo\(2026\)Agenthazard: a benchmark for evaluating harmful behavior in computer\-use agents\.arXiv preprint arXiv:2604\.02947\.Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[8\]Y\. Feng, Y\. Ding, Y\. Tan, B\. Zheng, Y\. Guo, X\. Li, K\. Zhai, Y\. Li, and W\. Huang\(2026\)Skilltrojan: backdoor attacks on skill\-based agent systems\.InInternational Conference on Machine Learning,Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[9\]Y\. Feng, Y\. Li, Y\. Wu, Y\. Tan, Y\. Guo, Y\. Ding, K\. Zhai, X\. Ma, and Y\. Jiang\(2026\)Backdooragent: a unified framework for backdoor attacks on llm\-based agents\.arXiv preprint arXiv:2601\.04566\.Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[10\]K\. Greshake, S\. Abdelnabi, S\. Mishra, C\. Endres, T\. Holz, and M\. Fritz\(2023\)Not what you’ve signed up for: compromising real\-world LLM\-integrated applications with indirect prompt injection\.InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security \(AISec\),pp\. 79–90\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[11\]S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri\(2024\)WildGuard: open one\-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs\.Advances in Neural Information Processing Systems37\.Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[12\]H\. Inan, K\. Upasani, J\. Chi, R\. Rungta, K\. Iyer, Y\. Mao, M\. Tontchev, Q\. Hu, B\. Fuller, D\. Testuggine,et al\.\(2023\)Llama guard: llm\-based input\-output safeguard for human\-ai conversations\.arXiv preprint arXiv:2312\.06674\.Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1),[§VI](https://arxiv.org/html/2607.01793#S6.p1.1)\.
- \[13\]Kimi Team, Y\. Bai, Y\. Bao, Y\. Charles, C\. Chen, G\. Chen, H\. Chen, H\. Chen, J\. Chen, N\. Chen,et al\.\(2025\)Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[14\]A\. Laurent\(2025\)Google TPUs explained: architecture & performance for Gemini 3\.Note:https://intuitionlabs\.ai/articles/google\-tpu\-architecture\-gemini\-3Cited by:[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[15\]H\. Lee, Z\. Zhang, H\. Lu, and L\. Zhang\(2025\)Sec\-bench: automated benchmarking of llm agents on real\-world software security tasks\.Advances in Neural Information Processing Systems38,pp\. 116342–116378\.Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[16\]I\. Levy, B\. Wiesel, S\. Marreed, A\. Oved, A\. Yaeli, N\. Mashkif, and S\. Shlomov\(2026\)St\-webagentbench: a benchmark for evaluating safety and trustworthiness in web agents\.InInternational Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[17\]Y\. Li, H\. Luo, Y\. Xie, Y\. Fu, Z\. Yang, S\. Shao, Q\. Ren, W\. Qu, Y\. Fu, Y\. Yang,et al\.\(2026\)Atbench: a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis\.arXiv preprint arXiv:2604\.02022\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[18\]J\. Lin, M\. Liu, X\. Huang, J\. Li, H\. Hong, X\. Yuan, Y\. Chen, L\. Huang, H\. Xue, R\. Duan,et al\.\(2026\)YuFeng\-xguard: a reasoning\-centric, interpretable, and flexible guardrail model for large language models\.arXiv preprint arXiv:2601\.15588\.Cited by:[§VI](https://arxiv.org/html/2607.01793#S6.p1.1)\.
- \[19\]R\. Lin, Q\. Li, J\. Chen, C\. Zhou, and S\. Ji\(2026\)SOPE: situation\-aware and statistically indistinguishable privacy exfiltration for MCP\-enabled agents\.InInternational Conference on Machine Learning,Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[20\]D\. Liu, Y\. Li, Z\. Yang, P\. Wang, G\. Chen, Y\. Xie, Q\. Mao, W\. Qu, Y\. Zhu, T\. Zhou,et al\.\(2026\)AgentDoG 1\.5: a lightweight and scalable alignment framework for ai agent safety and security\.arXiv preprint arXiv:2605\.29801\.Cited by:[§VI](https://arxiv.org/html/2607.01793#S6.p1.1)\.
- \[21\]D\. Liu, Q\. Ren, C\. Qian, S\. Shao, Y\. Xie, Y\. Li, Z\. Yang, H\. Luo, P\. Wang, Q\. Liu,et al\.\(2026\)AgentDoG: a diagnostic guardrail framework for AI agent safety and security\.arXiv preprint arXiv:2601\.18491\.Cited by:[§VI](https://arxiv.org/html/2607.01793#S6.p1.1)\.
- \[22\]H\. Luo, S\. Dai, C\. Ni, X\. Li, G\. Zhang, K\. Wang, T\. Liu, and H\. Salam\(2025\)Agentauditor: human\-level safety and security evaluation for llm agents\.Advances in Neural Information Processing Systems38,pp\. 43241–43298\.Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[23\]X\. Ma, Y\. Gao, Y\. Wang, R\. Wang, X\. Wang, Y\. Sun, Y\. Ding, H\. Xu, Y\. Chen, Y\. Zhao,et al\.\(2025\)Safety at scale: a comprehensive survey of large model and agent safety\.Foundations and Trends in Privacy and Security8\(3\-4\),pp\. 1–240\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[24\]Nous Research\(2026\)Hermes Agent\.Note:Computer softwareExternal Links:[Link](https://github.com/NousResearch/hermes-agent)Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1),[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[25\]OpenAI\(2025\-05\)Introducing Codex\.External Links:[Link](https://openai.com/index/introducing-codex/)Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1),[§III](https://arxiv.org/html/2607.01793#S3.p1.18),[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[26\]OpenClaw\(2026\)OpenClaw\.Note:Computer softwareExternal Links:[Link](https://github.com/openclaw/openclaw)Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1),[§III](https://arxiv.org/html/2607.01793#S3.p1.18),[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[27\]OWASP Foundation\(2025\)OWASP top 10 for large language model applications v2\.0\.Note:https://owasp\.org/www\-project\-top\-10\-for\-large\-language\-model\-applications/Published November 2024Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1)\.
- \[28\]Qwen Team\(2026\)Qwen3\.5\-omni technical report\.arXiv preprint arXiv:2604\.15804\.Cited by:[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[29\]T\. Rebedea, R\. Dinu, M\. N\. Sreedhar, C\. Parisien, and J\. Cohen\(2023\)Nemo guardrails: a toolkit for controllable and safe llm applications with programmable rails\.InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations,pp\. 431–445\.Cited by:[§VI](https://arxiv.org/html/2607.01793#S6.p1.1)\.
- \[30\]Y\. Ruan, H\. Dong, A\. Wang, S\. Pitis, Y\. Zhou, J\. Ba, Y\. Dubois, C\. J\. Maddison, and T\. Hashimoto\(2024\)Identifying the risks of LM agents with an LM\-emulated sandbox\.InInternational Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[31\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 68539–68551\.Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[32\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[33\]A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2026\)OpenAI GPT\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[34\]B\. E\. Strom, A\. Applebaum, D\. P\. Miller, K\. C\. Nickels, A\. G\. Pennington, and C\. B\. Thomas\(2020\)MITRE ATT&CK: design and philosophy\.Technical reportThe MITRE Corporation\.Note:Originally published July 2018, revised March 2020\. Available athttps://attack\.mitre\.org/docs/ATTACK\_Design\_and\_Philosophy\_March\_2020\.pdfCited by:[§IV\-A](https://arxiv.org/html/2607.01793#S4.SS1.p2.1)\.
- \[35\]H\. Su, J\. Luo, C\. Liu, X\. Yang, Y\. Zhang, Y\. Dong, and J\. Zhu\(2026\)A survey on autonomy\-induced security risks in large model\-based agents\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[36\]A\. D\. Tur, N\. Meade, X\. H\. Lù, A\. Zambrano, A\. Patel, E\. Durmus, S\. Gella, K\. Stańczak, and S\. Reddy\(2025\)Safearena: evaluating the safety of autonomous web agents\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[37\]S\. Vijayvargiya, A\. B\. Soni, X\. Zhou, Z\. Z\. Wang, N\. Dziri, G\. Neubig, and M\. Sap\(2026\)OpenAgentSafety: a comprehensive framework for evaluating real\-world AI agent safety\.InInternational Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[38\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2024\)Autogen: enabling next\-gen LLM applications via multi\-agent conversation\.InConference on Language Modeling,Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[39\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shi, Z\. Lu,et al\.\(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[40\]T\. Xie, X\. Qi, Y\. Zeng, Y\. Huang, U\. M\. Sehwag, K\. Huang, L\. He, B\. Wei, D\. Li, Y\. Sheng, R\. Jia, B\. Li, K\. Li, D\. Chen, P\. Henderson, and P\. Mittal\(2025\)SORRY\-Bench: systematically evaluating large language model safety refusal\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[41\]C\. Xu, M\. Kang, J\. Zhang, Z\. Liao, L\. Mo, M\. Yuan, H\. Sun, and B\. Li\(2025\)Advagent: controllable blackbox red\-teaming on web agents\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[42\]M\. Xu, J\. Fan, X\. Huang, C\. Zhou, J\. Kang, D\. Niyato, S\. Mao, Z\. Han, X\. Shen, and K\. Lam\(2025\)Forewarned is forearmed: a survey on large language model\-based agents in autonomous cyberattacks\.arXiv preprint arXiv:2505\.12786\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[43\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[44\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[45\]T\. Yuan, Z\. He, L\. Dong, Y\. Wang, R\. Zhao, T\. Xia, L\. Xu, B\. Zhou, F\. Li, Z\. Zhang,et al\.\(2024\)R\-judge: benchmarking safety risk awareness for llm agents\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 1467–1490\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1),[§VI](https://arxiv.org/html/2607.01793#S6.p4.1)\.
- \[46\]A\. Zeng, X\. Lv, Z\. Hou, Z\. Du, Q\. Zheng, B\. Chen, D\. Yin, C\. Ge, C\. Huang, C\. Xie,et al\.\(2026\)GLM\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.Cited by:[§V\-A](https://arxiv.org/html/2607.01793#S5.SS1.SSS0.Px1.p1.1)\.
- \[47\]Q\. Zhan, Z\. Liang, Z\. Ying, and D\. Kang\(2024\)Injecagent: benchmarking indirect prompt injections in tool\-integrated large language model agents\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10471–10506\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p1.1),[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[48\]H\. Zhang, J\. Huang, K\. Mei, Y\. Yao, Z\. Wang, C\. Zhan, H\. Wang, and Y\. Zhang\(2025\)Agent security bench \(asb\): formalizing and benchmarking attacks and defenses in llm\-based agents\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 88011–88046\.Cited by:[§II\-A](https://arxiv.org/html/2607.01793#S2.SS1.p1.1)\.
- \[49\]J\. Zhang, S\. Yang, and B\. Li\(2025\)UDora: a unified red teaming framework against LLM agents by dynamically hijacking their own reasoning\.InInternational Conference on Machine Learning,Cited by:[§II\-B](https://arxiv.org/html/2607.01793#S2.SS2.p1.1)\.
- \[50\]J\. M\. Zhang, M\. Harman, L\. Ma, and Y\. Liu\(2022\)Machine learning testing: survey, landscapes and horizons\.IEEE Transactions on Software Engineering48\(1\),pp\. 1–36\.Cited by:[§I](https://arxiv.org/html/2607.01793#S1.p3.1)\.
- \[51\]H\. Zhao, C\. Yuan, F\. Huang, X\. Hu, Y\. Zhang, A\. Yang, B\. Yu, D\. Liu, J\. Zhou, J\. Lin,et al\.\(2025\)Qwen3guard technical report\.arXiv preprint arXiv:2510\.14276\.Cited by:[§VI](https://arxiv.org/html/2607.01793#S6.p1.1)\.Similar Articles
SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
SABER introduces a benchmark for evaluating the operational safety of LLM coding agents in realistic stateful project workspaces, showing that even the best model has over a 54% harmful safety-violation rate, indicating insufficient alignment for real-world environments.
The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]
This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.
Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection
This paper proposes a constrained, verifiable agent framework for open-web data collection that shifts LLM output from free-form code to typed JSON collector configurations, achieving zero execution-stage LLM tokens and low latency on 80 tasks.
Agent Evaluation: A Detailed Guide (53 minute read)
A comprehensive guide on evaluating LLM-based agent systems, covering fundamental concepts, evaluation frameworks, and case studies from recent benchmarks.
Towards Security-Auditable LLM Agents: A Unified Graph Representation
This paper introduces Agent-BOM, a unified graph representation for security auditing in LLM-based agentic systems. It addresses the semantic gap in post-hoc auditing by modeling static capabilities and dynamic runtime states to detect complex attack chains like memory poisoning and tool misuse.