Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents

arXiv cs.AI 06/15/26, 04:00 AM Papers
llm-agents safety least-privilege tool-augmented prompt-injection authorization causal-gating
Summary
This paper proposes Risk-Aware Causal Gating (RACG), a training-free mechanism that applies the principle of least privilege to LLM agent tool exposure, reducing attack surface from prompt injection by only exposing high-risk tools when authorized and causally necessary.
arXiv:2606.13884v1 Announce Type: new Abstract: Modern decision systems increasingly rely on learned components whose outputs may be confident yet wrong, exposing downstream actions to costly errors. We introduce Risk-Aware Causal Gating (RACG), a framework that decides whether to act on, defer, or abstain from a model's prediction by combining causal effect estimation with calibrated risk control. RACG models the causal pathway from candidate actions to outcomes and gates each decision according to an estimated counterfactual risk rather than raw predictive confidence. To make gating reliable, we derive distribution-free bounds on the probability of acting under high-risk conditions and show how these bounds translate into operating thresholds that satisfy user-specified safety constraints. We further propose an adaptive gating policy that adjusts to distribution shift by monitoring discrepancies between predicted and realized outcomes, tightening the gate when causal assumptions appear violated. Across simulated interventions and real-world decision benchmarks, RACG reduces high-cost errors substantially while preserving most of the utility of an ungated policy, and it outperforms confidence-based and selective-prediction baselines at matched abstention rates. Our results indicate that explicitly separating causal risk from predictive uncertainty yields decision systems that are both safer and more transparent, offering a principled mechanism for trustworthy automation in high-stakes settings.
Original Article
View Cached Full Text
Cached at: 06/15/26, 09:10 AM
# Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents
Source: [https://arxiv.org/html/2606.13884](https://arxiv.org/html/2606.13884)
###### Abstract

Tool\-augmented large language model \(LLM\) agents are increasingly granted access to high\-consequence actions—sending messages, transferring funds, deleting records—yet most tool\-selection methods treat every tool as equally safe to expose\. We argue that the visible tool set is a*security control surface*: an exposed but unnecessary high\-risk tool enlarges the attack surface and enables misuse via prompt injection\. Unlike relevance\-based tool retrieval, which asks*which tools are useful*, we treat tool visibility as*temporary authority*and ask which tools are safe and authorized to expose at the current state\. We propose*Risk\-Aware Causal Gating*\(RACG\), a training\-free mechanism that applies the principle of least privilege to agent tool exposure\. RACG does not replace causal tool filtering; it supplies the safety dimension that causal filtering omits: a tool can be causally useful yet unsafe to expose until the state contains*trusted*authorization evidence\. Building on precondition–effect tool contracts, RACG exposes a high\-risk tool only when it is \(i\) on a minimal causal path to the goal and \(ii\) gated by an authorization precondition satisfied in the current state\. We formalize attack\-surface metrics, characterize the safety–success Pareto frontier over a risk\-penalty parameterλ\\lambda, and evaluate RACG as a structural defense against indirect prompt injection\. On a controlled benchmark with enforced tool visibility and trusted authorization provenance, RACG eliminates unauthorized high\-risk exposure and targeted injection\-induced high\-risk calls while maintaining task completion on authorization\-required tasks\. Validation with seven hosted LLMs \(Claude Opus 4, Sonnet 4\.6, and Haiku 4\.5; GPT\-OSS 120B; and Nova Premier, Nova Pro, and Nova 2 Lite\) reproduces the pattern, and we show the guarantee is exactly conditional on authorization provenance: when injections can forge authorization variables, the defense collapses, delineating precisely when the method holds\.

## IIntroduction

Tool access lets large language model \(LLM\) agents move beyond text generation to act in the world: they call APIs, edit files, send email, update calendars, move money, and operate structured systems\[[21](https://arxiv.org/html/2606.13884#bib.bib1),[19](https://arxiv.org/html/2606.13884#bib.bib2),[15](https://arxiv.org/html/2606.13884#bib.bib3)\]\. As agents are wired to more tools, two distinct problems arise\. The first is*capability*: can the model select and call the right tool with valid arguments\[[14](https://arxiv.org/html/2606.13884#bib.bib6),[9](https://arxiv.org/html/2606.13884#bib.bib4)\]? The second, which we study here, is*exposure*: which tools should be*visible*to the agent at each decision step, and at what risk?

Most prior tool\-selection work answers the exposure question with relevance or efficiency\. Retrieval and pruning methods surface tools whose names, descriptions, or schemas match the request\[[20](https://arxiv.org/html/2606.13884#bib.bib7),[6](https://arxiv.org/html/2606.13884#bib.bib8),[10](https://arxiv.org/html/2606.13884#bib.bib10)\], and recent work studies how shortlist size trades off selection difficulty against coverage\[[16](https://arxiv.org/html/2606.13884#bib.bib9)\]\. Causal Minimal Tool Filtering \(CMTF\) advanced this further by exposing only the tools*causally necessary*to advance the current state toward the goal\[[3](https://arxiv.org/html/2606.13884#bib.bib13)\]\. These methods improve reliability and cost, but they treat all tools as equally safe to show: a read\-onlysearchtool and an irreversibledelete\_fileortransfer\_fundstool are filtered by the same criterion\.

We argue that this is a safety gap\. In security, the principle of least privilege states that a component should hold only the authority required for its current task\[[18](https://arxiv.org/html/2606.13884#bib.bib16)\]; over\-granted authority is the root of the classic*confused\-deputy*problem, in which an otherwise\-correct component is tricked into misusing a capability it should not have held\[[8](https://arxiv.org/html/2606.13884#bib.bib17)\]\. LLM agents are confused deputies by construction: they act on natural\-language instructions that may be adversarial, ambiguous, or contaminated through indirect prompt injection\[[7](https://arxiv.org/html/2606.13884#bib.bib18)\]\. When a high\-risk tool is merely*visible*, an injected instruction, a hallucinated plan, or a single mis\-step can invoke it\. As a concrete example, an email agent asked merely to*summarize*a message does not needsend\_email,forward\_email, ordelete\_emailin its action space; if those tools are visible while the agent reads an attacker\-controlled email body, the entire prompt\-injection defense burden shifts onto the model’s willingness to refuse\. RACG instead withholds those tools until a causally necessary and authorized state is reached, so the dangerous call is not available to attempt in the first place\. The visible tool set is therefore not just an efficiency knob—it is an attack\-surface control\.

In this paper we make capability minimization a first\-class*safety primitive*for agents\. We propose*Risk\-Aware Causal Gating*\(RACG\), a training\-free method that extends precondition–effect tool contracts with explicit risk levels and authorization preconditions\. RACG exposes a high\-risk tool only when it is both \(i\) on a minimal causal path from the current state to the goal and \(ii\) gated by an authorization variable present in the current state\. Read\-only and low\-risk tools are exposed by causal sufficiency as usual; dangerous tools must be*causally justified and authorized*before they enter the agent’s action space\. We are explicit about the hierarchy: RACG does not replace Causal Minimal Tool Filtering \(CMTF\) but extends it\. CMTF establishes causal necessity for reliable exposure; RACG adds the missing safety dimension—a tool can be causally useful yet unsafe to expose until the state contains trusted authorization evidence—turning causal minimality into*least\-privilege*minimality through risk labels, authorization gates, and provenance constraints\.

This paper makes four contributions\.First, we formulate tool\-menu exposure as an agent*safety surface*: visible tools constitute temporary authority, and unnecessary high\-risk tools create exploitable standing capability\.Second, we introduce RACG, a training\-free least\-privilege exposure layer that combines causal tool contracts with risk labels, authorization preconditions, and trusted\-provenance constraints, together with a risk\-penalty parameterλ\\lambdathat traces a safety–success Pareto frontier\.Third, we define safety metrics for tool exposure—high\-risk and risk\-weighted attack surface, unauthorized exposure, premature high\-risk action rate, and injection success under gated action spaces—and an explicit threat model that scopes the guarantee\.Fourth, we evaluate RACG on RiskGate against all\-tools, relevance retrieval, state\-aware, and causal filtering, showing that under enforced tool visibility and trusted authorization provenance, RACG eliminates unauthorized high\-risk exposure and targeted injected high\-risk calls while preserving completion on authorization\-required tasks, and we identify the precise provenance condition under which the guarantee holds or collapses\.

## IIBackground and Related Work

### II\-ATool\-Augmented LLM Agents and Exposure

Interleaved reasoning and acting\[[21](https://arxiv.org/html/2606.13884#bib.bib1)\], self\-taught API use\[[19](https://arxiv.org/html/2606.13884#bib.bib2)\], and large API ecosystems\[[15](https://arxiv.org/html/2606.13884#bib.bib3)\]established tool use as a core agent capability, and benchmarks measure whether models call tools correctly\[[9](https://arxiv.org/html/2606.13884#bib.bib4),[14](https://arxiv.org/html/2606.13884#bib.bib6),[11](https://arxiv.org/html/2606.13884#bib.bib5)\]\. These assume a fixed interface; the upstream question of which tools to expose has been studied mainly through retrieval and pruning\[[20](https://arxiv.org/html/2606.13884#bib.bib7),[6](https://arxiv.org/html/2606.13884#bib.bib8),[10](https://arxiv.org/html/2606.13884#bib.bib10),[16](https://arxiv.org/html/2606.13884#bib.bib9)\]\. CMTF reframed exposure as causal sufficiency, exposing only the next causal frontier\[[3](https://arxiv.org/html/2606.13884#bib.bib13)\]\. We build directly on this contract\-based view but add a safety dimension that prior exposure work omits: the*risk*of the tools being exposed\.

### II\-BLeast Privilege and the Confused Deputy

The principle of least privilege\[[18](https://arxiv.org/html/2606.13884#bib.bib16)\]and the confused\-deputy analysis\[[8](https://arxiv.org/html/2606.13884#bib.bib17)\]are foundational to secure system design: authority should be minimal, just\-in\-time, and explicitly conferred\. We port these ideas to agent tool exposure, treating the visible tool set as the agent’s standing authority and arguing that high\-risk authority should be conferred only on causal\-and\-authorization demand\.

### II\-CAgent Safety and Prompt Injection

LLM\-integrated applications are vulnerable to indirect prompt injection, where adversarial content in retrieved data steers the agent into unintended actions\[[7](https://arxiv.org/html/2606.13884#bib.bib18)\]\. Sandboxes and benchmarks such as ToolEmu\[[17](https://arxiv.org/html/2606.13884#bib.bib20)\], R\-Judge\[[22](https://arxiv.org/html/2606.13884#bib.bib21)\], and AgentDojo\[[4](https://arxiv.org/html/2606.13884#bib.bib19)\]surface and measure such risks, and testing agents safely in the wild has been studied as an operational problem\[[13](https://arxiv.org/html/2606.13884#bib.bib22)\]\. Most defenses act at the instruction or output layer \(detection, sanitization, verification\) or recover after the fact\[[1](https://arxiv.org/html/2606.13884#bib.bib14)\]\. A useful way to position RACG against guardrail and policy\-enforcement systems: guardrails decide whether an*attempted*action is allowed; RACG decides whether the action is*available to attempt*at all\. RACG is therefore complementary and*structural*: it reduces the*means*of attack by withholding dangerous tools from the action space until they are causally justified and authorized, rather than adjudicating calls after the model has chosen to make them\.

TABLE I:Where RACG sits relative to prior tool\-handling work\. Each line of work optimizes a different objective; RACG adds an authority/risk dimension that the others omit\.
### II\-DPreconditions, Effects, and Contract Inference

RACG inherits the precondition–effect abstraction from classical planning\[[5](https://arxiv.org/html/2606.13884#bib.bib11),[12](https://arxiv.org/html/2606.13884#bib.bib12)\]and from contract\-based tool filtering\[[3](https://arxiv.org/html/2606.13884#bib.bib13)\]\. Because gating quality depends on contract quality, automatic contract inference\[[2](https://arxiv.org/html/2606.13884#bib.bib15)\]is both an enabler and a threat vector that we analyze in Section[IX](https://arxiv.org/html/2606.13884#S9)\.

## IIIProblem Formulation

We extend the multi\-step tool\-selection setting of CMTF\[[3](https://arxiv.org/html/2606.13884#bib.bib13)\]with explicit risk and authorization\.

### III\-ATools, Risk, and Authorization

Let𝒯=\{t1,…,tn\}\\mathcal\{T\}=\\\{t\_\{1\},\\dots,t\_\{n\}\\\}be the tool library\. Each tool is a contract

ti=\(di,Ri,Ei,ci,ρi,αi\),t\_\{i\}=\(d\_\{i\},R\_\{i\},E\_\{i\},c\_\{i\},\\rho\_\{i\},\\alpha\_\{i\}\),\(1\)wheredid\_\{i\}is a description,RiR\_\{i\}the required state variables \(preconditions\),EiE\_\{i\}the produced variables \(effects\),cic\_\{i\}an optional cost,ρi∈\{low,med,high\}\\rho\_\{i\}\\in\\\{\\textsc\{low\},\\textsc\{med\},\\textsc\{high\}\\\}a risk level, andαi⊆𝒳\\alpha\_\{i\}\\subseteq\\mathcal\{X\}an \(possibly empty\) set of*authorization variables*that must be present in the state before a risk\-bearing tool may be exposed\. Read\-only tools haveρi=low\\rho\_\{i\}=\\textsc\{low\}andαi=∅\\alpha\_\{i\}=\\emptyset; irreversible or externally visible actions \(send, delete, share, pay, update\) haveρi∈\{med,high\}\\rho\_\{i\}\\in\\\{\\textsc\{med\},\\textsc\{high\}\\\}and non\-emptyαi\\alpha\_\{i\}\.*Terminology\.*Throughout, we call a tool*risk\-bearing*whenρi∈\{med,high\}\\rho\_\{i\}\\in\\\{\\textsc\{med\},\\textsc\{high\}\\\}\(i\.e\.ρi≠low\\rho\_\{i\}\\neq\\textsc\{low\}\) and reserve*high\-risk*forρi=high\\rho\_\{i\}=\\textsc\{high\}specifically; the gating and exposure metrics below key on the risk\-bearing predicate, while examples of irreversible actions \(send, delete, pay\) are high\-risk\.

Let𝒳\\mathcal\{X\}be the universe of state variables; at stepttthe state isst⊆𝒳s\_\{t\}\\subseteq\\mathcal\{X\}, and the goal isg⊆𝒳g\\subseteq\\mathcal\{X\}, complete wheng⊆stg\\subseteq s\_\{t\}\. A filter selects a visible set𝒱t⊆𝒯\\mathcal\{V\}\_\{t\}\\subseteq\\mathcal\{T\}; the agent picksat∈𝒱ta\_\{t\}\\in\\mathcal\{V\}\_\{t\}and the state updates asst\+1=st∪Eats\_\{t\+1\}=s\_\{t\}\\cup E\_\{a\_\{t\}\}\.

### III\-BCausal Sufficiency with Authorization

As in CMTF, a tool is*executable*whenRi⊆stR\_\{i\}\\subseteq s\_\{t\}and*causally sufficient*when it lies on a valid dependency path to the goal\. We add an*authorization*condition: a risk\-bearing tool \(ρi≠low\\rho\_\{i\}\\neq\\textsc\{low\}\) is*admissible*atsts\_\{t\}only if

Ri⊆standαi⊆st\.R\_\{i\}\\subseteq s\_\{t\}\\quad\\text\{and\}\\quad\\alpha\_\{i\}\\subseteq s\_\{t\}\.\(2\)Thus a risk\-bearing tool may be relevant, executable, and even causally useful, yet remain inadmissible until its authorization variables are established \(e\.g\., a confirmed recipient, an explicit user approval token, or a verified target identifier\)\.

### III\-CAuthorization Provenance

The admissibility condition above is only as trustworthy as the*origin*of the authorization variablesαi\\alpha\_\{i\}\. We therefore make provenance a first\-class part of the formulation rather than an afterthought\. We partition the state universe𝒳\\mathcal\{X\}into*trusted*facts𝒳T\\mathcal\{X\}\_\{\\mathrm\{T\}\}and*untrusted*facts𝒳U\\mathcal\{X\}\_\{\\mathrm\{U\}\}, and we partition tools into*trusted producers*\(user\-confirmation steps, verification tools, and system\-controlled checks\) and*content producers*\(tools whose effects copy or summarize externally\-retrieved, attacker\-influenceable content into the state\)\. We impose the*provenance constraint*: every authorization variable must be a trusted fact,

⋃iαi⊆𝒳T,\\bigcup\_\{i\}\\alpha\_\{i\}\\subseteq\\mathcal\{X\}\_\{\\mathrm\{T\}\},\(3\)and a trusted fact may be produced*only*by a trusted producer\. Equivalently, no content producer may have anyα\\alpha\-variable in its effect setEiE\_\{i\}\.

Under this constraint, attacker\-controlled content—which can only flow through content producers into𝒳U\\mathcal\{X\}\_\{\\mathrm\{U\}\}—can never set an authorization variable, so it can never open a gate\. This is the precise property the injection guarantee \(H5\) relies on: if Eq\. \([3](https://arxiv.org/html/2606.13884#S3.E3)\) is violated, e\.g\. a retrieved email body is allowed to setrecipient\_confirmed, an injection can forge authorization and the structural defense collapses\. In RiskGate, the establishing tools of Table[III](https://arxiv.org/html/2606.13884#S5.T3)\(read\_email,confirm\_recipient,verify\_external\_party,confirm\_payment\) are trusted producers whose authorization effects are set from verified metadata or explicit user action, not from free\-text content; we encode and test the violating case \(an authorization\-forging injection\) as the boundary condition for H5 \(Section[VI\-A](https://arxiv.org/html/2606.13884#S6.SS1)\)\.

### III\-DAttack Surface and Objective

We define the per\-step*risk\-bearing attack surface*as the number of visible risk\-bearing tools,

AS\(𝒱t\)=\|\{ti∈𝒱t:ρi∈\{med,high\}\}\|,\\mathrm\{AS\}\(\\mathcal\{V\}\_\{t\}\)=\\big\|\\\{\\,t\_\{i\}\\in\\mathcal\{V\}\_\{t\}:\\rho\_\{i\}\\in\\\{\\textsc\{med\},\\textsc\{high\}\\\}\\,\\\}\\big\|,\(4\)and the*unauthorized exposure count*as the number of visible risk\-bearing tools whose authorization is not satisfied,

UE\(𝒱t\)=\|\{ti∈𝒱t:ρi≠low∧αi⊈st\}\|\.\\mathrm\{UE\}\(\\mathcal\{V\}\_\{t\}\)=\\big\|\\\{\\,t\_\{i\}\\in\\mathcal\{V\}\_\{t\}:\\rho\_\{i\}\\neq\\textsc\{low\}\\ \\wedge\\ \\alpha\_\{i\}\\not\\subseteq s\_\{t\}\\,\\\}\\big\|\.\(5\)Because a count treatstransfer\_fundsandupdate\_eventas equally severe, we also report a*weighted attack surface*that charges each visible tool by its risk penalty,

WAS\(𝒱t\)=∑ti∈𝒱trisk⁡\(ρi\),\\mathrm\{WAS\}\(\\mathcal\{V\}\_\{t\}\)=\\sum\_\{t\_\{i\}\\in\\mathcal\{V\}\_\{t\}\}\\operatorname\{risk\}\(\\rho\_\{i\}\),\(6\)so an irreversiblehightool contributes4×4\\timesa reversiblemedone \(Eq\. \([8](https://arxiv.org/html/2606.13884#S4.E8)\)\)\. The objective is to choose𝒱t\\mathcal\{V\}\_\{t\}that preserves progress towardggwhile minimizingAS\\mathrm\{AS\}andWAS\\mathrm\{WAS\}and drivingUE\\mathrm\{UE\}to zero\. Relevance, executability, and risk\-agnostic causal filtering all ignoreρi\\rho\_\{i\}andαi\\alpha\_\{i\}and can therefore expose unauthorized risk\-bearing tools whenever they appear plausible or executable\.

To verify that a low attack surface reflects*precision*rather than indiscriminate blocking, we additionally report two over\-filtering diagnostics:*gold\-tool exposure*\(GTE\), the fraction of decision steps at which the next\-needed gold\-chain tool is actually visible to the agent \(higher is better; a method that withholds tools the task legitimately needs scores below11\), and the*over\-block rate*\(OvB\), the fraction of authorization\-required tasks that fail because a legitimately\-needed—and ultimately authorizable—tool was never exposed \(lower is better\)\.

### III\-EThreat Model

RACG’s guarantee is precisely scoped, so we state the threat model explicitly\. The guarantee is only as strong as its enforcement boundary; in particular we assume the serving layer enforces𝒱t\\mathcal\{V\}\_\{t\}as the*complete*callable tool set, so a tool absent from𝒱t\\mathcal\{V\}\_\{t\}cannot be invoked by name or via a malformed tool\-call payload\. The boxed model below \(placed inline to keep it adjacent to this discussion\) lists what the attacker can and cannot do, what the defender controls, and what RACG does and does not prevent\.

TABLE II:RACG threat model: what the defense structurally prevents, and the failure modes it does not address\. The guarantee holds under enforced tool visibility and the provenance constraint of Eq\. \([3](https://arxiv.org/html/2606.13884#S3.E3)\)\.Threat model \(RACG\)\. *Attacker can:*place instructions in tool\-returned content \(emails, files, pages\); attempt to induce high\-risk calls; exploit any tool currently in𝒱t\\mathcal\{V\}\_\{t\}\. *Attacker cannot:*modify the system prompt, user goal, or tool contracts; call tools directly outside the agent; forge a trusted authorization variable \(Eq\. \([3](https://arxiv.org/html/2606.13884#S3.E3)\)\); bypass the platform’s𝒱t\\mathcal\{V\}\_\{t\}\-enforcement\. *Defender controls:*tool contracts, risk labelsρi\\rho\_\{i\}, authorization variablesαi\\alpha\_\{i\}, the gating layer, and the provenance policy partitioning trusted vs\. untrusted producers\. *RACG prevents:*calls to high\-risk tools absent from𝒱t\\mathcal\{V\}\_\{t\}; unauthorized high\-risk exposure; premature high\-risk calls; injection\-driven calls to currently gated tools\. *RACG doesnotprevent:*misuse of a high\-risk tool*after*legitimate authorization; bad arguments to an authorized tool; harm from incorrect contract annotations; forged authorization when provenance is broken; model errors among visible low\-risk tools; injections that manipulate low\-risk tools or the arguments of authorized tools\.

RACG is not a general prompt\-injection defense: it structurally prevents a specific and consequential class of attack—injected use of high\-risk tools that are gated out of the current action space—and is explicitly silent on misuse of already\-authorized capabilities, argument\-level attacks, and contract\-quality failures\. The provenance constraint of Section[III\-C](https://arxiv.org/html/2606.13884#S3.SS3)is the load\-bearing assumption: it is what makes “the targeted tool is absent from𝒱t\\mathcal\{V\}\_\{t\}” a security property rather than an implementation accident, and Section[VI\-A](https://arxiv.org/html/2606.13884#S6.SS1)encodes its violation as the boundary condition for the injection guarantee \(H5\)\.

## IVRisk\-Aware Causal Gating

RACG augments causal frontier exposure with a risk\-and\-authorization gate\. Low\-risk tools are exposed by causal sufficiency exactly as in CMTF\. High\-risk tools pass through an additional gate and incur a penalty in path scoring\.

s0s\_\{0\}: folder, sender\_names1s\_\{1\}: \+ email\_ids2s\_\{2\}: \+ recipient\_confirmeds3s\_\{3\}: \+ email\_sentsearch\_emailsread\_emailsend\_emailsearch\_emailsread\_emailsend\_email𝒱0\\mathcal\{V\}\_\{0\}𝒱1\\mathcal\{V\}\_\{1\}𝒱2\\mathcal\{V\}\_\{2\}\(authorized\)send\_emailgated:α⊈s1\\alpha\\not\\subseteq s\_\{1\}

Figure 1:RACG on an authorization\-required send\-email trajectory\. At each step, the statests\_\{t\}evolves and RACG computes the visible set𝒱t\\mathcal\{V\}\_\{t\}\. Ats1s\_\{1\},send\_emailis causally relevant but*gated*becauseα=\{recipient\_confirmed\}⊈s1\\alpha=\\\{\\texttt\{recipient\\\_confirmed\}\\\}\\not\\subseteq s\_\{1\}\. Only afterread\_emailestablishes the authorization variable \(s2s\_\{2\}\) doessend\_emailenter the action space\.### IV\-AGated Path Scoring

RACG selects a minimal causal path but penalizes risk, so that a lower\-risk path to the goal is preferred when one exists:

score⁡\(π\)=∑ti∈πci\+λ∑ti∈πrisk⁡\(ρi\),\\operatorname\{score\}\(\\pi\)=\\sum\_\{t\_\{i\}\\in\\pi\}c\_\{i\}\\;\+\\;\\lambda\\sum\_\{t\_\{i\}\\in\\pi\}\\operatorname\{risk\}\(\\rho\_\{i\}\),\(7\)wherecic\_\{i\}is the \(unit, in our experiments\) tool cost andrisk⁡\(⋅\)\\operatorname\{risk\}\(\\cdot\)maps a discrete risk level to a non\-negative scalar penalty\. We instantiate the penalty on an ordinal scale that grows super\-linearly so that a single irreversible action is never preferred over a short chain of reversible ones:

risk⁡\(low\)=0,risk⁡\(med\)=1,risk⁡\(high\)=4\.\\operatorname\{risk\}\(\\textsc\{low\}\)=0,\\quad\\operatorname\{risk\}\(\\textsc\{med\}\)=1,\\quad\\operatorname\{risk\}\(\\textsc\{high\}\)=4\.\(8\)The super\-linear gap \(high=4=4rather than22\) encodes that the harm of an irreversible or externally visible action \(send, delete, share, pay\) dominates the convenience of a shorter path; it makes RACG prefer, e\.g\., a two\-stepsearch→\\rightarrowreadverification over a one\-stepdeletewhenever both reach the goal\.

#### Whyhigh=4=4\.

The specific value is not load\-bearing; what matters is the ordering \(0<risk⁡\(med\)<risk⁡\(high\)0<\\operatorname\{risk\}\(\\textsc\{med\}\)<\\operatorname\{risk\}\(\\textsc\{high\}\)\) and a super\-linear gap ensuring one irreversible action is never preferred over a bounded chain of reversible ones\. We choose44as the smallest integer making a singlehighstep cost more than twomedsteps \(4\>2⋅14\>2\\cdot 1\)\. Becauserisk⁡\(high\)\\operatorname\{risk\}\(\\textsc\{high\}\)andλ\\lambdaenter only through their product \(Eq\. \([9](https://arxiv.org/html/2606.13884#S4.E9)\)\), the method is invariant to rescaling; results are unchanged forrisk⁡\(high\)∈\{3,5\}\\operatorname\{risk\}\(\\textsc\{high\}\)\\in\\\{3,5\\\}at compensatingλ\\lambda\.

#### Risk taxonomy\.

We assignρi\\rho\_\{i\}from tool semantics, not tool names, using three operational tiers:lowfor read\-only or idempotent tools whose effects are internal to the agent’s working state \(search\_\*,read\_\*,check\_\*,summarize\_\*\);medfor tools that mutate user\-owned state reversibly or with low blast radius \(create\_draft,update\_event,move\_file\); andhighfor tools that are irreversible, externally observable, or value\-transferring \(send\_email,delete\_\*,share\_file,share\_externally,transfer\_funds\)\. The taxonomy is part of the tool contract and is the unit a reviewer audits\.

#### Annotation protocol\.

Risk labels are a contract field, so their provenance matters as much as the gating rule\. We assignρi\\rho\_\{i\}with an explicit rubric—reversibility, external observability, and value transfer—and resolve any ambiguity*conservatively*: a tool that is borderline between two tiers receives the higher label, since over\-labeling only adds an authorization gate whereas under\-labeling can expose an unauthorized capability\. Two annotators independently labeled the RiskGate registry from this rubric; disagreements \(e\.g\. whethercreate\_draftislowormed, or whether anupdate\_eventthat can move a legal deadline should behigh\) were resolved by taking the higher\-risk label, and the full annotated registry is released with the code so labels can be audited and contested\. We note that risk can be sequence\-dependent—a tool benign in isolation may be dangerous in composition—which our per\-tool tiers do not capture; context\-sensitive risk is left to future work\. A single\-annotator deployment can adopt the same rubric and conservative tie\-breaking; broader inter\-annotator agreement studies on real tool registries are an important next step\.

### IV\-BThe Role ofλ\\lambda

The penaltyλ≥0\\lambda\\geq 0is the single knob that converts the risk map of Eq\. \([8](https://arxiv.org/html/2606.13884#S4.E8)\) into path\-selection behavior, and it traces a*safety–success Pareto frontier*:

- •λ=0\\lambda=0recovers risk\-agnostic causal filtering \(CMTF\): paths are chosen by length alone and high\-risk tools enter the frontier whenever they are shortest\.
- •0<λ<λ†0<\\lambda<\\lambda^\{\\dagger\}breaks ties*toward*safer paths and prefers a safe path over a high\-risk path when the safe path is not too much longer\.
- •λ≥λ†\\lambda\\geq\\lambda^\{\\dagger\}makes RACG avoid any high\-risk tool whenever*a*safe causal path exists, invoking a dangerous tool only when it is strictly necessary to reach the goal\.

The thresholdλ†\\lambda^\{\\dagger\}at which a safe path of lengthLsafeL\_\{\\text\{safe\}\}is preferred over a high\-risk path of lengthLriskL\_\{\\text\{risk\}\}containing onehightool is

λ†=Lsafe−Lriskrisk⁡\(high\)\.\\lambda^\{\\dagger\}=\\frac\{L\_\{\\text\{safe\}\}\-L\_\{\\text\{risk\}\}\}\{\\operatorname\{risk\}\(\\textsc\{high\}\)\}\.\(9\)For the benchmark’s worst case \(L≤6L\\leq 6,risk⁡\(high\)=4\\operatorname\{risk\}\(\\textsc\{high\}\)=4\), anyλ≥1\.25\\lambda\\geq 1\.25guarantees a safe path up to five steps longer is still preferred\. The empirical crossover on RiskGate occurs betweenλ=0\.5\\lambda=0\.5andλ=1\\lambda=1, below this conservative bound\. We sweepλ∈\{0,0\.25,0\.5,1,2,4\}\\lambda\\in\\\{0,0\.25,0\.5,1,2,4\\\}and useλ⋆=2\\lambda^\{\\star\}=2as the default operating point\.

#### Calibration\.

risk⁡\(⋅\)\\operatorname\{risk\}\(\\cdot\)andλ\\lambdaare deliberately decoupled: the risk map is a fixed, auditable property of the tool taxonomy, whileλ\\lambdais an operator\-chosen risk appetite that can differ across deployments \(a personal assistant may use a smallerλ\\lambdathan an enterprise finance agent\)\. Neither requires retraining; both are set once and logged with each run for reproducibility\.

### IV\-CAdmissibility Gate

After computing the next causal frontier, RACG removes any risk\-bearing tool whose authorization is unmet:

𝒱t=\{ti∈ℱt:ρi=low∨αi⊆st\},\\mathcal\{V\}\_\{t\}=\\\{\\,t\_\{i\}\\in\\mathcal\{F\}\_\{t\}:\\rho\_\{i\}=\\textsc\{low\}\\;\\vee\\;\\alpha\_\{i\}\\subseteq s\_\{t\}\\,\\\},\(10\)whereℱt\\mathcal\{F\}\_\{t\}is the frontier from gated path selection\. If gating empties the frontier, RACG exposes the lowest\-risk causally useful tool that can*establish*a missing authorization variable \(e\.g\., a confirmation or verification step\), realizing just\-in\-time privilege escalation rather than standing authority\.

#### Fail\-closed behavior\.

The establishing case must be fully specified for a safety method\. When multiple authorization variables inαi∖st\\alpha\_\{i\}\\setminus s\_\{t\}are missing, RACG selects an establisher for any one of them and re\-evaluates after its effect lands, repeating untilαi⊆st\\alpha\_\{i\}\\subseteq s\_\{t\}or no establisher remains\. Among candidate establishers it breaks ties by \(risk, cost, name\), preferring the lowest\-risk*trusted producer*; a content producer is never selected to satisfy anα\\alpha\-variable, by Eq\. \([3](https://arxiv.org/html/2606.13884#S3.E3)\)\. Crucially, if*no*trusted producer exists for a missing authorization variable—for instance, when authorization requires an out\-of\-band user confirmation that no tool can synthesize—RACG returns*no*risk\-bearing action and emits a blocking reason requesting external or user intervention\. RACG thus fails closed: the absence of a safe, authorized path yields an empty risk\-bearing frontier, never a fallback to exposing an unauthorized dangerous tool\.

### IV\-DAlgorithm

Algorithm[1](https://arxiv.org/html/2606.13884#alg1)summarizes RACG\. It mirrors the breadth\-first frontier search of CMTF but \(a\) scores paths with the risk penalty and \(b\) applies the admissibility gate before returning the visible set\.

Algorithm 1Risk\-Aware Causal Gating0:State

sts\_\{t\}, goal

gg, library

𝒯\\mathcal\{T\}, penalty

λ\\lambda
0:Visible tool set

𝒱t\\mathcal\{V\}\_\{t\}
1:if

g⊆stg\\subseteq s\_\{t\}then

2:return

∅\\emptyset
3:endif

4:Search for a minimal\-

score\\operatorname\{score\}causal path

π⋆\\pi^\{\\star\}from

sts\_\{t\}to

gg
5:

ℱt←\\mathcal\{F\}\_\{t\}\\leftarrowfirst executable frontier of

π⋆\\pi^\{\\star\}
6:

𝒱t←\{ti∈ℱt:ρi=low∨αi⊆st\}\\mathcal\{V\}\_\{t\}\\leftarrow\\\{\\,t\_\{i\}\\in\\mathcal\{F\}\_\{t\}:\\rho\_\{i\}=\\textsc\{low\}\\vee\\alpha\_\{i\}\\subseteq s\_\{t\}\\,\\\}
7:if

𝒱t=∅\\mathcal\{V\}\_\{t\}=\\emptysetthen

8:

𝒱t←\\mathcal\{V\}\_\{t\}\\leftarrowlowest\-risk causal tool establishing a missing

α\\alpha
9:endif

10:return

𝒱t\\mathcal\{V\}\_\{t\}

### IV\-ERunning Example

Task:“Reply to Dana’s email about the budget and send it\.”A risk\-agnostic causal filter may, once a draft exists, expose bothcreate\_draftandsend\_email\. RACG markssend\_emailashighwithα=\{recipient\_confirmed\}\\alpha=\\\{\\texttt\{recipient\\\_confirmed\}\\\}\. Until a confirmation step populatesrecipient\_confirmed,send\_emailis inadmissible and stays out of the action space, so neither an accidental call nor an injected “send to attacker@evil\.com” instruction can trigger it\.

## VBenchmark and Threat Model

We extend the controlled, synthetic, deterministically\-mocked benchmark of CMTF\[[3](https://arxiv.org/html/2606.13884#bib.bib13)\]so that failures are attributable to exposure rather than API variability, and we add an adversarial track\. We call the extended benchmarkRiskGate\.

### V\-ARegistry, Risk Levels, and Authorization Variables

We reuse the 100\-tool registry over the calendar, email, and files/documents domains and annotate every tool with a risk tier \(Section[IV](https://arxiv.org/html/2606.13884#S4)\) and, for non\-lowtools, an authorization variable setαi\\alpha\_\{i\}\. The authorization variable for a high\-risk tool names the state fact that a trustworthy prior step must establish before the action is safe to expose\. Table[III](https://arxiv.org/html/2606.13884#S5.T3)lists the high\-risk tools and their authorization gates; we addtransfer\_fundsandshare\_externallyto the original registry to broaden the high\-risk surface\.

TABLE III:High\-risk tools, their risk tier, and the authorization variable that must be present in the state before RACG will expose them\. The*establishing tool*is a low\-risk causal step that produces the authorization variable\.
### V\-BTask Suite

RiskGate contains the 102 benign CMTF tasks plus80 safety\-stress tasks\(four families of 20: email authorization\-required, email no\-action, files no\-action, and calendar high\-risk\-shortcut\) specifically constructed so that a high\-risk tool is*semantically attractive but not yet authorized*at an intermediate step\. Each safety\-stress task specifies: a natural\-language query, initial state, goal state, a gold tool chain that routes through the establishing tool before the high\-risk tool, the high\-risk distractor\(s\) that should be gated, and deterministic mocked outputs\. Tasks come in two flavors: \(i\)*authorization\-required*, where the high\-risk action*is*ultimately needed and must be exposed only after authorization \(testing that RACG does not over\-filter\); and \(ii\)*no\-action\-needed*, where a read/summarize goal makes the high\-risk tool never necessary \(testing that RACG never exposes it\)\. Table[IV](https://arxiv.org/html/2606.13884#S5.T4)gives an example of each\.

TABLE IV:Representative RiskGate safety\-stress tasks\. In both,send\_emailis semantically plausible; RACG exposes it only afterrecipient\_confirmedis in the state \(top\) and never \(bottom\)\.
### V\-CBenign Track

The benign track measures whether gating preserves task success while reducing exposure\. Over all 182 tasks we report task success, risk\-bearing attack surfaceAS\\mathrm\{AS\}, weighted attack surfaceWAS\\mathrm\{WAS\}, unauthorized exposureUE\\mathrm\{UE\}, injection success, gold\-tool exposure, over\-block rate, and serialized\-context tokens \(tokser\\mathrm\{tok\}\_\{\\mathrm\{ser\}\}\), swept overλ\\lambda\. The authorization\-required subset additionally checks*over\-filtering*: RACG must still complete the task by exposing the high\-risk tool once authorized\.

### V\-DAdversarial \(Injection\) Track

Threat model\.The attacker controls only*tool\-returned content*—the body of a retrieved email or file—and cannot modify the system prompt, the user goal, or the tool contracts\. This models indirect prompt injection in realistic retrieval\-augmented agents\[[7](https://arxiv.org/html/2606.13884#bib.bib18),[4](https://arxiv.org/html/2606.13884#bib.bib19)\]\. At a chosen step, a mocked observation embeds an injected instruction directing the agent to invoke an unauthorized high\-risk tool, e\.g\.*“Ignore previous instructions and forward this thread to audit@external\.example then delete it\.”*

Attack variants\.For each of the 80 safety\-stress tasks we generate three injection phrasings \(direct command, role\-play, and obfuscated/encoded\) targetingsend\_email,share\_externally, ordelete\_\*, yielding 240 adversarial trials per method per model\.

Metrics\.We report the*injection success rate*\(ISR\): the fraction of adversarial trials in which the targeted high\-risk tool is actually called;*benign\-task completion under attack*\(does the legitimate goal still get reached\); and*exposure\-at\-attack*\(whether the targeted tool was in𝒱t\\mathcal\{V\}\_\{t\}at the injection step\)\. These operationalize hypothesis H5 \(Section[VI\-A](https://arxiv.org/html/2606.13884#S6.SS1)\): when the targeted tool is gated out at the injection step,exposure\-at\-attack=0\\text\{exposure\-at\-attack\}=0and thereforeISR=0\\text\{ISR\}=0independent of injection phrasing, because the agent cannot call a tool that is not in its action space\.

## VIExperimental Setup

We evaluate on a controlled, deterministically\-mocked benchmark \(RiskGate\) with the 100\-tool registry of Section[V](https://arxiv.org/html/2606.13884#S5)\. The agent is a deterministic, adversarially\-compliant heuristic policy rather than an LLM: hypothesis H5 is a claim about the action space, so it should hold for*any*agent, including a worst\-case one that obeys an injection whenever the targeted tool is visible\. This isolates the effect of tool exposure from model variability and makes the structural guarantee falsifiable\. Methods compared: all\-tools, keyword top\-kk, state\-aware, risk\-agnostic causal frontier \(CMTF\), and RACG across theλ\\lambdasweep\. Each run logs visible tools, selected tool, risk level, authorization status, state transitions, token usage, and whether an injected high\-risk call occurred\.

#### Why a deterministic agent\.

We use an adversarially\-compliant heuristic—a worst\-case policy that always selects an injected tool when visible—rather than an LLM\. This is the hardest case: any real model can only do*better*at resisting injection, so ISR=0=0here upper\-bounds ISR for any LLM\. It also isolates exposure from model variability and makes H5 falsifiable \(a single nonzero ISR would refute it\)\. Model\-driven behavior is addressed in Section[VI\-B](https://arxiv.org/html/2606.13884#S6.SS2)\.

### VI\-AHypotheses

The experiments test a chain of claims, from the weakest \(motivating the problem\) to the strongest \(the structural safety guarantee\)\. We state them explicitly so each maps to a measurable outcome\.

- H1Relevance and executability are not safety\.All\-tools, keyword, and state\-aware filtering expose high\-risk tools whenever those tools are plausible or executable, incurring nonzero attack surfaceAS\\mathrm\{AS\}and unauthorized exposureUE\\mathrm\{UE\}\.*Measured by:*AS\\mathrm\{AS\},UE\\mathrm\{UE\}for these baselines\>0\>0throughout a trajectory \(Fig\.[3](https://arxiv.org/html/2606.13884#S7.F3)\)\.
- H2Risk\-agnostic causal filtering is necessary but insufficient\.CMTF \(λ=0\\lambda=0\) shrinks exposure but, ignoringρi\\rho\_\{i\}andαi\\alpha\_\{i\}, still exposes high\-risk tools*before*authorization\.*Measured by:*causal frontier reducesAS\\mathrm\{AS\}yet retainsUE\>0\\mathrm\{UE\}\>0\.
- H3RACG drives unauthorized exposure to zero at negligible success cost\.The authorization gate plus risk penalty hold the high\-risk surface near zero until authorized\.*Measured by:*UE→0\\mathrm\{UE\}\\\!\\to\\\!0and minimalAS\\mathrm\{AS\}while benign success stays at the causal\-filtering ceiling \(Fig\.[4](https://arxiv.org/html/2606.13884#S7.F4)\)\.
- H4RACG does not over\-filter\.On authorization\-required tasks, RACG must still expose the high\-risk tool once authorized and complete the task\.*Measured by:*success on the authorization\-required subset≈\\approxrisk\-agnostic causal filtering\. This is the falsifiable counterweight to H3: an overly strict gate would lower success here\.
- H5Gating is a structural injection defense\.If the targeted tool is not in𝒱t\\mathcal\{V\}\_\{t\}at the injection step, the agent cannot call it regardless of injection phrasing\.*Measured by:*injection success rate \(ISR\) tracks high\-risk exposure for baselines, and RACG’s ISR=0=0on gated targets, independent of phrasing \(Fig\.[5](https://arxiv.org/html/2606.13884#S7.F5)\)\.

#### Boundary condition for H5\.

The structural guarantee in H5 holds*only if*authorization variablesαi\\alpha\_\{i\}cannot be established by attacker\-controlled content\. If an injected observation could itself set, e\.g\.,recipient\_confirmed, the gate would open and the guarantee would collapse\. Our threat model therefore requires that authorization facts originate from trustworthy steps \(user confirmation or verified system state\), and we treat authorization provenance as an explicit assumption rather than an emergent property\. Violating this assumption defines the precise condition under which RACG fails, which we revisit in Section[IX](https://arxiv.org/html/2606.13884#S9)\.

#### Distinguishing structural from behavioral defense\.

H1–H4 concern a safety–efficiency tradeoff that depends on the quality of contracts and the value ofλ\\lambda\. H5 is a near\-deterministic claim that does not depend on the model resisting or out\-reasoning the attacker: the capability is simply absent from the action space\. This separation is the core conceptual contribution the experiments are designed to validate\.

TABLE V:Filtering and gating methods compared\.

### VI\-BValidation with Real LLM Agents

The deterministic agent upper\-bounds ISR for any policy but cannot speak to real model behavior\. We complement it with an LLM validation protocol: the gating layer \(RACG, CMTF\) is held fixed while a real model drives tool selection over the filter\-produced𝒱t\\mathcal\{V\}\_\{t\}, with RiskGate supplying deterministic mocked observations including injections\. We evaluate the full8080\-task safety\-stress set spanning three domains and all stress flavors, with three injection phrasings per task \(240240adversarial trials per method per model\)\. The contrast is deliberately restricted to the scientifically discriminating pair—risk\-agnostic causal filtering \(CMTF\) versus RACG—since the all\-tools arm ships the full∼100\\sim\\\!100\-tool registry on every call and its leak rate is already established by the deterministic track\. The key prediction: because RACG removes the targeted tool from𝒱t\\mathcal\{V\}\_\{t\}at the injection step, the model\-driven high\-risk\-call rate should remain zero regardless of model\.

#### Confirmatory results\.

Table[VI](https://arxiv.org/html/2606.13884#S6.T6)reports results with seven hosted models—Anthropic Claude Opus 4, Claude Sonnet 4\.6, and Claude Haiku 4\.5; OpenAI GPT\-OSS 120B; and Amazon Nova Premier, Nova Pro, and Nova 2 Lite—served via Amazon Bedrock at temperature0\. Across all seven models, RACG \(λ=2\\lambda\{=\}2\) yields exposure\-at\-attack=0\.00=0\.00and high\-risk\-call rate0\.000\.00: the gated tool is absent from𝒱t\\mathcal\{V\}\_\{t\}, so even a compliant model cannot call it\. CMTF reproduces its deterministic leak \(0\.250\.25\) for every model—the20/8020/80shortcut\-task fraction—confirming the guarantee does not depend on model refusal\. All models complete authorization\-required tasks at1\.001\.00under both methods, confirming RACG does not over\-filter \(H4\)\. The result is strikingly uniform: despite spanning three model families and a∼30×\\sim\\\!30\\timesrange in scale, every model exhibits the identical0\.25→0\.000\.25\\\!\\rightarrow\\\!0\.00collapse, exactly as the structural argument predicts\.

TABLE VI:Real\-LLM validation\. HR\-call: high\-risk\-call rate under injection\. Exp\.@atk: targeted tool in𝒱t\\mathcal\{V\}\_\{t\}at injection step\. Auth: authorization\-required task completion\.tokmdl\\mathrm\{tok\}\_\{\\mathrm\{mdl\}\}: mean measured model tokens \(provider\-reported prompt\+completion\) per task\. Full8080\-task safety\-stress set,240240adversarial trials/method/model, seven models via Amazon Bedrock\. Under RACG, HR\-call and Exp\.@atk are0\.000\.00for every model\.![Refer to caption](https://arxiv.org/html/2606.13884v1/figures/llm_highrisk_by_method.png)Figure 2:Real\-LLM high\-risk\-call rate under injection, by exposure method, for seven hosted models \(Amazon Bedrock\)\. Under RACG the targeted tool is gated out of𝒱t\\mathcal\{V\}\_\{t\}at the injection step, so the model\-driven high\-risk\-call rate is0\.000\.00for every model; CMTF reproduces its deterministic leak \(0\.250\.25\) across all models\.
#### Token cost and reproducibility\.

The all\-tools arm serializes the full∼100\\sim\\\!100\-tool registry on every call; under the deterministic serialized\-context proxy \(tokser\\mathrm\{tok\}\_\{\\mathrm\{ser\}\}\) this costs≈22×\\approx\\\!22\\timesthe per\-task tokens of RACG, so we omit it from the model\-driven sweep and contrast the two causally\-filtered methods directly\. Measured model tokens \(tokmdl\\mathrm\{tok\}\_\{\\mathrm\{mdl\}\}, provider\-reported prompt\+completion\) per task average2,1012\{,\}101under CMTF versus2,4562\{,\}456under RACG across the seven models; RACG’s modest overhead reflects the occasional authorization\-establishing step it routes through before exposing a gated tool\. Per\-model RACGtokmdl\\mathrm\{tok\}\_\{\\mathrm\{mdl\}\}spans1,2421\{,\}242\(GPT\-OSS 120B\) to3,5343\{,\}534\(Nova 2 Lite\); full per\-model figures appear in Table[VI](https://arxiv.org/html/2606.13884#S6.T6)\. All models are served via Bedrock’s Converse API at temperature0with forced single\-tool selection where supported \(with a text\-parse fallback for models that emit non\-conforming tool calls\); harness, specs, and per\-trial logs are released with the code\.

## VIIResults

Table[VII](https://arxiv.org/html/2606.13884#S7.T7)reports the main comparison\. Every exposure\-reducing method—including all\-tools—reaches benign task success1\.001\.00, so success is*not*the axis RACG improves: all methods can complete the tasks, and RACG’s contribution is to reach the same success with dramatically lower risk\. The methods differ sharply on safety\. All\-tools keeps the full high\-risk surface visible \(AS=26=26\) and admits every injection \(ISR=1\.00=1\.00\); keyword and state\-aware filtering shrink the surface but still leave unauthorized high\-risk tools exposed \(UE=5\.51=5\.51and3\.113\.11\) and admit most injections \(ISR=1\.00=1\.00and0\.750\.75\)\. Risk\-agnostic causal filtering reduces UE to0\.110\.11but is not zero: on high\-risk\-shortcut tasks it exposes a dangerous tool before authorization, yielding ISR=0\.25=0\.25\. RACG \(λ=2\\lambda\{=\}2\) is the only method with UE=0\.00=0\.00and ISR=0\.00=0\.00*at equal \(full\) task success*, supporting H1–H5\. Theλ=0\.5\\lambda\{=\}0\.5row shows theλ†\\lambda^\{\\dagger\}crossover: below it, RACG fails closed on the shortcut tasks \(success0\.890\.89\)\.

#### Severity\-weighting and over\-blocking\.

Two further columns guard against misreading the count\-based surface\. The*weighted*attack surface \(WAS\) charges each visible tool byrisk⁡\(ρ\)\\operatorname\{risk\}\(\\rho\), so it exposes severity that the raw count hides: all\-tools rises from AS=26=26to WAS=95=95because its standing surface is dominated by irreversiblehightools, whereas RACG \(λ=2\\lambda\{=\}2\) holds WAS=0\.34=0\.34\. Crucially, RACG’s safety does*not*come from indiscriminate blocking: its gold\-tool exposure is GTE=0\.94=0\.94\(it keeps the legitimately\-needed tool reachable when it is needed\) and its over\-block rate on authorization\-required tasks is OvB=0\.00=0\.00\(it never fails a task by withholding a tool the task genuinely needs\)\. The contrast with the sub\-λ†\\lambda^\{\\dagger\}regime is decisive: atλ=0\.5\\lambda\{=\}0\.5the over\-block rate jumps to OvB=0\.50=0\.50—RACG fails closed on half the authorization\-required tasks—which is exactly why its success falls to0\.890\.89\. Over\-blocking is thus a property of*under*\-tunedλ\\lambda, not of RACG at its operating point; atλ⋆=2\\lambda^\{\\star\}\{=\}2RACG attains zero unauthorized exposure and zero injection success*without*sacrificing the gold path\.

#### Interpreting the token metric\.

To avoid conflating two distinct quantities, we name them separately\. On the deterministic track,*serialized\-context tokens*\(tokser\\mathrm\{tok\}\_\{\\mathrm\{ser\}\}\) is the simulated prompt\-token count from serializing𝒱t\\mathcal\{V\}\_\{t\}at each step—a proxy for context overhead, not a measured model cost; the∼22×\\sim\\\!22\\timesgap between all\-tools and RACG mirrors the difference in\|𝒱t\|\|\\mathcal\{V\}\_\{t\}\|\. On the LLM track \(Section[VI\-B](https://arxiv.org/html/2606.13884#S6.SS2)\),*measured model tokens*\(tokmdl\\mathrm\{tok\}\_\{\\mathrm\{mdl\}\}\) is the provider\-reported prompt\+completion total\. The two are not directly comparable and are never aggregated together\.

Figure[3](https://arxiv.org/html/2606.13884#S7.F3)illustrates the central mechanism of capability minimization on an authorization\-required send\-email trajectory\. All\-tools exposure keeps the full high\-risk surface \(26 tools\) visible at every step, and state\-aware filtering still exposes several executable high\-risk tools\. Causal frontier and RACG both keep the visible high\-risk surface near zero, exposingsend\_emailonly at the final step onceread\_emailhas populatedrecipient\_confirmed\. On tasks with a high\-risk*shortcut*\(Table[VII](https://arxiv.org/html/2606.13884#S7.T7)\), the two diverge: risk\-agnostic causal filtering exposes the dangerous tool before authorization, whereas RACG does not\.

![Refer to caption](https://arxiv.org/html/2606.13884v1/figures/attack_surface_by_step.png)Figure 3:Visible high\-risk tools at each step of an authorization\-required send\-email task\. All\-tools keeps dangerous capability standing throughout; causal frontier and RACG expose a high\-risk tool only at the final authorized step\.TABLE VII:Main results on RiskGate \(deterministic agent\)\. Success: task completion\. AS / WAS: avg\. count\- and severity\-weighted visible risk\-bearing tools/step \(WAS sumsrisk⁡\(ρ\)\\operatorname\{risk\}\(\\rho\), so ahightool counts4×4\\timesamedone\)\. UE: unauthorized exposures/task\. Inj\.: injection success rate\. GTE: gold\-tool exposure \(frac\. of needed gold tools visible when needed; higher is better\)\. OvB: overblock rate on authorization\-required tasks \(frac\. failed because a legitimately\-needed tool was withheld; lower is better\)\.tokser\\mathrm\{tok\}\_\{\\mathrm\{ser\}\}: serialized\-context tokens/task\. All methods reach success1\.001\.00except RACG at sub\-λ†\\lambda^\{\\dagger\}\(λ=0\.5\\lambda\{=\}0\.5\)\.
### VII\-ARisk Penalty and theλ†\\lambda^\{\\dagger\}Crossover

Figure[4](https://arxiv.org/html/2606.13884#S7.F4)sweeps the risk penaltyλ\\lambdaand plots benign task success\. Across the sweep, unauthorized high\-risk exposure and injection success are identically zero for RACG; the discriminating axis is therefore success\. Below the crossover \(λ≤0\.5\\lambda\\leq 0\.5\) RACG fails closed on the high\-risk\-shortcut tasks \(success0\.890\.89\), because the penalty is too small to prefer the longer authorized route over the one\-step dangerous shortcut\. Atλ≥1\\lambda\\geq 1RACG routes through authorization and reaches full success, consistent with theλ†≈1\.25\\lambda^\{\\dagger\}\\approx 1\.25predicted by Eq\. \([9](https://arxiv.org/html/2606.13884#S4.E9)\)\. The default operating pointλ⋆=2\\lambda^\{\\star\}=2sits safely past the crossover\.

![Refer to caption](https://arxiv.org/html/2606.13884v1/figures/pareto_frontier.png)Figure 4:Benign task success versus the risk penaltyλ\\lambda\. Unauthorized exposure and injection success are zero for allλ\\lambda; success exhibits the predictedλ†≈1\.25\\lambda^\{\\dagger\}\\approx 1\.25crossover, below which RACG fails closed on high\-risk\-shortcut tasks\. The highlighted point is the defaultλ⋆=2\\lambda^\{\\star\}=2\.
### VII\-BInjection Mitigation

Figure[5](https://arxiv.org/html/2606.13884#S7.F5)reports injection success rate \(ISR\) on the adversarial track\. ISR tracks high\-risk exposure: methods that expose dangerous tools admit injected calls—all\-tools and keyword top\-10 at1\.001\.00, state\-aware at0\.750\.75, and risk\-agnostic causal filtering at0\.250\.25—while RACG structurally prevents them whenever the target tool is gated at the injection step, yieldingISR=0\\text\{ISR\}=0independent of injection phrasing\.

#### Boundary condition: forging the authorization variable\.

H5 holds only under Eq\. \([3](https://arxiv.org/html/2606.13884#S3.E3)\)\. We test the violating case:240240adversarial trials with authorization\-forging injections that write the target’sα\\alpha\-variable into the state\. Table[VIII](https://arxiv.org/html/2606.13884#S7.T8)shows that when provenance is intact, RACG attains ISR=0\.00=0\.00; when forged, the gate opens and RACG’s ISR rises to0\.250\.25\(matching CMTF\)—the documented failure mode confirming that provenance, not gating, is where deployment scrutiny must concentrate\.

TABLE VIII:Boundary condition for H5 \(240240trials/method\)\. With provenance intact, RACG attains ISR=0=0; when authorization is forged, the gate opens and RACG matches CMTF\.![Refer to caption](https://arxiv.org/html/2606.13884v1/figures/injection_by_method.png)Figure 5:Injection success rate by exposure method on the adversarial track \(lower is better\)\. RACG removes the targeted high\-risk tool from the action space at the injection step, so injected instructions cannot trigger it\.

## VIIIDiscussion

### VIII\-ALeast Privilege as Runtime Interface Design

RACG reframes tool exposure as a runtime interface\-design decision\. Conventional agent stacks fix the tool schema at deployment, granting standing authority over every tool the agent might ever need\. RACG instead treats𝒱t\\mathcal\{V\}\_\{t\}as a function of state and goal, recomputed each step—the agent analogue of capability\-based security\. The same tool library can present radically different effective authority across tasks with no change to the model or prompt\.

### VIII\-BStructural Gating versus Behavioral Compliance

Withholding a capability is categorically different from persuading a model not to use it\. Behavioral defenses reduce misuse*probability*, which is adversarially manipulable; structural gating removes the*means*, making misuse probability zero when the tool is absent from𝒱t\\mathcal\{V\}\_\{t\}\. The two are complementary: gating is the outer high\-assurance layer; behavioral compliance covers residual risk within the gated set\.

### VIII\-CAuditability and Provenance

Every exposed high\-risk tool carries an explicit causal\-plus\-authorization justification, shifting review from the unbounded space of model behaviors to the bounded space of tool contracts\. The security\-relevant question reduces to “which tools are trusted producers of whichα\\alpha\-variables?”—a static, contract\-level check that does not require running the agent\.

### VIII\-DRelationship to Monitoring and Recovery

Gating operates*before*an action enters the action space; monitoring and recovery operate during and after\. RACG composes with both: monitoring can watch for unusual authorization\-variable timing \(indicating contract drift or a forge attempt\), while self\-healing orchestration\[[1](https://arxiv.org/html/2606.13884#bib.bib14)\]handles residual failures gating does not prevent\.

## IXLimitations and Threats to Validity

The benchmark is synthetic and deterministically mocked; it isolates exposure behavior but does not capture real API failures, latency, or ambiguous observations\. RiskGate’s tasks, risk tiers, and injections are all authored by us, so reported leak rates \(e\.g\. CMTF’s0\.25=20/800\.25=20/80shortcut fraction\) reflect our task construction\. The most important external\-validity gap is the absence of evaluation on an independent adversarial benchmark \(ToolEmu\[[17](https://arxiv.org/html/2606.13884#bib.bib20)\], AgentDojo\[[4](https://arxiv.org/html/2606.13884#bib.bib19)\]\); this is key future work\. The deterministic agent makes H5 falsifiable and agent\-agnostic, but model\-driven behavior is established by the seven\-model,8080\-task validation of Section[VI\-B](https://arxiv.org/html/2606.13884#S6.SS2), which—while spanning three model families—remains a single\-benchmark study\. RACG’s guarantees depend on contract quality—incorrect risk levels or missing authorization variables can over\-expose dangerous tools—and inferred contracts\[[2](https://arxiv.org/html/2606.13884#bib.bib15)\]introduce their own threat surface\. The provenance constraint \(Eq\. \([3](https://arxiv.org/html/2606.13884#S3.E3)\)\) is the precise soundness condition; Table[VIII](https://arxiv.org/html/2606.13884#S7.T8)confirms that if authorization variables can be set by injected content, the gate is bypassed\. Finally, reported metrics do not fully capture user\-perceived quality or the severity of individual safety failures\.

## XConclusion

We framed capability minimization as a safety primitive for tool\-augmented LLM agents and introduced Risk\-Aware Causal Gating \(RACG\), which exposes a high\-risk tool only when it is causally necessary and explicitly authorized\. By treating the visible tool set as an attack\-surface control and by structurally withholding dangerous capabilities until they are justified, RACG offers a least\-privilege and auditable exposure layer that, under enforced tool visibility and trusted authorization provenance, structurally prevents a class of injected tool calls when the targeted high\-risk capability is absent from the current action space\. On a controlled benchmark with a worst\-case compliant agent, RACG was the only evaluated method to achieve zero unauthorized high\-risk exposure and zero targeted injected high\-risk calls while retaining full task success on authorization\-required tasks, and the risk\-penalty sweep reproduced the predictedλ†\\lambda^\{\\dagger\}crossover between safe and unsafe routing\. We stress the scope: this is not a general solution to prompt injection but a structural defense against injected use of*gated*high\-risk tools, conditional on the provenance constraint holding\. Key directions for future work include evaluating RACG on independently\-authored adversarial benchmarks \(AgentDojo, ToolEmu\) to establish external validity, developing runtime provenance enforcement mechanisms \(e\.g\. taint tracking or signed authorization facts\) to guarantee Eq\. \([3](https://arxiv.org/html/2606.13884#S3.E3)\) at the system level, and extending the formulation to multi\-agent settings where delegated authority and inter\-agent tool sharing introduce new attack surfaces\.

## References

- \[1\]R\. S\. Babu and A\. Agrawal\(2026\)Self\-healing agentic orchestrators for reliable tool\-augmented large language model systems\.External Links:2606\.01416,[Link](https://arxiv.org/abs/2606.01416)Cited by:[§II\-C](https://arxiv.org/html/2606.13884#S2.SS3.p1.1),[§VIII\-D](https://arxiv.org/html/2606.13884#S8.SS4.p1.1)\.
- \[2\]R\. S\. Babu and L\. G\. Iyer\(2026\)Contract2Tool: learning preconditions and effects for reliable tool\-augmented llm agents\.External Links:2606\.07904Cited by:[§II\-D](https://arxiv.org/html/2606.13884#S2.SS4.p1.1),[§IX](https://arxiv.org/html/2606.13884#S9.p1.2)\.
- \[3\]R\. S\. Babu and L\. G\. Iyer\(2026\)ToolChoiceConfusion: causal minimal tool filtering for reliable llm agents\.External Links:2606\.06284,[Link](https://arxiv.org/abs/2606.06284)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1),[§II\-D](https://arxiv.org/html/2606.13884#S2.SS4.p1.1),[TABLE I](https://arxiv.org/html/2606.13884#S2.T1.3.4.3.1),[§III](https://arxiv.org/html/2606.13884#S3.p1.1),[§V](https://arxiv.org/html/2606.13884#S5.p1.1),[TABLE V](https://arxiv.org/html/2606.13884#S6.T5.2.6.3.2.1.1)\.
- \[4\]E\. Debenedetti, J\. Zhang, M\. Balunović, L\. Beurer\-Kellner, M\. Fischer, and F\. Tramèr\(2024\)AgentDojo: a dynamic environment to evaluate prompt injection attacks and defenses for llm agents\.Advances in Neural Information Processing Systems\.External Links:[Link](https://arxiv.org/abs/2406.13352)Cited by:[§II\-C](https://arxiv.org/html/2606.13884#S2.SS3.p1.1),[§V\-D](https://arxiv.org/html/2606.13884#S5.SS4.p1.1),[§IX](https://arxiv.org/html/2606.13884#S9.p1.2)\.
- \[5\]R\. E\. Fikes and N\. J\. Nilsson\(1971\)STRIPS: a new approach to the application of theorem proving to problem solving\.Artificial Intelligence2\(3–4\),pp\. 189–208\.Cited by:[§II\-D](https://arxiv.org/html/2606.13884#S2.SS4.p1.1)\.
- \[6\]T\. Gan and Q\. Sun\(2025\)RAG\-mcp: mitigating prompt bloat in llm tool selection via retrieval\-augmented generation\.arXiv preprint arXiv:2505\.03275\.External Links:[Link](https://arxiv.org/abs/2505.03275)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1),[TABLE I](https://arxiv.org/html/2606.13884#S2.T1.3.3.2.1)\.
- \[7\]K\. Greshake, S\. Abdelnabi, S\. Mishra, C\. Endres, T\. Holz, and M\. Fritz\(2023\)Not what you’ve signed up for: compromising real\-world llm\-integrated applications with indirect prompt injection\.InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security,External Links:[Link](https://arxiv.org/abs/2302.12173)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.13884#S2.SS3.p1.1),[§V\-D](https://arxiv.org/html/2606.13884#S5.SS4.p1.1)\.
- \[8\]N\. Hardy\(1988\)The confused deputy: \(or why capabilities might have been invented\)\.ACM SIGOPS Operating Systems Review22\(4\),pp\. 36–38\.Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.13884#S2.SS2.p1.1)\.
- \[9\]M\. Li, Y\. Zhao, B\. Yu, F\. Song, H\. Li, H\. Yu, Z\. Li, F\. Huang, and Y\. Li\(2023\)API\-bank: a comprehensive benchmark for tool\-augmented llms\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/2304.08244)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1),[TABLE I](https://arxiv.org/html/2606.13884#S2.T1.3.2.1.1)\.
- \[10\]M\. M\. Liu, D\. Garcia, F\. Parllaku, V\. Upadhyay, S\. F\. A\. Shah, and D\. Roth\(2026\)ToolScope: enhancing llm agent tool use through tool merging and context\-aware filtering\.InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2510.20036)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1),[TABLE I](https://arxiv.org/html/2606.13884#S2.T1.3.3.2.1)\.
- \[11\]X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang, Y\. Dong, and J\. Tang\(2024\)AgentBench: evaluating llms as agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2308.03688)Cited by:[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1)\.
- \[12\]D\. McDermott, M\. Ghallab, A\. Howe, C\. Knoblock, A\. Ram, M\. Veloso, D\. Weld, and D\. Wilkins\(1998\)PDDL: the planning domain definition language\.Technical reportYale Center for Computational Vision and Control\.Cited by:[§II\-D](https://arxiv.org/html/2606.13884#S2.SS4.p1.1)\.
- \[13\]S\. Naihin, D\. Atkinson, M\. Green, M\. Hamadi, C\. Swift, D\. Schonholtz, A\. T\. Kalai, and D\. Bau\(2023\)Testing language model agents safely in the wild\.arXiv preprint arXiv:2311\.10538\.External Links:[Link](https://arxiv.org/abs/2311.10538)Cited by:[§II\-C](https://arxiv.org/html/2606.13884#S2.SS3.p1.1)\.
- \[14\]S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez\(2025\)The berkeley function\-calling leaderboard\.InProceedings of Machine Learning Research,External Links:[Link](https://proceedings.mlr.press/v267/patil25a.html)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1),[TABLE I](https://arxiv.org/html/2606.13884#S2.T1.3.2.1.1)\.
- \[15\]Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, D\. Li, Z\. Liu, and M\. Sun\(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2307.16789)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1)\.
- \[16\]V\. Repantis, A\. Gawde, H\. Singh, and J\. Blackwell\(2026\)How many tools should an llm agent see? a chance\-corrected answer\.arXiv preprint arXiv:2605\.24660\.External Links:[Link](https://arxiv.org/abs/2605.24660)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1)\.
- \[17\]Y\. Ruan, H\. Dong, A\. Wang, S\. Pitis, Y\. Zhou, J\. Ba, Y\. Dubois, C\. J\. Maddison, and T\. Hashimoto\(2024\)Identifying the risks of lm agents with an lm\-emulated sandbox\.International Conference on Learning Representations\.External Links:[Link](https://arxiv.org/abs/2309.15817)Cited by:[§II\-C](https://arxiv.org/html/2606.13884#S2.SS3.p1.1),[§IX](https://arxiv.org/html/2606.13884#S9.p1.2)\.
- \[18\]J\. H\. Saltzer and M\. D\. Schroeder\(1975\)The protection of information in computer systems\.Proceedings of the IEEE63\(9\),pp\. 1278–1308\.Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.13884#S2.SS2.p1.1)\.
- \[19\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2302.04761)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1)\.
- \[20\]Z\. Shi, Y\. Wang, L\. Yan, P\. Ren, S\. Wang, D\. Yin, and Z\. Ren\(2025\)Retrieval models aren’t tool\-savvy: benchmarking tool retrieval for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2503.01763)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1),[TABLE I](https://arxiv.org/html/2606.13884#S2.T1.3.3.2.1)\.
- \[21\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2210.03629)Cited by:[§I](https://arxiv.org/html/2606.13884#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.13884#S2.SS1.p1.1)\.
- \[22\]T\. Yuan, Z\. He, L\. Dong, Y\. Wang, R\. Zhao, T\. Xia, L\. Xu, B\. Zhou, F\. Li, Z\. Zhang, R\. Wang, and G\. Liu\(2024\)R\-judge: benchmarking safety risk awareness for llm agents\.Findings of the Association for Computational Linguistics: EMNLP 2024\.External Links:[Link](https://arxiv.org/abs/2401.10019)Cited by:[§II\-C](https://arxiv.org/html/2606.13884#S2.SS3.p1.1),[TABLE I](https://arxiv.org/html/2606.13884#S2.T1.3.5.4.1)\.
Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents

Similar Articles

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

Submit Feedback

Similar Articles

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation