Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

arXiv cs.AI Papers

Summary

This paper formalizes trust calibration for agentic tool use as a preference learning problem, using Gaussian processes and Bayesian optimization to decide when an AI agent's actions should be autonomous or require human approval.

arXiv:2605.19151v1 Announce Type: new Abstract: We formalize trust calibration for agentic tool use (deciding when an automated agent's proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:28 AM

# Progressive Autonomy as Preference Learning A Formalization of Trust Calibration for Agentic Tool Use
Source: [https://arxiv.org/html/2605.19151](https://arxiv.org/html/2605.19151)
\(March 4, 2026\)

###### Abstract

We formalize trust calibration for agentic tool use \(deciding when an automated agent’s proposed action may execute autonomously versus require human approval\) as a preference\-learning problem\. A policy gateway maintains a Gaussian\-process posterior over a latent human risk\-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain\. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery \(approximate Gaussian\-process classification\) and its sample\-efficiency argument \(uncertainty\-targeted querying\), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design\.

## 1Background and Related Work

Deciding how much autonomy to delegate to an automated system is a classical human\-in\-the\-loop control problem; the technical core here, recovering a human’s latent acceptability function from sparse binary feedback, is*preference learning*\.Chu and Ghahramani \[[6](https://arxiv.org/html/2605.19151#bib.bib1)\]introduced Gaussian\-process preference learning, placing a Gaussian\-process \(GP\) prior over a latent utility and linking observed human choices to it through a probit likelihood\. We adopt exactly this construction, specialized to unary approve/deny feedback\. The same latent\-utility\-with\-probit model underlies preference\-driven sequential decision making:Gonzálezet al\.\[[8](https://arxiv.org/html/2605.19151#bib.bib2)\]formalized*Preferential Bayesian Optimization*\(PBO\), embedding GP preference learning in the query loop of Bayesian optimization\[[3](https://arxiv.org/html/2605.19151#bib.bib3),[16](https://arxiv.org/html/2605.19151#bib.bib8)\]\. Our policy gateway is structurally an instance of this framework, differing only in objective \(classifying an action space rather than optimizing a design\), as Table[1](https://arxiv.org/html/2605.19151#S8.T1)makes precise\.

The inferential machinery is classical GP classification: a GP prior, a non\-Gaussian probit likelihood, and an analytically intractable posterior approximated by the Laplace method or Expectation Propagation\[[10](https://arxiv.org/html/2605.19151#bib.bib5)\], developed comprehensively byRasmussen and Williams \[[13](https://arxiv.org/html/2605.19151#bib.bib4)\]\. Treating the human query budget as a scarce resource makes theaskregion an*acquisition*rule in the sense of active learning\[[15](https://arxiv.org/html/2605.19151#bib.bib9)\]and Bayesian optimization\[[16](https://arxiv.org/html/2605.19151#bib.bib8)\]: interruptions are spent where the expected information about the allow/block boundary is greatest\. Finally, drifting risk tolerance is a non\-stationarity problem; we model it with a time\-decaying kernel component in the spirit of non\-stationary covariance functions\[[12](https://arxiv.org/html/2605.19151#bib.bib6)\]and, for abrupt shifts, Bayesian online changepoint detection\[[1](https://arxiv.org/html/2605.19151#bib.bib7)\]\.

This progressive view of autonomy has deep roots in the trust\-in\-automation literature:Lee and See \[[9](https://arxiv.org/html/2605.19151#bib.bib10)\]characterized appropriate reliance as the alignment of trust with actual system trustworthiness, andde Visseret al\.\[[7](https://arxiv.org/html/2605.19151#bib.bib11)\]extended this to*longitudinal*trust calibration in human–robot teams, precisely the dynamic our time\-decaying kernel \(§[6](https://arxiv.org/html/2605.19151#S6)\) is meant to capture\. The concern has resurfaced sharply for large\-language\-model agents, where graduated autonomy has emerged as an explicit deployment axis alongside raw capability\[[11](https://arxiv.org/html/2605.19151#bib.bib12)\]\. Recent work catalogs the risks of increasingly agentic systems\[[5](https://arxiv.org/html/2605.19151#bib.bib13)\], argues for visibility and oversight mechanisms over deployed agents\[[4](https://arxiv.org/html/2605.19151#bib.bib14)\], and proposes governance practices in which a human retains approval authority over consequential actions\[[17](https://arxiv.org/html/2605.19151#bib.bib15)\]\. These accounts are largely qualitative and taxonomic: they argue*why*graduated autonomy matters and*what*should be governed, but leave the escalation policy itself as a fixed, hand\-specified tier\. Our contribution is the missing mechanism: a learning rule that lets the auto\-approve/escalate boundary adapt from human feedback rather than being set by hand\.

## 2Setup

At each decision pointt=1,2,…t=1,2,\\ldots, the policy gateway observes a proposed agent actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}and an execution contextct∈𝒞c\_\{t\}\\in\\mathcal\{C\}, where:

at\\displaystyle a\_\{t\}=\(tool\_name,args,target\_resource\),\\displaystyle=\(\\texttt\{tool\\\_name\},\\;\\texttt\{args\},\\;\\texttt\{target\\\_resource\}\),\(1\)ct\\displaystyle c\_\{t\}=\(repo\_state,task\_desc,session\_history\)\.\\displaystyle=\(\\texttt\{repo\\\_state\},\\;\\texttt\{task\\\_desc\},\\;\\texttt\{session\\\_history\}\)\.\(2\)A human supervisor provides binary feedbackyt∈\{0,1\}y\_\{t\}\\in\\\{0,1\\\}\(deny/approve\)\. We writext≔\(at,ct\)∈𝒳=𝒜×𝒞x\_\{t\}\\coloneqq\(a\_\{t\},c\_\{t\}\)\\in\\mathcal\{X\}=\\mathcal\{A\}\\times\\mathcal\{C\}for the joint input\.

## 3Latent Risk Tolerance

###### Definition 1\(Risk tolerance function\)\.

There exists a latent functionf:𝒳→ℝf\\colon\\mathcal\{X\}\\to\\mathbb\{R\}encoding the human’s risk tolerance, such that approval probability follows a probit observation model:

Pr⁡\(y=1∣x\)=Φ​\(f​\(x\)\),\\Pr\(y=1\\mid x\)=\\Phi\\\!\\bigl\(f\(x\)\\bigr\),\(3\)whereΦ\\Phiis the standard normal CDF\.

## 4Gaussian Process Prior and Posterior

Place a GP prior overff\[[13](https://arxiv.org/html/2605.19151#bib.bib4)\]:

f∼𝒢​𝒫​\(μ0,k​\(x,x′\)\)\.f\\sim\\mathcal\{GP\}\\\!\\bigl\(\\mu\_\{0\},\\;k\(x,x^\{\\prime\}\)\\bigr\)\.\(4\)
The kernelkkdecomposes over the input structure\. A natural choice is a product kernel:

k​\(x,x′\)=ktool​\(a,a′\)⋅kctx​\(c,c′\)⋅ktime​\(t,t′\),k\(x,x^\{\\prime\}\)=k\_\{\\text\{tool\}\}\(a,a^\{\\prime\}\)\\cdot k\_\{\\text\{ctx\}\}\(c,c^\{\\prime\}\)\\cdot k\_\{\\text\{time\}\}\(t,t^\{\\prime\}\),\(5\)where:

- •ktoolk\_\{\\text\{tool\}\}encodes similarity between actions \(e\.g\., shared tool name, overlapping argument patterns, same reversibility class\),
- •kctxk\_\{\\text\{ctx\}\}captures context similarity \(same repository, file type, task category\),
- •ktimek\_\{\\text\{time\}\}handles non\-stationarity \(see §[6](https://arxiv.org/html/2605.19151#S6)\)\.

After observing𝒟N=\{\(xt,yt\)\}t=1N\\mathcal\{D\}\_\{N\}=\\\{\(x\_\{t\},y\_\{t\}\)\\\}\_\{t=1\}^\{N\}, the posterior is:

p​\(f∣𝒟N\)∝𝒢​𝒫​\(μ0,k\)⋅∏t=1NΦ​\(f​\(xt\)\)yt​\(1−Φ​\(f​\(xt\)\)\)1−yt\.p\(f\\mid\\mathcal\{D\}\_\{N\}\)\\;\\propto\\;\\mathcal\{GP\}\(\\mu\_\{0\},k\)\\;\\cdot\\prod\_\{t=1\}^\{N\}\\Phi\\\!\\bigl\(f\(x\_\{t\}\)\\bigr\)^\{y\_\{t\}\}\\bigl\(1\-\\Phi\\\!\\bigl\(f\(x\_\{t\}\)\\bigr\)\\bigr\)^\{1\-y\_\{t\}\}\.\(6\)
This is analytically intractable due to the non\-Gaussian likelihood\. Approximate inference proceeds via the Laplace approximation\[[13](https://arxiv.org/html/2605.19151#bib.bib4)\]or Expectation Propagation\[[10](https://arxiv.org/html/2605.19151#bib.bib5)\], the same machinery used in PBO\.

## 5The Policy Gateway Decision Rule

Given the posterior predictive distribution at a new pointx∗x\_\{\*\}:

p^​\(x∗\)≔𝔼f∣𝒟N​\[Φ​\(f​\(x∗\)\)\],\\hat\{p\}\(x\_\{\*\}\)\\coloneqq\\mathbb\{E\}\_\{f\\mid\\mathcal\{D\}\_\{N\}\}\\\!\\bigl\[\\Phi\(f\(x\_\{\*\}\)\)\\bigr\],\(7\)the gateway applies a three\-tier decision:

decision​\(x∗\)=\{allowif​p^​\(x∗\)\>τhigh,blockif​p^​\(x∗\)<τlow,askotherwise\.\\text\{decision\}\(x\_\{\*\}\)=\\begin\{cases\}\\textsc\{allow\}&\\text\{if \}\\hat\{p\}\(x\_\{\*\}\)\>\\tau\_\{\\text\{high\}\},\\\\\[3\.0pt\] \\textsc\{block\}&\\text\{if \}\\hat\{p\}\(x\_\{\*\}\)<\\tau\_\{\\text\{low\}\},\\\\\[3\.0pt\] \\textsc\{ask\}&\\text\{otherwise\}\.\\end\{cases\}\(8\)
Theaskregion\[τlow,τhigh\]\[\\tau\_\{\\text\{low\}\},\\tau\_\{\\text\{high\}\}\]plays the role of an*acquisition function*\[[16](https://arxiv.org/html/2605.19151#bib.bib8),[15](https://arxiv.org/html/2605.19151#bib.bib9)\]: the system queries the human precisely where the model is most uncertain about the approval outcome, maximizing the expected value of information per human interruption\.

## 6Non\-Stationarity

Human risk tolerance drifts: early in a project the supervisor is cautious; as trust accumulates for familiar patterns, they become permissive\. Model this via a time\-decayed kernel component:

ktime​\(t,t′\)=exp⁡\(−\|t−t′\|λ\),k\_\{\\text\{time\}\}\(t,t^\{\\prime\}\)=\\exp\\\!\\Bigl\(\-\\frac\{\|t\-t^\{\\prime\}\|\}\{\\lambda\}\\Bigr\),\(9\)whereλ\>0\\lambda\>0is a lengthscale controlling the forgetting rate\. This gives recent approve/deny signals more weight, a principled analog of the intuition that “agents earn trust over time\.”

For computational efficiency, an equivalent effect can be achieved through a sliding window of the most recentWWobservations, or via online changepoint detection when the supervisor’s behavior shifts abruptly \(e\.g\., moving to a new codebase\)\.

## 7Correlated Generalization

A key advantage over a naïve contextual bandit \(which treats each\(a,c\)\(a,c\)independently\) is*correlated generalization*through the kernel\. Concretely:

- •Approvingwrite\_fileto/workspace/src/transfers evidence towrite\_fileto/workspace/test/, sincektoolk\_\{\\text\{tool\}\}andkctxk\_\{\\text\{ctx\}\}assign high similarity\.
- •Denyingexecute\_sqlwith aDROPargument propagates caution toexecute\_sqlwithTRUNCATE, without the human having to deny each variant individually\.
- •A new tool with no interaction history inherits the priorμ0\\mu\_\{0\}, which maps toask, the fail\-safe default\.

## 8Connection to PBO

The mapping between the trust calibration problem and Preferential Bayesian Optimization\[[8](https://arxiv.org/html/2605.19151#bib.bib2)\]is summarized in Table[1](https://arxiv.org/html/2605.19151#S8.T1)\.

ComponentPBO \(Optimization\)Trust Calibration \(Policy\)Input space𝒳\\mathcal\{X\}Design parameters\(action, context\) pairsLatent functionffObjective to maximizeRisk tolerance to learnHuman feedbackPairwise preferencexi≻xjx\_\{i\}\\succ x\_\{j\}Unary approve/denyObservation modelΦ​\(f​\(xi\)−f​\(xj\)\)\\Phi\(f\(x\_\{i\}\)\-f\(x\_\{j\}\)\)Φ​\(f​\(x\)\)\\Phi\(f\(x\)\)AcquisitionNext query to evaluateNext action to escalateGoalFindx∗=arg⁡max⁡fx^\{\*\}=\\arg\\max fLearn theallow/blockboundaryTable 1:Structural correspondence between PBO and trust calibration\.The “preferential” aspect is literal: the human is expressing preferences over what the agent should be allowed to do\. The same mathematical machinery \(GP priors, probit likelihoods, approximate posterior inference\) transfers directly\. The difference is the objective: PBO seeks to*optimize*, while trust calibration seeks to*classify*the action space into allow/block/ask regions with minimal human queries\.

## 9Datasets and Evaluation

Empirically grounding the gateway requires data pairing proposed agent actions with human approve/deny judgments\. The closest public resource is R\-Judge\[[18](https://arxiv.org/html/2605.19151#bib.bib16)\], which supplies multi\-turn agent interaction records annotated by humans with binary safe/unsafe labels across a range of risk scenarios; it is a natural source for a cold\-start priorμ0\\mu\_\{0\}and for calibratingktoolk\_\{\\text\{tool\}\}andkctxk\_\{\\text\{ctx\}\}\. Broader agent\-safety benchmarks such as Agent\-SafetyBench\[[19](https://arxiv.org/html/2605.19151#bib.bib18)\]and ToolEmu\[[14](https://arxiv.org/html/2605.19151#bib.bib17)\]widen the action and context coverage, but their risk labels are produced by automated judges rather than per\-action human approval, so they are better suited to stress\-testing the learned boundary than to fittingffitself\.

A structural gap remains: no public dataset captures the*longitudinal, per\-supervisor*signal that the non\-stationarity model \(§[6](https://arxiv.org/html/2605.19151#S6)\) assumes\. Existing corpora provide single\-shot, aggregated annotations and do not track an individual supervisor’s risk tolerance drifting over the course of a project\. Validating the time\-decaying kernel therefore requires either a controlled user study or a simulation with deliberately drifting synthetic annotators, with R\-Judge serving as the static initializer\. We treat this as an inherent limitation of currently available data rather than a deficiency of the model\.

## 10Simulation Study

We accordingly exercise the formulation in a controlled simulation whose ground\-truth oracle instantiates Definition 1 and the drift of §[6](https://arxiv.org/html/2605.19151#S6): the standard evaluation protocol for Preferential Bayesian Optimization, where the latent preference function is synthetic by construction so that recovery, calibration, and query efficiency can be measured against a known target\. The action space is1818agent tools with interpretable decision\-time risk attributes \(reversibility, base sensitivity, blast radius, destructive\-argument flag\),88target\-resource sensitivity tiers, and77task contexts\. The oracle latent is a static action acceptability term plus a saturating accumulated\-trust term \(§[6](https://arxiv.org/html/2605.19151#S6)\) with an abrupt changepoint att=750t=750, plus a three\-way safety veto \(irreversible*and*sensitive*and*low\-trust\); approvals are drawn asy∼Bernoulli​\(Φ​\(f∗\)\)y\\sim\\mathrm\{Bernoulli\}\(\\Phi\(f^\{\*\}\)\)\. The gateway is the self\-contained Laplace GP\-probit ofRasmussen and Williams \[[13](https://arxiv.org/html/2605.19151#bib.bib4)\]with the product kernel of §4\. The kernel observes the action and context risk attributes but never the time\-varying veto or the drift, which it must recover from feedback throughktimek\_\{\\text\{time\}\}\. We process a stream ofN=1500N=1500decision points over66seeds, prequentially \(every decision is scored before any label at that step exists\), with phases learn\[0,560\)\[0,560\), validation\[560,1050\)\[560,1050\)where\(τlow,τhigh\)\(\\tau\_\{\\text\{low\}\},\\tau\_\{\\text\{high\}\}\)is tuned once under a tightened false\-allow cap, and a frozen\-policy test\[1050,1500\)\[1050,1500\)in which the model keeps adapting online\. Reproducible code, the full report, and all figures are provided underexperiment/\.

### Inference and correlated generalization hold\.

Table[2](https://arxiv.org/html/2605.19151#S10.T2)reports the validation phase \(a fair measurement of what was learned, with the changepoint inside it\) and the post\-changepoint test phase, against a no\-correlation baseline that learns each tool independently\. The GP gateway auto\-decides68%68\\%of actions at97\.3%97\.3\\%accuracy with a2\.4%2\.4\\%false\-allow rate on validation, and99\.7%99\.7\\%accuracy post\-changepoint; the independent baseline auto\-decides almost everything but at6969–71%71\\%accuracy, confirming that the kernel, not the three\-tier rule, carries the quality\. Correlated generalization is decisive: on a held\-outwrite\_fileto a test directory that was*never*queried, the GP recovers the correct decision98\.7%±2\.998\.7\\%\\pm 2\.9of the time purely by kernel extrapolation from similar tools and targets, against66\.7%±47\.166\.7\\%\\pm 47\.1for the independent learner \(chance50%50\\%\); Figure[1](https://arxiv.org/html/2605.19151#S10.F1)\(right\) visualizes the gap\. Theaskband narrows and the auto\-approve rate rises as the posterior concentrates, then both react sharply at the changepoint and recover \(Figure[1](https://arxiv.org/html/2605.19151#S10.F1), left\), and a fixed probe action’s posterior tracks the drift and the abrupt reset \(Figure[2](https://arxiv.org/html/2605.19151#S10.F2)\)\. Calibration is directionally correct but underconfident at the kernel\-far tail \(validation ECE0\.190\.19against the trueΦ​\(f∗\)\\Phi\(f^\{\*\}\)\), which we report rather than tune away\.

Table 2:Simulation results, mean±\\pmstd over66seeds\.Boldmarks the decisive cells: the GP gateway is far more accurate, safer, and better calibrated than a no\-correlation contextual baseline\. The baseline’s larger automated fraction is not a virtue \(left unbolded\): it automates more only by auto\-deciding cases it gets wrong\.![Refer to caption](https://arxiv.org/html/2605.19151v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.19151v1/x2.png)

Figure 1:Left: the rolling policy mix\. Theallowshare rises as the posterior concentrates \(theaskband narrows\), then collapses at thet=750t=750trust changepoint and recovers, exactly the Section 5 and Section 6 dynamics\. Right: correlated generalization \(Section 7\) to an action\-context combination that was never queried\.
### Human\-burden reduction\.

Against the status quo of always escalating \(one human query per action\), the gateway spends508±51508\\pm 51queries over the scored phases versus940940, a∼1\.8×\\sim\\\!1\.8\\timesreduction in interruptions, while auto\-deciding most actions accurately and safely \(Figure[2](https://arxiv.org/html/2605.19151#S10.F2), left; the full\-stream cumulative trajectory shows a larger gap because the cold\-start learn phase escalates heavily by design\)\. This is the Section 1 promise, and it is the defensible headline rather than a comparison to random querying\.

![Refer to caption](https://arxiv.org/html/2605.19151v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.19151v1/x4.png)

Figure 2:Left: cumulative human queries, gateway versus always\-escalate \(Section 1\)\. Right: a fixed probe action’s posterior approval probability tracks the drifting oracle and its abrupt reset \(Section 6\); the systematic gap is the Laplace\-probit underconfidence noted in the text\.
### An honest negative result on the acquisition rule\.

The rule “query inside theaskband,” taken literally as a sample\-efficiency claim, is*not*supported here: at a matched query budget its prequential boundary\-decision accuracy is76\.5%±2\.276\.5\\%\\pm 2\.2against78\.4%±4\.278\.4\\%\\pm 4\.2for random querying\. An ablation that switches the changepoint off and on \(Table[3](https://arxiv.org/html/2605.19151#S10.T3)\) shows the gap is non\-positive in*both*regimes, including with a fully stationary target, so the deficit is not caused by non\-stationarity\. It is the familiar behaviour of pure uncertainty sampling under class imbalance: a region the posterior has grown confident about leaves theaskband and is never re\-probed, so its estimate is never refreshed and a silently reset tolerance there goes undetected\. The time\-decayed kernel down\-weights stale evidence but does not itself generate new probes\. \(The per\-condition magnitude is small and seed\-noisy; the robust finding is the consistently non\-positive sign\.\) A recency\-aware or information\-theoretic acquisition criterion \(an expected\-information or BALD rule with a forgetting term\) is the natural remedy and is left to future work; it does not affect the inference, generalization, or burden\-reduction results above, which use the same band as a pure escalation rule\.

Table 3:Acquisition ablation \(55seeds, matched budget, prequential boundary accuracy\)\. Theboldgap is the finding: it is non\-positive in both regimes, so theask\-band rule does not beat random querying even when the target is stationary; the deficit is generic, not a changepoint artifact\.

## 11Concluding Remarks

We have argued that deciding when an agent may act autonomously is not a threshold to hand\-tune but a latent human risk\-tolerance function to learn, and that doing so is structurally an instance of Preferential Bayesian Optimization specialized to unary approve/deny feedback\. The contribution is the mechanism the governance literature leaves implicit: a GP\-probit policy gateway whose allow/escalate/block boundary adapts from feedback, generalizes across correlated actions through a structured kernel, and tracks a drifting supervisor through a time\-decayed component\.

The simulation study supports the parts of this story that concern*inference*: the gateway recovers a non\-stationary boundary it never observes directly, transfers evidence to unqueried action\-context combinations far better than an independent learner, tracks an abrupt changepoint, and cuts human interruptions substantially relative to escalating everything\. It also disciplines one claim: theaskband is a sound*escalation*rule, but treating it as a sample\-efficient*acquisition*rule does not hold under class imbalance, independently of non\-stationarity\. We report this rather than tune it away, and read it as locating the open problem precisely in the acquisition criterion rather than in the formulation\. The remaining gap is empirical: no public dataset tracks a single supervisor’s per\-action decisions as their tolerance drifts, so a recency\-aware acquisition rule and a longitudinal human study are the natural next steps\. We see the value of the formulation less in the present numbers, which are simulated by necessity, than in turning a hand\-set autonomy tier into an object that can be learned, audited, and questioned\.

## References

- \[1\]R\. P\. Adams and D\. J\. C\. MacKay\(2007\)Bayesian online changepoint detection\.arXiv preprint arXiv:0710\.3742\.External Links:[Link](https://arxiv.org/abs/0710.3742)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p2.1)\.
- \[2\]M\. Balandat, B\. Karrer, D\. R\. Jiang, S\. Daulton, B\. Letham, A\. G\. Wilson, and E\. Bakshy\(2020\)BoTorch: a framework for efficient monte\-carlo bayesian optimization\.InAdvances in Neural Information Processing Systems 33 \(NeurIPS\),Vol\.33,pp\. 21524–21538\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/f5b1b89d98b7286673128a5fb112cb9a-Abstract.html)Cited by:[Remark 2](https://arxiv.org/html/2605.19151#Thmremark2.p1.7.7)\.
- \[3\]E\. Brochu, V\. M\. Cora, and N\. de Freitas\(2010\)A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning\.arXiv preprint arXiv:1012\.2599\.External Links:[Link](https://arxiv.org/abs/1012.2599)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p1.1)\.
- \[4\]A\. Chan, C\. Ezell, M\. Kaufmann, K\. Wei, L\. Hammond, H\. Bradley, E\. Bluemke, N\. Rajkumar, D\. Krueger, N\. Kolt, L\. Heim, and M\. Anderljung\(2024\)Visibility into AI agents\.InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency \(FAccT\),pp\. 958–973\.External Links:[Document](https://dx.doi.org/10.1145/3630106.3658948),[Link](https://doi.org/10.1145/3630106.3658948)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p3.1)\.
- \[5\]A\. Chan, R\. Salganik, A\. Markelius, C\. Pang, N\. Rajkumar, D\. Krasheninnikov, L\. Langosco, Z\. He, Y\. Duan, M\. Carroll, M\. Lin, A\. Mayhew, K\. Collins, M\. Molamohammadi, J\. Burden, W\. Zhao, S\. Rismani, K\. Voudouris, U\. Bhatt, A\. Weller, D\. Krueger, and T\. Maharaj\(2023\)Harms from increasingly agentic algorithmic systems\.InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency \(FAccT\),pp\. 651–666\.External Links:[Document](https://dx.doi.org/10.1145/3593013.3594033),[Link](https://doi.org/10.1145/3593013.3594033)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p3.1)\.
- \[6\]W\. Chu and Z\. Ghahramani\(2005\)Preference learning with Gaussian processes\.InProceedings of the 22nd International Conference on Machine Learning \(ICML\),pp\. 137–144\.External Links:[Document](https://dx.doi.org/10.1145/1102351.1102369),[Link](https://doi.org/10.1145/1102351.1102369)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p1.1),[Remark 1](https://arxiv.org/html/2605.19151#Thmremark1.p1.3.3)\.
- \[7\]E\. J\. de Visser, M\. M\. M\. Peeters, M\. F\. Jung, S\. Kohn, T\. H\. Shaw, R\. Pak, and M\. A\. Neerincx\(2020\)Towards a theory of longitudinal trust calibration in human–robot teams\.International Journal of Social Robotics12\(2\),pp\. 459–478\.External Links:[Document](https://dx.doi.org/10.1007/s12369-019-00596-x),[Link](https://doi.org/10.1007/s12369-019-00596-x)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p3.1)\.
- \[8\]J\. González, Z\. Dai, A\. Damianou, and N\. D\. Lawrence\(2017\)Preferential Bayesian optimization\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.70,pp\. 1282–1291\.External Links:[Link](https://proceedings.mlr.press/v70/gonzalez17a.html)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p1.1),[§8](https://arxiv.org/html/2605.19151#S8.p1.1),[Remark 1](https://arxiv.org/html/2605.19151#Thmremark1.p1.3.3)\.
- \[9\]J\. D\. Lee and K\. A\. See\(2004\)Trust in automation: designing for appropriate reliance\.Human Factors46\(1\),pp\. 50–80\.External Links:[Link](https://doi.org/10.1518/hfes.46.1.50_30392)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p3.1)\.
- \[10\]T\. P\. Minka\(2001\)Expectation propagation for approximate Bayesian inference\.InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence \(UAI\),pp\. 362–369\.External Links:[Link](https://tminka.github.io/papers/ep/minka-ep-uai.pdf)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p2.1),[§4](https://arxiv.org/html/2605.19151#S4.p4.1)\.
- \[11\]M\. R\. Morris, J\. Sohl\-Dickstein, N\. Fiedel, T\. Warkentin, A\. Dafoe, A\. Faust, C\. Farabet, and S\. Legg\(2024\)Position: levels of AGI for operationalizing progress on the path to AGI\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.235,pp\. 36308–36321\.External Links:[Link](https://proceedings.mlr.press/v235/morris24b.html)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p3.1)\.
- \[12\]C\. J\. Paciorek and M\. J\. Schervish\(2003\)Nonstationary covariance functions for Gaussian process regression\.InAdvances in Neural Information Processing Systems 16 \(NIPS\),External Links:[Link](https://proceedings.neurips.cc/paper/2003/hash/326a8c055c0d04f5b06544665d8bb3ea-Abstract.html)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p2.1)\.
- \[13\]C\. E\. Rasmussen and C\. K\. I\. Williams\(2006\)Gaussian processes for machine learning\.MIT Press,Cambridge, MA\.External Links:ISBN 978\-0\-262\-18253\-9,[Document](https://dx.doi.org/10.7551/mitpress/3206.001.0001),[Link](https://gaussianprocess.org/gpml/chapters/RW.pdf)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p2.1),[§10](https://arxiv.org/html/2605.19151#S10.p1.12),[§4](https://arxiv.org/html/2605.19151#S4.p1.1),[§4](https://arxiv.org/html/2605.19151#S4.p4.1),[Remark 2](https://arxiv.org/html/2605.19151#Thmremark2.p1.7.7)\.
- \[14\]Y\. Ruan, H\. Dong, A\. Wang, S\. Pitis, Y\. Zhou, J\. Ba, Y\. Dubois, C\. J\. Maddison, and T\. Hashimoto\(2024\)Identifying the risks of LM agents with an LM\-emulated sandbox\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Note:SpotlightExternal Links:[Link](https://openreview.net/forum?id=GEcwtMk1uA)Cited by:[§9](https://arxiv.org/html/2605.19151#S9.p1.4)\.
- \[15\]B\. Settles\(2009\)Active learning literature survey\.Computer Sciences Technical ReportTechnical Report1648,University of Wisconsin–Madison\.External Links:[Link](https://research.cs.wisc.edu/techreports/2009/TR1648.pdf)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p2.1),[§5](https://arxiv.org/html/2605.19151#S5.p2.2)\.
- \[16\]B\. Shahriari, K\. Swersky, Z\. Wang, R\. P\. Adams, and N\. de Freitas\(2016\)Taking the human out of the loop: a review of Bayesian optimization\.Proceedings of the IEEE104\(1\),pp\. 148–175\.External Links:[Document](https://dx.doi.org/10.1109/JPROC.2015.2494218),[Link](https://doi.org/10.1109/JPROC.2015.2494218)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p1.1),[§1](https://arxiv.org/html/2605.19151#S1.p2.1),[§5](https://arxiv.org/html/2605.19151#S5.p2.2)\.
- \[17\]Y\. Shavit, S\. Agarwal, M\. Brundage, S\. Adler, C\. O’Keefe, R\. Campbell, T\. Lee, P\. Mishkin, T\. Eloundou, A\. Hickey, K\. Slama, L\. Ahmad, P\. McMillan, A\. Beutel, A\. Passos, and D\. G\. Robinson\(2023\)Practices for governing agentic AI systems\.White PaperOpenAI\.External Links:[Link](https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf)Cited by:[§1](https://arxiv.org/html/2605.19151#S1.p3.1)\.
- \[18\]T\. Yuan, Z\. He, L\. Dong, Y\. Wang, R\. Zhao, T\. Xia, L\. Xu, B\. Zhou, F\. Li, Z\. Zhang, R\. Wang, and G\. Liu\(2024\)R\-judge: benchmarking safety risk awareness for LLM agents\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 1467–1490\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.79),[Link](https://aclanthology.org/2024.findings-emnlp.79/)Cited by:[§9](https://arxiv.org/html/2605.19151#S9.p1.4)\.
- \[19\]Z\. Zhang, S\. Cui, Y\. Lu, J\. Zhou, J\. Yang, H\. Wang, and M\. Huang\(2024\)Agent\-safetybench: evaluating the safety of LLM agents\.External Links:2412\.14470,[Link](https://arxiv.org/abs/2412.14470)Cited by:[§9](https://arxiv.org/html/2605.19151#S9.p1.4)\.

Similar Articles

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

arXiv cs.CL

This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

arXiv cs.AI

This paper presents the 'Digital Apprentice,' a framework for scalable and safe agentic AI in which autonomy is earned incrementally through observational learning, human authorization, and continuous alignment correction. It introduces ADAPT, an inference-time control plane that operationalizes graduated autonomy tiers and converts human corrections into reusable preference data.