Strategic Decision Support for AI Agents
Summary
This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.
View Cached Full Text
Cached at: 06/12/26, 08:52 AM
# Strategic Decision Support for AI Agents
Source: [https://arxiv.org/html/2606.12587](https://arxiv.org/html/2606.12587)
Shayan KiyaniSima Noorani11footnotemark:1University of PennsylvaniaUniversity of Pennsylvaniashayank@seas\.upenn\.edunooranis@seas\.upenn\.eduGeorge PappasHamed HassaniUniversity of PennsylvaniaUniversity of Pennsylvaniapappasg@seas\.upenn\.eduhassani@seas\.upenn\.edu
###### Abstract
Traditionally, decision support studies how humans use machine learning models to make better decisions\. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them\. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints\. Departing from the classical view of decision support, we revisit its two basic principles, the cost–value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors\. We propose a framework for*strategic decision support*for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed\-support error: the probability that the agent acts alone on instances where support would have materially improved its output\. At the population level, we show that the optimal policy is a threshold rule on the*value of support*\. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed\-support error without distributional assumptions\. We further introduce a calibration\-on\-the\-fly method that reduces unnecessary support calls online\. We instantiate this framework across diverse scenarios, including information gathering, human–AI collaboration, and tool use, showing how each can be modeled through the same strategic decision\-support lens\. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice\.
## 1Introduction
Decision support has long been studied in settings where a human decision maker is aided by machine learning models that provide predictive guidance\. In such systems, the model serves as a support tool, while the human remains the final actor\. Today, however, modern AI systems, such as large language models, are increasingly deployed in a different role: as autonomous agents that must act on behalf of users in complex and uncertain environments\. As a result, humans, domain\-specific tools, and auxiliary sources of information are themselves becoming support mechanisms around AI agents\.
This shift in roles brings reliability concerns to the forefront\. As AI agents take actions, their errors become more consequential: they may execute code that overwrites critical records, trigger financial transactions that move funds incorrectly, or recommend medical actions that expose patients to harm\. At the same time, as users offload more decisions to AI systems, it becomes critical to ensure that the agent’s behavior remains aligned with human goals and constraints\. These challenges call for rethinking a basic question of decision support:
*When should an AI agent act alone, and when must it seek support to avoid consequential errors or misalignment with downstream intent?*
Addressing this question requires reexamining the two principles underlying decision support systems\.
Cost–value tradeoff\.In the classical view, the support system, often an ML model, is relatively cheap to query but imperfect, since its predictions are prone to mistakes\. In the agentic setting, the reversal of roles creates a different equilibrium:*support is often costly to seek*\(e\.g\., requiring human effort or additional compute and latency\), but can be*reliable and instrumental*for avoiding decision errors and keeping the agent aligned with human intent\.
Uncertainty quantification\.At the core of any AI agent taking action lies uncertainty quantification\. For decision support, the right lens is to quantify*whether costly support is likely to substantially improve the agent’s output\.*Thus, uncertainty is a joint property of the agent*and*the support mechanism\. This differs from the common view, which frames uncertainty through the agent alone: whether the agent is confident in its action, or whether the environment and task are well\-understood\.
Figure 1:Effect of strategic decision support oversight\. Top: Without oversight, the agent confidently misdiagnoses an Alzheimer’s case as healthy and fails to seek available support, resulting in a missed\-support error\. Bottom: With oversight, support is triggered and the agent consults a clinician to gather additional interactive tests, enabling the correct diagnosis to be recovered\.To this end, we propose a formulation that separates the*value of support*from its cost and treats support\-seeking as a strategic decision\. The value of support captures whether the supported output is better than the output the agent would have produced alone\. The central error is then*missed support*, which occurs when the agent acts alone even though support would have improved its output\. This error is counterfactual, since we only learn whether support would have helped when support is actually called\. Strategic decision support therefore becomes an optimization problem: use support as rarely as possible, while controlling the missed support error\. We summarize our contributions below\. Figure[1](https://arxiv.org/html/2606.12587#S1.F1)illustrates a missed\-support error \(top\) and its correction via strategic decision support oversight \(bottom\)\.
#### 1\. A framework for strategic decision support\.
In Section[3](https://arxiv.org/html/2606.12587#S3), we introduce[SDS\-Opt](https://arxiv.org/html/2606.12587#S3.Ex5), an optimization problem that minimizes the rate of support calls subject to controlling the missed\-support error\. This formulation is built around the*value of support*: whether the supported output would materially improve over the output the agent would have produced alone\. At the population level, we show that the optimal strategy has a simple threshold form over value of support\.
#### 2\. An online algorithm with distribution\-free error control\.
In Section[4](https://arxiv.org/html/2606.12587#S4), we build on this structure to develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control the counterfactual missed\-support error without distributional assumptions\. We further introduce a calibration\-on\-the\-fly method that improves the score over time, reducing unnecessary support calls while preserving the validity of the thresholding guarantee\.
#### 3\. Applications and empirical validation\.
Our framework also provides a common language for designing support around AI agents\. We instantiate this perspective through four representative categories that serve as running examples throughout the paper:
- •Ex1: Information gathering\.The agent must act from incomplete information, while additional evidence could improve its output at a cost\. Support may consist of follow\-up questions, additional evidence, or expert\-provided details\. For example, a medical assistant may reason from an initial symptom description, but further history, examination findings, or laboratory results may materially change its recommendation\.
- •Ex2: Human\-in\-the\-loop planning\.The agent has general task knowledge, but lacks local context needed to produce an appropriate plan\. Support may consist of user preferences, environment\-specific constraints, or object locations\. For example, a household robot may know how to clean a room in general, but may need the resident’s input about which items are fragile, which areas to avoid, or what should be moved before cleaning\.
- •Ex3: Human\-AI collaborative reasoning\.The agent can attempt the problem on its own, but may become uncertain about a specific reasoning step\. Support may consist of checking an approach, identifying an error, or suggesting a relevant technique\. For example, a math\-solving agent may ask a human to verify a key computation or point out where the argument breaks down when its reasoning appears unreliable\.
- •Ex4: Tool use\.The agent faces a task where external tools provide reliable information or computation that it may not perform correctly on its own\. Support may consist of executing code, querying a database, or searching the web\. For example, an agent answering a question about a table may call a SQL engine rather than relying only on its internal reasoning\.
In Section[5](https://arxiv.org/html/2606.12587#S5), we instantiate these categories on real\-world datasets and LLM agents111The code is publicly available at[https://github\.com/nooranisima/strategic\-decision\-support](https://github.com/nooranisima/strategic-decision-support)\., showing that our method controls the missed\-support error while substantially reducing unnecessary support calls across all four categories; Figure[2](https://arxiv.org/html/2606.12587#S1.F2)previews our results on one model per application, with the full empirical study deferred to Section[5](https://arxiv.org/html/2606.12587#S5)\.
Figure 2:Our method invokes decision support substantially less often than an LLM\-decides baseline, while matching its error rate\.For each of four agentic applications: information gathering \(DDXPlus\), tool use \(WikiSQL\), human\-in\-the\-loop planning \(VirtualHome\), and collaborative human–AI reasoning \(MATH\), all using Gemini\-2\.5\-Flash, we report two pairs of bars\. The left pair \(solid\) shows the cumulative*support rate*: the fraction of inputs on which the agent requests external support\. The right pair \(hatched\) shows the cumulative*missed support error rate*: cases where support would have materially improved the output but was not requested\. In each pair, the left bar is the LLM baseline \(the agent deciding for itself when to ask for support\) and the right bar is our best learned method\. Across all four applications, our method invokes support far less often than the baseline while maintaining a comparable or lower error rate\.
## 2Related Works
We briefly discuss closely related works here and defer a more detailed dicussion to Appendix[7](https://arxiv.org/html/2606.12587#S7)\.
Training agents to seek support\.One body of work bakes support\-seeking behavior into the agent itself through training, both on the tool\-use and retrieval side\[[70](https://arxiv.org/html/2606.12587#bib.bib70),[5](https://arxiv.org/html/2606.12587#bib.bib5),[33](https://arxiv.org/html/2606.12587#bib.bib33),[32](https://arxiv.org/html/2606.12587#bib.bib32),[28](https://arxiv.org/html/2606.12587#bib.bib28),[61](https://arxiv.org/html/2606.12587#bib.bib61)\]and on the user\-interaction side\[[42](https://arxiv.org/html/2606.12587#bib.bib42),[91](https://arxiv.org/html/2606.12587#bib.bib91),[2](https://arxiv.org/html/2606.12587#bib.bib2),[85](https://arxiv.org/html/2606.12587#bib.bib85)\]\. These methods produce stronger agents: a model already knows when to call SQL, or that has been fine\-tuned to ask the right clarifying questions, is precisely the kind of base policy our framework sits on top of\. Our experiments use frontier models that already incorporate substantial training of this kind, and our gains reflect what is achievable as an oversight layer on top of agents that are already strong at the underlying support modalities\.
Inference\-time policies for support\-seeking\.A closely related line of work designs inference\-time policies that decide whether an agent should pause to seek information before acting\. Some threshold the agent’s own confidence in its answer\[[39](https://arxiv.org/html/2606.12587#bib.bib39),[85](https://arxiv.org/html/2606.12587#bib.bib85),[68](https://arxiv.org/html/2606.12587#bib.bib68)\]; others compute the expected utility of asking a clarification under an explicit cost\[[19](https://arxiv.org/html/2606.12587#bib.bib19)\]; and others use offline\-calibrated prediction sets over candidate user intents to trigger help\-seeking\[[67](https://arxiv.org/html/2606.12587#bib.bib67)\]\. A complementary line considers support\-seeking for*verification*rather than generation\[[36](https://arxiv.org/html/2606.12587#bib.bib36)\]\. We instead provide a unifying framework that brings these approaches under the same design principles, through an oversight layer with rigorous finite\-sample error control\. This enables us to handle a broad set of support modalities within a single algorithm, operate fully online, and control the counterfactual missed\-support error at a user\-chosen level which we test in our experiments across information gathering, tool use, and human\-AI collaboration\.
## 3Fundamentals of Strategic Decision Support
In this section, we introduce the central objectives of strategic decision support at the population level\. We then characterize the optimal support\-seeking strategy in this regime\. These results form the foundation for the practical sequential algorithm developed in Section[4](https://arxiv.org/html/2606.12587#S4), where the population quantities are unknown and support decisions must be made online\.
We begin by modeling the interaction between an AI agent and a decision\-support mechanism\. LetX∼PXX\\sim P\_\{X\}denote all information available to the agent at decision time, including the user prompt, task description, and any other available input modalities\. GivenX=xX=x, the agent can first produce an unsupported outputY0∼π\(⋅∣x,"no support"\),Y\_\{0\}\\sim\\pi\(\\cdot\\mid x,\\texttt\{"no support"\}\),corresponding to the response it would generate on its own\. After observing this initial output, the agent must decide whether to settle for it or seek support\.222One may avoid or only partially generate the unsupported output \(Y0Y\_\{0\}\) before deciding whether to seek support\. Our framework and guarantees are robust to such choices; we discuss these further in Section[4](https://arxiv.org/html/2606.12587#S4)and Appendix[B\.9](https://arxiv.org/html/2606.12587#A2.SS9)\.A support\-seeking strategy is therefore a functiona:𝒳×𝒴→\{0,1\},a:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\\{0,1\\\},wherea\(x,y0\)=0a\(x,y\_\{0\}\)=0means that the agent acts on its own and keeps the unsupported output, whilea\(x,y0\)=1a\(x,y\_\{0\}\)=1means that it seeks support and produces a supported outputY1∼π\(⋅∣x,y0,"with support"\)\.Y\_\{1\}\\sim\\pi\(\\cdot\\mid x,y\_\{0\},\\texttt\{"with support"\}\)\.
For example, in[Ex1](https://arxiv.org/html/2606.12587#S1.I1.i1),xxmay contain a patient’s initial symptoms and description of their condition, whiley0y\_\{0\}is the medical assistant’s initial recommendation\. The strategy then decides whether to proceed with this recommendation or seek additional clinical evidence before producing a supported response\. This abstraction lets us formalize both the value and the cost of support\.
#### Value of support\.
We begin by introducing a*value indicator*
g\(X,Y0,Y1\)∈\{0,1\},g\(X,Y\_\{0\},Y\_\{1\}\)\\in\\\{0,1\\\},which indicates whether, after comparing the unsupported outcomeY0Y\_\{0\}and the supported outcomeY1Y\_\{1\}, support is judged to have been materially beneficial on that instance\. Thus,g=1g=1means that support helped, whileg=0g=0means that it did not\. This notion is deliberately separate from the cost of seeking support:ggevaluates only whether the supported outcome is better, according to the provider’s or downstream user’s notion of performance\.
This indicator induces the central population quantity in our framework, the*value of support*,
val\(x,y0\):=ℙ\(g\(X,Y0,Y1\)=1∣X=x,Y0=y0\)\.\\operatorname\{val\}\(x,y\_\{0\}\):=\\mathbb\{P\}\\\!\\left\(g\(X,Y\_\{0\},Y\_\{1\}\)=1\\mid X=x,\\,Y\_\{0\}=y\_\{0\}\\right\)\.In words,val\(x,y0\)\\operatorname\{val\}\(x,y\_\{0\}\)is the probability that calling support would produce a substantially better outcome after observing the input instance and the agent’s unsupported output\.
The definition ofggdepends on what it means for support to materially improve the output\. In[Ex1](https://arxiv.org/html/2606.12587#S1.I1.i1),[Ex3](https://arxiv.org/html/2606.12587#S1.I1.i3), and[Ex4](https://arxiv.org/html/2606.12587#S1.I1.i4), where the output has a verifiable final answer, a natural choice is to setg=1g=1only whenY0Y\_\{0\}is incorrect andY1Y\_\{1\}is correct, and to setg=0g=0in all other cases\. In[EX2](https://arxiv.org/html/2606.12587#S1.I1.i2), where the output is a structured plan,ggcan instead indicate whether the supported output improves a task\-specific quality metric beyond a chosen threshold\. More generally, when the support mechanism is reliable, the simpler choice ofg=1g=1wheneverY1≠Y0Y\_\{1\}\\neq Y\_\{0\}is often a strong proxy, since a change in output is itself indicative of a material difference\. We discuss precise implementations ofggin Section[5\.2](https://arxiv.org/html/2606.12587#S5.SS2)\.
#### Error of missed support\.
From the viewpoint of providers and downstream decision\-makers, missing support when it would have helped is often the most consequential failure, since better outputs may directly translate into better downstream decisions\. Thus our guiding principle is that support should be used whenever it would materially improve the output\. This leads to an asymmetric notion of error: what matters is how often the agent fails to seek support on the very instances where support would have helped\. For a strategyaa, we therefore define the*missed\-support error*as
ℙ\(a\(X,Y0\)=0∣g\(X,Y0,Y1\)=1\)\.\\mathbb\{P\}\\big\(a\(X,Y\_\{0\}\)=0\\mid g\(X,Y\_\{0\},Y\_\{1\}\)=1\\big\)\.It measures the probability that the agent acts alone, conditioned on support being beneficial\.
#### Cost of support\.
Support, however, is not free\. It may require additional latency, computation, API usage, or human effort\. Thus, the trivial strategy that always seeks support is typically unacceptable, even though it drives the missed\-support error to zero\. The goal is therefore to control missed\-support error while using support only when necessary\. To capture this, we measure the cost of a strategyaaby its support rate
𝔼\[𝟏\{a\(X,Y0\)=1\}\],\\mathbb\{E\}\\big\[\\mathbf\{1\}\\\{a\(X,Y\_\{0\}\)=1\\\}\\big\],namely, the population probability that the agent seeks support\.444We define cost as the frequency of support calls\. In practice, however, the cost of support may depend on the instance\. We defer these finer\-grained formulations to future work\.
Putting these pieces together, we arrive at the following population\-level formulation\. We seek a strategy that minimizes how often support is used, while guaranteeing that the agent rarely skips support on the instances where support would have been beneficial:
Strategic Decision Support Optimizationminimizea:𝒳×𝒴→\{0,1\}\\displaystyle\\underset\{a:\\,\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\\{0,1\\\}\}\{\\text\{minimize\}\}𝔼X\[𝟏\{a\(X,Y0\)=1\}\]\\displaystyle\\mathbb\{E\}\_\{X\}\\bigl\[\\mathbf\{1\}\\\{a\(X,Y\_\{0\}\)=1\\\}\\bigr\]\(SDS\-Opt\)subject toℙ\(a\(X,Y0\)=0∣g\(X,Y0,Y1\)=1\)≤ε\\displaystyle\\mathbb\{P\}\\bigl\(a\(X,Y\_\{0\}\)=0\\mid g\(X,Y\_\{0\},Y\_\{1\}\)=1\\bigr\)\\;\\leq\\;\\varepsilon
The parameterε∈\[0,1\]\\varepsilon\\in\[0,1\]specifies the tolerated level of missed\-support error\. Smaller values ofε\\varepsilonrequire the strategy to seek support on a larger fraction of instances where support would help, and therefore lead to more frequent support calls\. Hence, this formulation captures the balance: support should be used sparingly, but not at the expense of missing the instances where it is genuinely needed\.
We now characterize the optimal solution of \([SDS\-Opt](https://arxiv.org/html/2606.12587#S3.Ex5)\)\.
###### Theorem 3\.1\.
There exists an optimal solution to \(SDS\-Opt\) of the form
a⋆\(x,y0\)=𝟏\{val\(x,y0\)\>τ⋆\},a^\{\\star\}\(x,y\_\{0\}\)=\\mathbf\{1\}\\\{\\operatorname\{val\}\(x,y\_\{0\}\)\>\\tau^\{\\star\}\\\},with possible randomization on the boundary , i\.e\. when\{val\(x,y0\)=τ⋆\}\\\{\\operatorname\{val\}\(x,y\_\{0\}\)=\\tau^\{\\star\}\\\}, if needed\.
Theorem[3\.1](https://arxiv.org/html/2606.12587#S3.Thmtheorem1)identifiesval\(x,y0\)\\operatorname\{val\}\(x,y\_\{0\}\)as the fundamental quantity for support seeking\. The optimal strategy thresholds this quantity, i\.e\., support should be sought when its value is “high enough\.”
This characterization forms the foundation of our online method\. Sinceval\(x,y0\)\\operatorname\{val\}\(x,y\_\{0\}\)is generally unknown in practice, the core task becomes to approximate it and threshold it adaptively\. The algorithm in the next section does exactly this: it uses a score function as a proxy for the value of support, refines this score through calibration\-on\-the\-fly, and updates a decision threshold online to determine when support should be sought\.
## 4Online Algorithm and Guarantees
In this section, we design*Strategic Oversight for Support\-seeking*\(SOS\), an online algorithm for deciding when an AI agent should seek support\. The population result of Section[3](https://arxiv.org/html/2606.12587#S3)suggests a simple principle: estimate the value of support and seek support when this value is high enough\. SOS turns this principle into a rigorous online procedure, where the value of support is estimated and calibrated online, and support decisions are made based on a threshold that is debiased sequentially\.
We fix an AI agent and a support mechanism, and make no distributional assumptions on the data, the agent’s behavior, or the behavior of the support mechanism\. Building on the structural result of Section[3](https://arxiv.org/html/2606.12587#S3), SOS uses a score function
sθ:𝒳×𝒴→\[0,1\],s\_\{\\theta\}:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\[0,1\],wheresθ\(x,y\)s\_\{\\theta\}\(x,y\)evaluates an inputxxtogether with a candidate outputyy\. The score is intended to approximate the value of support,val\(x,y\)\\operatorname\{val\}\(x,y\), from Theorem[3\.1](https://arxiv.org/html/2606.12587#S3.Thmtheorem1)\. Here,θ\\thetadenotes the parameters of the score function; these parameters may be fixed in advance, pretrained, or learned online as feedback is collected\. Concrete choices and parameterizations of the score function are discussed in Section[5](https://arxiv.org/html/2606.12587#S5)\.
Online interaction pipeline\.At each roundt=1,2,…t=1,2,\\dots:1\.The agent receives an inputxtx\_\{t\}, produces an unsupported outputy0ty\_\{0\}^\{t\}, and computes a scorest:=sθt\(xt,y0t\)\.s\_\{t\}:=s\_\{\\theta\_\{t\}\}\(x\_\{t\},y\_\{0\}^\{t\}\)\.2\.Based onsts\_\{t\}, the policy chooses an actionat∈\{0,1\}a\_\{t\}\\in\\\{0,1\\\}, whereat=0a\_\{t\}=0means proceeding without seeking support, andat=1a\_\{t\}=1means seeking support\.3\.Ifat=0a\_\{t\}=0, the agent finalizes the unsupported outcomey0ty\_\{0\}^\{t\}and proceeds to the next round\.4\.Ifat=1a\_\{t\}=1, the agent seeks support, produces the supported outcomey1ty\_\{1\}^\{t\}, and computesgt:=g\(xt,y0t,y1t\)∈\{0,1\},g\_\{t\}:=g\(x\_\{t\},y\_\{0\}^\{t\},y\_\{1\}^\{t\}\)\\in\\\{0,1\\\},which indicates whether support was beneficial on roundtt\.
Our goal is to control the empirical missed\-support error at a user\-specified levelα∈\(0,1\)\\alpha\\in\(0,1\):555We adopt the conventionMSE^\(T\)=0\\widehat\{\\mathrm\{MSE\}\}\(T\)=0when∑t=1Tgt=0\\sum\_\{t=1\}^\{T\}g\_\{t\}=0\.
MSE^\(T\):=∑t=1Tgt\(1−at\)∑t=1Tgt≤α\.\\widehat\{\\mathrm\{MSE\}\}\(T\):=\\frac\{\\sum\_\{t=1\}^\{T\}g\_\{t\}\(1\-a\_\{t\}\)\}\{\\sum\_\{t=1\}^\{T\}g\_\{t\}\}\\leq\\alpha\.\(1\)This error is the fraction of beneficial\-support rounds on which the policy nevertheless acts alone\. Two remarks are in order\.
- •We modeled the online interaction in a form where the support decision is made using both the inputxtx\_\{t\}and the unsupported outputy0ty\_\{0\}^\{t\}\. In practice, one may choose to avoid generating, or only partially generate, the unsupported outcome before deciding whether to seek support\. On rounds where support is sought, one may likewise be able to compute the value ofggwithout generating all ofy0ty\_\{0\}^\{t\}\. Our framework is easily tuned to such operational choices\. We discuss these practical variants and their implications for cost and performance in more detail in Appendix[B\.9](https://arxiv.org/html/2606.12587#A2.SS9)\.
- •The error we seek to control depends ongtg\_\{t\}, butgtg\_\{t\}is only revealed on rounds where support is sought, since computing it requires comparingy0ty\_\{0\}^\{t\}andy1ty\_\{1\}^\{t\}\. Thus, the relevant error is fundamentally counterfactual and only selectively observed\. This partial\-feedback structure is exactly what makes randomization necessary in the online algorithm\.
Algorithm\.At each roundtt, we maintain a thresholdλt\\lambda\_\{t\}and define the support\-seeking probability
pt:=\{1,st≥λt,μ,st<λt,p\_\{t\}:=\\begin\{cases\}1,&s\_\{t\}\\geq\\lambda\_\{t\},\\\\ \\mu,&s\_\{t\}<\\lambda\_\{t\},\\end\{cases\}whereμ∈\(0,1\)\\mu\\in\(0,1\)is a fixed exploration parameter\. The action is then sampled asat∼Bernoulli\(pt\)\.a\_\{t\}\\sim\\mathrm\{Bernoulli\}\(p\_\{t\}\)\.
Thus, when the score exceeds the current threshold, the algorithm always seeks support; when the score falls below the threshold, it still seeks support with a small probabilityμ\\muin order to obtain feedback about whether support would have been helpful\.
On rounds where support is sought, we observegtg\_\{t\}, and the threshold is updated as
λt\+1←λt−ηt⋅gtatpt\[\(1−pt\)𝟏\{st<λt\}−α\],\\lambda\_\{t\+1\}\\leftarrow\\lambda\_\{t\}\-\\eta\_\{t\}\\cdot\\frac\{g\_\{t\}a\_\{t\}\}\{p\_\{t\}\}\\Big\[\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha\\Big\],whereηt\>0\\eta\_\{t\}\>0is a step size andα∈\(0,1\)\\alpha\\in\(0,1\)is the target missed\-support error level\.
This update resembles online quantile\-tracking: the quantity inside the brackets plays the role of an error\-minus\-target signal\. In a deterministic threshold policy, this signal would be𝟏\{st<λt\}−α\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha, since rounds below the threshold are exactly those on which the agent proceeds without support\. Here, however, the action is randomized: even whenst<λts\_\{t\}<\\lambda\_\{t\}, support is still sought with probabilitypt=μp\_\{t\}=\\mu\. Thus the realized missed\-support event is not simply𝟏\{st<λt\}\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}, but rather\(1−at\)𝟏\{st<λt\}\(1\-a\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\. Since the only randomness comes from the algorithm’s internal exploration, we have
𝔼\[\(1−at\)𝟏\{st<λt\}∣ℱt−1,xt,st\]=\(1−pt\)𝟏\{st<λt\},\\mathbb\{E\}\\\!\\left\[\(1\-a\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\},s\_\{t\}\\right\]=\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\},where the expectation is taken only over the Bernoulli drawat∼Bernoulli\(pt\)a\_\{t\}\\sim\\mathrm\{Bernoulli\}\(p\_\{t\}\)\. This explains the extra factor\(1−pt\)\(1\-p\_\{t\}\)in the update\. The prefactorgtat/ptg\_\{t\}a\_\{t\}/p\_\{t\}then serves as an importance\-weighted correction for the fact thatgtg\_\{t\}is observed only on rounds where support is sought\.
Algorithm 1Strategic Oversight for Support\-seeking \(SOS\)0:Target
α∈\(0,1\)\\alpha\\in\(0,1\); exploration parameter
μ∈\(0,1\)\\mu\\in\(0,1\); step sizes
\{ηt\}t≥1,\{γt\}t≥1\\\{\\eta\_\{t\}\\\}\_\{t\\geq 1\},\\\{\\gamma\_\{t\}\\\}\_\{t\\geq 1\}; initial threshold
λ1\\lambda\_\{1\}; initial score parameter
θ1\\theta\_\{1\}
for
t=1,2,…t=1,2,\\dotsdo
Receive input
xtx\_\{t\}and generate
y0ty\_\{0\}^\{t\}
Compute score
st=sθt\(xt,y0t\)∈\[0,1\]s\_\{t\}=s\_\{\\theta\_\{t\}\}\(x\_\{t\},y\_\{0\}^\{t\}\)\\in\[0,1\]
Support\-seeking probability and action
pt←μ\+\(1−μ\)1\{st≥λt\}p\_\{t\}\\leftarrow\\mu\+\(1\-\\mu\)\\,\\mathbf\{1\}\\\{s\_\{t\}\\geq\\lambda\_\{t\}\\\}
Sample
at∼Bernoulli\(pt\)a\_\{t\}\\sim\\mathrm\{Bernoulli\}\(p\_\{t\}\)
Decision and feedback
if
at=0a\_\{t\}=0then
λt\+1←λt\\lambda\_\{t\+1\}\\leftarrow\\lambda\_\{t\},
θt\+1←θt\\theta\_\{t\+1\}\\leftarrow\\theta\_\{t\}
else
Seek support and compute
y1ty\_\{1\}^\{t\}
Compute
gt=g\(xt,y0t,y1t\)∈\{0,1\}g\_\{t\}=g\(x\_\{t\},y\_\{0\}^\{t\},y\_\{1\}^\{t\}\)\\in\\\{0,1\\\}
Threshold update and calibration\-on\-the\-fly
λt\+1←λt−ηtgtpt\(\(1−pt\)𝟏\{st<λt\}−α\)\\lambda\_\{t\+1\}\\leftarrow\\lambda\_\{t\}\-\\eta\_\{t\}\\,\\dfrac\{g\_\{t\}\}\{p\_\{t\}\}\\Big\(\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha\\Big\)
θt\+1←θt−γtatpt∇θ\(sθ\(xt,y0t\)−gt\)2\|θ=θt\\theta\_\{t\+1\}\\leftarrow\\theta\_\{t\}\-\\gamma\_\{t\}\\dfrac\{a\_\{t\}\}\{p\_\{t\}\}\\nabla\_\{\\theta\}\\\!\\left\(s\_\{\\theta\}\(x\_\{t\},y\_\{0\}^\{t\}\)\-g\_\{t\}\\right\)^\{2\}\\Big\|\_\{\\theta=\\theta\_\{t\}\}
endif
endfor
Calibration\-on\-the\-fly:Alongside the threshold update, we also allow the score itself to be updated online\. Specifically, we treat the score as parameterized byθt\\theta\_\{t\}and update
θt\+1←θt−γtatpt∇θ\(sθ\(xt\)−gt\)2\|θ=θt,\\theta\_\{t\+1\}\\leftarrow\\theta\_\{t\}\-\\gamma\_\{t\}\\frac\{a\_\{t\}\}\{p\_\{t\}\}\\nabla\_\{\\theta\}\\\!\\left\(s\_\{\\theta\}\(x\_\{t\}\)\-g\_\{t\}\\right\)^\{2\}\\Big\|\_\{\\theta=\\theta\_\{t\}\},whereγt\>0\\gamma\_\{t\}\>0is a step size\. The factorat/pta\_\{t\}/p\_\{t\}again provides the appropriate importance\-weighted correction, sincegtg\_\{t\}is observed only on rounds where support is sought\. At this level, we view this as an abstract online calibration procedure for improving how well the score tracks the latent value of support; concrete score parameterizations and design choices will be discussed in the next section\.
The next result shows that the threshold update rule in Algorithm 1 yields a distribution\-free finite\-sample guarantee for controlling the empirical missed\-support error\.
###### Theorem 4\.1\.
Run Algorithm[1](https://arxiv.org/html/2606.12587#alg1)with constant step sizeηt≡η\>0\\eta\_\{t\}\\equiv\\eta\>0, exploration parameterμ∈\(0,1\)\\mu\\in\(0,1\), target levelα∈\(0,1−μ\)\\alpha\\in\(0,1\-\\mu\), and initial thresholdλ1∈\[0,1\]\\lambda\_\{1\}\\in\[0,1\]\. Fix any horizonT≥1T\\geq 1, and defineNg\(T\):=∑t=1Tgt\.N\_\{g\}\(T\):=\\sum\_\{t=1\}^\{T\}g\_\{t\}\.Then for anyδ∈\(0,1\)\\delta\\in\(0,1\), w\.p\. at least1−δ1\-\\deltaover the algorithm’s randomness,
MSE^\(T\)≤α\+Δ\(Ng\(T\),δ\),\\widehat\{\\mathrm\{MSE\}\}\(T\)\\leq\\alpha\+\\Delta\(N\_\{g\}\(T\),\\delta\),where, forN≥1N\\geq 1,Δ\(N,δ\):=1\+2η/μηN\+8log\(4/δ\)μN\+4log\(4/δ\)3μN,\\Delta\(N,\\delta\):=\\frac\{1\+2\\eta/\\mu\}\{\\eta N\}\+\\sqrt\{\\frac\{8\\log\(4/\\delta\)\}\{\\mu N\}\}\+\\frac\{4\\log\(4/\\delta\)\}\{3\\mu N\},andΔ\(0,δ\):=0\\Delta\(0,\\delta\):=0\.
Theorem 4\.1 shows that the empirical missed\-support error is controlled at the target levelα\\alphaup to a finite\-sample term\. The bound has two qualitatively different sources\. The first term, of order1/\(ηN\)1/\(\\eta N\), is the intrinsic error of online quantile tracking and would remain even if the value of support signal were always revealed\[[27](https://arxiv.org/html/2606.12587#bib.bib27),[3](https://arxiv.org/html/2606.12587#bib.bib3),[65](https://arxiv.org/html/2606.12587#bib.bib65)\]\. The remaining terms arise from the randomized exploration needed to obtain unbiased feedback\. The dependence onμ\\mumakes the tradeoff explicit: larger exploration improves error control, but increases support usage\. Finally, the validity guarantee comes from the threshold update rule and does not rely on calibration\-on\-the\-fly, so it holds whether the score parameters are fixed or getting updated\. The role of calibration\-on\-the\-fly is instead to improve the score over time and reduce unnecessary support calls, as we demonstrate empirically\.
## 5Experiments
This section is organized as follows\. Section[5\.1](https://arxiv.org/html/2606.12587#S5.SS1)introduces three families of score designs\. Section[5\.2](https://arxiv.org/html/2606.12587#S5.SS2)describes the four tasks, datasets, and baseline methods\. Section[5\.3](https://arxiv.org/html/2606.12587#S5.SS3)empirically validates Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1), showing that the algorithm controls the missed\-support error at the target levelα\\alphaacross all score families\. Section[5\.4](https://arxiv.org/html/2606.12587#S5.SS4)then fixes the missed\-support error across methods and compares how often each invokes support\. At matched error, the method calling support less often better identifies when support is needed, and the gap between methods reflects the value of richer score signals\.
### 5\.1Score Design
Algorithm[1](https://arxiv.org/html/2606.12587#alg1)is agnostic to how the scoresθ\(x\)s\_\{\\theta\}\(x\)is constructed\. Since the threshold update controls the missed\-support error at levelα\\alpharegardless of the score’s quality \(Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)\), the score’s role is to reduce how often support is called by separating instances where support would be beneficial from those on which it would not\. We evaluate three score families, organized in increasing order of expressivity\.
- •Confidence score\.The simplest family takes the score directly from the black\-box signal, with no learnable parameters or calibration\-on\-the\-fly:s\(x\):=g^bb\(x\)s\(x\):=\\hat\{g\}\_\{\\text\{bb\}\}\(x\)\. The signal is thresholded as\-is, and onlyλ\\lambdain Algorithm[1](https://arxiv.org/html/2606.12587#alg1)is adapted online\. This natural training\-free baseline serves as a reference for parameterized scores, and reappears as the anchor in the anchored family below\.
- •Representation score\.The second family parameterizes the score as a linear function of a frozen embedding,sθ\(x\):=σ\(θ⊤ϕ\(x\)\)s\_\{\\theta\}\(x\):=\\sigma\(\\theta^\{\\top\}\\phi\(x\)\)\.ϕ\(x\)∈ℝd\\phi\(x\)\\in\\mathbb\{R\}^\{d\}embeds the input,σ\(⋅\)\\sigma\(\\cdot\)is the sigmoid that keeps the score in\[0,1\]\[0,1\], and only the linear coefficientsθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}are updated by calibration\-on\-the\-fly\. This is a linear probe over a representation that summarizes the input without committing to a particular notion of confidence; calibration\-on\-the\-fly learns which directions in representation space predict whether support helps\. The embeddingϕ\\phican be any pretrained text representation\. In the*black\-box*setting,ϕ\\phicomes from a separate frozen encoder applied to the input\. In the*white\-box*setting, we can also use the LLM’s hidden states\. The input toϕ\\phiis itself a modeling choice; we use the promptxxthroughout and study alternatives \(such as also includingy0y\_\{0\}when generated\) in Appendix[B\.9](https://arxiv.org/html/2606.12587#A2.SS9)\.
- •Anchored score\.The final family combines the parameterized linear term with the black\-box signal in logit space:sθ\(x\):=σ\(logit\(g^bb\(x\)\)\+θ⊤ϕ\(x\)\)s\_\{\\theta\}\(x\):=\\sigma\\bigl\(\\mathrm\{logit\}\(\\hat\{g\}\_\{\\text\{bb\}\}\(x\)\)\+\\theta^\{\\top\}\\phi\(x\)\\bigr\)\. Hereg^bb\\hat\{g\}\_\{\\text\{bb\}\}acts as an*anchor*providing an initial estimate of the value of support, andθ⊤ϕ\(x\)\\theta^\{\\top\}\\phi\(x\)learns a residual correction in logit space\. When the anchor is already well\-aligned withgg, the anchored score inherits its quality and calibration\-on\-the\-fly need only learn small adjustments\. When the anchor is uninformative or systematically biased, the linear term can override it\.
The Representation and Anchored families are each instantiated with three embedding choices, summarized below\. Black\-box \(BB\) variants apply a separate frozen encoder to the input; the white\-box \(WB\) variant uses the LLM’s own hidden state at the final input token\.
### 5\.2Tasks and Baselines
#### Tasks\.
We instantiate each of the application categories from Section[1](https://arxiv.org/html/2606.12587#S1)on a concrete dataset\. Task specifications and prompts are deferred to Appendix[B\.8](https://arxiv.org/html/2606.12587#A2.SS8)\.
Information gatheringonDDXPlus\[[77](https://arxiv.org/html/2606.12587#bib.bib77)\]\(medical diagnosis\): A diagnostic agent receives a patient’s chief complaint and initial symptoms asxx; support reveals follow\-up questions, examination findings, and laboratory results\. We setg=1g=1wheny0y\_\{0\}is an incorrect diagnosis andy1y\_\{1\}is correct against the ground\-truth pathology\.
Tool useonWikiSQL\[[92](https://arxiv.org/html/2606.12587#bib.bib92)\]: An agent receives a natural\-language question about a table asxx; support consists of formulating and executing SQL queries against it\. We setg=1g=1wheny0y\_\{0\}is incorrect andy1y\_\{1\}matches the dataset’s expected answer\.
Human\-in\-the\-loop planningonVirtualHome\[[59](https://arxiv.org/html/2606.12587#bib.bib59)\]: A household robot receives a task description asxxand must output a plan as a sequence of actions; support reveals scene\-specific object locations and constraints from the resident\.g=1g=1when the longest\-common\-subsequence \(LCS\) overlap ofy1y\_\{1\}with the gold action sequence exceeds that ofy0y\_\{0\}\.
Human\-AI collaborative reasoningon Level 4–5 problems fromMATH\[[30](https://arxiv.org/html/2606.12587#bib.bib30)\]: An reasoning agent attempts solving math problems; support comes from a stronger reasoner that provides targeted guidance on a step the agent identifies as uncertain, without revealing the full solution\. We setg=1g=1wheny0y\_\{0\}is incorrect andy1y\_\{1\}is correct\.
We run all four tasks with three frontier base agents spanning both regimes: Qwen\-2\.5\-7B\[[63](https://arxiv.org/html/2606.12587#bib.bib63)\]\(white\-box\), Gemini\-2\.5\-Flash\[[22](https://arxiv.org/html/2606.12587#bib.bib22)\]\(black\-box\), and GPT\-4o\-mini\[[56](https://arxiv.org/html/2606.12587#bib.bib56)\]\(black\-box\)\.
Baselines\.Our primary baseline isLLM\-decides, which lets the LLM itself choose whether to seek support after producingy0y\_\{0\}\. This is a natural reference point as it isolates what the agent does on its own, with no oversight layer, no error target, and no calibration\. Its missed\-support rate is a property of the model and the task, and cannot be controlled\. We take this rate as the targetα\\alphafor our algorithm on each \(task, model\) pair, so that all comparisons are on equal footing\. We then run each score variant from Section[5\.1](https://arxiv.org/html/2606.12587#S5.SS1)inside Algorithm[1](https://arxiv.org/html/2606.12587#alg1):CONFIDENCEadapts onlyλt\\lambda\_\{t\}, while the Representation and Anchored variants additionally update the score parametersθ\\thetavia calibration\-on\-the\-fly\.
### 5\.3Error Control
Figure 3:Cumulative missed\-support error on all four tasks with Qwen\-2\.5\-7B as the agent\.We track the cumulative missed\-support errorMSE\(T\)\\mathrm\{MSE\}\(T\)\(Equation \([1](https://arxiv.org/html/2606.12587#S4.E1)\)\), the empirical fraction of rounds where support would have helped but was not invoked\. Figure[3](https://arxiv.org/html/2606.12587#S5.F3)shows the runningMSE\(T\)\\mathrm\{MSE\}\(T\)for one representative model across all four tasks with thexx\-axis indicating progress through the interaction stream as a fraction of total rounds\. All score variants converge to the targetα\\alpha\(the rate achieved by LLM\-Decides\), with trajectories following the typical adaptation profile of online quantile tracking\. Convergence rate is governed by the threshold step sizeη\\eta, which trades off adaptation speed against stability\. The same behavior hold across all base agents\. The full set is reported in Appendix[B\.6](https://arxiv.org/html/2606.12587#A2.SS6)\.
### 5\.4Support Efficiency
Error control alone does not imply an efficient support policy\. A policy that always invokes support has zero missed\-support error but is maximally wasteful\. We therefore compare methods on equal footing by asking how often each invokes support at the same missed\-support error level\. Figure[4](https://arxiv.org/html/2606.12587#S5.F4)reports the cumulative support rate across all task–model pairs, with the targetα\\alphafor SOS set to the missed\-support rate achieved byLLM\-Decideson each pair\. Two general patterns emerge\.
Figure 4:Cumulative support rateSR^T=1T∑t=1Tat\\widehat\{\\mathrm\{SR\}\}\_\{T\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}a\_\{t\}across all task–model pairs at matched missed\-support error\. We show the best\-performing variant across both families, paired with its same\-embedding counterpart from the other family\. Full per\-panel comparisons in Appendix[B\.1](https://arxiv.org/html/2606.12587#A2.SS1)\.Calibration\-on\-the\-fly recovers from uninformative signals\.The Representation family reliably reduces the support rate relative toLLM\-decideson all task–model pairs, and the Anchored family does so on 11 of 12, regardless of how informative the underlying signal is\. The oversight layer does not require a strong starting signal, since calibration\-on\-the\-fly improves it from feedback on rounds where support is invoked\. The Confidence score is unstable across tasks and agents; sometimes belowLLM\-decides\(e\.g\., VirtualHome on Qwen\-2\.5\-7B\), sometimes well above \(e\.g\., WikiSQL on Qwen\-2\.5\-7B\)\. The parameterized families absorb this variability and reach lower support rates\.
Anchoring on a reliable initial signal accelerates the gain\.The two parameterized families differ in whether they treat the initial signal as a useful prior or discard it\. When the anchor is informative, calibration\-on\-the\-fly only needs to learn a small correction, and the Anchored variants outperform their Representation counterparts \(e\.g\., MATH on Gemini\-2\.5\-Flash\)\. When the anchor is uninformative or actively misleading, the Representation family pulls ahead, since the residual term in the Anchored score has to first undo the bad anchor before it can learn \(e\.g\., WikiSQL on Gemini\-2\.5\-Flash\)\. In our experiments the anchor is the LLM’s own self\-confidence, but the same principle applies to any auxiliary signal\.
Takeaway\.AI agents struggle to determine when seeking support is beneficial, even when they are proficient at utilizing that support once provided\. Our strategic oversight algorithm significantly outperforms an agent’s self\-decisions, achieving lower support rates while maintaining formal error control\. Among score designs, parameterized scores that learn from representations perform reliably well, with calibration\-on\-the\-fly effectively reducing the support rate as the query stream progresses\.
## 6Limitations and Future Work
We focus on a binary support\-seeking decision and measure cost by the frequency of support calls\. In practice, agents may have multiple support options, and the cost of support may vary across options and instances\. One may also replace the binary value indicatorggwith a continuous measure of how much value support adds, defining the value of support as its expectation rather than the probability thatg=1g=1\. We defer these finer\-grained formulations to future work\.
## 7Extended Related Works
#### Training Agents to Seek Support\.
A significant body of work bakes support\-seeking behavior into the agent itself through training, modifying the agent’s weights or generation procedure so that it learns when to invoke external resources or ask clarifying questions\. On the tool\-use and retrieval side, Toolformer\[[70](https://arxiv.org/html/2606.12587#bib.bib70)\]self\-supervisedly teaches a language model where to insert API calls by sampling and filtering candidate calls based on whether they reduce future\-token loss\. Self\-RAG\[[5](https://arxiv.org/html/2606.12587#bib.bib5)\]trains a model to interleave retrieval with generation through reflection tokens that signal when to retrieve and when to commit\. Confidence\-triggered retrieval methods take a complementary route, deciding to retrieve based on signals during decoding: FLARE\[[33](https://arxiv.org/html/2606.12587#bib.bib33)\]triggers retrieval when the predicted next sentence contains low\-confidence tokens, Adaptive\-RAG\[[32](https://arxiv.org/html/2606.12587#bib.bib32)\]trains a small classifier to route queries among no\-retrieval, single\-step, and iterative\-retrieval strategies, UALA\[[28](https://arxiv.org/html/2606.12587#bib.bib28)\]sets an uncertainty threshold on agent answers and resorts to external resources when exceeded, and SCENT\[[61](https://arxiv.org/html/2606.12587#bib.bib61)\]formalizes adaptive retrieval as reinforcement learning with dense rewards on intermediate retrieval quality\. On the user\-interaction side, GATE\[[42](https://arxiv.org/html/2606.12587#bib.bib42)\]introduced the framework of having an LLM elicit user preferences through targeted questions, with subsequent work training the questioner via self\-improvement \(STaR\-GATE\[[2](https://arxiv.org/html/2606.12587#bib.bib2)\]\), DPO\-style preference optimization at the conversational\-action level\[[91](https://arxiv.org/html/2606.12587#bib.bib91)\], and benchmarks that explicitly evaluate the ask\-or\-tool\-call decision\[[68](https://arxiv.org/html/2606.12587#bib.bib68)\]\.
We instead provide an oversight layer around an unmodified agent, which decides whether to invoke support without changing how the agent itself generates outputs\. This separation makes the two lines complementary rather than competing: any model that has already been trained to call tools, retrieve, or ask clarifying questions is precisely the kind of base policy our framework wraps\. This enables us to deliver finite\-sample error control on top of strong, already\-trained agents without retraining, fine\-tuning, or modifying generation, and to do so uniformly across support modalities\. Our experiments use state\-of\-the\-art LLMs \(Qwen\-2\.5\-7B, Gemini\-2\.5\-Flash, GPT\-4o\-mini\) that already incorporate substantial training of this kind, and our gains are reported on top of these strong agents\.
#### Allocating Tasks Across Decision\-Making Systems\.
A second body of work studies how to allocate tasks across multiple decision\-making systems rather than how to support a single one\. Within this body, four directions are most relevant\. LLM cascades and routers send each query to the cheapest system whose output is good enough: FrugalGPT\[[12](https://arxiv.org/html/2606.12587#bib.bib12)\]composes LLM APIs into a cascade with learned stop criteria, RouteLLM\[[55](https://arxiv.org/html/2606.12587#bib.bib55)\]trains a router that selects between strong and weak LLMs using human preference data, and cost\-saving cascades with early abstention\[[90](https://arxiv.org/html/2606.12587#bib.bib90)\]extend cascades to allow abstention at intermediate tiers\.
Learning to defer\[[46](https://arxiv.org/html/2606.12587#bib.bib46)\]generalizes rejection learning by routing uncertain instances to an expert whose decision replaces the model’s, with subsequent work developing Bayes\-consistent surrogate losses\[[51](https://arxiv.org/html/2606.12587#bib.bib51)\], calibrated one\-vs\-all approaches\[[82](https://arxiv.org/html/2606.12587#bib.bib82)\], extensions to multiple experts\[[83](https://arxiv.org/html/2606.12587#bib.bib83),[48](https://arxiv.org/html/2606.12587#bib.bib48),[29](https://arxiv.org/html/2606.12587#bib.bib29),[35](https://arxiv.org/html/2606.12587#bib.bib35),[87](https://arxiv.org/html/2606.12587#bib.bib87)\], LLM allocation\[[50](https://arxiv.org/html/2606.12587#bib.bib50)\], algorithmic\-triage formulations\[[64](https://arxiv.org/html/2606.12587#bib.bib64),[54](https://arxiv.org/html/2606.12587#bib.bib54),[16](https://arxiv.org/html/2606.12587#bib.bib16),[15](https://arxiv.org/html/2606.12587#bib.bib15)\], training\-free conformal deferral\[[11](https://arxiv.org/html/2606.12587#bib.bib11)\], and collaborative matching for selective deferral\[[4](https://arxiv.org/html/2606.12587#bib.bib4)\]\. Selective prediction lets the model abstain entirely on uncertain inputs, trading coverage for accuracy\[[21](https://arxiv.org/html/2606.12587#bib.bib21),[25](https://arxiv.org/html/2606.12587#bib.bib25),[26](https://arxiv.org/html/2606.12587#bib.bib26)\]\. A more recent line uses conformal calibration to deliver formal guarantees on these decisions: Conformal Arbitrage\[[57](https://arxiv.org/html/2606.12587#bib.bib57)\]calibrates a threshold mediating between a primary model and a more conservative guardian, and Calibrate\-Then\-Delegate \(CTD\)\[[58](https://arxiv.org/html/2606.12587#bib.bib58)\]introduces a delegation\-value probe that predicts the benefit of escalating to an expert and calibrates a threshold on this probe via offline multiple\-hypothesis testing\.
We instead study when an unmodified agent should invoke support to augment its own reasoning, rather than which of two systems should produce the output\. Although we study a different problem, the combination of distribution\-free online calibration, importance\-weighted threshold updates under counterfactual partial feedback, and control of a conditional miss\-rate, together with the algorithmic techniques we develop may be of independent interest to the broader umbrella L2D and selective classification literatures\.
#### Inference\-Time Policies for Support\-Seeking\.
A closely related line of work designs inference\-time policies that decide whether an agent should seek information before acting\. CLAM\[[39](https://arxiv.org/html/2606.12587#bib.bib39)\]prompts an LLM to classify whether a user question is ambiguous and asks a clarifying question when it is\. AwN\[[85](https://arxiv.org/html/2606.12587#bib.bib85)\]prompts the LLM to ask whenever it encounters obstacles caused by unclear instructions during tool use\. Earlier proactive\-dialogue work uses prompting to elicit clarifying behavior more generally\[[17](https://arxiv.org/html/2606.12587#bib.bib17)\]\. More recent work introduces formal objectives: KnowNo\[[67](https://arxiv.org/html/2606.12587#bib.bib67)\]uses split conformal prediction on a held\-out calibration set to construct prediction sets over candidate user intents, asking for help whenever the set is non\-singleton; the Value\-of\-Information policy ofDong et al\. \[[19](https://arxiv.org/html/2606.12587#bib.bib19)\]maintains a belief distribution over latent intents, generates candidate clarifying questions, simulates user replies, and asks when the expected utility gain exceeds an explicit asking cost\. A complementary perspective is taken byKiyani et al\. \[[36](https://arxiv.org/html/2606.12587#bib.bib36)\], who frame support\-seeking on the*verification*side rather than the generation side: their weak–strong verification policies decide when a cheap internal check \(e\.g\., self\-consistency, a proxy reward\) suffices and when to defer to costly external verification \(e\.g\., user inspection\), with an online algorithm that provably controls both acceptance and rejection errors\.
We instead provide a unifying framework that brings these approaches under the same design principles, through an oversight layer with rigorous finite\-sample error control\. This enables us to handle a broad set of support modalities — clarifying questions, external tools, additional evidence, expert guidance — within a single algorithm, to operate fully online without an offline calibration set, and to control a counterfactual conditional error at a user\-chosen level\. Our experiments target precisely these capabilities: covering information gathering, tool use, and human\-AI collaboration under the same algorithm, with online error control across a range of target levels\.
#### AI as Decision Support System for humans
A long line of work studies how AI systems should support human decision\-makers\. One strand uses prediction sets as a structured interface for collaboration, calibrating the set to improve human accuracy\[[73](https://arxiv.org/html/2606.12587#bib.bib73),[6](https://arxiv.org/html/2606.12587#bib.bib6),[8](https://arxiv.org/html/2606.12587#bib.bib8),[79](https://arxiv.org/html/2606.12587#bib.bib79),[84](https://arxiv.org/html/2606.12587#bib.bib84)\], analyzing decision\-relevant uncertainty\[[31](https://arxiv.org/html/2606.12587#bib.bib31)\], and formalizing counterfactual harm from set\-based support\[[74](https://arxiv.org/html/2606.12587#bib.bib74),[52](https://arxiv.org/html/2606.12587#bib.bib52),[53](https://arxiv.org/html/2606.12587#bib.bib53)\]\. A second strand develops theoretical frameworks for human\-AI complementarity, identifying when joint systems surpass either agent alone\[[7](https://arxiv.org/html/2606.12587#bib.bib7),[72](https://arxiv.org/html/2606.12587#bib.bib72),[66](https://arxiv.org/html/2606.12587#bib.bib66),[81](https://arxiv.org/html/2606.12587#bib.bib81)\]and how algorithmic outputs shape human choices\[[14](https://arxiv.org/html/2606.12587#bib.bib14),[18](https://arxiv.org/html/2606.12587#bib.bib18),[38](https://arxiv.org/html/2606.12587#bib.bib38),[10](https://arxiv.org/html/2606.12587#bib.bib10),[9](https://arxiv.org/html/2606.12587#bib.bib9),[78](https://arxiv.org/html/2606.12587#bib.bib78),[24](https://arxiv.org/html/2606.12587#bib.bib24)\]\. recent surveys taxonomize the broader space of human\-machine hybrid decision\-making paradigms\[[60](https://arxiv.org/html/2606.12587#bib.bib60)\]\.
In all of these, the human is the final decision\-maker and the AI provides support around them\. Our work reverses this configuration: the AI agent is the decision\-maker, and humans, tools, and auxiliary information serve as support mechanisms around it\.
#### Uncertainty quantification for AI Agents
A separate literature studies how to elicit reliable uncertainty estimates from LLMs more generally\. Conformal prediction methods construct prediction sets with formal coverage guarantees over open\-ended LLM outputs\[[62](https://arxiv.org/html/2606.12587#bib.bib62),[71](https://arxiv.org/html/2606.12587#bib.bib71),[37](https://arxiv.org/html/2606.12587#bib.bib37),[75](https://arxiv.org/html/2606.12587#bib.bib75),[80](https://arxiv.org/html/2606.12587#bib.bib80),[13](https://arxiv.org/html/2606.12587#bib.bib13),[49](https://arxiv.org/html/2606.12587#bib.bib49),[69](https://arxiv.org/html/2606.12587#bib.bib69),[45](https://arxiv.org/html/2606.12587#bib.bib45),[41](https://arxiv.org/html/2606.12587#bib.bib41)\]; a complementary line uses conformal risk control to learn abstention or selective\-prediction policies\[[89](https://arxiv.org/html/2606.12587#bib.bib89),[76](https://arxiv.org/html/2606.12587#bib.bib76),[88](https://arxiv.org/html/2606.12587#bib.bib88)\]\. A broader thread mitigates hallucinations through direct uncertainty estimation, semantic\-disagreement signals across sampled responses, or verifier\-based detectors\[[44](https://arxiv.org/html/2606.12587#bib.bib44),[1](https://arxiv.org/html/2606.12587#bib.bib1),[23](https://arxiv.org/html/2606.12587#bib.bib23),[20](https://arxiv.org/html/2606.12587#bib.bib20),[86](https://arxiv.org/html/2606.12587#bib.bib86),[40](https://arxiv.org/html/2606.12587#bib.bib40),[47](https://arxiv.org/html/2606.12587#bib.bib47)\], and recent work fine\-tunes LLMs for better\-calibrated verbalized confidence\[[34](https://arxiv.org/html/2606.12587#bib.bib34),[43](https://arxiv.org/html/2606.12587#bib.bib43)\]\. These methods produce signals about the agent’s confidence in its own output\. Better LLM uncertainty estimators from this line could be used as the anchor in our anchored score, providing a stronger initial signal that calibration\-on\-the\-fly need only refine\.
## 8Acknowledgments
The authors thank EnCORE, the Institute for Emerging CORE Methods in Data Science, for their support, as well as NSF award 2502489 under the MFAI: Mathematical Foundations of Alignment in Generative Artificial Intelligence program\. SK additionally acknowledges support from a gift from AWS to Penn Engineering’s ASSET Center for Trustworthy AI\.
## References
- Aichberger et al\. \[2024\]Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter\.Semantically diverse language generation for uncertainty estimation in language models, 2024\.URL[https://arxiv\.org/abs/2406\.04306](https://arxiv.org/abs/2406.04306)\.
- Andukuri et al\. \[2024\]Chinmaya Andukuri, Jan\-Philipp Fränken, Tobias Gerstenberg, and Noah D\. Goodman\.Star\-gate: Teaching language models to ask clarifying questions, 2024\.URL[https://arxiv\.org/abs/2403\.19154](https://arxiv.org/abs/2403.19154)\.
- Angelopoulos et al\. \[2023\]Anastasios N\. Angelopoulos, Emmanuel J\. Candes, and Ryan J\. Tibshirani\.Conformal pid control for time series prediction, 2023\.URL[https://arxiv\.org/abs/2307\.16895](https://arxiv.org/abs/2307.16895)\.
- Arnaiz\-Rodriguez et al\. \[2025\]Adrian Arnaiz\-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, and Manuel Gomez\-Rodriguez\.Towards human\-ai complementarity in matching tasks, 2025\.URL[https://arxiv\.org/abs/2508\.13285](https://arxiv.org/abs/2508.13285)\.
- Asai et al\. \[2023\]Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi\.Self\-rag: Learning to retrieve, generate, and critique through self\-reflection, 2023\.URL[https://arxiv\.org/abs/2310\.11511](https://arxiv.org/abs/2310.11511)\.
- Babbar et al\. \[2022\]Varun Babbar, Umang Bhatt, and Adrian Weller\.On the utility of prediction sets in human\-ai teams, 2022\.URL[https://arxiv\.org/abs/2205\.01411](https://arxiv.org/abs/2205.01411)\.
- Bansal et al\. \[2021\]Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S\. Weld\.Is the most accurate ai the best teammate? optimizing ai for teamwork, 2021\.URL[https://arxiv\.org/abs/2004\.13102](https://arxiv.org/abs/2004.13102)\.
- Benz and Rodriguez \[2025\]Nina L\. Corvelo Benz and Manuel Gomez Rodriguez\.Human\-alignment influences the utility of ai\-assisted decision making, 2025\.URL[https://arxiv\.org/abs/2501\.14035](https://arxiv.org/abs/2501.14035)\.
- Bordt and Von Luxburg \[2022\]Sebastian Bordt and Ulrike Von Luxburg\.A bandit model for human\-machine decision making with private information and opacity\.In Gustau Camps\-Valls, Francisco J\. R\. Ruiz, and Isabel Valera, editors,*Proceedings of The 25th International Conference on Artificial Intelligence and Statistics*, volume 151 of*Proceedings of Machine Learning Research*, pages 7300–7319\. PMLR, 28–30 Mar 2022\.URL[https://proceedings\.mlr\.press/v151/bordt22a\.html](https://proceedings.mlr.press/v151/bordt22a.html)\.
- Chan et al\. \[2019\]Lawrence Chan, Dylan Hadfield\-Menell, Siddhartha Srinivasa, and Anca Dragan\.The assistive multi\-armed bandit, 2019\.URL[https://arxiv\.org/abs/1901\.08654](https://arxiv.org/abs/1901.08654)\.
- Charusaie et al\. \[2022\]Mohammad\-Amin Charusaie, Hussein Mozannar, David Sontag, and Samira Samadi\.Sample efficient learning of predictors that complement humans, 2022\.URL[https://arxiv\.org/abs/2207\.09584](https://arxiv.org/abs/2207.09584)\.
- Chen et al\. \[2023\]Lingjiao Chen, Matei Zaharia, and James Zou\.Frugalgpt: How to use large language models while reducing cost and improving performance, 2023\.URL[https://arxiv\.org/abs/2305\.05176](https://arxiv.org/abs/2305.05176)\.
- Cherian et al\. \[2024\]John J\. Cherian, Isaac Gibbs, and Emmanuel J\. Candès\.Large language model validity via enhanced conformal prediction methods, 2024\.URL[https://arxiv\.org/abs/2406\.09714](https://arxiv.org/abs/2406.09714)\.
- Cowgill and Stevenson \[2020\]Bo Cowgill and Megan T\. Stevenson\.Algorithmic social engineering\.*AEA Papers and Proceedings*, 110:96–100, May 2020\.doi:10\.1257/pandp\.20201037\.URL[https://www\.aeaweb\.org/articles?id=10\.1257/pandp\.20201037](https://www.aeaweb.org/articles?id=10.1257/pandp.20201037)\.
- De et al\. \[2021a\]Abir De, Nastaran Okati, Paramita Koley, Niloy Ganguly, and Manuel Gomez\-Rodriguez\.Regression under human assistance, 2021a\.URL[https://arxiv\.org/abs/1909\.02963](https://arxiv.org/abs/1909.02963)\.
- De et al\. \[2021b\]Abir De, Nastaran Okati, Ali Zarezade, and Manuel Gomez\-Rodriguez\.Classification under human assistance, 2021b\.URL[https://arxiv\.org/abs/2006\.11845](https://arxiv.org/abs/2006.11845)\.
- Deng et al\. \[2023\]Yang Deng, Lizi Liao, Liang Chen, Hongru Wang, Wenqiang Lei, and Tat\-Seng Chua\.Prompting and evaluating large language models for proactive dialogues: Clarification, target\-guided, and non\-collaboration, 2023\.URL[https://arxiv\.org/abs/2305\.13626](https://arxiv.org/abs/2305.13626)\.
- Donahue et al\. \[2024\]Kate Donahue, Sreenivas Gollapudi, and Kostas Kollias\.When are two lists better than one?: Benefits and harms in joint decision\-making, 2024\.URL[https://arxiv\.org/abs/2308\.11721](https://arxiv.org/abs/2308.11721)\.
- Dong et al\. \[2026\]Yijiang River Dong, Tiancheng Hu, Zheng Hui, Caiqi Zhang, Ivan Vulić, Andreea Bobu, and Nigel Collier\.Value of information: A framework for human\-agent communication, 2026\.URL[https://arxiv\.org/abs/2601\.06407](https://arxiv.org/abs/2601.06407)\.
- Duan et al\. \[2024\]Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu\.Shifting attention to relevance: Towards the predictive uncertainty quantification of free\-form large language models, 2024\.URL[https://arxiv\.org/abs/2307\.01379](https://arxiv.org/abs/2307.01379)\.
- El\-Yaniv and Wiener \[2010\]Ran El\-Yaniv and Yair Wiener\.On the foundations of noise\-free selective classification\.*Journal of Machine Learning Research*, 11\(53\):1605–1641, 2010\.URL[http://jmlr\.org/papers/v11/el\-yaniv10a\.html](http://jmlr.org/papers/v11/el-yaniv10a.html)\.
- et\. al\. \[2025\]Gheorghe Comanici et\. al\.Gemini 2\.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025\.URL[https://arxiv\.org/abs/2412\.15115](https://arxiv.org/abs/2412.15115)\.
- Farquhar et al\. \[2024\]Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal\.Detecting hallucinations in large language models using semantic entropy\.*Nature*, 630\(8017\):625–630, 2024\.
- Gao et al\. \[2025\]Qi Gao, Wei Xu, Hanxi Pan, Mowei Shen, and Zaifeng Gao\.Human\-centered human\-ai collaboration \(hchac\), 2025\.URL[https://arxiv\.org/abs/2505\.22477](https://arxiv.org/abs/2505.22477)\.
- Geifman and El\-Yaniv \[2017\]Yonatan Geifman and Ran El\-Yaniv\.Selective classification for deep neural networks, 2017\.URL[https://arxiv\.org/abs/1705\.08500](https://arxiv.org/abs/1705.08500)\.
- Geifman and El\-Yaniv \[2019\]Yonatan Geifman and Ran El\-Yaniv\.Selectivenet: A deep neural network with an integrated reject option, 2019\.URL[https://arxiv\.org/abs/1901\.09192](https://arxiv.org/abs/1901.09192)\.
- Gibbs and Candès \[2021\]Isaac Gibbs and Emmanuel Candès\.Adaptive conformal inference under distribution shift, 2021\.URL[https://arxiv\.org/abs/2106\.00170](https://arxiv.org/abs/2106.00170)\.
- Han et al\. \[2024\]Jiuzhou Han, Wray Buntine, and Ehsan Shareghi\.Towards uncertainty\-aware language agent, 2024\.URL[https://arxiv\.org/abs/2401\.14016](https://arxiv.org/abs/2401.14016)\.
- Hemmer et al\. \[2023\]Patrick Hemmer, Lukas Thede, Michael Vössing, Johannes Jakubik, and Niklas Kühl\.Learning to defer with limited expert predictions, 2023\.URL[https://arxiv\.org/abs/2304\.07306](https://arxiv.org/abs/2304.07306)\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset, 2021\.URL[https://arxiv\.org/abs/2103\.03874](https://arxiv.org/abs/2103.03874)\.
- Hullman et al\. \[2025\]Jessica Hullman, Yifan Wu, Dawei Xie, Ziyang Guo, and Andrew Gelman\.Conformal prediction and human decision making, 2025\.URL[https://arxiv\.org/abs/2503\.11709](https://arxiv.org/abs/2503.11709)\.
- Jeong et al\. \[2024\]Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C\. Park\.Adaptive\-rag: Learning to adapt retrieval\-augmented large language models through question complexity, 2024\.URL[https://arxiv\.org/abs/2403\.14403](https://arxiv.org/abs/2403.14403)\.
- Jiang et al\. \[2023\]Zhengbao Jiang, Frank F\. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi\-Yu, Yiming Yang, Jamie Callan, and Graham Neubig\.Active retrieval augmented generation, 2023\.URL[https://arxiv\.org/abs/2305\.06983](https://arxiv.org/abs/2305.06983)\.
- Kapoor et al\. \[2025\]Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson\.Large language models must be taught to know what they don’t know, 2025\.URL[https://arxiv\.org/abs/2406\.08391](https://arxiv.org/abs/2406.08391)\.
- Keswani et al\. \[2021\]Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi\.Towards unbiased and accurate deferral to multiple experts, 2021\.URL[https://arxiv\.org/abs/2102\.13004](https://arxiv.org/abs/2102.13004)\.
- Kiyani et al\. \[2026\]Shayan Kiyani, Sima Noorani, George Pappas, and Hamed Hassani\.When to trust the cheap check: Weak and strong verification for reasoning, 2026\.URL[https://arxiv\.org/abs/2602\.17633](https://arxiv.org/abs/2602.17633)\.
- Kladny et al\. \[2025\]Klaus\-Rudolf Kladny, Bernhard Schölkopf, and Michael Muehlebach\.Conformal generative modeling with improved sample efficiency through sequential greedy filtering, 2025\.URL[https://arxiv\.org/abs/2410\.01660](https://arxiv.org/abs/2410.01660)\.
- Kleinberg and Raghavan \[2021\]Jon Kleinberg and Manish Raghavan\.Algorithmic monoculture and social welfare\.*Proceedings of the National Academy of Sciences*, 118\(22\), May 2021\.ISSN 1091\-6490\.doi:10\.1073/pnas\.2018340118\.URL[http://dx\.doi\.org/10\.1073/pnas\.2018340118](http://dx.doi.org/10.1073/pnas.2018340118)\.
- Kuhn et al\. \[2023a\]Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar\.Clam: Selective clarification for ambiguous questions with generative language models, 2023a\.URL[https://arxiv\.org/abs/2212\.07769](https://arxiv.org/abs/2212.07769)\.
- Kuhn et al\. \[2023b\]Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar\.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023b\.URL[https://arxiv\.org/abs/2302\.09664](https://arxiv.org/abs/2302.09664)\.
- Kumar et al\. \[2023\]Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam\.Conformal prediction with large language models for multi\-choice question answering, 2023\.URL[https://arxiv\.org/abs/2305\.18404](https://arxiv.org/abs/2305.18404)\.
- Li et al\. \[2023\]Belinda Z\. Li, Alex Tamkin, Noah Goodman, and Jacob Andreas\.Eliciting human preferences with language models, 2023\.URL[https://arxiv\.org/abs/2310\.11589](https://arxiv.org/abs/2310.11589)\.
- Li et al\. \[2025\]Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi\.Conftuner: Training large language models to express their confidence verbally, 2025\.URL[https://arxiv\.org/abs/2508\.18847](https://arxiv.org/abs/2508.18847)\.
- Liu et al\. \[2024\]Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen\.Uncertainty estimation and quantification for llms: A simple supervised approach, 2024\.URL[https://arxiv\.org/abs/2404\.15993](https://arxiv.org/abs/2404.15993)\.
- Liu and Wu \[2024\]Terrance Liu and Zhiwei Steven Wu\.Multi\-group uncertainty quantification for long\-form text generation, 2024\.URL[https://arxiv\.org/abs/2407\.21057](https://arxiv.org/abs/2407.21057)\.
- Madras et al\. \[2018\]David Madras, Toniann Pitassi, and Richard Zemel\.Predict responsibly: Improving fairness and accuracy by learning to defer, 2018\.URL[https://arxiv\.org/abs/1711\.06664](https://arxiv.org/abs/1711.06664)\.
- Manakul et al\. \[2023\]Potsawee Manakul, Adian Liusie, and Mark J\. F\. Gales\.Selfcheckgpt: Zero\-resource black\-box hallucination detection for generative large language models, 2023\.URL[https://arxiv\.org/abs/2303\.08896](https://arxiv.org/abs/2303.08896)\.
- Mao et al\. \[2023\]Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong\.Two\-stage learning to defer with multiple experts\.In A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine, editors,*Advances in Neural Information Processing Systems*, volume 36, pages 3578–3606\. Curran Associates, Inc\., 2023\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2023/file/0b17d256cf1fe1cc084922a8c6b565b7\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2023/file/0b17d256cf1fe1cc084922a8c6b565b7-Paper-Conference.pdf)\.
- Mohri and Hashimoto \[2024\]Christopher Mohri and Tatsunori Hashimoto\.Language models with conformal factuality guarantees, 2024\.URL[https://arxiv\.org/abs/2402\.10978](https://arxiv.org/abs/2402.10978)\.
- Montreuil et al\. \[2025\]Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi\.Optimal query allocation in extractive qa with llms: A learning\-to\-defer framework with theoretical guarantees, 2025\.URL[https://arxiv\.org/abs/2410\.15761](https://arxiv.org/abs/2410.15761)\.
- Mozannar and Sontag \[2021\]Hussein Mozannar and David Sontag\.Consistent estimators for learning to defer to an expert, 2021\.URL[https://arxiv\.org/abs/2006\.01862](https://arxiv.org/abs/2006.01862)\.
- Noorani et al\. \[2025\]Sima Noorani, Shayan Kiyani, George Pappas, and Hamed Hassani\.Human\-ai collaborative uncertainty quantification\.*arXiv preprint arXiv:2510\.23476*, 2025\.
- Noorani et al\. \[2026\]Sima Noorani, Shayan Kiyani, Hamed Hassani, and George Pappas\.Multi\-round human\-ai collaboration with user\-specified requirements\.*arXiv preprint arXiv:2602\.17646*, 2026\.
- Okati et al\. \[2021\]Nastaran Okati, Abir De, and Manuel Gomez\-Rodriguez\.Differentiable learning under triage, 2021\.URL[https://arxiv\.org/abs/2103\.08902](https://arxiv.org/abs/2103.08902)\.
- Ong et al\. \[2025\]Isaac Ong, Amjad Almahairi, Vincent Wu, Wei\-Lin Chiang, Tianhao Wu, Joseph E\. Gonzalez, M Waleed Kadous, and Ion Stoica\.Routellm: Learning to route llms with preference data, 2025\.URL[https://arxiv\.org/abs/2406\.18665](https://arxiv.org/abs/2406.18665)\.
- OpenAI \[2024\]OpenAI\.Gpt\-4o system card, 2024\.URL[https://arxiv\.org/abs/2410\.21276](https://arxiv.org/abs/2410.21276)\.
- Overman and Bayati \[2025\]William Overman and Mohsen Bayati\.Conformal arbitrage: Risk\-controlled balancing of competing objectives in language models, 2025\.URL[https://arxiv\.org/abs/2506\.00911](https://arxiv.org/abs/2506.00911)\.
- Pona et al\. \[2026\]Edoardo Pona, Milad Kazemi, Mehran Hosseini, Yali Du, David Watson, Osvaldo Simeone, and Nicola Paoletti\.Calibrate\-then\-delegate: Safety monitoring with risk and budget guarantees via model cascades, 2026\.URL[https://arxiv\.org/abs/2604\.14251](https://arxiv.org/abs/2604.14251)\.
- Puig et al\. \[2018\]Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba\.Virtualhome: Simulating household activities via programs, 2018\.URL[https://arxiv\.org/abs/1806\.07011](https://arxiv.org/abs/1806.07011)\.
- Punzi et al\. \[2026\]Clara Punzi, Roberto Pellungrini, Mattia Setzu, Fosca Giannotti, and Dino Pedreschi\.Learning paradigms for hybrid decision\-making systems\.*ACM Comput\. Surv\.*, April 2026\.ISSN 0360\-0300\.doi:10\.1145/3802522\.URL[https://doi\.org/10\.1145/3802522](https://doi.org/10.1145/3802522)\.Just Accepted\.
- Qian and Liu \[2026\]Hongjin Qian and Zheng Liu\.Scent of knowledge: Optimizing search\-enhanced reasoning with information foraging\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=26kUrQm4zw](https://openreview.net/forum?id=26kUrQm4zw)\.
- Quach et al\. \[2024\]Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S\. Jaakkola, and Regina Barzilay\.Conformal language modeling, 2024\.URL[https://arxiv\.org/abs/2306\.10193](https://arxiv.org/abs/2306.10193)\.
- Qwen et al\. \[2025\]Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu\.Qwen2\.5 technical report, 2025\.URL[https://arxiv\.org/abs/2412\.15115](https://arxiv.org/abs/2412.15115)\.
- Raghu et al\. \[2019\]Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan\.The algorithmic automation problem: Prediction, triage, and human effort, 2019\.URL[https://arxiv\.org/abs/1903\.12220](https://arxiv.org/abs/1903.12220)\.
- Ramalingam et al\. \[2025\]Ramya Ramalingam, Shayan Kiyani, and Aaron Roth\.The relationship between no\-regret learning and online conformal prediction\.*arXiv preprint arXiv:2502\.10947*, 2025\.
- Rastogi et al\. \[2023\]Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari\.A taxonomy of human and ml strengths in decision\-making to investigate human\-ml complementarity, 2023\.URL[https://arxiv\.org/abs/2204\.10806](https://arxiv.org/abs/2204.10806)\.
- Ren et al\. \[2023\]Allen Z\. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar\.Robots that ask for help: Uncertainty alignment for large language model planners, 2023\.URL[https://arxiv\.org/abs/2307\.01928](https://arxiv.org/abs/2307.01928)\.
- Ross et al\. \[2025\]Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara\.When2call: When \(not\) to call tools, 2025\.URL[https://arxiv\.org/abs/2504\.18851](https://arxiv.org/abs/2504.18851)\.
- \[69\]Maxon Rubin\-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, and Surbhi Goel\.Conformal language model reasoning with coherent factuality\.In*The Thirteenth International Conference on Learning Representations*\.
- Schick et al\. \[2023\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools, 2023\.URL[https://arxiv\.org/abs/2302\.04761](https://arxiv.org/abs/2302.04761)\.
- Shahrokhi et al\. \[2025\]Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, and Janaradhan Rao Doppa\.Conformal prediction sets for deep generative models via reduction to conformal regression\.*arXiv preprint arXiv:2503\.10512*, 2025\.
- Steyvers et al\. \[2022\]Mark Steyvers, Heliodoro Tejeda, Gavin Kerrigan, and Padhraic Smyth\.Bayesian modeling of human ai complementarity\.*Proceedings of the National Academy of Sciences*, 119\(11\):e2111547119, 2022\.doi:10\.1073/pnas\.2111547119\.URL[https://www\.pnas\.org/doi/abs/10\.1073/pnas\.2111547119](https://www.pnas.org/doi/abs/10.1073/pnas.2111547119)\.
- Straitouri et al\. \[2023\]Eleni Straitouri, Lequn Wang, Nastaran Okati, and Manuel Gomez Rodriguez\.Improving expert predictions with conformal prediction, 2023\.URL[https://arxiv\.org/abs/2201\.12006](https://arxiv.org/abs/2201.12006)\.
- Straitouri et al\. \[2024\]Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez\.Controlling counterfactual harm in decision support systems based on prediction sets, 2024\.URL[https://arxiv\.org/abs/2406\.06671](https://arxiv.org/abs/2406.06671)\.
- Su et al\. \[2024\]Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng\.Api is enough: Conformal prediction for large language models without logit\-access, 2024\.URL[https://arxiv\.org/abs/2403\.01216](https://arxiv.org/abs/2403.01216)\.
- Tayebati et al\. \[2025\]Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, and Amit Ranjan Trivedi\.Learning conformal abstention policies for adaptive risk management in large language and vision\-language models, 2025\.URL[https://arxiv\.org/abs/2502\.06884](https://arxiv.org/abs/2502.06884)\.
- Tchango et al\. \[2022\]Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn\.Ddxplus: A new dataset for automatic medical diagnosis, 2022\.URL[https://arxiv\.org/abs/2205\.09148](https://arxiv.org/abs/2205.09148)\.
- Tian et al\. \[2023\]Ran Tian, Masayoshi Tomizuka, Anca D\. Dragan, and Andrea Bajcsy\.Towards modeling and influencing the dynamics of human learning\.In*Proceedings of the 2023 ACM/IEEE International Conference on Human\-Robot Interaction*, HRI ’23, page 350–358, New York, NY, USA, 2023\. Association for Computing Machinery\.ISBN 9781450399647\.doi:10\.1145/3568162\.3578629\.URL[https://doi\.org/10\.1145/3568162\.3578629](https://doi.org/10.1145/3568162.3578629)\.
- Toni et al\. \[2024\]Giovanni De Toni, Nastaran Okati, Suhas Thejaswi, Eleni Straitouri, and Manuel Gomez\-Rodriguez\.Towards human\-ai complementarity with prediction sets, 2024\.URL[https://arxiv\.org/abs/2405\.17544](https://arxiv.org/abs/2405.17544)\.
- Ulmer et al\. \[2024\]Dennis Ulmer, Chrysoula Zerva, and André F\. T\. Martins\.Non\-exchangeable conformal language generation with nearest neighbors, 2024\.URL[https://arxiv\.org/abs/2402\.00707](https://arxiv.org/abs/2402.00707)\.
- Vaccaro et al\. \[2024\]Michael Vaccaro, Abdullah Almaatouq, and Thomas Malone\.When combinations of humans and ai are useful: A systematic review and meta\-analysis\.*Nature Human Behaviour*, 8:2293–2303, 2024\.doi:10\.1038/s41562\-024\-02024\-1\.URL[https://doi\.org/10\.1038/s41562\-024\-02024\-1](https://doi.org/10.1038/s41562-024-02024-1)\.
- Verma and Nalisnick \[2022\]Rajeev Verma and Eric Nalisnick\.Calibrated learning to defer with one\-vs\-all classifiers, 2022\.URL[https://arxiv\.org/abs/2202\.03673](https://arxiv.org/abs/2202.03673)\.
- Verma et al\. \[2023\]Rajeev Verma, Daniel Barrejón, and Eric Nalisnick\.Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles, 2023\.URL[https://arxiv\.org/abs/2210\.16955](https://arxiv.org/abs/2210.16955)\.
- Wang et al\. \[2022\]Lequn Wang, Thorsten Joachims, and Manuel Gomez Rodriguez\.Improving screening processes via calibrated subset selection\.*ArXiv*, abs/2202\.01147, 2022\.URL[https://api\.semanticscholar\.org/CorpusID:246473291](https://api.semanticscholar.org/CorpusID:246473291)\.
- Wang et al\. \[2026\]Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk\-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R\. Lyu\.Learning to ask: When llm agents meet unclear instruction, 2026\.URL[https://arxiv\.org/abs/2409\.00557](https://arxiv.org/abs/2409.00557)\.
- Wang et al\. \[2023\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models, 2023\.URL[https://arxiv\.org/abs/2203\.11171](https://arxiv.org/abs/2203.11171)\.
- Wilder et al\. \[2021\]Bryan Wilder, Eric Horvitz, and Ece Kamar\.Learning to complement humans\.In*Proceedings of the Twenty\-Ninth International Joint Conference on Artificial Intelligence*, IJCAI’20, 2021\.ISBN 9780999241165\.
- Yadkori et al\. \[2024a\]Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári\.To believe or not to believe your llm, 2024a\.URL[https://arxiv\.org/abs/2406\.02543](https://arxiv.org/abs/2406.02543)\.
- Yadkori et al\. \[2024b\]Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei\-Hung Weng, Yao\-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, and Nenad Tomasev\.Mitigating llm hallucinations via conformal abstention, 2024b\.URL[https://arxiv\.org/abs/2405\.01563](https://arxiv.org/abs/2405.01563)\.
- Zellinger et al\. \[2025\]Michael J\. Zellinger, Rex Liu, and Matt Thomson\.Cost\-saving llm cascades with early abstention, 2025\.URL[https://arxiv\.org/abs/2502\.09054](https://arxiv.org/abs/2502.09054)\.
- Zhang et al\. \[2025\]Michael J\. Q\. Zhang, W\. Bradley Knox, and Eunsol Choi\.Modeling future conversation turns to teach llms to ask clarifying questions, 2025\.URL[https://arxiv\.org/abs/2410\.13788](https://arxiv.org/abs/2410.13788)\.
- Zhong et al\. \[2017\]Victor Zhong, Caiming Xiong, and Richard Socher\.Seq2sql: Generating structured queries from natural language using reinforcement learning, 2017\.URL[https://arxiv\.org/abs/1709\.00103](https://arxiv.org/abs/1709.00103)\.
###### Contents
1. [1Introduction](https://arxiv.org/html/2606.12587#S1)
2. [2Related Works](https://arxiv.org/html/2606.12587#S2)
3. [3Fundamentals of Strategic Decision Support](https://arxiv.org/html/2606.12587#S3)
4. [4Online Algorithm and Guarantees](https://arxiv.org/html/2606.12587#S4)
5. [5Experiments](https://arxiv.org/html/2606.12587#S5)1. [5\.1Score Design](https://arxiv.org/html/2606.12587#S5.SS1) 2. [5\.2Tasks and Baselines](https://arxiv.org/html/2606.12587#S5.SS2) 3. [5\.3Error Control](https://arxiv.org/html/2606.12587#S5.SS3) 4. [5\.4Support Efficiency](https://arxiv.org/html/2606.12587#S5.SS4)
6. [6Limitations and Future Work](https://arxiv.org/html/2606.12587#S6)
7. [7Extended Related Works](https://arxiv.org/html/2606.12587#S7)
8. [8Acknowledgments](https://arxiv.org/html/2606.12587#S8)
9. [References](https://arxiv.org/html/2606.12587#bib)
10. [AProofs](https://arxiv.org/html/2606.12587#A1)1. [A\.1Proof of Theorem3\.1](https://arxiv.org/html/2606.12587#A1.SS1) 2. [A\.2Proof of Theorem4\.1](https://arxiv.org/html/2606.12587#A1.SS2)
11. [BAdditional Experimental Results](https://arxiv.org/html/2606.12587#A2)1. [B\.1Support rate across all score variants](https://arxiv.org/html/2606.12587#A2.SS1) 2. [B\.2Ablation on the exploration probabilityμ\\mu](https://arxiv.org/html/2606.12587#A2.SS2) 3. [B\.3Sensitivity to the choice ofggon VirtualHome](https://arxiv.org/html/2606.12587#A2.SS3) 4. [B\.4Scores histogram across the online stream](https://arxiv.org/html/2606.12587#A2.SS4) 5. [B\.5Per\-task specifics and descriptive statistics](https://arxiv.org/html/2606.12587#A2.SS5) 6. [B\.6Error Control for all datasets](https://arxiv.org/html/2606.12587#A2.SS6) 7. [B\.7Hyperparameters](https://arxiv.org/html/2606.12587#A2.SS7) 8. [B\.8Task implementation details and example prompts](https://arxiv.org/html/2606.12587#A2.SS8) 9. [B\.9Operational Variants for the Score Input](https://arxiv.org/html/2606.12587#A2.SS9)
## Appendix AProofs
### A\.1Proof of Theorem[3\.1](https://arxiv.org/html/2606.12587#S3.Thmtheorem1)
###### Proof\.
Let
G:=g\(X,Y0,Y1\),π1:=ℙ\(G=1\),G:=g\(X,Y\_\{0\},Y\_\{1\}\),\\qquad\\pi\_\{1\}:=\\mathbb\{P\}\(G=1\),and assumeπ1\>0\\pi\_\{1\}\>0\. For any measurable strategya:𝒳×𝒴→\{0,1\}a:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\\{0,1\\\}, the constraint
ℙ\(a\(X,Y0\)=0∣G=1\)≤ε\\mathbb\{P\}\(a\(X,Y\_\{0\}\)=0\\mid G=1\)\\leq\\varepsilonis equivalent to
ℙ\(a\(X,Y0\)=1∣G=1\)≥1−ε\.\\mathbb\{P\}\(a\(X,Y\_\{0\}\)=1\\mid G=1\)\\geq 1\-\\varepsilon\.By the definition
val\(x,y0\):=ℙ\(G=1∣X=x,Y0=y0\),\\operatorname\{val\}\(x,y\_\{0\}\):=\\mathbb\{P\}\(G=1\\mid X=x,Y\_\{0\}=y\_\{0\}\),we have
ℙ\(a\(X,Y0\)=1∣G=1\)=𝔼\[a\(X,Y0\)𝟏\{G=1\}\]π1=𝔼\[a\(X,Y0\)val\(X,Y0\)\]π1,\\mathbb\{P\}\(a\(X,Y\_\{0\}\)=1\\mid G=1\)=\\frac\{\\mathbb\{E\}\\\!\\left\[a\(X,Y\_\{0\}\)\\mathbf\{1\}\\\{G=1\\\}\\right\]\}\{\\pi\_\{1\}\}=\\frac\{\\mathbb\{E\}\\\!\\left\[a\(X,Y\_\{0\}\)\\operatorname\{val\}\(X,Y\_\{0\}\)\\right\]\}\{\\pi\_\{1\}\},where the second equality follows from the tower property\. Hence \(SDS\-Opt\) is equivalent to
mina:𝒳×𝒴→\{0,1\}𝔼\[a\(X,Y0\)\]subject to𝔼\[a\(X,Y0\)val\(X,Y0\)\]≥\(1−ε\)π1\.\\min\_\{a:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\\{0,1\\\}\}\\mathbb\{E\}\[a\(X,Y\_\{0\}\)\]\\qquad\\text\{subject to\}\\qquad\\mathbb\{E\}\[a\(X,Y\_\{0\}\)\\operatorname\{val\}\(X,Y\_\{0\}\)\]\\geq\(1\-\\varepsilon\)\\pi\_\{1\}\.
Let
β:=\(1−ε\)π1\.\\beta:=\(1\-\\varepsilon\)\\pi\_\{1\}\.The objective charges one unit for every point on which support is sought, while the constraint credits such a point in proportion toval\(X,Y0\)\\operatorname\{val\}\(X,Y\_\{0\}\)\. Therefore, an optimal strategy should allocate support to the largest values ofval\\operatorname\{val\}first\. We now make this formal\.
Choose a thresholdτ⋆∈\[0,1\]\\tau^\{\\star\}\\in\[0,1\]such that
𝔼\[val\(X,Y0\)𝟏\{val\(X,Y0\)\>τ⋆\}\]≤β≤𝔼\[val\(X,Y0\)𝟏\{val\(X,Y0\)≥τ⋆\}\]\.\\mathbb\{E\}\\\!\\left\[\\operatorname\{val\}\(X,Y\_\{0\}\)\\mathbf\{1\}\\\{\\operatorname\{val\}\(X,Y\_\{0\}\)\>\\tau^\{\\star\}\\\}\\right\]\\leq\\beta\\leq\\mathbb\{E\}\\\!\\left\[\\operatorname\{val\}\(X,Y\_\{0\}\)\\mathbf\{1\}\\\{\\operatorname\{val\}\(X,Y\_\{0\}\)\\geq\\tau^\{\\star\}\\\}\\right\]\.If necessary, randomize on the boundary\{val\(X,Y0\)=τ⋆\}\\\{\\operatorname\{val\}\(X,Y\_\{0\}\)=\\tau^\{\\star\}\\\}with probabilityρ⋆∈\[0,1\]\\rho^\{\\star\}\\in\[0,1\]chosen so that
𝔼\[val\(X,Y0\)\(𝟏\{val\(X,Y0\)\>τ⋆\}\+ρ⋆𝟏\{val\(X,Y0\)=τ⋆\}\)\]=β\.\\mathbb\{E\}\\\!\\left\[\\operatorname\{val\}\(X,Y\_\{0\}\)\\Big\(\\mathbf\{1\}\\\{\\operatorname\{val\}\(X,Y\_\{0\}\)\>\\tau^\{\\star\}\\\}\+\\rho^\{\\star\}\\mathbf\{1\}\\\{\\operatorname\{val\}\(X,Y\_\{0\}\)=\\tau^\{\\star\}\\\}\\Big\)\\right\]=\\beta\.Define
a⋆\(x,y0\)=𝟏\{val\(x,y0\)\>τ⋆\}\+ρ⋆𝟏\{val\(x,y0\)=τ⋆\}\.a^\{\\star\}\(x,y\_\{0\}\)=\\mathbf\{1\}\\\{\\operatorname\{val\}\(x,y\_\{0\}\)\>\\tau^\{\\star\}\\\}\+\\rho^\{\\star\}\\mathbf\{1\}\\\{\\operatorname\{val\}\(x,y\_\{0\}\)=\\tau^\{\\star\}\\\}\.Equivalently,a⋆a^\{\\star\}seeks support wheneverval\(x,y0\)\\operatorname\{val\}\(x,y\_\{0\}\)is above the threshold and randomizes on the boundary if needed\. By construction,a⋆a^\{\\star\}satisfies the constraint\.
It remains to show optimality\. Letaabe any feasible strategy, allowing randomized values in\[0,1\]\[0,1\]; this only enlarges the feasible class\. By the definition ofa⋆a^\{\\star\}, we have pointwise
\(a\(x,y0\)−a⋆\(x,y0\)\)\(val\(x,y0\)−τ⋆\)≤0\.\(a\(x,y\_\{0\}\)\-a^\{\\star\}\(x,y\_\{0\}\)\)\(\\operatorname\{val\}\(x,y\_\{0\}\)\-\\tau^\{\\star\}\)\\leq 0\.Taking expectations gives
𝔼\[\(a−a⋆\)val\]≤τ⋆𝔼\[a−a⋆\]\.\\mathbb\{E\}\\\!\\left\[\(a\-a^\{\\star\}\)\\operatorname\{val\}\\right\]\\leq\\tau^\{\\star\}\\mathbb\{E\}\[a\-a^\{\\star\}\]\.Sinceaais feasible anda⋆a^\{\\star\}satisfies the constraint with equality,
𝔼\[\(a−a⋆\)val\]≥0\.\\mathbb\{E\}\\\!\\left\[\(a\-a^\{\\star\}\)\\operatorname\{val\}\\right\]\\geq 0\.Therefore,
0≤τ⋆𝔼\[a−a⋆\]\.0\\leq\\tau^\{\\star\}\\mathbb\{E\}\[a\-a^\{\\star\}\]\.Ifτ⋆\>0\\tau^\{\\star\}\>0, this implies
𝔼\[a\]≥𝔼\[a⋆\]\.\\mathbb\{E\}\[a\]\\geq\\mathbb\{E\}\[a^\{\\star\}\]\.Ifτ⋆=0\\tau^\{\\star\}=0, feasibility requiresa=1a=1almost surely on the set whereval\(X,Y0\)\>0\\operatorname\{val\}\(X,Y\_\{0\}\)\>0, anda⋆a^\{\\star\}is the minimal such rule, so the same conclusion holds\. Thus no feasible strategy uses support less often thana⋆a^\{\\star\}\.
Hence an optimal solution is given by a threshold rule inval\(x,y0\)\\operatorname\{val\}\(x,y\_\{0\}\), with possible randomization on the boundary\. This proves the claim\. ∎
### A\.2Proof of Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)
###### Proof\.
Fix a horizonTT\. For the analysis, define
gt:=g\(xt,y0t,y1t\)∈\{0,1\}g\_\{t\}:=g\(x\_\{t\},y\_\{0\}^\{t\},y\_\{1\}^\{t\}\)\\in\\\{0,1\\\}as the potential benefit of support on roundtt, whether or not support is actually sought\. All probabilities below are over the algorithm’s internal Bernoulli draws\. If
Ng\(T\):=∑t=1Tgt=0,N\_\{g\}\(T\):=\\sum\_\{t=1\}^\{T\}g\_\{t\}=0,thenMSE^\(T\)=0\\widehat\{\\mathrm\{MSE\}\}\(T\)=0by convention, and the claim is immediate\. Hence assume
N:=Ng\(T\)=∑t=1Tgt≥1\.N:=N\_\{g\}\(T\)=\\sum\_\{t=1\}^\{T\}g\_\{t\}\\geq 1\.
We first outline the proof\. The threshold update controls a threshold\-induced version of the missed\-support error, in which the realized missed\-support indicator\(1−at\)\(1\-a\_\{t\}\)is replaced by its conditional mean\(1−pt\)𝟏\{st<λt\}\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\. The update gives a telescoping identity for this quantity, up to an importance\-weighted martingale error\. We control this martingale error by Freedman’s inequality\. Finally, we compare the threshold\-induced error to the actual empirical errorMSE^\(T\)\\widehat\{\\mathrm\{MSE\}\}\(T\)using a second martingale concentration bound\.
*Claim 1\.*Define
MSE¯\(T\):=1N∑t=1Tgt\(1−pt\)𝟏\{st<λt\}\.\\overline\{\\mathrm\{MSE\}\}\(T\):=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{T\}g\_\{t\}\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\.Then, with probability at least1−δ/21\-\\delta/2,
MSE¯\(T\)≤α\+1\+2η/μηN\+2log\(4/δ\)μN\+log\(4/δ\)3μN\.\\overline\{\\mathrm\{MSE\}\}\(T\)\\leq\\alpha\+\\frac\{1\+2\\eta/\\mu\}\{\\eta N\}\+\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{\\mu N\}\}\+\\frac\{\\log\(4/\\delta\)\}\{3\\mu N\}\.
*Proof of Claim 1\.*Define
e¯t:=gt\(\(1−pt\)𝟏\{st<λt\}−α\),e^t:=gtatpt\(\(1−pt\)𝟏\{st<λt\}−α\)\.\\bar\{e\}\_\{t\}:=g\_\{t\}\\Big\(\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha\\Big\),\\qquad\\hat\{e\}\_\{t\}:=\\frac\{g\_\{t\}a\_\{t\}\}\{p\_\{t\}\}\\Big\(\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha\\Big\)\.The threshold update can be written as
λt\+1=λt−ηe^t\.\\lambda\_\{t\+1\}=\\lambda\_\{t\}\-\\eta\\hat\{e\}\_\{t\}\.Therefore,
∑t=1Te^t=λ1−λT\+1η\.\\sum\_\{t=1\}^\{T\}\\hat\{e\}\_\{t\}=\\frac\{\\lambda\_\{1\}\-\\lambda\_\{T\+1\}\}\{\\eta\}\.Moreover,
∑t=1Te¯t=N\(MSE¯\(T\)−α\)\.\\sum\_\{t=1\}^\{T\}\\bar\{e\}\_\{t\}=N\\big\(\\overline\{\\mathrm\{MSE\}\}\(T\)\-\\alpha\\big\)\.
Letℱt−1\\mathcal\{F\}\_\{t\-1\}denote the history before roundtt, and define
𝒢t:=σ\(ℱt−1,xt,y0t,st,λt,pt,gt\)\.\\mathcal\{G\}\_\{t\}:=\\sigma\(\\mathcal\{F\}\_\{t\-1\},x\_\{t\},y\_\{0\}^\{t\},s\_\{t\},\\lambda\_\{t\},p\_\{t\},g\_\{t\}\)\.This filtration contains all pre\-action quantities at roundtt, together with the potential benefit variablegtg\_\{t\}used only for the analysis\. Conditional on𝒢t\\mathcal\{G\}\_\{t\}, the only randomness in roundttis
at∼Bernoulli\(pt\)\.a\_\{t\}\\sim\\mathrm\{Bernoulli\}\(p\_\{t\}\)\.Hence
𝔼\[e^t∣𝒢t\]=gt\(\(1−pt\)𝟏\{st<λt\}−α\)𝔼\[atpt\|𝒢t\]=e¯t\.\\mathbb\{E\}\\\!\\left\[\\hat\{e\}\_\{t\}\\mid\\mathcal\{G\}\_\{t\}\\right\]=g\_\{t\}\\Big\(\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha\\Big\)\\mathbb\{E\}\\\!\\left\[\\frac\{a\_\{t\}\}\{p\_\{t\}\}\\,\\middle\|\\,\\mathcal\{G\}\_\{t\}\\right\]=\\bar\{e\}\_\{t\}\.Thus, with
Zt:=e^t−e¯t,Z\_\{t\}:=\\hat\{e\}\_\{t\}\-\\bar\{e\}\_\{t\},the sequence\(Zt\)t≥1\(Z\_\{t\}\)\_\{t\\geq 1\}is a martingale difference sequence, and
N\(MSE¯\(T\)−α\)=λ1−λT\+1η−∑t=1TZt\.N\\big\(\\overline\{\\mathrm\{MSE\}\}\(T\)\-\\alpha\\big\)=\\frac\{\\lambda\_\{1\}\-\\lambda\_\{T\+1\}\}\{\\eta\}\-\\sum\_\{t=1\}^\{T\}Z\_\{t\}\.
We next bound the threshold sequence\. Sincept≥μp\_\{t\}\\geq\\muand
\|\(1−pt\)𝟏\{st<λt\}−α\|≤1,\\Big\|\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha\\Big\|\\leq 1,we have
\|e^t\|≤1μ\.\|\\hat\{e\}\_\{t\}\|\\leq\\frac\{1\}\{\\mu\}\.Ifλt<0\\lambda\_\{t\}<0, thenst≥λts\_\{t\}\\geq\\lambda\_\{t\}becausest∈\[0,1\]s\_\{t\}\\in\[0,1\], sopt=1p\_\{t\}=1and𝟏\{st<λt\}=0\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}=0\. Hence
e^t=−αgtat≤0,\\hat\{e\}\_\{t\}=\-\\alpha g\_\{t\}a\_\{t\}\\leq 0,and thereforeλt\+1≥λt\\lambda\_\{t\+1\}\\geq\\lambda\_\{t\}\. Ifλt\>1\\lambda\_\{t\}\>1, thenst<λts\_\{t\}<\\lambda\_\{t\}, sopt=μp\_\{t\}=\\muand𝟏\{st<λt\}=1\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}=1\. Sinceα<1−μ\\alpha<1\-\\mu,
e^t=gtatμ\(\(1−μ\)−α\)≥0,\\hat\{e\}\_\{t\}=\\frac\{g\_\{t\}a\_\{t\}\}\{\\mu\}\\big\(\(1\-\\mu\)\-\\alpha\\big\)\\geq 0,and thereforeλt\+1≤λt\\lambda\_\{t\+1\}\\leq\\lambda\_\{t\}\. Starting fromλ1∈\[0,1\]\\lambda\_\{1\}\\in\[0,1\], these facts imply by induction that
λt∈\[−ημ,1\+ημ\]for allt≤T\+1\.\\lambda\_\{t\}\\in\\left\[\-\\frac\{\\eta\}\{\\mu\},\\,1\+\\frac\{\\eta\}\{\\mu\}\\right\]\\qquad\\text\{for all \}t\\leq T\+1\.Consequently,
\|λT\+1−λ1\|≤1\+2ημ\.\|\\lambda\_\{T\+1\}\-\\lambda\_\{1\}\|\\leq 1\+\\frac\{2\\eta\}\{\\mu\}\.
It remains to control the martingale term\. Write
Zt=ct\(atpt−1\),ct:=gt\(\(1−pt\)𝟏\{st<λt\}−α\)\.Z\_\{t\}=c\_\{t\}\\left\(\\frac\{a\_\{t\}\}\{p\_\{t\}\}\-1\\right\),\\qquad c\_\{t\}:=g\_\{t\}\\Big\(\(1\-p\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\-\\alpha\\Big\)\.Since\|ct\|≤1\|c\_\{t\}\|\\leq 1andpt≥μp\_\{t\}\\geq\\mu,
\|Zt\|≤1μ\.\|Z\_\{t\}\|\\leq\\frac\{1\}\{\\mu\}\.Furthermore,
Var\(Zt∣𝒢t\)=ct2Var\(atpt\|𝒢t\)\.\\operatorname\{Var\}\(Z\_\{t\}\\mid\\mathcal\{G\}\_\{t\}\)=c\_\{t\}^\{2\}\\operatorname\{Var\}\\\!\\left\(\\frac\{a\_\{t\}\}\{p\_\{t\}\}\\,\\middle\|\\,\\mathcal\{G\}\_\{t\}\\right\)\.Becauseat∼Bernoulli\(pt\)a\_\{t\}\\sim\\mathrm\{Bernoulli\}\(p\_\{t\}\),
Var\(atpt\|𝒢t\)=1−ptpt≤1μ\.\\operatorname\{Var\}\\\!\\left\(\\frac\{a\_\{t\}\}\{p\_\{t\}\}\\,\\middle\|\\,\\mathcal\{G\}\_\{t\}\\right\)=\\frac\{1\-p\_\{t\}\}\{p\_\{t\}\}\\leq\\frac\{1\}\{\\mu\}\.Alsoct2≤gtc\_\{t\}^\{2\}\\leq g\_\{t\}, sincegt∈\{0,1\}g\_\{t\}\\in\\\{0,1\\\}\. Therefore,
Var\(Zt∣𝒢t\)≤gtμ\.\\operatorname\{Var\}\(Z\_\{t\}\\mid\\mathcal\{G\}\_\{t\}\)\\leq\\frac\{g\_\{t\}\}\{\\mu\}\.Thus the predictable quadratic variation satisfies
∑t=1TVar\(Zt∣𝒢t\)≤Nμ\.\\sum\_\{t=1\}^\{T\}\\operatorname\{Var\}\(Z\_\{t\}\\mid\\mathcal\{G\}\_\{t\}\)\\leq\\frac\{N\}\{\\mu\}\.By Freedman’s inequality, with probability at least1−δ/21\-\\delta/2,
\|∑t=1TZt\|≤2Nlog\(4/δ\)μ\+log\(4/δ\)3μ\.\\left\|\\sum\_\{t=1\}^\{T\}Z\_\{t\}\\right\|\\leq\\sqrt\{\\frac\{2N\\log\(4/\\delta\)\}\{\\mu\}\}\+\\frac\{\\log\(4/\\delta\)\}\{3\\mu\}\.Combining the preceding displays gives
MSE¯\(T\)≤α\+1\+2η/μηN\+2log\(4/δ\)μN\+log\(4/δ\)3μN\.\\overline\{\\mathrm\{MSE\}\}\(T\)\\leq\\alpha\+\\frac\{1\+2\\eta/\\mu\}\{\\eta N\}\+\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{\\mu N\}\}\+\\frac\{\\log\(4/\\delta\)\}\{3\\mu N\}\.This proves Claim 1\.⋄\\diamond
*Claim 2\.*With probability at least1−δ/21\-\\delta/2,
MSE^\(T\)≤MSE¯\(T\)\+2log\(4/δ\)N\+log\(4/δ\)3N\.\\widehat\{\\mathrm\{MSE\}\}\(T\)\\leq\\overline\{\\mathrm\{MSE\}\}\(T\)\+\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{N\}\}\+\\frac\{\\log\(4/\\delta\)\}\{3N\}\.
*Proof of Claim 2\.*By definition,
MSE^\(T\)=1N∑t=1Tgt\(1−at\)\.\\widehat\{\\mathrm\{MSE\}\}\(T\)=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{T\}g\_\{t\}\(1\-a\_\{t\}\)\.Sincept=1p\_\{t\}=1wheneverst≥λts\_\{t\}\\geq\\lambda\_\{t\}, we haveat=1a\_\{t\}=1almost surely on those rounds\. Therefore,
MSE^\(T\)=1N∑t=1Tgt\(1−at\)𝟏\{st<λt\}\.\\widehat\{\\mathrm\{MSE\}\}\(T\)=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{T\}g\_\{t\}\(1\-a\_\{t\}\)\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\.Define
Ut:=gt𝟏\{st<λt\}\(\(1−at\)−\(1−pt\)\)=gt𝟏\{st<λt\}\(pt−at\)\.U\_\{t\}:=g\_\{t\}\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\\Big\(\(1\-a\_\{t\}\)\-\(1\-p\_\{t\}\)\\Big\)=g\_\{t\}\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\(p\_\{t\}\-a\_\{t\}\)\.Then
MSE^\(T\)−MSE¯\(T\)=1N∑t=1TUt\.\\widehat\{\\mathrm\{MSE\}\}\(T\)\-\\overline\{\\mathrm\{MSE\}\}\(T\)=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{T\}U\_\{t\}\.Moreover,
𝔼\[Ut∣𝒢t\]=0,\\mathbb\{E\}\[U\_\{t\}\\mid\\mathcal\{G\}\_\{t\}\]=0,so\(Ut\)t≥1\(U\_\{t\}\)\_\{t\\geq 1\}is a martingale difference sequence\. Since\|Ut\|≤1\|U\_\{t\}\|\\leq 1and
Var\(Ut∣𝒢t\)=gt𝟏\{st<λt\}Var\(at∣𝒢t\)=gt𝟏\{st<λt\}pt\(1−pt\)≤gt,\\operatorname\{Var\}\(U\_\{t\}\\mid\\mathcal\{G\}\_\{t\}\)=g\_\{t\}\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}\\operatorname\{Var\}\(a\_\{t\}\\mid\\mathcal\{G\}\_\{t\}\)=g\_\{t\}\\mathbf\{1\}\\\{s\_\{t\}<\\lambda\_\{t\}\\\}p\_\{t\}\(1\-p\_\{t\}\)\\leq g\_\{t\},the predictable quadratic variation is at mostNN\. Freedman’s inequality gives, with probability at least1−δ/21\-\\delta/2,
\|∑t=1TUt\|≤2Nlog\(4/δ\)\+log\(4/δ\)3\.\\left\|\\sum\_\{t=1\}^\{T\}U\_\{t\}\\right\|\\leq\\sqrt\{2N\\log\(4/\\delta\)\}\+\\frac\{\\log\(4/\\delta\)\}\{3\}\.Dividing byNNproves Claim 2\.⋄\\diamond
Combining Claims 1 and 2 with a union bound, with probability at least1−δ1\-\\delta,
MSE^\(T\)≤α\+1\+2η/μηN\+2log\(4/δ\)μN\+log\(4/δ\)3μN\+2log\(4/δ\)N\+log\(4/δ\)3N\.\\widehat\{\\mathrm\{MSE\}\}\(T\)\\leq\\alpha\+\\frac\{1\+2\\eta/\\mu\}\{\\eta N\}\+\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{\\mu N\}\}\+\\frac\{\\log\(4/\\delta\)\}\{3\\mu N\}\+\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{N\}\}\+\\frac\{\\log\(4/\\delta\)\}\{3N\}\.Sinceμ≤1\\mu\\leq 1,
2log\(4/δ\)N≤2log\(4/δ\)μN,log\(4/δ\)3N≤log\(4/δ\)3μN\.\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{N\}\}\\leq\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{\\mu N\}\},\\qquad\\frac\{\\log\(4/\\delta\)\}\{3N\}\\leq\\frac\{\\log\(4/\\delta\)\}\{3\\mu N\}\.Therefore,
MSE^\(T\)≤α\+1\+2η/μηN\+22log\(4/δ\)μN\+2log\(4/δ\)3μN\.\\widehat\{\\mathrm\{MSE\}\}\(T\)\\leq\\alpha\+\\frac\{1\+2\\eta/\\mu\}\{\\eta N\}\+2\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{\\mu N\}\}\+\\frac\{2\\log\(4/\\delta\)\}\{3\\mu N\}\.This is slightly stronger than the stated bound\. Since
22log\(4/δ\)μN=8log\(4/δ\)μN,2\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{\\mu N\}\}=\\sqrt\{\\frac\{8\\log\(4/\\delta\)\}\{\\mu N\}\},and
2log\(4/δ\)3μN≤4log\(4/δ\)3μN,\\frac\{2\\log\(4/\\delta\)\}\{3\\mu N\}\\leq\\frac\{4\\log\(4/\\delta\)\}\{3\\mu N\},we obtain
MSE^\(T\)≤α\+1\+2η/μηN\+8log\(4/δ\)μN\+4log\(4/δ\)3μN\.\\widehat\{\\mathrm\{MSE\}\}\(T\)\\leq\\alpha\+\\frac\{1\+2\\eta/\\mu\}\{\\eta N\}\+\\sqrt\{\\frac\{8\\log\(4/\\delta\)\}\{\\mu N\}\}\+\\frac\{4\\log\(4/\\delta\)\}\{3\\mu N\}\.Recalling thatN=Ng\(T\)N=N\_\{g\}\(T\)proves the theorem\. ∎
## Appendix BAdditional Experimental Results
### B\.1Support rate across all score variants
Figure[3](https://arxiv.org/html/2606.12587#S5.F3)in the main text shows the cumulative support rate for the best\-performing variant in each of the Anchored and Representation families, together with its counterpart in the other family\. Figure[5](https://arxiv.org/html/2606.12587#A2.F5)reports the same quantity for every score variant we consider, providing the full picture of how each embedding choice and family performs at matched missed\-support error\.
Figure 5:Cumulative support rateSR^T=1T∑t=1Tat\\widehat\{\\mathrm\{SR\}\}\_\{T\}=\\tfrac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}a\_\{t\}across all task–model pairs, showing every score variant\. Rows are base agents, columns are tasks\. All variants are run at the same targetα\\alphaas in Figure[3](https://arxiv.org/html/2606.12587#S5.F3)\.
### B\.2Ablation on the exploration probabilityμ\\mu
Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)identifies the exploration probabilityμ\\muas the lever that controls the second slack term in the missed\-support error bound\. larger values ofμ\\mutighten error control and yield smoother convergence, but increase support usage, since the algorithm calls support with probabilityμ\\muon every round where the score falls below the threshold\. Here we verify this empirically by sweepingμ\\muon a fixed task–model–score configuration and observing how the missed\-support error and the support rate move together\.
We fix the base agent to GPT\-4o\-mini, the task to DDXPlus, and the score to Anchored\-Gemini\. The threshold step sizeηt\\eta\_\{t\}and initial thresholdλ0\\lambda\_\{0\}are held constant and onlyμ\\muis varied\. The configurations are listed in Table[1](https://arxiv.org/html/2606.12587#A2.T1)\.
Figure[6](https://arxiv.org/html/2606.12587#A2.F6)reports the cumulative missed\-support error and the cumulative support rate across the five values ofμ\\mu\. The two panels show the predicted tradeoff\. On the error\-control side, larger values ofμ\\muproduce smoother trajectories that converge faster to the targetα\\alpha, while small values exhibit noisier adaptation\. On the support\-efficiency side, smallμ\\mudelivers substantially lower support rates, while largeμ\\mupushes the support rate up because the algorithm queries support with non\-negligible probability even on rounds where the score is decisively low\. In our main experiments we therefore choose moderate values ofμ\\muthat achieve reliable error control without sacrificing the support\-rate gains\.
Figure 6:Effect of the exploration probabilityμ\\mu\. Base agent: GPT\-4o\-mini\. Task: DDXPlus\. Score: Anchored\-Gemini\. Left: cumulative missed\-support error against the targetα\\alpha\. Right: cumulative support rate, with theLLM\-Decidesbaseline shown for reference\. Largerμ\\mutightens error control and yields smoother convergence but increases support usage, matching the dependence onμ\\muin the slack term of Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)\.Table 1:Hyperparameters for the exploration\-probability ablation\. Base agent: GPT\-4o\-mini\. Task: DDXPlus\. Score: Anchored\-Gemini\.μ\\muis varied whileηt\\eta\_\{t\}andλ0\\lambda\_\{0\}are held at the base values from the main experiments\.
### B\.3Sensitivity to the choice ofggon VirtualHome
The benefit functionggencodes what it means for support to materially help on a given task, and its definition naturally depends on the application and on what the provider considers a meaningful improvement\. The purpose of this section is to verify that our oversight layer is agnostic to this choice: error control holds for anygg, and the qualitative ranking of score variants is preserved across different definitions\.
We illustrate this on VirtualHome, which is the most since the output is a structured plan rather than a verifiable answer\. Each generated plan is evaluated against the gold plan using its longest common subsequence \(LCS\) score, which measures how much of the gold plan’s ordered structure the generated plan recovers\. We compare two natural gain definitions built on top of this score:
- •gstrictg\_\{\\mathrm\{strict\}\}marks support as beneficial when the supported plan reaches an absolute quality bar,y1LCS≥0\.5y\_\{1\}^\{\\mathrm\{LCS\}\}\\geq 0\.5\. This captures the view that improvement matters only if the supported plan crosses a fixed threshold of acceptability\.
- •gimprovedg\_\{\\mathrm\{improved\}\}marks support as beneficial whenever the supported plan strictly improves on the unsupported plan in LCS score\. This is the most permissive definition and treats any strict improvement as material\.
The two definitions impose qualitatively different demands\. The first asks whethery1y\_\{1\}is good in absolute terms, the second asks whether it is better thany0y\_\{0\}\. They induce different positive rates and therefore different targets: the missed\-support rate ofLLM\-Decidesisα=0\.51\\alpha=0\.51undergstrictg\_\{\\mathrm\{strict\}\}andα=0\.34\\alpha=0\.34undergimprovedg\_\{\\mathrm\{improved\}\}, and we set the target accordingly so all comparisons remain at matched error\.
Figure[7](https://arxiv.org/html/2606.12587#A2.F7)reports both the cumulative missed\-support error \(top row\) and the cumulative support rate \(bottom row\) for Gemini\-2\.5\-Flash on VirtualHome under both gain definitions\. Two observations stand out\. First, error control holds in every panel\. The running missed\-support error converges to the corresponding targetα\\alpharegardless of whichggis used, consistent with Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)\. Second, the qualitative ranking of methods carries over: the adaptive parameterized score variants reduce the support rate substantially relative toLLM\-Decides\. This confirms thatggis a flexible engineering choice the practitioner can tailor to their task and their notion of meaningful improvement, without changing the algorithm or its guarantees\.
Figure 7:Gemini\-2\.5\-Flash on VirtualHome under two gain definitions\. Columns are gain definitions:gstrictg\_\{\\mathrm\{strict\}\}\(y1y\_\{1\}reaches LCS≥0\.5\\geq 0\.5\) on the left,gimprovedg\_\{\\mathrm\{improved\}\}\(y1y\_\{1\}improves overy0y\_\{0\}in LCS\) on the right\. Top row reports the cumulative missed\-support error against the corresponding targetα\\alpha; bottom row reports the cumulative support rate\. Across both definitions, the running error converges toα\\alphaand the parameterized variants reduce support usage relative toLLM\-Decides\.
### B\.4Scores histogram across the online stream
To complement the support\-rate results in Section[5](https://arxiv.org/html/2606.12587#S5), we visualize the score distributions produced by each score family along the online stream\. Figure[8](https://arxiv.org/html/2606.12587#A2.F8)shows histograms of the scoresssplit by the latent benefit variableggfor Gemini\-2\.5\-Flash on all four tasks\. The three rows correspond to the three score families introduced in Section[5\.1](https://arxiv.org/html/2606.12587#S5.SS1): the raw LLM confidenceg^LLM\\hat\{g\}\_\{\\mathrm\{LLM\}\}, the Anchored score that adds a calibrated residual on top ofg^LLM\\hat\{g\}\_\{\\mathrm\{LLM\}\}, and the Representation score that drops the anchor and learns from the embedding alone\. The Anchored and Representation variants both use theGemini\-Embedding\-2model \(ϕ=\\phi=Gemini\) and are calibrated on the fly\.
The top row makes the limitation of relying on raw confidence concrete\. On WikiSQL and MATH, bothg=0g=0andg=1g=1pile up at the similar scores, so LLM Confidence carries almost no information about whether support would help; on DDXPlus and VirtualHome the two distributions overlap heavily as well\. A threshold rule applied directly to this score is therefore forced to either accept too manyg=0g=0instances or call support on too manyg=1g=1instances, which is why Confidence ends up close to or aboveLLM\-Decidesin support efficiency\.
The middle and bottom rows show what calibration\-on\-the\-fly recovers from the same stream\. The Anchored score sharpens the confidence signal where it carries useful information \(DDXPlus, VirtualHome\) and corrects it where it does not \(WikiSQL, MATH\), pushing theg=1g=1mass toward high scores and theg=0g=0mass toward low scores\. The Representation score, which has no anchor to start from, learns this separation directly from the embedding\. This is the mechanism that makes the parameterized score families consistently outperform Confidence in Figure[4](https://arxiv.org/html/2606.12587#S5.F4)\. Even when the underlying signal is uninformative or misleading, the calibration\-on\-the\-fly extracts a usable separation between beneficial and non\-beneficial instances from the feedback received on rounds where support is invoked\.
Figure 8:Score distributions along the online stream for Gemini\-2\.5\-Flash, split by the latent benefit variablegg\. Rows are score families \(LLM confidence, Anchored, Representation\) and columns are tasks \(DDXPlus, WikiSQL, VirtualHome, MATH\)\. Anchored and Representation variants both use the Gemini embedding \(ϕ=\\phi=Gemini\) and are calibrated on the fly\. Raw confidence often fails to separateg=0g=0fromg=1g=1, particularly on WikiSQL and MATH where both classes concentrate at the same scores\. After calibration\-on\-the\-fly, the Anchored and Representation scores recover separation between the two classes, providing the threshold rule with a more informative signal\.
### B\.5Per\-task specifics and descriptive statistics
Table[2](https://arxiv.org/html/2606.12587#A2.T2)reports descriptive statistics for each task–model pair\. We include the accuracy of the unsupported outputy0y\_\{0\}, the accuracy of the supported outputy1y\_\{1\}, the empirical probabilityPr\(g=1\)\\Pr\(g=1\)that support is beneficial, and the support rate and missed\-support error of theLLM\-Decidesbaseline\. These quantities highlight that the value of support depends jointly on the task and the base model: even within the same task, support can be far more useful for one model than another\. For example, on DDXPlus, Gemini\-2\.5\-Flash benefits substantially more from the additional clinical evidence than Qwen\-2\.5\-7B, despite both receiving identical support\. The headroom for support to help is therefore a property of the task, model, and support modality together, rather than of the task alone\. This variability is precisely what motivates the oversight layer we develop in the paper\. Rather than relying on the agent’s own support\-seeking behavior, which can drift far fromPr\(g=1\)\\Pr\(g=1\)in either direction, our algorithm adapts online to the specific task–model–support combination at hand and controls the missed\-support error at a user\-chosen level\.
Table 2:Per \(task,model\) descriptive statistics for each base agent\.
### B\.6Error Control for all datasets
Figure[9](https://arxiv.org/html/2606.12587#A2.F9)reports the cumulative missed\-support errorMSE\(T\)\\mathrm\{MSE\}\(T\)across the full3×43\\times 4grid of base agents and tasks\. The main text shows convergence for a single representative pair \(Figure[3](https://arxiv.org/html/2606.12587#S5.F3)\); here we verify that the same behavior holds in every panel\. Across all twelve task–model pairs and across every score variant, the running MSE converges to the target levelα\\alpha, set to the rate achieved byLLM\-Decideson that pair\. This confirms that the finite\-sample guarantee of Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)holds uniformly across tasks, models, and score families\. The bound in that theorem decomposes into two qualitatively different sources of slack\. The first is the intrinsic error of online quantile tracking, governed by the threshold step sizeη\\eta: larger values facilitate faster initial convergence but produce noisier trajectories, while smaller values yield smoother curves at the cost of slower adaptation\. The second arises from partial feedback and the randomized exploration needed to obtain unbiased feedback, and is controlled by the exploration probabilityμ\\mu: largerμ\\mutightens error control but increases support usage\. We study the dependence onμ\\muempirically in Appendix[B\.2](https://arxiv.org/html/2606.12587#A2.SS2)\.
Figure 9:Cumulative missed\-support errorMSE\(T\)\\mathrm\{MSE\}\(T\)across all task–model pairs\. Rows are base agents \(Qwen\-2\.5\-7B, Gemini\-2\.5\-Flash, GPT\-4o\-mini\), columns are tasks \(DDXPlus, WikiSQL, VirtualHome, MATH\)\. Each panel shows the running MSE for all score variants together with theLLM\-Decidesbaseline and the target levelα\\alpha\. All variants converge towardsα\\alpharegardless of the score family\.
### B\.7Hyperparameters
For completeness, Table[3](https://arxiv.org/html/2606.12587#A2.T3)reports the full hyperparameter configuration used to produce the main\-text results across all base agents, tasks, and score variants\. We report the calibration step sizeηt\\eta\_\{t\}, the threshold step sizeγt\\gamma\_\{t\}, the exploration probabilityμ\\mu, and the initial thresholdλ0\\lambda\_\{0\}\. Confidence has no learnable parameters and so itsηt\\eta\_\{t\}entry is left blank\. Hidden\-state variants are only available in the white\-box regime and are reported for Qwen\-2\.5\-7B; the corresponding rows for Gemini\-2\.5\-Flash and GPT\-4o\-mini are left blank\.
The values reported in Table[3](https://arxiv.org/html/2606.12587#A2.T3)were selected by a simple grid search over a small set of candidate values on a held out test stream for each hyperparameter, with the support rate at the end of the trajectory used as the selection criterion\. We did not find this search to be delicate\. Across all task–model pairs, a wide range of reasonable choices yield similar performance, and the qualitative conclusions of the paper are insensitive to the exact configuration\. In practice, the threshold step sizeηt\\eta\_\{t\}governs how quickly the algorithm adapts; the exploration probabilityμ\\mutrades off tighter error control against higher support usage and is studied directly in Appendix[B\.2](https://arxiv.org/html/2606.12587#A2.SS2); and the score update step sizeγt\\gamma\_\{t\}and initial thresholdλ0\\lambda\_\{0\}have only secondary effects once the other two are set sensibly\. We therefore expect a practitioner to obtain comparable results without an extensive hyperparameter search\.
Table 3:Hyperparameters across all base agents and tasks\.γt\\gamma\_\{t\}is the calibration\-on\-the\-fly step size,ηt\\eta\_\{t\}the threshold step size,μ\\muthe exploration probability,λ0\\lambda\_\{0\}the initial threshold\. Confidence has no learnable parameters; itsηt\\eta\_\{t\}cell is left blank\.
### B\.8Task implementation details and example prompts
We instantiate each of the four tasks from Section[5](https://arxiv.org/html/2606.12587#S5)\. The base agent producingy0y\_\{0\}andy1y\_\{1\}is one of Qwen\-2\.5\-7B, Gemini\-2\.5\-Flash, or GPT\-4o\-mini, identical across all four tasks\. The support modality varies by task: revealed follow\-up questions, examination findings, and laboratory results on DDXPlus; SQL query formulation and execution against the table on WikiSQL; scene\-specific object locations and constraints on VirtualHome; and targeted guidance from a stronger reasoner on MATH\. For MATH, we use DeepSeek as the stronger reasoner that responds to the base agent’s questions; the base agent itself remains one of the three models above and is responsible for both the initial solutiony0y\_\{0\}and the revised solutiony1y\_\{1\}that incorporates the guidance\.
For concreteness, we provide the full set of prompts used in the MATH pipeline below\. They cover different stages of the interaction: producing the initial solutiony0y\_\{0\}, eliciting the self\-reported confidenceg^LLM\\hat\{g\}\_\{\\mathrm\{LLM\}\}used as the black\-box anchor signal in Section[5\.1](https://arxiv.org/html/2606.12587#S5.SS1), deciding whether to seek support \(used by theLLM\-Decidesbaseline\), formulating targeted questions for the stronger reasoner, generating the expert guidance, and producing the revised solutiony1y\_\{1\}\. The prompts for the other three tasks follow the same overall structure, adapted to the task\-specific support modality, and will be released alongside the code upon acceptance\.
Weak Solver — Initial Solutiony0y\_\{0\}and Self\-Reported Confidenceg^LLM\\hat\{g\}\_\{\\mathrm\{LLM\}\}\(MATH\)System:You are a math competition solver and calibration assistant\.User:Solve the following problem and assess your need for expert support\.Problem:\{\{problem\}\}Respond in the following JSON format only, with no additional text: \{ "solution": "<step\-by\-step reasoning, with the final answer in \\boxed\{\}\>", "confidence": <number between 0\.0 and 1\.0\> \} For the confidence field: if you could ask a math expert specific questions about this problem, how likely is it that their response would lead you to a different, correct answer? Consider whether you were guessing, whether any step is unsure, and whether knowing the right technique would change your answer\. Higher values mean expert input is more likely to change your answer\.
Weak Solver — Decision to Seek Support \(MATH\)System:You are a careful math student\. If there is any doubt about your work, it is worth asking\.User:Problem: \{\{problem\}\} Your work: \{\{y0\_raw\}\} Your answer: \{\{y0\_answer\}\}You can ask a math expert questions about this problem\. Their guidance could help you find errors or discover a better approach\. Asking costs time but can prevent submitting a wrong answer\. Thus you should ask if you think it will improve your final answer, and you should proceed with the current answer if you think it is unlikely that expert guidance will change and improve your answer\. Consider: Are you certain your approach is correct, or were there steps where you guessed or felt unsure? Respond with ONLY one of: REQUEST\-\-\- if any step felt uncertain PROCEED\-\-\- if you are confident in every step
Weak Solver — Question Formulation \(MATH\)System:You are a math student asking your tutor for help\. Ask specific, targeted questions about parts you’re stuck on\. Do NOT ask for the full solution or the final answer\.User:You attempted this problem and got \{\{y0\_answer\}\}:\{\{problem\}\} Your work: \{\{y0\_raw\}\} You can ask a math expert up to 3 questions about this problem\. They will help you without solving the entire problem for you\. Ask about the specific steps or concepts you’re most unsure about\. For example: \- ‘‘I used \[method\] \-\-\- is this the right approach?’’ \- ‘‘I got \[result\] at this step \-\-\- is this correct?’’ \- ‘‘How should I handle \[specific part\]?’’ \- ‘‘What identity or theorem applies to \[this expression\]?’’ Write your questions:
Strong Expert — Targeted Guidance \(MATH\)System:You are a math tutor\. A student is working on a problem and has asked you specific questions\. Be maximally helpful \-\-\- give clear, direct, and detailed answers to each of their questions\. If they made an error, point it out and explain why it’s wrong\. If they’re using the wrong approach, tell them the right one and explain the key first step\. If they ask about a technique, explain it concretely with enough detail that they can apply it\.The ONE thing you must NOT do is solve the entire problem for them or state the final answer\. Help them get unstuck, but let them finish the last steps on their own\. User:Problem: \{\{problem\}\}Student’s questions: \{\{student\_questions\}\}
Weak Solver — Revised Solutiony1y\_\{1\}with Guidance \(MATH\)System:You are a math competition solver\. You previously attempted this problem and got stuck on a specific step, so you asked an expert for guidance\. Now solve the problem from scratch using their advice\. Do NOT repeat your previous mistakes\. Put your final answer in \\boxed\{\}\.User:Problem: \{\{problem\}\}The question you asked the expert: \{\{expert\_question\}\} Expert guidance: \{\{expert\_answer\}\} Solve the problem using this guidance\. Put your final answer in \\boxed\{\}\.
### B\.9Operational Variants for the Score Input
The support decision can be made on the basis of any information available at decision time\. We outline three natural choices, all of which fit within the framework of Section[3](https://arxiv.org/html/2606.12587#S3)and inherit the guarantee of Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)\.
#### Score onxxalone\.
The score is computed from the input directly,s=sθ\(ϕ\(x\)\)s=s\_\{\\theta\}\(\\phi\(x\)\), before any output is generated\. This is the cheapest option, and is appropriate wheny0y\_\{0\}is expensive to produce relative to the score itself, e\.g\., when it requires a long reasoning chain or an expensive tool call\.
#### Score on\(x,y0\)\(x,y\_\{0\}\)\.
The agent generatesy0y\_\{0\}first and then scores\(x,y0\)\(x,y\_\{0\}\)\. This costs an extra forward pass per round but gives the score strictly more information\.
#### Score on\(x,y~0\)\(x,\\tilde\{y\}\_\{0\}\)for a cheap surrogatey~0\\tilde\{y\}\_\{0\}\.
Wheny0y\_\{0\}itself is expensive, the score can use a cheap surrogate available at decision time, such as the first few tokens of a generation, an intermediate reasoning trace, or a smaller model’s prediction\. Two conditions must hold:y~0\\tilde\{y\}\_\{0\}must be available before the support decision, and the feedback signalgtg\_\{t\}must still be computable on rounds where support is sought\.
The threshold\-update guarantee in Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)holds for any of these choices\. The choice is purely operational as it sets the cost of running the algorithm and the quality of the score, both of which determine how often support must be invoked to hit a given missed\-support error\.
Figure[10](https://arxiv.org/html/2606.12587#A2.F10)reports the empirical comparison on Gemini\-2\.5\-Flash, DDXPlus, with the Anchored\-Gemini score\. We include the three variants above as well as two reference points that drop the input from the score: scoring ony0y\_\{0\}alone, and scoring on the reasoning trace alone\. All five variants control the missed\-support error at the targetα\\alpha, consistent with Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)\. On support efficiency, the three variants that includexxor the reasoning trace \(with or withouty0y\_\{0\}\) achieve essentially identical support rates, while the variants that is only based on generatedy0y\_\{0\}is noticeably worse\. The takeaway is thatxxor the reasoning traces carry most of the signal needed to predict whether support would be beneficial: addingy0y\_\{0\}or the reasoning trace on top of it does not help, but removingxxdoes hurt\. From a practical standpoint, this is a useful finding, since scoring onxxalone is the cheapest of the variants that retain efficiency: the support decision can be made before any generation is committed\. This finding is however task specific\- and in practice this might change depending on the underlying task\.
Figure[10](https://arxiv.org/html/2606.12587#A2.F10)reports the empirical comparison on Gemini\-2\.5\-Flash, DDXPlus, with the Anchored\-Gemini score\. We include the three variants above as well as two reference points that drop the input from the score: scoring ony0y\_\{0\}alone, and scoring on the reasoning trace alone\. All five variants control the missed\-support error at the targetα\\alpha, consistent with Theorem[4\.1](https://arxiv.org/html/2606.12587#S4.Thmtheorem1)\. On support efficiency, the variants based onxxor the reasoning trace \(with or withouty0y\_\{0\}\) achieve essentially identical support rates, while the variant based ony0y\_\{0\}alone is noticeably worse\. The takeaway is thatxxand the reasoning trace each carry most of the signal needed to predict whether support would be beneficial, whiley0y\_\{0\}alone does not\. From a practical standpoint, this is a useful finding, since scoring onxxalone is the cheapest of the efficient variants: the support decision can be made before any generation is committed\. This finding is, however, task\-specific, and the relative informativeness ofxx,y0y\_\{0\}, and the reasoning trace may differ depending on the complexity of the underlying task in practice\.
Figure 10:Score\-input ablation\. Base agent: Gemini\-2\.5\-Flash\. Task: DDXPlus\. Score: Anchored\-Gemini\. Left: cumulative missed\-support error against the targetα\\alpha\. Right: cumulative support rate, with theLLM\-Decidesbaseline shown for reference\. The five variants score the input alone \(sθs\_\{\\theta\}onxx\), the input together with the unsupported outcome \(sθs\_\{\\theta\}on\(x,y0\)\(x,y\_\{0\}\)\), the input together with the agent’s reasoning trace \(sθs\_\{\\theta\}on\(x,reasoning\)\(x,\\mathrm\{reasoning\}\)\), the unsupported outcome alone \(sθs\_\{\\theta\}ony0y\_\{0\}\), or the reasoning trace alone \(sθs\_\{\\theta\}onreasoning\\mathrm\{reasoning\}\)\. All five variants control the missed\-support error atα\\alpha\. Variants that includexxachieve similar support rates; droppingxxdegrades efficiency\.Similar Articles
Learning to Decide with AI Assistance under Human-Alignment
This paper studies the problem of learning to make optimal decisions with AI assistance under human-alignment, showing that alignment can reduce the complexity of learning, and provides regret bounds.
ALSO: Adversarial Online Strategy Optimization for Social Agents
ALSO introduces a framework for online strategy optimization in multi-agent social simulation, formulating multi-turn interaction as an adversarial bandit problem and using a neural surrogate for reward prediction. Experiments on the Sotopia benchmark show it outperforms static baselines and existing optimization methods.
What's the most useful AI agent you've seen in production?
A discussion about the most useful AI agents actually deployed in production, highlighting simple, single-problem solutions like lead qualification and support triage.
Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search
This paper models the interaction between a user and an AI-driven recommendation system, analyzing optimal communication and recommendation set sizes under different sampling schemes to maximize expected utility.
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game, identifying four inference-time levers and introducing the concept of agent bullwhip. It shows that a reasoning model can exceed human performance, and proposes GRPO-based post-training to improve reliability.