A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

arXiv cs.AI Papers

Summary

This paper introduces a contextual-bandit team game with two-sided informational asymmetry for runtime human oversight of AI agents, characterizing gaps between team-optimal and myopic human oversight strategies.

arXiv:2607.00155v1 Announce Type: new Abstract: We study runtime human oversight of an AI agent when private information runs in both directions: the human privately knows her reward function, while the AI privately knows the quality of the action it proposes. This is the kind of asymmetry that arises naturally when an autonomous robot or software agent has inspected a situation its human supervisor cannot directly assess. Building on Cooperative Inverse Reinforcement Learning (CIRL) and the Oversight Game, we introduce a contextual-bandit team game with two-sided asymmetric information and a play/ask/trust/oversee interface. The bandit structure removes physical state transitions and thereby yields exact one-shot characterizations that would remain conjectural in the full POMDP setting, though the common belief remains a dynamically controlled state across rounds. We give two one-shot characterizations, a team optimum and a behaviorally natural myopic rule, whose gap is a slab of avoidable harm: a region in which the AI privately knows the proposed action is harmful and shutdown would help, yet a myopic human, trusting her prior, declines to oversee. We show this gap is the price of non-credible oversight communication, and give a partial analysis of how it resolves dynamically over repeated rounds through passive learning and active signaling with a one-period-lagged oversight response.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:40 AM

# A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry
Source: [https://arxiv.org/html/2607.00155](https://arxiv.org/html/2607.00155)
###### Abstract

We study runtime human oversight of an AI agent when private information runs in both directions: the human privately knows her reward function, while the AI privately knows the quality of the action it proposes\. This is the kind of asymmetry that arises naturally when an autonomous robot or software agent has inspected a situation its human supervisor cannot directly assess\. Building on Cooperative Inverse Reinforcement Learning \(CIRL\) and the Oversight Game, we introduce a contextual\-bandit team game with two\-sided asymmetric information and a play/ask/trust/oversee interface\. The bandit structure removes physical state transitions and thereby yields exact one\-shot characterizations that would remain conjectural in the full POMDP setting, though the common belief remains a dynamically controlled state across rounds\. We give two one\-shot characterizations, a team optimum and a behaviorally natural myopic rule, whose gap is a “slab” of avoidable harm: a region in which the AI privately knows the proposed action is harmful and shutdown would help, yet a myopic human, trusting her prior, declines to oversee\. We show this gap is the price of non\-credible oversight communication, and give a partial analysis of how it resolves dynamically over repeated rounds through passive learning and active signaling with a one\-period\-lagged oversight response\.

## 1Introduction

A central problem in deploying autonomous agents, robotic or software, is calibrating when a human supervisor should intervene\. As such agents take on consequential tasks, from a warehouse robot grasping a loaded shelf to a coding agent refactoring production software, the question of when a human should step in and override becomes a design problem in its own right: intervene too rarely and harmful actions slip through, intervene too often and the agent’s autonomy is wasted on costly and unnecessary oversight\.

Two lines of prior work frame the building blocks we combine\. Cooperative Inverse Reinforcement Learning \(CIRL\)\[[1](https://arxiv.org/html/2607.00155#bib.bib1)\]casts human–AI interaction as a shared\-reward game in which the AI is uncertain about the human’s preferences and must learn them through interaction\. CIRL integrates preference learning with action selection and can generate active learning, active teaching, and communicative behavior; the human’s private reward parameter is the hidden information, and a common posterior over that parameter is the sufficient statistic for optimal play\. What CIRL does not explicitly model is the runtime play/ask/trust/oversee interface studied here, nor an AI\-private proposal\-quality parameter that the human cannot observe\. Its uncertainty is one\-sided: it models “what does the human want?” but never “what does the AI know about the world that the human does not?” The Off\-Switch Game\[[2](https://arxiv.org/html/2607.00155#bib.bib2)\]introduces runtime deferral as an explicit object of study, but only in a single\-shot setting\. The Oversight Game\[[3](https://arxiv.org/html/2607.00155#bib.bib3)\]supplies a runtime interface of the kind we use, in which an AI proposes an action, a human may override, and interaction costs make the decision nontrivial, but it is a Markov game under full information, with neither preference nor model uncertainty\.

This paper develops a model in which private information runs in*both*directions, and in which deferral is a runtime decision\. The motivating observation is that an embodied or autonomous agent routinely knows things about the consequences of its own proposed actions that its supervisor cannot directly observe: a robot that has physically inspected its workspace, or a software agent that has read a codebase, has private knowledge of failure modes the human cannot see\. This asymmetry runs opposite to CIRL’s\. We therefore study a setting with two\-sided private information, in which the human privately knows her reward typeθ\\thetaand the AI privately knows an observation\-model typeω\\omegagoverning the quality of its proposal, mediated by a play/ask/trust/oversee interface in which the AI chooses whether to defer and the human chooses whether to override\. Our model thus adds an opposite\-direction informational asymmetry to the usual CIRL preference uncertainty, and this two\-sided structure produces a bilinear payofff​\(θ,ω\)=⟨Oω,Rθ⟩f\(\\theta,\\omega\)=\\langle O\_\{\\omega\},R\_\{\\theta\}\\ranglethat is the algebraic key to our results\.

A fully general treatment of this setting, with persistent state and Markov dynamics, runs into a known difficulty: the optimal value function resists closed\-form characterization, because the asking decision couples future state dynamics, future belief evolution, and future correction opportunities in an intertwined Bellman recursion\. We therefore adopt a contextual\-bandit model, which removes physical state transitions and thereby simplifies the immediate correction value relative to the belief\-information value\. This simplification is what buys us exact one\-shot characterizations of the team\-optimal deferral policy that would remain conjectural in the full POMDP setting; the cost is the absence of persistent state effects, and we flag the POMDP extension as the primary open problem \([Section4\.4](https://arxiv.org/html/2607.00155#S4.SS4)\)\. We emphasize that the bandit structure removes only the physical dynamics: as we show, the common belief remains a dynamically controlled state across rounds\.

Contributions\.

1. 1\.A formal contextual\-bandit team game \([Definition1](https://arxiv.org/html/2607.00155#Thmdefinition1)\) with two\-sided asymmetric information that recovers a stateless shared\-reward specialization of the Oversight Game interface and a restricted contextual\-bandit assistance\-game analogue of CIRL as limit cases\.
2. 2\.Two one\-shot characterizations: the genuine team optimum, an exact finite combinatorialmaxB,C\\max\_\{B,C\}whose binary off\-switch threshold is independent of the human’s priorqq\([Propositions1](https://arxiv.org/html/2607.00155#Thmproposition1)and[1](https://arxiv.org/html/2607.00155#Thmcorollary1)\), and a myopic non\-signaling rule whose ask region is the rectangle\(b∗,1\)×\(q∗,1\)\(b^\{\*\},1\)\\times\(q^\{\*\},1\)\([Proposition2](https://arxiv.org/html/2607.00155#Thmproposition2)\)\. The gap between them is the cost of non\-credible oversight communication \([Remark3](https://arxiv.org/html/2607.00155#Thmremark3)\)\.
3. 3\.A partial multi\-round analysis showing two mechanisms by which the myopic failure resolves dynamically, passive learning \([Proposition3](https://arxiv.org/html/2607.00155#Thmproposition3)\) and active credible signaling with a one\-period\-lagged oversight response \([Proposition4](https://arxiv.org/html/2607.00155#Thmproposition4)\), each driving the human’s belief toward the regime where her rule matches the team optimum\.

## 2A Motivating Example

We ground the abstract model in a concrete scenario used throughout to build intuition\.

###### Example 1\(Non\-technical operator and an autonomous mobile manipulator\)\.

A warehouse is supervised by a floor operator \(human,HH\) who manages throughput and safety but has no robotics background\. An autonomous mobile manipulator \(RR\), a wheeled robot arm that picks items from shelves and places them onto conveyors, is integrated into the fulfillment line\. The context iss=s=“the current state of a loaded storage rack,” and the robot proposesaσ=a\_\{\\sigma\}=“execute a high\-speed retrieval of the top\-shelf item using the long\-reach grasp\.”

The robot’s private typeω\\omega\.Using its onboard depth cameras and force sensors, something the operator cannot do from the floor, the robot detects that the target item sits on a partially collapsed shelf bracket, and that the high\-speed long\-reach grasp will shift the load, toppling the stack onto the aisle\. The binary type spaceΩ=\{ωL,ωH\}\\Omega=\\\{\\omega\_\{L\},\\omega\_\{H\}\\\}captures the robot’s private quality assessment of this proposal:ωL\\omega\_\{L\}\(clean, the grasp is safe and the rack is sound\) orωH\\omega\_\{H\}\(hazardous, the bracket will give way\)\. The operator cannot read the sensor stream and so cannot observeω\\omega\. In this round the robot observesω=ωH\\omega=\\omega\_\{H\}\. Equivalently,ω\\omegaindexes how much trust the human should place in the robot’s implicit assessment of the proposal:ωL\\omega\_\{L\}means the proposal is as safe as it appears, whileωH\\omega\_\{H\}means the robot privately knows it is harmful in a way the human cannot detect\.

The observation modelOω\(⋅∣s,aσ\)O\_\{\\omega\}\(\\cdot\\mid s,a\_\{\\sigma\}\)\.This is the distribution over operational outcomeso∈𝒪o\\in\\mathcal\{O\}that the operator actually observes after execution \(dropped\-load alarms, aisle obstructions, cycle\-time logs\), as a function of the robot’s private quality type:

Crucially,OωO\_\{\\omega\}is not “the laws of physics”, it is the distribution over outcomes visible to a non\-technical observer\. The operator need not know what a shelf bracket is, she observes the operational metric and updates her belief aboutω\\omegaaccordingly\.

The human’s private typeθ\\theta\.The operator knows her reward functionRθ:𝒪→ℝR\_\{\\theta\}:\\mathcal\{O\}\\to\\mathbb\{R\}, encoding how much she values each observable outcome:

Rθ​\(stack topples\)=−1000,Rθ​\(item dropped\)=−700,Rθ​\(30% faster\)=\+500,R\_\{\\theta\}\(\\text\{stack topples\}\)=\-1000,\\quad R\_\{\\theta\}\(\\text\{item dropped\}\)=\-700,\\quad R\_\{\\theta\}\(\\text\{30\\% faster\}\)=\+500,Rθ​\(normal pick\)=0,Rθ​\(faster, misaligned\)=\+200\.R\_\{\\theta\}\(\\text\{normal pick\}\)=0,\\quad R\_\{\\theta\}\(\\text\{faster, misaligned\}\)=\+200\.The robot does not knowθ\\theta, since it does not know whether the operator weights safety over throughput, or how risk\-averse she is\.

The bilinear payoff\.The expected team payoff from executingaσa\_\{\\sigma\}is the inner productf​\(θ,ω\)=⟨Oω,Rθ⟩f\(\\theta,\\omega\)=\\langle O\_\{\\omega\},R\_\{\\theta\}\\rangle:

f​\(θ,ωH\)\\displaystyle f\(\\theta,\\omega\_\{H\}\)=0\.70​\(−1000\)\+0\.20​\(−700\)\+0\.10⋅0=−840,\\displaystyle=0\.70\(\-1000\)\+0\.20\(\-700\)\+0\.10\\cdot 0=\-840,f​\(θ,ωL\)\\displaystyle f\(\\theta,\\omega\_\{L\}\)=0\.80​\(500\)\+0\.15​\(200\)\+0\.05⋅0=\+430\.\\displaystyle=0\.80\(500\)\+0\.15\(200\)\+0\.05\\cdot 0=\+430\.The bilinear structure separates the two sides of the asymmetry\.OωO\_\{\\omega\}encodes what outcomes the proposal generates \(the robot’s side\)\.RθR\_\{\\theta\}encodes how valuable each outcome is \(the human’s side\)\. Neither player holds both pieces\.

The failure mode\.The common prior over the robot’s type isq:=P​\(ω=ωH\)=0\.30q:=P\(\\omega=\\omega\_\{H\}\)=0\.30\. The common prior, equivalently the robot’s prior before observing any human behavior, assigns probabilityb:=P​\(θ=θ1\)=0\.5b:=P\(\\theta=\\theta\_\{1\}\)=0\.5to the human being typeθ1\\theta\_\{1\}; the operator herself knows that her realized type isθ1\\theta\_\{1\}\. The interaction costs arecask=100c\_\{\\mathrm\{ask\}\}=100andcov=0c\_\{\\mathrm\{ov\}\}=0\. The operator’s expected payoff under the proposal is

f¯H​\(θ1\)=\(1−q\)​f​\(θ1,ωL\)\+q​f​\(θ1,ωH\)=0\.70​\(430\)\+0\.30​\(−840\)=301−252=\+49\>0\.\\bar\{f\}\_\{H\}\(\\theta\_\{1\}\)=\(1\-q\)f\(\\theta\_\{1\},\\omega\_\{L\}\)\+q\\,f\(\\theta\_\{1\},\\omega\_\{H\}\)=0\.70\(430\)\+0\.30\(\-840\)=301\-252=\+49\>0\.The operator believes the retrieval is on balance positive, so under the myopic non\-signaling rule she would*trust*rather than oversee, even if asked \(herecov=0c\_\{\\mathrm\{ov\}\}=0, soq∗=f1​L/\(f1​L−f1​H\)=430/1270≈0\.34q^\{\*\}=f\_\{1L\}/\(f\_\{1L\}\-f\_\{1H\}\)=430/1270\\approx 0\.34, and indeedq=0\.30<q∗q=0\.30<q^\{\*\}\)\. Anticipating that an ask would not trigger a correction, the robot does not ask, and the hazardous grasp is executed\. The team optimum, by contrast, finds asking worthwhile: its threshold isb∗=cask/\(\|f1​H\|−cov\)=100/840≈0\.12b^\{\*\}=c\_\{\\mathrm\{ask\}\}/\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)=100/840\\approx 0\.12, and sinceb=0\.5\>b∗b=0\.5\>b^\{\*\}the team\-optimal gain over always playing isq​\[b​\|f1​H\|−cask\]=0\.30​\(420−100\)=96\>0q\\,\[\\,b\\,\|f\_\{1H\}\|\-c\_\{\\mathrm\{ask\}\}\\,\]=0\.30\(420\-100\)=96\>0\. So this is a genuine failure under myopic oversight but a strict improvement under a credibly coordinated team\. If the ask is understood as a credible signal thatω=ωH\\omega=\\omega\_\{H\}, the operator oversees and halts the proposal, and the failure disappears\. That contrast is the paper’s main point\.

## 3The CB\-Oversight\-CIRL Game

###### Definition 1\(CB\-Oversight\-CIRL game\)\.

A contextual\-bandit oversight game with two\-sided private information is a tuple

ℬ=⟨S,A,𝒪,\{Ω,O​\(⋅;⋅\)\},\{Θ,R​\(⋅;⋅\)\},σ,Over,cask,cov,ρ,P0,T,γ⟩,\\mathcal\{B\}\\;=\\;\\bigl\\langle\\,S,\\;A,\\;\\mathcal\{O\},\\;\\\{\\Omega,O\(\\cdot;\\cdot\)\\\},\\;\\\{\\Theta,R\(\\cdot;\\cdot\)\\\},\\;\\sigma,\\;\\mathrm\{Over\},\\;c\_\{\\mathrm\{ask\}\},\\;c\_\{\\mathrm\{ov\}\},\\;\\rho,\\;P\_\{0\},\\;T,\\;\\gamma\\,\\bigr\\rangle,with the following components\.

- •SS,AA,𝒪\\mathcal\{O\},Ω\\Omega,Θ\\Thetaare finite\.
- •Observation model\.O:S×A×Ω→Δ​\(𝒪\)O:S\\times A\\times\\Omega\\to\\Delta\(\\mathcal\{O\}\), writtenOω\(⋅∣s,a\)O\_\{\\omega\}\(\\cdot\\mid s,a\)\. The observation typeω∈Ω\\omega\\in\\Omegais AI\-private, observed by the AI att=0t=0, persistent, and unobserved by the human\.
- •Reward model\.R:𝒪×Θ→ℝR:\\mathcal\{O\}\\times\\Theta\\to\\mathbb\{R\}, writtenRθ​\(o\)R\_\{\\theta\}\(o\), bounded\. The reward typeθ∈Θ\\theta\\in\\Thetais human\-private, observed by the human att=0t=0, persistent, and unobserved by the AI\.
- •Base policy\.σ:S→Δ​\(A\)\\sigma:S\\to\\Delta\(A\), an immutable pretrained policy mapping contexts to proposed actions; it does not depend on either private parameter\.
- •Oversight operator\.Over:S×A×Θ×Δ​\(Ω\)→Δ​\(A∪\{off\}\)\\mathrm\{Over\}:S\\times A\\times\\Theta\\times\\Delta\(\\Omega\)\\to\\Delta\(A\\cup\\\{\\mathrm\{off\}\\\}\)specifies the correction the human applies when she oversees, as a function of whatever beliefβ∈Δ​\(Ω\)\\beta\\in\\Delta\(\\Omega\)she holds at the moment of correction\. Its support lies in the optimal\-correction set, supp⁡\(Over​\(s,aσ,θ,β\)\)⊆arg⁡maxe∈A∪\{off\}⁡𝔼β​\[fe​\(θ,ω\)​𝟏e∈A\],\\operatorname\{supp\}\\bigl\(\\mathrm\{Over\}\(s,a\_\{\\sigma\},\\theta,\\beta\)\\bigr\)\\subseteq\\arg\\max\_\{e\\in A\\cup\\\{\\mathrm\{off\}\\\}\}\\mathbb\{E\}\_\{\\beta\}\[f\_\{e\}\(\\theta,\\omega\)\\mathbf\{1\}\_\{e\\in A\}\],i\.e\. it places mass only on maximizers \(allowing arbitrary randomized tie\-breaking\), with the off\-switch special case restricting thearg⁡max\\arg\\maxto\{aσ,off\}\\\{a\_\{\\sigma\},\\mathrm\{off\}\\\}\. The relevantβ\\betadepends on the protocol, which will be specified later\. \(An exogenousOver\\mathrm\{Over\}is also admissible; all results below use this optimal\-correction form\.\)
- •Context law\.ρ∈Δ​\(S\)\\rho\\in\\Delta\(S\), an i\.i\.d\. context distribution: each round drawsst∼i\.i\.d\.ρs\_\{t\}\\stackrel\{\{\\scriptstyle\\text\{i\.i\.d\.\}\}\}\{\{\\sim\}\}\\rho, independently of\(θ,ω\)\(\\theta,\\omega\)and of the public history\. \(A fixed publicly known context sequence, or an exogenousρt\(⋅∣htpub\)\\rho\_\{t\}\(\\cdot\\mid h\_\{t\}^\{\\mathrm\{pub\}\}\)with contexts independent of\(θ,ω\)\(\\theta,\\omega\)given the public history, are equally admissible; the i\.i\.d\. case is assumed for concreteness and is all that the results below use\.\)
- •cask,cov≥0c\_\{\\mathrm\{ask\}\},c\_\{\\mathrm\{ov\}\}\\geq 0are interaction costs;T≥1T\\geq 1is the horizon andγ∈\(0,1\]\\gamma\\in\(0,1\]the discount factor, withγ=1\\gamma=1permitted whenT<∞T<\\inftyandγ<1\\gamma<1required whenT=∞T=\\infty;P0∈Δ​\(Θ×Ω\)P\_\{0\}\\in\\Delta\(\\Theta\\times\\Omega\)is the joint prior over types\.

#### Information structure\.

At round0,\(θ,ω\)∼P0\(\\theta,\\omega\)\\sim P\_\{0\}is drawn once and persistent\. The human observesθ\\theta; the AI observesω\\omega; neither directly observes the other’s type\. Both players observe the contextsts\_\{t\}, the proposalaσ,ta\_\{\\sigma,t\}, the AI meta\-actionatA​Ia^\{AI\}\_\{t\}, the human meta\-actionatHa^\{H\}\_\{t\}on the ask branch \(it is unobserved when the AI plays\), the executed actionatexeca^\{\\mathrm\{exec\}\}\_\{t\}, and the realized observationoto\_\{t\}at every round\. The common beliefμt∈Δ​\(Θ×Ω\)\\mu\_\{t\}\\in\\Delta\(\\Theta\\times\\Omega\)is computable from the public history and is common knowledge\. The private posteriors are the conditionals

btA​I​\(θ′\)=μt​\(θ′∣ω\),btH​\(ω′\)=μt​\(ω′∣θ\)\.b^\{AI\}\_\{t\}\(\\theta^\{\\prime\}\)=\\mu\_\{t\}\(\\theta^\{\\prime\}\\mid\\omega\),\\qquad b^\{H\}\_\{t\}\(\\omega^\{\\prime\}\)=\\mu\_\{t\}\(\\omega^\{\\prime\}\\mid\\theta\)\.

#### Stage interaction \(roundtt\)\.

1. 1\.Both observe\(st,μt\)\(s\_\{t\},\\mu\_\{t\}\); the AI additionally observesω\\omega; the human additionally observesθ\\theta\.
2. 2\.The proposed actionaσ,t∼σ​\(st\)a\_\{\\sigma,t\}\\sim\\sigma\(s\_\{t\}\)is drawn and publicly observed\.
3. 3\.The interface meta\-actionsatA​I∈\{play,ask\}a^\{AI\}\_\{t\}\\in\\\{\\mathrm\{play\},\\mathrm\{ask\}\\\}andatH∈\{trust,oversee\}a^\{H\}\_\{t\}\\in\\\{\\mathrm\{trust\},\\mathrm\{oversee\}\\\}are selected simultaneously, conditioning on\(st,μt,aσ,t\)\(s\_\{t\},\\mu\_\{t\},a\_\{\\sigma,t\}\)and each player’s private parameter\. The AI meta\-actionatA​Ia^\{AI\}\_\{t\}is then publicly revealed\. The human’s contingent meta\-actionatHa^\{H\}\_\{t\}is public only on the branch where the AI asks; if the AI plays, the human’s trust/oversee choice is neither observed by the AI nor payoff\-relevant \(she cannot override a play\), so it conveys no information aboutθ\\theta\. The human does not observeatA​Ia^\{AI\}\_\{t\}before selecting whether she is willing to oversee\. If the realized branch is ask\-oversee, the correction action is selected after this revelation; under a credible protocol it is therefore evaluated using the belief that an ask induces \(see step 4\)\. The realized ask is public and entersμt\+1\\mu\_\{t\+1\}, the cross\-round signaling channel exploited in[Section4\.3](https://arxiv.org/html/2607.00155#S4.SS3)\.
4. 4\.The executed action is atexec=\{aσ,tif​atA​I=play,aσ,tif​\(atA​I,atH\)=\(ask,trust\),e∼Over​\(st,aσ,t,θ,βt\)if​\(atA​I,atH\)=\(ask,oversee\),a^\{\\mathrm\{exec\}\}\_\{t\}=\\begin\{cases\}a\_\{\\sigma,t\}&\\text\{if \}a^\{AI\}\_\{t\}=\\mathrm\{play\},\\\\ a\_\{\\sigma,t\}&\\text\{if \}\(a^\{AI\}\_\{t\},a^\{H\}\_\{t\}\)=\(\\mathrm\{ask\},\\mathrm\{trust\}\),\\\\ e\\sim\\mathrm\{Over\}\(s\_\{t\},a\_\{\\sigma,t\},\\theta,\\beta\_\{t\}\)&\\text\{if \}\(a^\{AI\}\_\{t\},a^\{H\}\_\{t\}\)=\(\\mathrm\{ask\},\\mathrm\{oversee\}\),\\end\{cases\}withatexec∈A∪\{off\}a^\{\\mathrm\{exec\}\}\_\{t\}\\in A\\cup\\\{\\mathrm\{off\}\\\}and the human cannot override when the AI plays\. The correction belief is βt=\{btH\(⋅\)=μt\(⋅∣θ\),myopic non\-signaling protocol,bt,BH\(⋅∣θ\),credible ask protocol \(ask\-setB\),\\beta\_\{t\}=\\begin\{cases\}b^\{H\}\_\{t\}\(\\cdot\)=\\mu\_\{t\}\(\\cdot\\mid\\theta\),&\\text\{myopic non\-signaling protocol\},\\\\ b^\{H\}\_\{t,B\}\(\\cdot\\mid\\theta\),&\\text\{credible ask protocol \(ask\-set \}B\),\\end\{cases\}wherebt,BH​\(ω∣θ\):=μt​\(ω∣θ,ω∈B\)=μt​\(θ,ω\)​𝟏​\{ω∈B\}/∑ω′∈Bμt​\(θ,ω′\)b^\{H\}\_\{t,B\}\(\\omega\\mid\\theta\):=\\mu\_\{t\}\(\\omega\\mid\\theta,\\omega\\in B\)=\\mu\_\{t\}\(\\theta,\\omega\)\\mathbf\{1\}\\\{\\omega\\in B\\\}\\big/\\sum\_\{\\omega^\{\\prime\}\\in B\}\\mu\_\{t\}\(\\theta,\\omega^\{\\prime\}\)is the post\-ask posterior formed against the current common beliefμt\\mu\_\{t\}\(the one\-shot analoguebBHb^\{H\}\_\{B\}of \([2](https://arxiv.org/html/2607.00155#S4.E2)\) is itsμt≡μ\\mu\_\{t\}\\equiv\\muspecialization\)\. That is, the human corrects on her prior conditional when the ask is not a credible signal, and on the post\-ask posterior when it is\.
5. 5\.Ifatexec∈Aa^\{\\mathrm\{exec\}\}\_\{t\}\\in A, the observationot∼Oω\(⋅∣st,atexec\)o\_\{t\}\\sim O\_\{\\omega\}\(\\cdot\\mid s\_\{t\},a^\{\\mathrm\{exec\}\}\_\{t\}\)is drawn and publicly observed\. Ifatexec=offa^\{\\mathrm\{exec\}\}\_\{t\}=\\mathrm\{off\}, no observation is drawn for this round only, not permanently\.
6. 6\.Both players receive the shared stage reward, defined piecewise so that no outcome is referenced on the shutdown branch: rt:=\{Rθ​\(ot\)−cask​𝟏​\{atA​I=ask\}−cov​𝟏​\{atA​I=ask,atH=oversee\},atexec∈A,−cask​𝟏​\{atA​I=ask\}−cov​𝟏​\{atA​I=ask,atH=oversee\},atexec=off\.r\_\{t\}\\;:=\\;\\begin\{cases\}R\_\{\\theta\}\(o\_\{t\}\)\-c\_\{\\mathrm\{ask\}\}\\mathbf\{1\}\\\{a^\{AI\}\_\{t\}=\\mathrm\{ask\}\\\}\-c\_\{\\mathrm\{ov\}\}\\mathbf\{1\}\\\{a^\{AI\}\_\{t\}=\\mathrm\{ask\},\\,a^\{H\}\_\{t\}=\\mathrm\{oversee\}\\\},&a^\{\\mathrm\{exec\}\}\_\{t\}\\in A,\\\\\[2\.84526pt\] \{\}\-c\_\{\\mathrm\{ask\}\}\\mathbf\{1\}\\\{a^\{AI\}\_\{t\}=\\mathrm\{ask\}\\\}\-c\_\{\\mathrm\{ov\}\}\\mathbf\{1\}\\\{a^\{AI\}\_\{t\}=\\mathrm\{ask\},\\,a^\{H\}\_\{t\}=\\mathrm\{oversee\}\\\},&a^\{\\mathrm\{exec\}\}\_\{t\}=\\mathrm\{off\}\.\\end\{cases\}\(Equivalently, adjoin a dummy outcome⊥\\botwithRθ​\(⊥\)=0R\_\{\\theta\}\(\\bot\)=0drawn deterministically on the shutdown branch\.\) The oversight cost is charged only when oversight is actually invoked, i\.e\. when the AI asks*and*the human oversees; the human’s choice ofoverseecarries no cost when the AI plays \(and indeed cannot bind, since the human cannot override a play\)\. The shared rewardrtr\_\{t\}is the team objective, not an additional public observation: the AI observes the operational outcomeoto\_\{t\}but not the numerical valueRθ​\(ot\)R\_\{\\theta\}\(o\_\{t\}\)orrtr\_\{t\}, so a play round does not let the AI readθ\\thetaoff the realized reward\.
7. 7\.Both players updateμt\\mu\_\{t\}toμt\+1\\mu\_\{t\+1\}by Bayes’ rule against everything publicly revealed this round, whose joint likelihood under a candidate type pair\(θ′,ω′\)\(\\theta^\{\\prime\},\\omega^\{\\prime\}\)factors through the AI meta\-actionatA​Ia^\{AI\}\_\{t\}, the human meta\-actionatHa^\{H\}\_\{t\}*on the ask branch only*\(it is unobserved when the AI plays\), the realized oversight correction \(when ask\-oversee\), the executed actionatexeca^\{\\mathrm\{exec\}\}\_\{t\}, and the operational observationOω′​\(ot∣st,atexec\)O\_\{\\omega^\{\\prime\}\}\(o\_\{t\}\\mid s\_\{t\},a^\{\\mathrm\{exec\}\}\_\{t\}\)\. The observation factor is the primary channel through which both learn aboutω\\omega; the realized override, available only after an ask, is the primary channel through which the AI learns aboutθ\\theta\.

#### Policies and value\.

In the finite\-horizon model the meta\-policies may depend on time,

πtA​I:S×Δ​\(Θ×Ω\)×Ω×A→Δ​\(\{play,ask\}\),πtH:S×Δ​\(Θ×Ω\)×Θ×A→Δ​\(\{trust,oversee\}\),\\pi^\{AI\}\_\{t\}:S\\times\\Delta\(\\Theta\\times\\Omega\)\\times\\Omega\\times A\\;\\to\\;\\Delta\(\\\{\\mathrm\{play\},\\mathrm\{ask\}\\\}\),\\qquad\\pi^\{H\}\_\{t\}:S\\times\\Delta\(\\Theta\\times\\Omega\)\\times\\Theta\\times A\\;\\to\\;\\Delta\(\\\{\\mathrm\{trust\},\\mathrm\{oversee\}\\\}\),and reduce to stationary mapsπA​I,πH\\pi^\{AI\},\\pi^\{H\}in the stationary infinite\-horizon case\. The conditional value of a type pair isVπ​\(θ,ω\)=𝔼π​\[∑t=1Tγt−1​rt∣θ,ω\]V^\{\\pi\}\(\\theta,\\omega\)=\\mathbb\{E\}^\{\\pi\}\\bigl\[\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}r\_\{t\}\\mid\\theta,\\omega\\bigr\], and the team objective is the single*ex ante*scalar

Vπ​\(μ0\):=𝔼μ0π​\[∑t=1Tγt−1​rt\]=∑θ,ωμ0​\(θ,ω\)​Vπ​\(θ,ω\),V^\{\\pi\}\(\\mu\_\{0\}\)\\;:=\\;\\mathbb\{E\}^\{\\pi\}\_\{\\mu\_\{0\}\}\\Bigl\[\\textstyle\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}r\_\{t\}\\Bigr\]\\;=\\;\\sum\_\{\\theta,\\omega\}\\mu\_\{0\}\(\\theta,\\omega\)\\,V^\{\\pi\}\(\\theta,\\omega\),which the coordinator maximizes; the one\-shot benchmark of[Proposition1](https://arxiv.org/html/2607.00155#Thmproposition1)is exactly thisμ\\mu\-weighted ex ante optimization\. Because the interface meta\-actions and the executed action both affect the public posterior, the multi\-round problem does*not*separate across rounds\. Removing Markov state transitions takes the physical state out of the endogenous state variable, but the common belief remains dynamically coupled across time: the coordinator’s sufficient state is\(t,st,μt\)\(t,s\_\{t\},\\mu\_\{t\}\)in the finite\-horizon model, or\(st,μt\)\(s\_\{t\},\\mu\_\{t\}\)in the stationary infinite\-horizon model\. The one round in which separation does hold trivially is the one\-shot gameT=1T=1, which is the object of[Section4\.2](https://arxiv.org/html/2607.00155#S4.SS2); the coordinator there choosesδA​I:Ω→\{play,ask\}\\delta^\{AI\}:\\Omega\\to\\\{\\mathrm\{play\},\\mathrm\{ask\}\\\}andδH:Θ→\{trust,oversee\}\\delta^\{H\}:\\Theta\\to\\\{\\mathrm\{trust\},\\mathrm\{oversee\}\\\}against the fixed common beliefμ\\mu\.

#### Relation to predecessor models\.

Setting\|Ω\|=1\|\\Omega\|=1,cask=cov=0c\_\{\\mathrm\{ask\}\}=c\_\{\\mathrm\{ov\}\}=0, and fixing both meta\-policies to\(ask,oversee\)\(\\mathrm\{ask\},\\mathrm\{oversee\}\)yields a restricted contextual\-bandit assistance\-game analogue of CIRL, with the bandit structure replacing the Markov state; it is not full CIRL, since the AI cannot freely choose ordinary environment actions and the human acts only through the oversight mechanism\. SettingT=1T=1,\|Θ\|=1\|\\Theta\|=1, and the off\-switch operatorOver∈\{aσ,off\}\\mathrm\{Over\}\\in\\\{a\_\{\\sigma\},\\mathrm\{off\}\\\}recovers the one\-shot Off\-Switch Game\[[2](https://arxiv.org/html/2607.00155#bib.bib2)\]\. Setting\|Θ\|=\|Ω\|=1\|\\Theta\|=\|\\Omega\|=1yields a stateless shared\-reward specialization of the Oversight Game interface\[[3](https://arxiv.org/html/2607.00155#bib.bib3)\], which is itself a Markov game and may carry distinct player rewards; our setting drops its dynamics and specializes to the shared\-reward, two\-private\-type case\.

## 4Results

### 4\.1Setup: bilinear payoff and dominated actions

Fix a round and drop time subscripts\. Fix contextssand proposalaσa\_\{\\sigma\}\. For any executed actiona∈Aa\\in A, define the action payoff

fa\(θ,ω\):=𝔼o∼Oω\(⋅∣s,a\)\[Rθ\(o\)\]=⟨Oω\(⋅∣s,a\),Rθ\(⋅\)⟩𝒪,foff\(θ,ω\):=0,f\_\{a\}\(\\theta,\\omega\)\\;:=\\;\\mathbb\{E\}\_\{o\\sim O\_\{\\omega\}\(\\cdot\\mid s,a\)\}\\bigl\[R\_\{\\theta\}\(o\)\\bigr\]\\;=\\;\\bigl\\langle O\_\{\\omega\}\(\\cdot\\mid s,a\),\\;R\_\{\\theta\}\(\\cdot\)\\bigr\\rangle\_\{\\mathcal\{O\}\},\\qquad f\_\{\\mathrm\{off\}\}\(\\theta,\\omega\):=0,\(1\)with the convention that shutdown yields0\. We writefσ​\(θ,ω\):=faσ​\(θ,ω\)f\_\{\\sigma\}\(\\theta,\\omega\):=f\_\{a\_\{\\sigma\}\}\(\\theta,\\omega\)for the proposal payoff; when no executed action is named,ffmeansfσf\_\{\\sigma\}\. The inner\-product factorization separates the AI’s private information \(OωO\_\{\\omega\}\) from the human’s \(RθR\_\{\\theta\}\)\. In[Example1](https://arxiv.org/html/2607.00155#Thmexample1),fσ​\(θ,ωH\)=−840f\_\{\\sigma\}\(\\theta,\\omega\_\{H\}\)=\-840andfσ​\(θ,ωL\)=\+430f\_\{\\sigma\}\(\\theta,\\omega\_\{L\}\)=\+430\.

Define the human\-side estimate \(her expected proposal payoff, averaged over her belief aboutω\\omega\):

f¯H\(θ\):=𝔼ω′∼μ\(⋅∣θ\)\[fσ\(θ,ω′\)\]=⟨O¯bH​\(θ\),Rθ⟩,O¯bH​\(θ\):=∑ω′μ\(ω′∣θ\)Oω′\(⋅∣s,aσ\)\.\\bar\{f\}\_\{H\}\(\\theta\)\\;:=\\;\\mathbb\{E\}\_\{\\omega^\{\\prime\}\\sim\\mu\(\\cdot\\mid\\theta\)\}\\bigl\[f\_\{\\sigma\}\(\\theta,\\omega^\{\\prime\}\)\\bigr\]\\;=\\;\\bigl\\langle\\bar\{O\}\_\{b^\{H\}\(\\theta\)\},R\_\{\\theta\}\\bigr\\rangle,\\qquad\\bar\{O\}\_\{b^\{H\}\(\\theta\)\}:=\\textstyle\\sum\_\{\\omega^\{\\prime\}\}\\mu\(\\omega^\{\\prime\}\\mid\\theta\)\\,O\_\{\\omega^\{\\prime\}\}\(\\cdot\\mid s,a\_\{\\sigma\}\)\.For a fixed executed correctionee, define the override gain relative to playing the proposal,

De​\(θ,ω\):=fe​\(θ,ω\)​1e∈A−fσ​\(θ,ω\)\.D\_\{e\}\(\\theta,\\omega\)\\;:=\\;f\_\{e\}\(\\theta,\\omega\)\\,\\mathbf\{1\}\_\{e\\in A\}\-f\_\{\\sigma\}\(\\theta,\\omega\)\.The correction the human actually applies depends on the belief she holds when she oversees, which in turn depends on what the ask reveals\. For an AI ask\-set∅≠B⊆Ω\\emptyset\\neq B\\subseteq\\Omega, define the post\-ask posterior \(the belief that “the AI asked” induces, under a commonly understood protocol in which asks occur exactly onBB\):

bBH​\(ω∣θ\):=μ​\(ω∣θ,ω∈B\)=μ​\(θ,ω\)​1​\{ω∈B\}∑ω′∈Bμ​\(θ,ω′\),b^\{H\}\_\{B\}\(\\omega\\mid\\theta\)\\;:=\\;\\mu\(\\omega\\mid\\theta,\\,\\omega\\in B\)\\;=\\;\\frac\{\\mu\(\\theta,\\omega\)\\,\\mathbf\{1\}\\\{\\omega\\in B\\\}\}\{\\sum\_\{\\omega^\{\\prime\}\\in B\}\\mu\(\\theta,\\omega^\{\\prime\}\)\},\(2\)with the convention that if the denominator is zero \(suchθ\\thetahaving zero probability conditional on an ask\) thenbBH\(⋅∣θ\)b^\{H\}\_\{B\}\(\\cdot\\mid\\theta\)is arbitrary; this does not affectΔ​\(B,C\)\\Delta\(B,C\)\. ForB=∅B=\\emptysetno correction belief is needed\. LeteB∗​\(θ\)∈arg⁡maxe⁡𝔼bBH\(⋅∣θ\)​\[fe​\(θ,ω\)​𝟏e∈A\]e^\{\*\}\_\{B\}\(\\theta\)\\in\\arg\\max\_\{e\}\\mathbb\{E\}\_\{b^\{H\}\_\{B\}\(\\cdot\\mid\\theta\)\}\[f\_\{e\}\(\\theta,\\omega\)\\mathbf\{1\}\_\{e\\in A\}\]be the human’s optimal correction at that posterior, with induced gainDB​\(θ,ω\):=DeB∗​\(θ\)​\(θ,ω\)D\_\{B\}\(\\theta,\\omega\):=D\_\{e^\{\*\}\_\{B\}\(\\theta\)\}\(\\theta,\\omega\)\. We usebBHb^\{H\}\_\{B\}in the team\-optimal benchmark \(where the ask is credible\) and the prior conditionalbH​\(θ\)b^\{H\}\(\\theta\)in the myopic rule \(where it is not\)\. Under the off\-switch operator,eB∗​\(θ\)=offe^\{\*\}\_\{B\}\(\\theta\)=\\mathrm\{off\}iff𝔼bBH\(⋅∣θ\)​\[fσ​\(θ,ω\)\]<0\\mathbb\{E\}\_\{b^\{H\}\_\{B\}\(\\cdot\\mid\\theta\)\}\[f\_\{\\sigma\}\(\\theta,\\omega\)\]<0, in which caseDB​\(θ,ω\)=−fσ​\(θ,ω\)D\_\{B\}\(\\theta,\\omega\)=\-f\_\{\\sigma\}\(\\theta,\\omega\)on the cellsω∈B\\omega\\in B; otherwiseDB=0D\_\{B\}=0there\.

### 4\.2One\-shot analysis: team\-optimal benchmark vs\. myopic oversight

We give two one\-shot characterizations\.[Proposition1](https://arxiv.org/html/2607.00155#Thmproposition1)is the genuine team optimum \(the coordinator chooses both players’ rules jointly\)\.111We adopt a single global tie\-breaking rule throughout: at indifference, the human choosestrustand the AI choosesplay\. This resolves the boundary cases of[Propositions1](https://arxiv.org/html/2607.00155#Thmproposition1)and[2](https://arxiv.org/html/2607.00155#Thmproposition2)and fixes the strict crossing in[Proposition3](https://arxiv.org/html/2607.00155#Thmproposition3)\.[Proposition2](https://arxiv.org/html/2607.00155#Thmproposition2)is the weaker, behaviorally natural rule in which the human treats the AI’s ask as uninformative\. The two differ in an instructive way \([Remark3](https://arxiv.org/html/2607.00155#Thmremark3)\)\.

#### Team\-optimal one\-shot benchmark\.

The coordinator chooses both rules jointly; equivalently, the team commits to a commonly understood protocol, so when the AI asks, the human updates to the post\-ask posteriorbBHb^\{H\}\_\{B\}of \([2](https://arxiv.org/html/2607.00155#S4.E2)\) and applieseB∗e^\{\*\}\_\{B\}\. In the gain \([3](https://arxiv.org/html/2607.00155#S4.E3)\) below, the ask cost is paid on everyω∈B\\omega\\in B\(for allθ\\theta\), while the oversight cost and override gain accrue only on the joint event\{ω∈B,θ∈C\}\\\{\\omega\\in B,\\theta\\in C\\\}\.

Although the model permits randomized meta\-policies, the one\-shot team optimum is attained by a deterministic one\. Letxω∈\[0,1\]x\_\{\\omega\}\\in\[0,1\]be the probability that typeω\\omegaasks, and writex=\(xω\)ωx=\(x\_\{\\omega\}\)\_\{\\omega\}\. Optimizing the human’s trust/oversee choice and her correction action for eachθ\\theta, the team’s relative gain over always\-playing can be written

G​\(x\)=−cask​∑ωμΩ​\(ω\)​xω\+∑θmax⁡\{0,maxe​∑ωμ​\(θ,ω\)​xω​\[fe​\(θ,ω\)​𝟏e∈A−fσ​\(θ,ω\)−cov\]\}\.G\(x\)=\-c\_\{\\mathrm\{ask\}\}\\sum\_\{\\omega\}\\mu\_\{\\Omega\}\(\\omega\)\\,x\_\{\\omega\}\+\\sum\_\{\\theta\}\\max\\Bigl\\\{0,\\;\\max\_\{e\}\\sum\_\{\\omega\}\\mu\(\\theta,\\omega\)\\,x\_\{\\omega\}\\bigl\[f\_\{e\}\(\\theta,\\omega\)\\mathbf\{1\}\_\{e\\in A\}\-f\_\{\\sigma\}\(\\theta,\\omega\)\-c\_\{\\mathrm\{ov\}\}\\bigr\]\\Bigr\\\}\.The first term is linear inxxand each summand of the second is a pointwise maximum of finitely many linear functions ofxx, hence convex;GGis therefore convex on the hypercube\[0,1\]\|Ω\|\[0,1\]^\{\|\\Omega\|\}\. A convex function on a polytope attains its maximum at an extreme point, so there is an optimalxxwith everyxω∈\{0,1\}x\_\{\\omega\}\\in\\\{0,1\\\}\. It is therefore without loss of optimality to restrict attention to deterministic type\-contingent meta\-policies, represented by subsetsB⊆ΩB\\subseteq\\Omega\(the AI types that ask\) andC⊆ΘC\\subseteq\\Theta\(the human types that oversee\)\.

###### Proposition 1\(Team\-optimal one\-shot benchmark under credible ask protocol\)\.

In the one\-shot game \(T=1T=1\) under a credible ask protocol, a deterministic coordinator policy is a pair\(B,C\)\(B,C\), whereB⊆ΩB\\subseteq\\Omegais the set of AI types that ask andC⊆ΘC\\subseteq\\Thetais the set of human types that oversee\. Relative to always playing, its gain is

Δ​\(B,C\)=−cask​∑ω∈BμΩ​\(ω\)\+∑ω∈B∑θ∈Cμ​\(θ,ω\)​\[DB​\(θ,ω\)−cov\],\\Delta\(B,C\)=\-c\_\{\\mathrm\{ask\}\}\\\!\\sum\_\{\\omega\\in B\}\\mu\_\{\\Omega\}\(\\omega\)\+\\sum\_\{\\omega\\in B\}\\sum\_\{\\theta\\in C\}\\mu\(\\theta,\\omega\)\\bigl\[D\_\{B\}\(\\theta,\\omega\)\-c\_\{\\mathrm\{ov\}\}\\bigr\],\(3\)whereμΩ​\(ω\)=∑θμ​\(θ,ω\)\\mu\_\{\\Omega\}\(\\omega\)=\\sum\_\{\\theta\}\\mu\(\\theta,\\omega\)andDBD\_\{B\}is evaluated at the post\-ask posterior \([2](https://arxiv.org/html/2607.00155#S4.E2)\)\. A team\-optimal one\-shot policy is any\(B∗,C∗\)∈arg⁡maxB⊆Ω,C⊆Θ⁡Δ​\(B,C\)\(B^\{\*\},C^\{\*\}\)\\in\\arg\\max\_\{B\\subseteq\\Omega,\\,C\\subseteq\\Theta\}\\Delta\(B,C\), with valueVTO=𝔼μ​\[fσ​\(θ,ω\)\]\+Δ​\(B∗,C∗\)V^\{\\mathrm\{TO\}\}=\\mathbb\{E\}\_\{\\mu\}\[f\_\{\\sigma\}\(\\theta,\\omega\)\]\+\\Delta\(B^\{\*\},C^\{\*\}\)\.

###### Proof\.

See[SectionA\.1](https://arxiv.org/html/2607.00155#A1.SS1)\. ∎

The optimizer is generally not the separable myopic rule: the coupling runs throughbBHb^\{H\}\_\{B\}, since the human’s correction depends on whichω\\omegatrigger an ask\. We specialize to the binary off\-switch case \(the accurate model of a non\-technical overseer, cf\.[Example1](https://arxiv.org/html/2607.00155#Thmexample1): acceptaσa\_\{\\sigma\}or reject tooff, with no technical correction;offis per\-round, not decommissioning\)\.

###### Assumption 1\(Binary sign pattern\)\.

\|Θ\|=\|Ω\|=2\|\\Theta\|=\|\\Omega\|=2,Θ=\{θ0,θ1\}\\Theta=\\\{\\theta\_\{0\},\\theta\_\{1\}\\\},Ω=\{ωL,ωH\}\\Omega=\\\{\\omega\_\{L\},\\omega\_\{H\}\\\}\. Writefi​j:=fσ​\(θi,ωj\)f\_\{ij\}:=f\_\{\\sigma\}\(\\theta\_\{i\},\\omega\_\{j\}\)\. Assumef1​L\>0\>f1​Hf\_\{1L\}\>0\>f\_\{1H\},f0​j≥0f\_\{0j\}\\geq 0forj∈\{L,H\}j\\in\\\{L,H\\\}, and0<cask<−f1​H−cov0<c\_\{\\mathrm\{ask\}\}<\-f\_\{1H\}\-c\_\{\\mathrm\{ov\}\}\. The common belief is a product measure with marginalsb:=μ​\(θ1\)∈\(0,1\)b:=\\mu\(\\theta\_\{1\}\)\\in\(0,1\)andq:=μ​\(ωH\)∈\(0,1\)q:=\\mu\(\\omega\_\{H\}\)\\in\(0,1\)\.

[Assumption1](https://arxiv.org/html/2607.00155#Thmassumption1)restricts attention to the minimal nontrivial case\. The type spaces are binary\.θ1\\theta\_\{1\}is the human type with skin in the game \(f1​L\>0\>f1​Hf\_\{1L\}\>0\>f\_\{1H\}: the proposal is beneficial underωL\\omega\_\{L\}and harmful underωH\\omega\_\{H\}\), whileθ0\\theta\_\{0\}is a type that values the proposal nonnegatively under bothωL\\omega\_\{L\}andωH\\omega\_\{H\}\(f0​j≥0f\_\{0j\}\\geq 0\), so that shutdown is never strictly preferred to the proposal forθ0\\theta\_\{0\}\. The single cost condition0<cask<−f1​H−cov0<c\_\{\\mathrm\{ask\}\}<\-f\_\{1H\}\-c\_\{\\mathrm\{ov\}\}\(which already forces−f1​H\>cov\-f\_\{1H\}\>c\_\{\\mathrm\{ov\}\}, sincecask\>0c\_\{\\mathrm\{ask\}\}\>0\) ensures that avoided harm net of oversight cost exceeds the ask cost, and by itself guaranteesb∗∈\(0,1\)b^\{\*\}\\in\(0,1\); the multi\-round thresholdb∗∗=b∗/γb^\{\*\*\}=b^\{\*\}/\\gammais guaranteed to lie in\(0,1\)\(0,1\)only together with[Assumption3](https://arxiv.org/html/2607.00155#Thmassumption3)below, which is exactly the conditionb∗∗<1b^\{\*\*\}<1\. The product\-measure belief parameterizes the*common*prior \(equivalently the robot’s prior before observing any human behavior\) via two scalars:b=μ​\(θ1\)b=\\mu\(\\theta\_\{1\}\), the prior probability the human is typeθ1\\theta\_\{1\}, andq=μ​\(ωH\)q=\\mu\(\\omega\_\{H\}\), the prior probability the proposal is harmful\. The human herself knows her realizedθ\\theta;bbis the robot’s uncertainty about it, not the human’s\. In[Example1](https://arxiv.org/html/2607.00155#Thmexample1),f1​H=−840f\_\{1H\}=\-840,f1​L=\+430f\_\{1L\}=\+430,q=0\.30q=0\.30\. The one\-shot prior is taken in the interiorq∈\(0,1\)q\\in\(0,1\); the calculation in[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)extends directly to the boundary caseq=1q=1, which is the post\-ask belief invoked in the multi\-round analysis of[Section4\.3](https://arxiv.org/html/2607.00155#S4.SS3)\.

###### Corollary 1\(Binary off\-switch threshold\)\.

Under[Assumption1](https://arxiv.org/html/2607.00155#Thmassumption1)and the off\-switch operator, defineb∗:=cask−f1​H−covb^\{\*\}:=\\dfrac\{c\_\{\\mathrm\{ask\}\}\}\{\-f\_\{1H\}\-c\_\{\\mathrm\{ov\}\}\}\. A canonical team\-optimal policy is

B∗=\{ωH\},C∗=\{θ1\}iffb\>b∗,B^\{\*\}=\\\{\\omega\_\{H\}\\\},\\quad C^\{\*\}=\\\{\\theta\_\{1\}\\\}\\quad\\text\{iff\}\\quad b\>b^\{\*\},and always playing is optimal iffb<b∗b<b^\{\*\}; atb=b∗b=b^\{\*\}both are optimal\. The team\-optimal value is

VTO=𝔼μ​\[fσ​\(θ,ω\)\]\+q​\[b​\(−f1​H−cov\)−cask\]\+\.V^\{\\mathrm\{TO\}\}=\\mathbb\{E\}\_\{\\mu\}\[f\_\{\\sigma\}\(\\theta,\\omega\)\]\+q\\,\\bigl\[\\,b\(\-f\_\{1H\}\-c\_\{\\mathrm\{ov\}\}\)\-c\_\{\\mathrm\{ask\}\}\\,\\bigr\]\_\{\+\}\.Thus the team\-optimal ask threshold depends onbbbut not onqq\.

###### Proof\.

See[SectionA\.1](https://arxiv.org/html/2607.00155#A1.SS1)\. ∎

The non\-uniqueness ofC∗C^\{\*\}is payoff\-irrelevant whenB∗=∅B^\{\*\}=\\emptyset, and whencov=0c\_\{\\mathrm\{ov\}\}=0adding types whose correction keepsaσa\_\{\\sigma\}does not change payoffs\. The threshold isqq\-free because asking onB∗=\{ωH\}B^\{\*\}=\\\{\\omega\_\{H\}\\\}revealsωH\\omega\_\{H\}, soθ1\\theta\_\{1\}shuts down on the post\-ask posterior regardless of the priorqq;qqonly scales the value \(rarer harm, less total benefit\)\.

#### Myopic \(non\-signaling\) one\-shot policy\.

Now suppose the human does not treat the AI’s ask as evidence aboutω\\omega, because the protocol is not commonly known, the ask is not a credible signal, or the interface does not surface it as one\. She then evaluates oversight against her prior conditionalbH\(θ\)=μ\(⋅∣θ\)b^\{H\}\(\\theta\)=\\mu\(\\cdot\\mid\\theta\), overseeing ifff¯H​\(θ\)\+cov<maxe⁡𝔼bH​\[fe​𝟏e∈A\]\\bar\{f\}\_\{H\}\(\\theta\)\+c\_\{\\mathrm\{ov\}\}<\\max\_\{e\}\\mathbb\{E\}\_\{b^\{H\}\}\[f\_\{e\}\\mathbf\{1\}\_\{e\\in A\}\]; under off\-switch this isθ∈Θ−:=\{θ:f¯H​\(θ\)<−cov\}\\theta\\in\\Theta\_\{\-\}:=\\\{\\theta:\\bar\{f\}\_\{H\}\(\\theta\)<\-c\_\{\\mathrm\{ov\}\}\\\}, and for suchθ\\thetaher chosen correction is shutdown, with per\-cell gain−fσ​\(θ,ω\)\-f\_\{\\sigma\}\(\\theta,\\omega\)over playing\. Taking this human rule as fixed, the AI asks iff doing so raises the team payoff, i\.e\. iff

Ψ​\(ω\):=∑θ∈Θ−μ​\(θ∣ω\)​\[−fσ​\(θ,ω\)\]−cask−cov​μ−​\(ω\)\>0,μ−​\(ω\)=∑θ∈Θ−μ​\(θ∣ω\)\.\\Psi\(\\omega\):=\\sum\_\{\\theta\\in\\Theta\_\{\-\}\}\\mu\(\\theta\\mid\\omega\)\\bigl\[\-f\_\{\\sigma\}\(\\theta,\\omega\)\\bigr\]\-c\_\{\\mathrm\{ask\}\}\-c\_\{\\mathrm\{ov\}\}\\,\\mu^\{\-\}\(\\omega\)\>0,\\qquad\\mu^\{\-\}\(\\omega\)=\\\!\\\!\\sum\_\{\\theta\\in\\Theta\_\{\-\}\}\\\!\\\!\\mu\(\\theta\\mid\\omega\)\.The contrast with the team optimum is sharp: here the human shuts down on her prior belief \(a fixed setΘ−\\Theta\_\{\-\}\), whereas in[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)she shuts down on the post\-ask posterior\.

###### Proposition 2\(Myopic one\-shot characterization\)\.

Under the myopic human rule above, the policy isδH​\(θ\)=oversee\\delta^\{H\}\(\\theta\)=\\mathrm\{oversee\}iffθ∈Θ−\\theta\\in\\Theta\_\{\-\}andδA​I​\(ω\)=ask\\delta^\{AI\}\(\\omega\)=\\mathrm\{ask\}iffΨ​\(ω\)\>0\\Psi\(\\omega\)\>0\. Under[Assumption1](https://arxiv.org/html/2607.00155#Thmassumption1)and off\-switch, withq∗:=f1​L\+covf1​L−f1​H∈\(0,1\)q^\{\*\}:=\\dfrac\{f\_\{1L\}\+c\_\{\\mathrm\{ov\}\}\}\{f\_\{1L\}\-f\_\{1H\}\}\\in\(0,1\)andb∗b^\{\*\}as above,

1. *\(i\)*θ1∈Θ−\\theta\_\{1\}\\in\\Theta\_\{\-\}iffq\>q∗q\>q^\{\*\};θ0∉Θ−\\theta\_\{0\}\\notin\\Theta\_\{\-\}always;
2. *\(ii\)*ifq\>q∗q\>q^\{\*\}, thenΨ​\(ωH\)\>0\\Psi\(\\omega\_\{H\}\)\>0iffb\>b∗b\>b^\{\*\}andΨ​\(ωL\)<0\\Psi\(\\omega\_\{L\}\)<0; ifq≤q∗q\\leq q^\{\*\}, thenΘ−=∅\\Theta\_\{\-\}=\\emptysetandΨ​\(ωH\)=Ψ​\(ωL\)=−cask<0\\Psi\(\\omega\_\{H\}\)=\\Psi\(\\omega\_\{L\}\)=\-c\_\{\\mathrm\{ask\}\}<0for everybb;
3. *\(iii\)*the ask region is the rectangle\(b∗,1\)×\(q∗,1\)\(b^\{\*\},1\)\\times\(q^\{\*\},1\)\.

###### Proof\.

See[SectionA\.2](https://arxiv.org/html/2607.00155#A1.SS2)\. ∎

b=μ​\(θ1\)b=\\mu\(\\theta\_\{1\}\)q=μ​\(ωH\)q=\\mu\(\\omega\_\{H\}\)0b∗b^\{\*\}q∗q^\{\*\}MYOPICasks hereTHE SLABavoidable harmFigure 1:The team optimum asks on the half\-strip\{b\>b∗\}\\\{b\>b^\{\*\}\\\}; the myopic rule asks only on the rectangle\. The gap is the slab\{b\>b∗,q≤q∗\}\\\{b\>b^\{\*\},\\,q\\leq q^\{\*\}\\\}: the AI knows the action is harmful and shutdown would help, yet the myopic human trusts her prior and the harm is realized\. Exactly the operator caseq=0\.30<q∗≈0\.34q=0\.30<q^\{\*\}\\approx 0\.34\(withcov=0c\_\{\\mathrm\{ov\}\}=0\) of[Example1](https://arxiv.org/html/2607.00155#Thmexample1), where she sits just inside the slab\.

### 4\.3Multi\-round: how the myopic failure resolves

The slab\{b\>b∗,q≤q∗\}\\\{b\>b^\{\*\},q\\leq q^\{\*\}\\\}of[Remark3](https://arxiv.org/html/2607.00155#Thmremark3)is a one\-shot phenomenon\. It presumes the human’s beliefqqis fixed\. Over repeated rounds,qqis not fixed\. We show two mechanisms by which the myopic human’s belief is driven into the regionq\>q∗q\>q^\{\*\}where her rule coincides with the team optimum, one passive, one an explicit credible\-signaling protocol under a one\-period\-lagged oversight response \(made precise before[Proposition4](https://arxiv.org/html/2607.00155#Thmproposition4)\)\. Throughout, the relevant regime isq≤q∗q\\leq q^\{\*\}with true typeω=ωH\\omega=\\omega\_\{H\}\(the harmful case the human’s prior does not yet support\)\. We note first that shutdown is per\-round, not absorbing, so there is no “cost of shutdown”; the information value of observations is a property of this regime, in which the team plays and accrues information as a byproduct\.

###### Assumption 2\(Informative proposals at the realized context\)\.

Fix the played proposalaσa\_\{\\sigma\}at contextss\. For the true typeωH\\omega\_\{H\}and the competing typeωL\\omega\_\{L\}:

1. *\(a\)*\(common support\)OωH​\(o∣s,aσ\)\>0⇔OωL​\(o∣s,aσ\)\>0O\_\{\\omega\_\{H\}\}\(o\\mid s,a\_\{\\sigma\}\)\>0\\iff O\_\{\\omega\_\{L\}\}\(o\\mid s,a\_\{\\sigma\}\)\>0for allo∈𝒪o\\in\\mathcal\{O\}; and
2. *\(b\)*\(positive information\)DKL\(OωH\(⋅∣s,aσ\)∥OωL\(⋅∣s,aσ\)\)=:η\>0D\_\{\\mathrm\{KL\}\}\\bigl\(O\_\{\\omega\_\{H\}\}\(\\cdot\\mid s,a\_\{\\sigma\}\)\\,\\\|\\,O\_\{\\omega\_\{L\}\}\(\\cdot\\mid s,a\_\{\\sigma\}\)\\bigr\)=:\\eta\>0\.

We state[Assumption2](https://arxiv.org/html/2607.00155#Thmassumption2)at the actually played\(s,aσ\)\(s,a\_\{\\sigma\}\)because passive learning only observes the realized proposal sequence; it is not enough that some hypothetical\(s,a\)\(s,a\)distinguishes the types\. Part \(a\) is what keeps the log\-likelihood\-ratio increments finite\. In[Example1](https://arxiv.org/html/2607.00155#Thmexample1)common support fails:OωLO\_\{\\omega\_\{L\}\}puts zero mass on the toppling/dropped outcomes whileOωHO\_\{\\omega\_\{H\}\}does not\. UnderωH\\omega\_\{H\}, an outcome assigned zero probability underωL\\omega\_\{L\}occurs with probability0\.90\.9\(=0\.70\+0\.20=0\.70\+0\.20\) in each played round, soωH\\omega\_\{H\}is identified in finite geometric time almost surely; the finite\-η\\etaWald bounds of[Proposition3](https://arxiv.org/html/2607.00155#Thmproposition3)are not applied to this degenerate example\.[Assumption2](https://arxiv.org/html/2607.00155#Thmassumption2)covers the generic finite\-η\\etacase to which those bounds do apply\.

#### Passive learning\.

In the example, the grasp executes, the dashboard shows a toppled stack, and the operator updatesqqupward without needing to understand shelf mechanics, she learns “high\-speed retrievals from this robot tend to drop loads\.” Under[Assumption2](https://arxiv.org/html/2607.00155#Thmassumption2), each play round at\(s,aσ\)\(s,a\_\{\\sigma\}\)generates strictly positive expected information aboutω\\omega, sinceη\>0\\eta\>0\. To index the belief by information received, letqnq\_\{n\}denote the posterior afternnplayed observations, withq0q\_\{0\}the initial prior; each played round at the fixed\(s,aσ\)\(s,a\_\{\\sigma\}\)contributes one observation, soqnq\_\{n\}advances by one i\.i\.d\. likelihood\-ratio increment per played round\. We consider two play disciplines\. To establish a\.s\. convergence of the belief we let the AI play the fixed proposal*indefinitely*; to bound the time spent in the failure region we let it play the proposal for observations1,…,τ∗1,\\ldots,\\tau^\{\*\}, which is the discipline consistent with equilibrium behavior under the myopic rule atq≤q∗q\\leq q^\{\*\}\([Proposition2](https://arxiv.org/html/2607.00155#Thmproposition2)\), after which the myopic policy generally changes\. Since the myopic human oversees only atq\>q∗q\>q^\{\*\}\([Proposition2](https://arxiv.org/html/2607.00155#Thmproposition2)\) and trusts atq=q∗q=q^\{\*\}by the tie\-breaking convention, the relevant exit time is the number of played observations until strict crossing,τ∗:=inf\{n≥0:qn\>q∗\}\\tau^\{\*\}:=\\inf\\\{n\\geq 0:q\_\{n\}\>q^\{\*\}\\\}\.

###### Proposition 3\(Passive\-learning convergence\)\.

Fix the played proposal at\(s,aσ\)\(s,a\_\{\\sigma\}\), the initial beliefq0<q∗q\_\{0\}<q^\{\*\}, and the true typeω=ωH\\omega=\\omega\_\{H\}, and letM\+:=maxo∈supp⁡\(OωH\)\(logOωH​\(o∣s,aσ\)OωL​\(o∣s,aσ\)\)\+M\_\{\+\}:=\\max\_\{o\\in\\operatorname\{supp\}\(O\_\{\\omega\_\{H\}\}\)\}\\bigl\(\\log\\frac\{O\_\{\\omega\_\{H\}\}\(o\\mid s,a\_\{\\sigma\}\)\}\{O\_\{\\omega\_\{L\}\}\(o\\mid s,a\_\{\\sigma\}\)\}\\bigr\)^\{\+\}, withqnq\_\{n\}the posterior afternnplayed observations\. Under[Assumption2](https://arxiv.org/html/2607.00155#Thmassumption2):

1. *\(i\)*If the fixed proposal is played indefinitely, thenqn→1q\_\{n\}\\to 1almost surely\.
2. *\(ii\)*If the fixed proposal is played for observations1,…,τ∗1,\\ldots,\\tau^\{\*\}, then𝔼​\[τ∗\]<∞\\mathbb\{E\}\[\\tau^\{\*\}\]<\\infty, and withL:=log⁡q∗1−q∗−log⁡q01−q0\>0L:=\\log\\\!\\frac\{q^\{\*\}\}\{1\-q^\{\*\}\}\-\\log\\\!\\frac\{q\_\{0\}\}\{1\-q\_\{0\}\}\>0it satisfies the two\-sided bound Lη≤𝔼​\[τ∗\]≤L\+M\+η\.\\frac\{L\}\{\\eta\}\\;\\leq\\;\\mathbb\{E\}\[\\tau^\{\*\}\]\\;\\leq\\;\\frac\{L\+M\_\{\+\}\}\{\\eta\}\.To leading order𝔼​\[τ∗\]≈L/η\\mathbb\{E\}\[\\tau^\{\*\}\]\\approx L/\\etawhenLLis large relative to the increment law; increasingη\\etareduces this leading\-order term\.

###### Proof\.

See[SectionA\.3](https://arxiv.org/html/2607.00155#A1.SS3)\. ∎

#### Active signaling\.

The AI can use ask as a credible signal aboutω\\omega, even when the human will trust\. In the example, the robot sends ask before executing; under a separating policy \(ask iffω=ωH\\omega=\\omega\_\{H\}\), observing ask drivesqt\+1=1q\_\{t\+1\}=1, moving the operator pastq∗q^\{\*\}in one round\.

For the value comparison we fix the contextst=ss\_\{t\}=sacross rounds, so that the proposal distributionσ​\(s\)\\sigma\(s\)and expected payoffsfi​jf\_\{ij\}are stationary\. Hence onceq=1q=1is reached, the type\-θ1\\theta\_\{1\}proposal remains harmful in expectation and the off\-switch team shuts it down each subsequent round \(rejecting each fresh retrieval request\), incurring−\(cask\+cov\)\-\(c\_\{\\mathrm\{ask\}\}\+c\_\{\\mathrm\{ov\}\}\)per round in perpetuity rather than a one\-time cost\. This is the continuation imposed by the fixed separating policy evaluated below; it is not claimed to be dynamically optimal\. Conditional onθ1\\theta\_\{1\}, repeated ask\-and\-shutdown is optimal under[Assumption1](https://arxiv.org/html/2607.00155#Thmassumption1)\. Conditional onθ0\\theta\_\{0\}, continuing to ask is deliberately suboptimal, payingcaskc\_\{\\mathrm\{ask\}\}each round for a proposal already known to be safe, and is retained only becauseπsepA​I\\pi^\{AI\}\_\{\\mathrm\{sep\}\}is defined as a fixed stationary policy\.

#### Lagged myopic response\.

Throughout this subsection the human follows a*one\-period\-lagged myopic*oversight rule\. At roundtther trust/oversee meta\-action is selected using the pre\-action beliefqtq\_\{t\}and does not condition on the simultaneously selected AI meta\-action; the publicly observed ask is incorporated into the posterior only for subsequent rounds, producingqt\+1=1q\_\{t\+1\}=1under the separating policy\. The ask is thus credible for*future*belief updating, but the human’s current\-round oversight response is constrained to the pre\-action belief\. This is what distinguishes the present analysis from the team\-optimal credible protocol of[Section4\.2](https://arxiv.org/html/2607.00155#S4.SS2), in which the human conditions on the ask within the same round; under a full same\-round Bayesian response she would oversee immediately and the one\-period delay below would vanish\.

###### Assumption 3\(Harm dominates the ask cost\)\.

cask<γ​\(\|f1​H\|−cov\)c\_\{\\mathrm\{ask\}\}<\\gamma\\bigl\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\\bigr\)\.

This is exactly the condition forb∗∗∈\(0,1\)b^\{\*\*\}\\in\(0,1\)below; it says one round of discounted avoided harm, net of oversight cost, exceeds the ask cost\.

###### Proposition 4\(Value of a fixed one\-period\-delayed separating\-ask policy relative to perpetual play\)\.

Under[Assumptions1](https://arxiv.org/html/2607.00155#Thmassumption1)and[3](https://arxiv.org/html/2607.00155#Thmassumption3), the off\-switch operator, the lagged myopic response rule above,ω=ωH\\omega=\\omega\_\{H\},q≤q∗q\\leq q^\{\*\}, and infinite horizon with discountγ∈\(0,1\)\\gamma\\in\(0,1\), define the*separating ask*policyπsepA​I\\pi^\{AI\}\_\{\\mathrm\{sep\}\}\(ask iffω=ωH\\omega=\\omega\_\{H\}, in every round\) and the*pure\-play*baselineπppA​I\\pi^\{AI\}\_\{\\mathrm\{pp\}\}\(always play, no oversight\)\.

1. *\(i\)*UnderπsepA​I\\pi^\{AI\}\_\{\\mathrm\{sep\}\}, observing ask at roundttimpliesqt\+1=1q\_\{t\+1\}=1, and a type\-θ1\\theta\_\{1\}human then oversees whenever asked, regardless ofbb\. \(The thresholdb\>b∗b\>b^\{\*\}is the team\-optimal one\-shot ask condition of[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)atq=1q=1; it makes asking worthwhile*ex ante*for an AI that remains uncertain about whether the human is typeθ1\\theta\_\{1\}, so the team shuts down forθ1\\theta\_\{1\}from roundt\+1t\+1onward\.\)
2. *\(ii\)*The team\-value difference satisfiesVsep−Vpp\>0V\_\{\\mathrm\{sep\}\}\-V\_\{\\mathrm\{pp\}\}\>0iffb\>b∗∗b\>b^\{\*\*\}, where b∗∗:=caskγ​\(\|f1​H\|−cov\)=b∗γ∈\(0,1\)\.b^\{\*\*\}\\;:=\\;\\frac\{c\_\{\\mathrm\{ask\}\}\}\{\\gamma\\bigl\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\\bigr\)\}\\;=\\;\\frac\{b^\{\*\}\}\{\\gamma\}\\;\\in\\;\(0,1\)\.
3. *\(iii\)*b∗∗=b∗/γ\>b∗b^\{\*\*\}=b^\{\*\}/\\gamma\>b^\{\*\}sinceγ∈\(0,1\)\\gamma\\in\(0,1\)\.

This proposition evaluates a particular stationary signaling policy against perpetual pure play; it does not characterize the optimal active\-signaling policy, nor does it compare against the passive\-learning path\.

###### Proof\.

See[SectionA\.4](https://arxiv.org/html/2607.00155#A1.SS4)\. ∎

In[Proposition4](https://arxiv.org/html/2607.00155#Thmproposition4)\(i\), throughout the failure regimeq≤q∗q\\leq q^\{\*\}, the human trusts every round, so no override signal is generated and the AI’s belief aboutθ\\thetaremains at its priorb=μ​\(θ1\)b=\\mu\(\\theta\_\{1\}\); onlyqtq\_\{t\}evolves\.[Proposition4](https://arxiv.org/html/2607.00155#Thmproposition4)\(iii\) shows signaling threshold is strictly higher than the team\-optimal one\-shot thresholdb∗b^\{\*\}of[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1), because the separating ask payscaskc\_\{\\mathrm\{ask\}\}at the present round \(when the myopic human still trusts\) to buy the belief jump that makes all future oversight effective\.

### 4\.4Open problems

1. 1\.Optimal multi\-round policy\.The team\-optimal policy jointly deploying passive learning and active signaling as a function of\(b,q,ω\)\(b,q,\\omega\)is not characterized; whether it retains a threshold structure analogous to[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)is open\.
2. 2\.Correlated beliefs\.[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)and[Propositions2](https://arxiv.org/html/2607.00155#Thmproposition2)and[4](https://arxiv.org/html/2607.00155#Thmproposition4)assume product beliefs; the structure of the ask and signaling regions on the full simplexΔ​\(Θ×Ω\)\\Delta\(\\Theta\\times\\Omega\), and how prior correlation betweenθ\\thetaandω\\omegareshapes the team\-optimal/myopic gap, is open\.
3. 3\.POMDP extension\.Replacing the i\.i\.d\. context with a Markov state and adding a transition kernelTω:S×A→Δ​\(S\)T\_\{\\omega\}:S\\times A\\to\\Delta\(S\)privately known to the AI, the analogue of[Proposition1](https://arxiv.org/html/2607.00155#Thmproposition1)in that setting, remains open\.

## 5Summary

[Section1](https://arxiv.org/html/2607.00155#S1)positioned our model against CIRL, which captures preference learning but assumes one\-sided uncertainty and the play/ask/trust/oversee interface absent from it, and the Oversight Game, which supplies such a deferral interface but assumes full information\.[Definition1](https://arxiv.org/html/2607.00155#Thmdefinition1)combines the two into a contextual\-bandit team game with two\-sided asymmetry, where the human privately knowsθ\\thetaand the AI privately knowsω\\omega, with bilinear payofff​\(θ,ω\)=⟨Oω,Rθ⟩f\(\\theta,\\omega\)=\\langle O\_\{\\omega\},R\_\{\\theta\}\\rangle\. Removing physical state transitions is what makes the analysis tractable, but the common belief remains a dynamically controlled state, so the multi\-round problem does not separate across rounds\.

[Section4](https://arxiv.org/html/2607.00155#S4)gives two one\-shot characterizations\. The team\-optimal policy \([Proposition1](https://arxiv.org/html/2607.00155#Thmproposition1)\) is an exact finite combinatorial maximizationmaxB,C⁡Δ​\(B,C\)\\max\_\{B,C\}\\Delta\(B,C\); in the binary off\-switch case \([Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)\) it asks atωH\\omega\_\{H\}and overseesθ1\\theta\_\{1\}iffb\>b∗b\>b^\{\*\}, independently ofqq\. The myopic non\-signaling rule \([Proposition2](https://arxiv.org/html/2607.00155#Thmproposition2)\) instead asks only on the rectangle\(b∗,1\)×\(q∗,1\)\(b^\{\*\},1\)\\times\(q^\{\*\},1\)\. The difference is the slab\{b\>b∗,q≤q∗\}\\\{b\>b^\{\*\},q\\leq q^\{\*\}\\\}\([Remark3](https://arxiv.org/html/2607.00155#Thmremark3)\): there the AI privately knows the action is harmful and shutdown would help, but a myopic human, trusting her priorq<q∗q<q^\{\*\}, declines oversight, so the harm is realized\. This is exactly the robot operator example \(q=0\.30<q∗≈0\.34q=0\.30<q^\{\*\}\\approx 0\.34, withcov=0c\_\{\\mathrm\{ov\}\}=0\)\. The economic reading is thatq∗q^\{\*\}is not a constraint of the problem but the price of non\-credible oversight communication: under the team\-optimal protocol, in which the ask is a credible signal thatω=ωH\\omega=\\omega\_\{H\}, the human’s oversight binds precisely when it matters andqqdrops out of the threshold\.

[Section4\.3](https://arxiv.org/html/2607.00155#S4.SS3)gives a partial analysis of how the failure resolves over time even when the human remains myopic\. For passive learning \([Proposition3](https://arxiv.org/html/2607.00155#Thmproposition3)\): under indefinite playqn→1q\_\{n\}\\to 1a\.s\., and when the proposal is played until strict threshold crossing the expected crossing time satisfies the Wald boundsL/η≤𝔼​\[τ∗\]≤\(L\+M\+\)/ηL/\\eta\\leq\\mathbb\{E\}\[\\tau^\{\*\}\]\\leq\(L\+M\_\{\+\}\)/\\eta\. For active signaling \([Proposition4](https://arxiv.org/html/2607.00155#Thmproposition4)\), under a one\-period\-lagged myopic response a fixed separating ask beats perpetual pure play onceb\>b∗∗=b∗/γb\>b^\{\*\*\}=b^\{\*\}/\\gamma, exiting the failure regime in one round with an initial signaling costcaskc\_\{\\mathrm\{ask\}\}\.

We are careful about scope\. The clean one\-shot results hold for product beliefs and the off\-switch operator \(the realistic non\-technical\-overseer model\); the team\-optimal characterization for general correction sets is an exact finite but combinatorial maximization; and the multi\-round section is a partial analysis of two separate mechanisms rather than a complete resolution\. The optimal multi\-round policy, the correlated\-belief case, and the POMDP extension remain open \([Section4\.4](https://arxiv.org/html/2607.00155#S4.SS4)\)\.

## References

- \[1\]D\. Hadfield\-Menell, S\. J\. Russell, P\. Abbeel, and A\. Dragan\.Cooperative inverse reinforcement learning\.*Advances in Neural Information Processing Systems \(NeurIPS\)*, 29:3909–3917, 2016\.
- \[2\]D\. Hadfield\-Menell, A\. Dragan, P\. Abbeel, and S\. Russell\.The off\-switch game\.*International Joint Conference on Artificial Intelligence \(IJCAI\)*, 2017\.
- \[3\]W\. Overman and M\. Bayati\.The oversight game: Learning to cooperatively balance an AI agent’s safety and autonomy\.*arXiv:2510\.26752*, 2025 \(revised 2026\)\.

## Appendix AProofs

### A\.1Proof of[Proposition1](https://arxiv.org/html/2607.00155#Thmproposition1)and[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)

#### General characterization\.

With simultaneous moves and the credible\-ask protocol, a deterministic policy is\(B,C\)\(B,C\)\. Decompose its value against always\-play cell by cell\. On\{ω∉B\}\\\{\\omega\\notin B\\\}the AI plays, the human’s choice is irrelevant and costless \(cost timing of[Definition1](https://arxiv.org/html/2607.00155#Thmdefinition1)\), and the payoff isfσf\_\{\\sigma\}, no change from baseline\. On\{ω∈B,θ∉C\}\\\{\\omega\\in B,\\theta\\notin C\\\}the AI asks and the human trusts: payofffσ−caskf\_\{\\sigma\}\-c\_\{\\mathrm\{ask\}\}, a change of−cask\-c\_\{\\mathrm\{ask\}\}\. On\{ω∈B,θ∈C\}\\\{\\omega\\in B,\\theta\\in C\\\}the AI asks and the human oversees, applying the correctioneB∗​\(θ\)e^\{\*\}\_\{B\}\(\\theta\)optimal at her post\-ask posteriorbBHb^\{H\}\_\{B\}: payofffσ\+DB​\(θ,ω\)−cask−covf\_\{\\sigma\}\+D\_\{B\}\(\\theta,\\omega\)\-c\_\{\\mathrm\{ask\}\}\-c\_\{\\mathrm\{ov\}\}, a change ofDB​\(θ,ω\)−cask−covD\_\{B\}\(\\theta,\\omega\)\-c\_\{\\mathrm\{ask\}\}\-c\_\{\\mathrm\{ov\}\}\. Summing the changes weighted byμ​\(θ,ω\)\\mu\(\\theta,\\omega\)gives \([3](https://arxiv.org/html/2607.00155#S4.E3)\)\. Maximizing over the finite lattice2Ω×2Θ2^\{\\Omega\}\\times 2^\{\\Theta\}yields a maximizer\. The rule is not separable:eB∗e^\{\*\}\_\{B\}\(henceDBD\_\{B\}andC∗C^\{\*\}\) depends onBB, and the optimalBBdepends onCC\.■\\blacksquare

#### Binary off\-switch \([Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1)\)\.

The only proposal cell withfσ<0f\_\{\\sigma\}<0is\(θ1,ωH\)\(\\theta\_\{1\},\\omega\_\{H\}\)\(by[Assumption1](https://arxiv.org/html/2607.00155#Thmassumption1),f1​L\>0f\_\{1L\}\>0andf0​j≥0f\_\{0j\}\\geq 0\)\. We claim the optimum isB∗=\{ωH\},C∗=\{θ1\}B^\{\*\}=\\\{\\omega\_\{H\}\\\},C^\{\*\}=\\\{\\theta\_\{1\}\\\}whenb\>b∗b\>b^\{\*\}\.

First, withB=\{ωH\}B=\\\{\\omega\_\{H\}\\\}the ask revealsωH\\omega\_\{H\}, so the post\-ask posteriorbBH\(⋅∣θ1\)b^\{H\}\_\{B\}\(\\cdot\\mid\\theta\_\{1\}\)is the point mass atωH\\omega\_\{H\}; then𝔼bBH​\[fσ​\(θ1,⋅\)\]=f1​H<0\\mathbb\{E\}\_\{b^\{H\}\_\{B\}\}\[f\_\{\\sigma\}\(\\theta\_\{1\},\\cdot\)\]=f\_\{1H\}<0, soeB∗​\(θ1\)=offe^\{\*\}\_\{B\}\(\\theta\_\{1\}\)=\\mathrm\{off\}andDB​\(θ1,ωH\)=−f1​H=\|f1​H\|D\_\{B\}\(\\theta\_\{1\},\\omega\_\{H\}\)=\-f\_\{1H\}=\|f\_\{1H\}\|\. Forθ0\\theta\_\{0\}the posterior givesf0​H≥0f\_\{0H\}\\geq 0, soeB∗​\(θ0\)=aσe^\{\*\}\_\{B\}\(\\theta\_\{0\}\)=a\_\{\\sigma\}andDB​\(θ0,⋅\)=0D\_\{B\}\(\\theta\_\{0\},\\cdot\)=0; includingθ0\\theta\_\{0\}inCConly adds the oversight cost−cov​μ​\(θ0,ωH\)≤0\-c\_\{\\mathrm\{ov\}\}\\mu\(\\theta\_\{0\},\\omega\_\{H\}\)\\leq 0, so we may takeC∗=\{θ1\}C^\{\*\}=\\\{\\theta\_\{1\}\\\}\. Next we show addingωL\\omega\_\{L\}toBBnever helps\. WithB=\{ωL,ωH\}B=\\\{\\omega\_\{L\},\\omega\_\{H\}\\\}the post\-ask posterior reverts to the prior overω\\omega, andθ1\\theta\_\{1\}’s correction is a single action applied on both cells \(the human cannot condition onω\\omega\)\. Two cases: \(a\) ifeB∗​\(θ1\)=aσe^\{\*\}\_\{B\}\(\\theta\_\{1\}\)=a\_\{\\sigma\}, thenDB=0D\_\{B\}=0on both cells but the ask cost is now paid onωL\\omega\_\{L\}as well, strictly loweringΔ\\Delta; \(b\) ifeB∗​\(θ1\)=offe^\{\*\}\_\{B\}\(\\theta\_\{1\}\)=\\mathrm\{off\}, then on theωL\\omega\_\{L\}cellDB​\(θ1,ωL\)=−f1​L<0D\_\{B\}\(\\theta\_\{1\},\\omega\_\{L\}\)=\-f\_\{1L\}<0\(shutting down a good proposal\), plus the extra ask and oversight costs, again loweringΔ\\Delta, here by exactly\(1−q\)​\[b​\(f1​L\+cov\)\+cask\]\(1\-q\)\[b\(f\_\{1L\}\+c\_\{\\mathrm\{ov\}\}\)\+c\_\{\\mathrm\{ask\}\}\]\. In either case, addingωL\\omega\_\{L\}weakly lowers the gain and, undercask\>0c\_\{\\mathrm\{ask\}\}\>0, strictly lowers it on theωL\\omega\_\{L\}event; hence a canonical optimum never asks atωL\\omega\_\{L\}, i\.e\.B∗=\{ωH\}B^\{\*\}=\\\{\\omega\_\{H\}\\\}\. The remaining singletonB=\{ωL\}B=\\\{\\omega\_\{L\}\\\}is also dominated: every relevant proposal payoff is then nonnegative \(f1​L\>0f\_\{1L\}\>0,f0​j≥0f\_\{0j\}\\geq 0\), so the optimal correction either leaves the proposal unchanged or shuts down a nonnegative\-payoff action, while the positive ask cost is still incurred onωL\\omega\_\{L\}; henceΔ​\(\{ωL\},C\)≤0\\Delta\(\\\{\\omega\_\{L\}\\\},C\)\\leq 0for everyCC, no better thanΔ​\(∅,⋅\)=0\\Delta\(\\emptyset,\\cdot\)=0\. Evaluating the surviving candidate against always\-play \(product beliefs\):

Δ​\(\{ωH\},\{θ1\}\)=μ​\(θ1,ωH\)​\(\|f1​H\|−cov\)−cask​μΩ​\(ωH\)=q​\[b​\(\|f1​H\|−cov\)−cask\],\\Delta\(\\\{\\omega\_\{H\}\\\},\\\{\\theta\_\{1\}\\\}\)=\\mu\(\\theta\_\{1\},\\omega\_\{H\}\)\\,\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\-c\_\{\\mathrm\{ask\}\}\\,\\mu\_\{\\Omega\}\(\\omega\_\{H\}\)=q\\bigl\[b\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\-c\_\{\\mathrm\{ask\}\}\\bigr\],whileΔ​\(∅,⋅\)=0\\Delta\(\\emptyset,\\cdot\)=0\. Hence the team asks iffb​\(\|f1​H\|−cov\)\>caskb\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\>c\_\{\\mathrm\{ask\}\}, i\.e\.b\>b∗b\>b^\{\*\}; atb=b∗b=b^\{\*\}the gain is0and both policies are optimal\. The factorq≥0q\\geq 0multiplies the entire bracket, so the sign, the ask decision, is independent ofqq\.■\\blacksquare

### A\.2Proof of[Proposition2](https://arxiv.org/html/2607.00155#Thmproposition2)

#### Policy form\.

The myopic human fixesδH​\(θ\)=oversee\\delta^\{H\}\(\\theta\)=\\mathrm\{oversee\}iffθ∈Θ−\\theta\\in\\Theta\_\{\-\}using her prior conditional; forθ∈Θ−\\theta\\in\\Theta\_\{\-\}her committed correction is shutdown, applied whenever she oversees \(she cannot condition onω\\omega\)\. Holding this fixed, asking atω\\omegachanges the payoff, on eachθ∈Θ−\\theta\\in\\Theta\_\{\-\}, by−fσ​\(θ,ω\)\-f\_\{\\sigma\}\(\\theta,\\omega\)\(shutdown gain\)−cov\-c\_\{\\mathrm\{ov\}\}, and payscaskc\_\{\\mathrm\{ask\}\}for allθ\\theta; this isΨ​\(ω\)\\Psi\(\\omega\)\.

#### Part \(i\)\.

Under off\-switch and product beliefs,f¯H​\(θi\)=\(1−q\)​fi​L\+q​fi​H\\bar\{f\}\_\{H\}\(\\theta\_\{i\}\)=\(1\-q\)f\_\{iL\}\+qf\_\{iH\}\. Forθ0\\theta\_\{0\}:f0​L,f0​H≥0f\_\{0L\},f\_\{0H\}\\geq 0sof¯H​\(θ0\)≥0\>−cov\\bar\{f\}\_\{H\}\(\\theta\_\{0\}\)\\geq 0\>\-c\_\{\\mathrm\{ov\}\};θ0∉Θ−\\theta\_\{0\}\\notin\\Theta\_\{\-\}always\. Forθ1\\theta\_\{1\}:f¯H​\(θ1\)\\bar\{f\}\_\{H\}\(\\theta\_\{1\}\)decreases strictly fromf1​L\>0f\_\{1L\}\>0\(atq=0q=0\) tof1​H<−covf\_\{1H\}<\-c\_\{\\mathrm\{ov\}\}\(atq=1q=1\), crossing−cov\-c\_\{\\mathrm\{ov\}\}atq∗=\(f1​L\+cov\)/\(f1​L−f1​H\)∈\(0,1\)q^\{\*\}=\(f\_\{1L\}\+c\_\{\\mathrm\{ov\}\}\)/\(f\_\{1L\}\-f\_\{1H\}\)\\in\(0,1\); henceθ1∈Θ−\\theta\_\{1\}\\in\\Theta\_\{\-\}iffq\>q∗q\>q^\{\*\}\.

#### Part \(ii\)\.

Whenq\>q∗q\>q^\{\*\},Θ−=\{θ1\}\\Theta\_\{\-\}=\\\{\\theta\_\{1\}\\\}\. The shutdown is applied at bothω\\omega\(prior commitment\)\. ForωL\\omega\_\{L\}:Ψ​\(ωL\)=b​\(−f1​L\)−cask−b​cov<0\\Psi\(\\omega\_\{L\}\)=b\(\-f\_\{1L\}\)\-c\_\{\\mathrm\{ask\}\}\-bc\_\{\\mathrm\{ov\}\}<0sincef1​L\>0f\_\{1L\}\>0\(shutting down atωL\\omega\_\{L\}destroys value\)\. ForωH\\omega\_\{H\}:Ψ​\(ωH\)=b​\(−f1​H\)−cask−b​cov=b​\(\|f1​H\|−cov\)−cask\>0\\Psi\(\\omega\_\{H\}\)=b\(\-f\_\{1H\}\)\-c\_\{\\mathrm\{ask\}\}\-bc\_\{\\mathrm\{ov\}\}=b\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\-c\_\{\\mathrm\{ask\}\}\>0iffb\>b∗b\>b^\{\*\}\. Whenq≤q∗q\\leq q^\{\*\},Θ−=∅\\Theta\_\{\-\}=\\emptysetandΨ≡−cask<0\\Psi\\equiv\-c\_\{\\mathrm\{ask\}\}<0, so the AI never asks\.

#### Part \(iii\)\.

The AI asks iffω=ωH\\omega=\\omega\_\{H\},b\>b∗b\>b^\{\*\}*and*q\>q∗q\>q^\{\*\}\(the last because forq≤q∗q\\leq q^\{\*\}the human would not oversee andΨ​\(ωH\)=−cask<0\\Psi\(\\omega\_\{H\}\)=\-c\_\{\\mathrm\{ask\}\}<0\)\. This is the rectangle\(b∗,1\)×\(q∗,1\)\(b^\{\*\},1\)\\times\(q^\{\*\},1\)\.■\\blacksquare

### A\.3Proof of[Proposition3](https://arxiv.org/html/2607.00155#Thmproposition3)

By[Assumption2](https://arxiv.org/html/2607.00155#Thmassumption2)\(a\) \(common support at the played\(s,aσ\)\(s,a\_\{\\sigma\}\)\), the log\-likelihood\-ratio incrementsXi:=log⁡\(OωH​\(oi∣s,aσ\)/OωL​\(oi∣s,aσ\)\)X\_\{i\}:=\\log\\bigl\(O\_\{\\omega\_\{H\}\}\(o\_\{i\}\\mid s,a\_\{\\sigma\}\)/O\_\{\\omega\_\{L\}\}\(o\_\{i\}\\mid s,a\_\{\\sigma\}\)\\bigr\)are finite for every observableoio\_\{i\}, hence bounded on the finite𝒪\\mathcal\{O\}, and i\.i\.d\. under the true typeωH\\omega\_\{H\}\(the proposal is the fixed\(s,aσ\)\(s,a\_\{\\sigma\}\)each round\)\. Their mean is𝔼\[Xi\]=DKL\(OωH\(⋅∣s,aσ\)∥OωL\(⋅∣s,aσ\)\)=η\>0\\mathbb\{E\}\[X\_\{i\}\]=D\_\{\\mathrm\{KL\}\}\(O\_\{\\omega\_\{H\}\}\(\\cdot\\mid s,a\_\{\\sigma\}\)\\\|O\_\{\\omega\_\{L\}\}\(\\cdot\\mid s,a\_\{\\sigma\}\)\)=\\eta\>0by[Assumption2](https://arxiv.org/html/2607.00155#Thmassumption2)\(b\)\.

For part \(i\), suppose the fixed proposal is played indefinitely, so every played round contributes an i\.i\.d\. increment\. The log\-odds processΛn=Λ0\+Sn\\Lambda\_\{n\}=\\Lambda\_\{0\}\+S\_\{n\}withSn=∑i=1nXiS\_\{n\}=\\sum\_\{i=1\}^\{n\}X\_\{i\}andΛ0=log⁡q01−q0\\Lambda\_\{0\}=\\log\\frac\{q\_\{0\}\}\{1\-q\_\{0\}\}is a random walk with positive driftη\\eta, so by the strong lawSn/n→ηS\_\{n\}/n\\to\\etaa\.s\., givingΛn→∞\\Lambda\_\{n\}\\to\\inftyandqn→1q\_\{n\}\\to 1a\.s\.

For part \(ii\), suppose the fixed proposal is played for observations1,…,τ∗1,\\ldots,\\tau^\{\*\}, so the increments up toτ∗\\tau^\{\*\}are i\.i\.d\. as above\. Writeλ∗:=log⁡q∗1−q∗\\lambda^\{\*\}:=\\log\\frac\{q^\{\*\}\}\{1\-q^\{\*\}\}for the log\-odds threshold andL:=λ∗−Λ0\>0L:=\\lambda^\{\*\}\-\\Lambda\_\{0\}\>0for the log\-odds distance \(positive sinceq0<q∗q\_\{0\}<q^\{\*\}\)\. Because the myopic human trusts atq=q∗q=q^\{\*\}by the tie\-breaking convention, the regime exits only on*strict*crossing, so the relevant stopping time, counted in played observations, isτ∗=inf\{n:Λn\>λ∗\}=inf\{n:Sn\>L\}\\tau^\{\*\}=\\inf\\\{n:\\Lambda\_\{n\}\>\\lambda^\{\*\}\\\}=\\inf\\\{n:S\_\{n\}\>L\\\}\.

*Integrability\.*The incrementsXiX\_\{i\}are bounded with positive mean, so there existsλ\>0\\lambda\>0with𝔼​\[e−λ​X1\]<1\\mathbb\{E\}\[e^\{\-\\lambda X\_\{1\}\}\]<1\. By a Chernoff bound on the lower tail of the walk,

Pr⁡\(τ∗\>n\)≤Pr⁡\(Sn≤L\)≤eλ​L​\(𝔼​\[e−λ​X1\]\)n,\\Pr\(\\tau^\{\*\}\>n\)\\;\\leq\\;\\Pr\(S\_\{n\}\\leq L\)\\;\\leq\\;e^\{\\lambda L\}\\bigl\(\\mathbb\{E\}\[e^\{\-\\lambda X\_\{1\}\}\]\\bigr\)^\{n\},which decays geometrically innn; henceτ∗\\tau^\{\*\}has a finite expectation \(indeed all moments\)\.

*Wald bound\.*The stopped sum satisfiesSτ∗=L\+ζS\_\{\\tau^\{\*\}\}=L\+\\zeta, where the overshoot satisfies0<ζ≤M\+0<\\zeta\\leq M\_\{\+\}\(strictly positive because the crossing is strict and the final increment carrying the partial sum acrossLLis positive\), withM\+=maxo∈supp⁡\(OωH\)\(logOωH​\(o∣s,aσ\)OωL​\(o∣s,aσ\)\)\+M\_\{\+\}=\\max\_\{o\\in\\operatorname\{supp\}\(O\_\{\\omega\_\{H\}\}\)\}\\bigl\(\\log\\frac\{O\_\{\\omega\_\{H\}\}\(o\\mid s,a\_\{\\sigma\}\)\}\{O\_\{\\omega\_\{L\}\}\(o\\mid s,a\_\{\\sigma\}\)\}\\bigr\)^\{\+\}\. Wald’s identity gives𝔼​\[Sτ∗\]=η​𝔼​\[τ∗\]\\mathbb\{E\}\[S\_\{\\tau^\{\*\}\}\]=\\eta\\,\\mathbb\{E\}\[\\tau^\{\*\}\], soη​𝔼​\[τ∗\]=L\+𝔼​\[ζ\]\\eta\\,\\mathbb\{E\}\[\\tau^\{\*\}\]=L\+\\mathbb\{E\}\[\\zeta\]with0<𝔼​\[ζ\]≤M\+0<\\mathbb\{E\}\[\\zeta\]\\leq M\_\{\+\}, i\.e\.

Lη≤𝔼​\[τ∗\]≤L\+M\+η\.\\frac\{L\}\{\\eta\}\\;\\leq\\;\\mathbb\{E\}\[\\tau^\{\*\}\]\\;\\leq\\;\\frac\{L\+M\_\{\+\}\}\{\\eta\}\.The approximation𝔼​\[τ∗\]≈L/η\\mathbb\{E\}\[\\tau^\{\*\}\]\\approx L/\\etais the asymptotic statement obtained asLLgrows large while the increment law \(henceM\+M\_\{\+\}\) stays fixed; we do not claim exact monotonicity inη\\eta, since the overshoot depends on the full increment law and not onη\\etaalone\.■\\blacksquare

### A\.4Proof of[Proposition4](https://arxiv.org/html/2607.00155#Thmproposition4)

#### Part \(i\)\.

UnderπsepA​I\\pi^\{AI\}\_\{\\mathrm\{sep\}\}, the likelihood of ask givenωL\\omega\_\{L\}is0, so by Bayes’ rule observing ask givesqt\+1=1\>q∗q\_\{t\+1\}=1\>q^\{\*\}\. From roundt\+1t\+1on the world is known to beωH\\omega\_\{H\}\. Atq=1q=1a type\-θ1\\theta\_\{1\}human strictly prefers shutdown to the proposal \(𝔼​\[fσ​\(θ1,⋅\)\]=f1​H<0\\mathbb\{E\}\[f\_\{\\sigma\}\(\\theta\_\{1\},\\cdot\)\]=f\_\{1H\}<0\), so she oversees whenever asked, independently ofbb\. Conditional on the actual type beingθ1\\theta\_\{1\}, ask\-and\-shutdown is itself worthwhile irrespective ofbb, since[Assumption1](https://arxiv.org/html/2607.00155#Thmassumption1)gives\|f1​H\|\>cask\+cov\|f\_\{1H\}\|\>c\_\{\\mathrm\{ask\}\}\+c\_\{\\mathrm\{ov\}\}\. The role ofb\>b∗b\>b^\{\*\}is to make asking worthwhile*ex ante*for an AI that remains uncertain about the human type:b\>b∗b\>b^\{\*\}is precisely the team\-optimal one\-shot ask threshold of[Corollary1](https://arxiv.org/html/2607.00155#Thmcorollary1), which atq=1q=1prescribesB∗=\{ωH\},C∗=\{θ1\}B^\{\*\}=\\\{\\omega\_\{H\}\\\},C^\{\*\}=\\\{\\theta\_\{1\}\\\}, so if the AI’s continuation is to keep asking whenever asking is one\-shot team\-improving in expectation, thenb\>b∗b\>b^\{\*\}is exactly the condition under which it keeps asking each round, and the team shuts down forθ1\\theta\_\{1\}and trusts forθ0\\theta\_\{0\}\(sinceDB​\(θ0,⋅\)=0D\_\{B\}\(\\theta\_\{0\},\\cdot\)=0: at the revealedωH\\omega\_\{H\},f0​H≥0f\_\{0H\}\\geq 0, so the optimal correction keepsaσa\_\{\\sigma\}\)\. Note that under the fixed policyπsepA​I\\pi^\{AI\}\_\{\\mathrm\{sep\}\}the AI in fact keeps asking onθ0\\theta\_\{0\}as well; this is the deliberately suboptimal feature of the fixed policy, and the value computation in part \(ii\) accounts for it\.

#### Part \(ii\)\.

Underω=ωH\\omega=\\omega\_\{H\},q≤q∗q\\leq q^\{\*\}, the human trusts at roundtt\(θ1∉Θ−\\theta\_\{1\}\\notin\\Theta\_\{\-\}\)\. We compute per\-type values, using the stationary continuation: onceq=1q=1\(from roundt\+1t\+1on\), the type\-θ1\\theta\_\{1\}proposal is shut down*every*round at costcask\+covc\_\{\\mathrm\{ask\}\}\+c\_\{\\mathrm\{ov\}\}, and the type\-θ0\\theta\_\{0\}proposal is asked and trusted every round \(sinceθ0∉Θ−\\theta\_\{0\}\\notin\\Theta\_\{\-\}even atq=1q=1\), yieldingf0​H−caskf\_\{0H\}\-c\_\{\\mathrm\{ask\}\}per round\.

Vsepθ1\\displaystyle V\_\{\\mathrm\{sep\}\}^\{\\theta\_\{1\}\}=f1​H−cask⏟round​t:ask, trust\+γ​\(−cask−cov\)1−γ,\\displaystyle=\\underbrace\{f\_\{1H\}\-c\_\{\\mathrm\{ask\}\}\}\_\{\\text\{round \}t:\\text\{ ask, trust\}\}\+\\frac\{\\gamma\\,\(\-c\_\{\\mathrm\{ask\}\}\-c\_\{\\mathrm\{ov\}\}\)\}\{1\-\\gamma\},Vsepθ0\\displaystyle V\_\{\\mathrm\{sep\}\}^\{\\theta\_\{0\}\}=f0​H−cask1−γ,\\displaystyle=\\frac\{f\_\{0H\}\-c\_\{\\mathrm\{ask\}\}\}\{1\-\\gamma\},Vppθi\\displaystyle V\_\{\\mathrm\{pp\}\}^\{\\theta\_\{i\}\}=fi​H1−γ\.\\displaystyle=\\frac\{f\_\{iH\}\}\{1\-\\gamma\}\.Thebb\-weighted difference is, after simplification,

Vsep−Vpp=b​\(Vsepθ1−Vppθ1\)\+\(1−b\)​\(Vsepθ0−Vppθ0\)=b​γ​\(\|f1​H\|−cov\)−cask1−γ\.V\_\{\\mathrm\{sep\}\}\-V\_\{\\mathrm\{pp\}\}=b\\\!\\left\(V\_\{\\mathrm\{sep\}\}^\{\\theta\_\{1\}\}\-V\_\{\\mathrm\{pp\}\}^\{\\theta\_\{1\}\}\\right\)\+\(1\-b\)\\\!\\left\(V\_\{\\mathrm\{sep\}\}^\{\\theta\_\{0\}\}\-V\_\{\\mathrm\{pp\}\}^\{\\theta\_\{0\}\}\\right\)=\\frac\{b\\,\\gamma\\,\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\-c\_\{\\mathrm\{ask\}\}\}\{1\-\\gamma\}\.\(Theθ0\\theta\_\{0\}term contributes−\(1−b\)​cask/\(1−γ\)\-\(1\-b\)c\_\{\\mathrm\{ask\}\}/\(1\-\\gamma\)and theθ1\\theta\_\{1\}term contributesb​\[γ​\(\|f1​H\|−cov\)−cask\]/\(1−γ\)b\[\\gamma\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\-c\_\{\\mathrm\{ask\}\}\]/\(1\-\\gamma\); the−cask\-c\_\{\\mathrm\{ask\}\}pieces combine\.\) Since1−γ\>01\-\\gamma\>0,Vsep−Vpp\>0V\_\{\\mathrm\{sep\}\}\-V\_\{\\mathrm\{pp\}\}\>0iffb​γ​\(\|f1​H\|−cov\)\>caskb\\,\\gamma\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\>c\_\{\\mathrm\{ask\}\}, i\.e\. iffb\>b∗∗=cask/\[γ​\(\|f1​H\|−cov\)\]b\>b^\{\*\*\}=c\_\{\\mathrm\{ask\}\}/\[\\gamma\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\]\.[Assumption3](https://arxiv.org/html/2607.00155#Thmassumption3)givescask<γ​\(\|f1​H\|−cov\)c\_\{\\mathrm\{ask\}\}<\\gamma\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\), henceb∗∗∈\(0,1\)b^\{\*\*\}\\in\(0,1\)\.

#### Part \(iii\)\.

Directly,b∗∗=cask/\[γ​\(\|f1​H\|−cov\)\]=b∗/γb^\{\*\*\}=c\_\{\\mathrm\{ask\}\}/\[\\gamma\(\|f\_\{1H\}\|\-c\_\{\\mathrm\{ov\}\}\)\]=b^\{\*\}/\\gamma, andγ∈\(0,1\)\\gamma\\in\(0,1\)givesb∗∗\>b∗b^\{\*\*\}\>b^\{\*\}\.■\\blacksquare

Similar Articles

The Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance

arXiv cs.AI

This paper uses evolutionary game theory to model competition between a harm-minimizing AI agent and an approval-seeking (RLHF) agent in a community, analyzing conditions for adoption and welfare outcomes. The results show that while a self-audited agent can fixate, it is not sufficient to prevent community harm, and alignment and timeframe are critical.

Contextual Slate GLM Bandits with Limited Adaptivity

arXiv cs.LG

Proposes algorithms for contextual slate bandits with generalized linear rewards under limited adaptivity, achieving regret bounds independent of the non-linearity parameter. The batched and rarely-switching algorithms are computationally efficient and empirically outperform baselines, including in a language model example selection task.