Mind the Sim-to-Real Gap & Think Like a Scientist

arXiv cs.AI 05/22/26, 04:00 AM Papers
sim-to-real reinforcement-learning simulation experimentation policy causal-inference sequential-decision
Summary
This paper studies when and how a planner should supplement a pre-trained simulator with real experiments in sequential decision problems, proposing Fisher-SEP to minimize posterior variance of a target policy's value.
arXiv:2605.21458v1 Announce Type: new Abstract: Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.
Original Article
View Cached Full Text
Cached at: 05/22/26, 08:50 AM
# Mind the Sim-to-Real Gap & Think Like a ScientistThe first half of the title borrows the London Underground announcement Mind the Gap, which the paper takes as a metaphor for the local-and-reachability decomposition of the value gap (Proposition 1). The second half borrows the children’s song Think Like a Scientist by GoNoodle, and points at the prescription: use the simulator to choose where to experiment, rather than to choose a policy to deploy.
Source: [https://arxiv.org/html/2605.21458](https://arxiv.org/html/2605.21458)
Harsh Parikh Amazon SCOT, Seattle, USA Yale University, New Haven, USA &Gabriel Levin\-Konigsberg Amazon SCOT, Seattle, USA &Dominique Perrault\-Joncas Amazon SCOT, Seattle, USA &Alexander Volfovsky Amazon SCOT, Seattle, USA Duke University, Durham, USA

###### Abstract

Suppose a planner has a pre\-trained simulator of a sequential decision problem and the option to run real experiments in the field\. The simulator is cheap to query but inherits confounding and drift from its calibration data\. Experimentation is unbiased but consumes one real unit per trial\. We study when, and how, the planner should supplement the simulator with experiments\. We give three results\. First, an extended simulation lemma decomposes the simulator’s value error into a calibration–deployment shift that randomization can identify and a parametric residual that no further interaction can reduce\. Second, the value gap between the simulator\-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not\. The reachability component stays bounded away from zero at any horizon under purely passive learning\. Third, we propose Fisher\-SEP, a simulation\-aided experimental policy \(SEP\) that minimizes the posterior predictive variance of a target policy’s value, with reward\-only and transition\-only specializations\. Two case studies illustrate the regimes\. In a vending\-machine supply chain, front\-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot\. In an HIV mobile\-testing example with a corridor that separates a well\-surveilled region from a poorly\-surveilled one, only designed exploration reaches the poorly\-surveilled region\.

## 1Introduction

A mobile HIV\-testing program plans where to send vans each week\. The team has a pre\-trained*simulator*of neighborhood prevalence — a probabilistic forecast model that, for each zone and each weekly testing choice, predicts the expected number of new cases found, fit to two years of clinic data\(Gonsalveset al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib150); Warrenet al\.,[2025](https://arxiv.org/html/2605.21458#bib.bib152)\)— and a fixed weekly budget of vans\. The simulator ranks zones by expected new\-case yield, so the cheap option is to deploy the resulting route as is\. The other option is to divert a fraction of vans to zones the simulator ranks low\. Diverting costs immediate outreach in well\-served areas, but it is the only way to learn whether a low\-ranked zone is genuinely low\-prevalence or has merely never been tested\. This is the question we study\.

This question is more general than HIV testing\. A planner with a pre\-trained simulator of any real\-world system has to decide whether, when, and how to supplement it with real\-world data\. The simulator is built from history, and that history was collected under actions chosen in presence of a*hidden state*: a feature of the world that affects rewards and transitions but is not directly observed\. Because the historical operator’s actions and the hidden state may be correlated, the simulator’s calibration data confounds the action’s effect with the effect of the hidden state\. This is what causal inference calls*confounding*\(Pearl,[2009](https://arxiv.org/html/2605.21458#bib.bib140); Bareinboim and Pearl,[2016](https://arxiv.org/html/2605.21458#bib.bib29)\): the conditional mean reward in the calibration data combines the causal effect of the action and the effect of the hidden state given the operator’s choice\. \(Concretely: if the historical operator only sent vans to a zone when prevalence was high, then high prevalence and “send a van” are both elevated in the data, and the simulator cannot tell whether the high yield came from the zone or from the operator’s choice\.\)

A second problem persists even after deployment\. Large regions of the state\-action space may never appear in any trajectory the planner generates\. The causal\-inference name for this is a*positivity*violation: a state\-action pair with zero probability under the deployed policy produces no evidence, so no amount of further online data speaks to it\(Parikhet al\.,[2024b](https://arxiv.org/html/2605.21458#bib.bib163),[2025b](https://arxiv.org/html/2605.21458#bib.bib164)\)\. Confounding and positivity violations are two faces of the same picture\. The simulator is a*Markov decision process*\(MDP\) on the observed state — a model whose next state and reward are Markov functions of the current state and action\. The world is a partially observable MDP \(POMDP\), which generalizes an MDP by letting the relevant state have both an observed and a hidden component\(Zhang and Bareinboim,[2016](https://arxiv.org/html/2605.21458#bib.bib133); Namkoonget al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib131)\)\. \(Throughout,*simulator*means a pre\-trained dynamics model treated as an MDP, and*experimentation*means a deliberately randomized real\-world study\.\)

These observations motivate two questions\.*Given a pre\-trained simulator, when should a planner run real\-world experiments?*And,how is the simulator best used: as a tool to learn a deployable policy or as a tool to choose experiments?

Offline reinforcement learning \(RL\) and confounded MDPs supply tools for learning from logged data and diagnosing latent confounders\(Levineet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib24); Wanget al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib130); Zhang and Bareinboim,[2016](https://arxiv.org/html/2605.21458#bib.bib133); Kallus and Zhou,[2018](https://arxiv.org/html/2605.21458#bib.bib76)\), but treat the historical dataset as the only signal and do not jointly optimize its use against future field experiments\. Bayes\-adaptive MDPs \(where the unknown MDP parameters are treated as hidden state and updated by Bayes’ rule from observed rewards and transitions\) and posterior sampling formalize how a posterior should adapt during deployment\(Duff,[2002](https://arxiv.org/html/2605.21458#bib.bib116); Guezet al\.,[2012](https://arxiv.org/html/2605.21458#bib.bib117); Osbandet al\.,[2013](https://arxiv.org/html/2605.21458#bib.bib91); Russoet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib55)\), but update over the same parametric family the simulator already uses and assume positivity rather than diagnose when it fails\. Sim\-to\-real and hybrid RL combine simulated and real interaction, including simulator\-directed Fisher\-information\-based exploration\(Tobinet al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib13); Wagenmaker and Jamieson,[2024](https://arxiv.org/html/2605.21458#bib.bib118); Ballet al\.,[2023](https://arxiv.org/html/2605.21458#bib.bib35); Songet al\.,[2023](https://arxiv.org/html/2605.21458#bib.bib123); Memmelet al\.,[2024](https://arxiv.org/html/2605.21458#bib.bib145)\), but typically deploy the simulator as a warm start\. Causal transportability characterizes when effects identified in one population transfer to another\(Bareinboim and Pearl,[2016](https://arxiv.org/html/2605.21458#bib.bib29); Parikhet al\.,[2025a](https://arxiv.org/html/2605.21458#bib.bib147); Lannerset al\.,[2025](https://arxiv.org/html/2605.21458#bib.bib148)\), but stops short of saying which experiments to run when transport fails\. Multi\-fidelity Bayesian optimization treats the simulator as a cheap surrogate for a black\-box objective\(Poloczeket al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib11); Kandasamyet al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib9)\), but does not separate states the deployed policy visits from those it does not\. Our framework addresses a question that is logically prior to all of these: for a given simulator and finite planning horizon, which portion of the value gap is closable by passive online updating, and which portion requires deliberate experimentation? See Appendix[H](https://arxiv.org/html/2605.21458#A8)for a detailed comparison\.

Contributions\.We work in a finite\-latent decision\-theoretic framework where the simulator is an MDP on the observed state and the world is a POMDP\. We name the planner’s three options: trust the simulator and deploy \(the*simulator\-optimal policy*, SOP\), take the simulator as a Bayesian prior and update passively \(the*adaptive SOP*, A\-SOP\), or use the simulator’s structure to choose experiments \(a*simulation\-aided experimental policy*, SEP\)\. The simulator’s error decomposes into a component on states the deployed policy visits \(its visitation support\) and a component on states it does not, and these two components do not respond to online data in the same way\. \(i\)*Gap decomposition\.*The value gap between the deployed policy and the optimal one decomposes into a*local*part, on the deployed policy’s visitation support, and a*reachability*part, on states outside it \(Proposition[1](https://arxiv.org/html/2605.21458#Thmproposition1), Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)\)\. Posterior\-mean\-optimal online updating asymptotically closes the local part but provably leaves the reachability part open\. \(ii\)*Fisher\-SEP\.*We propose Fisher\-SEP, a Fisher\-information\-directed instance of SEP that minimizes the posterior predictive variance \(PVV\) of the value of a chosen target policy \(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\), with two specializations for transition\-known and reward\-known regimes\. \(iii\)*Case studies and diagnostic\.*A vending\-machine supply chain instantiates the local regime\. An HIV mobile\-testing campaign instantiates the reachability regime\. A per\-pair diagnostic, the Exploration Priority Index \(EPI\), gives a per\-pair score for how much pilot effort an\(s,a\)\(s,a\)pair deserves \(Remark[5](https://arxiv.org/html/2605.21458#Thmremark5)\)\.

Organization\.Section[2](https://arxiv.org/html/2605.21458#S2)fixes the world, the simulator, and the policy classes\. Section[3](https://arxiv.org/html/2605.21458#S3)states the gap decomposition\. Section[4](https://arxiv.org/html/2605.21458#S4)defines Fisher\-SEP\. Section[5](https://arxiv.org/html/2605.21458#S5)reports the case studies\. Section[6](https://arxiv.org/html/2605.21458#S6)closes\. Figure[1](https://arxiv.org/html/2605.21458#S1.F1.7)illustrates the decomposition on a four\-state example\.

![Refer to caption](https://arxiv.org/html/2605.21458v1/x1.png)

Figure 1:Gap decomposition on the visitation simplex for a four\-state example\. Blue contours: value under the simulator’s model, confined to a face of the simplex\. Orange contours: value under the true environment, peaking at the corner the simulator ignores\. Policies that either trust the simulator \(∙\\bullet\) or update online without experimenting \(■\\blacksquare\) remain on the simulator’s face and close only the local gap; an experimental policy \(▲\\blacktriangle\) reaches the ignored corner and closes the reachability gap\.
## 2Setup

A population ofnnunits, indexed byi∈\{1,…,n\}i\\in\\\{1,\\dots,n\\\}, evolves over discrete timet∈\{0,…,T\}t\\in\\\{0,\\dots,T\\\}with discount factorγ∈\[0,1\)\\gamma\\in\[0,1\)— a weight that downweights rewards received far in the future, so the infinite\-sum value is finite; the*effective horizon*Teff:=1/\(1−γ\)T\_\{\\mathrm\{eff\}\}:=1/\(1\-\\gamma\)summarizes how far the policy plans ahead\. Each unit has an*observed state*Si,t∈𝕊S\_\{i,t\}\\in\\mathbb\{S\}and a*hidden state*Hi,t∈ℍH\_\{i,t\}\\in\\mathbb\{H\}, with𝕊\\mathbb\{S\}andℍ\\mathbb\{H\}finite\. The planner chooses an actionAi,t∈𝔸A\_\{i,t\}\\in\\mathbb\{A\}from a finite set, observes a rewardRi,t∼ρ\(⋅∣Si,t,Ai,t,Hi,t\)∈\[0,Rmax\]R\_\{i,t\}\\sim\\rho\(\\cdot\\mid S\_\{i,t\},A\_\{i,t\},H\_\{i,t\}\)\\in\[0,R\_\{\\max\}\], and the state transitions as\(Si,t\+1,Hi,t\+1\)∼℘\(⋅∣Si,t,Ai,t,Hi,t\)\(S\_\{i,t\+1\},H\_\{i,t\+1\}\)\\sim\\wp\(\\cdot\\mid S\_\{i,t\},A\_\{i,t\},H\_\{i,t\}\)\. Writer¯\(s,a,h\):=𝔼\[R∣s,a,h\]\\bar\{r\}\(s,a,h\):=\\mathbb\{E\}\[R\\mid s,a,h\]for the expected reward at the truth and℘S\(⋅∣s,a,h\)\\wp\_\{S\}\(\\cdot\\mid s,a,h\)for the next\-observed\-state kernel\.

BecauseHHis hidden, the planner sees only the marginal process over\(S,A,R\)\(S,A,R\), and that marginal is not Markov onSSalone: pastS,AS,Atrajectories carry information about the currentHH\. The hidden state plays two roles\. First, it carries*confounding*: in the calibration data on which the simulator was trained, the historical operator’s actions depended onHH, soA⟂̸H∣SA\\not\\perp H\\mid S\(read:AAis*not*conditionally independent ofHHgivenSS\) in that data\. Second, it carries*drift*: the conditional ofHHgivenSSmoved between calibration and deployment\. Throughout,ℙt\(H∣S\)\\mathbb\{P\}\_\{t\}\(H\\mid S\)denotes the deployment\-time conditional, andℙcalib\(H∣S,A\)\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(H\\mid S,A\)denotes the calibration\-time conditional\. In causal\-inference language, the simulator is calibrated to the observational𝔼\[R∣S,A\]\\mathbb\{E\}\[R\\mid S,A\], while the planner cares about the interventional𝔼\[R∣S,do\(A\)\]\\mathbb\{E\}\[R\\mid S,\\mathrm\{do\}\(A\)\]\. The two coincide only whenA⟂H∣SA\\perp H\\mid Sin the calibration data\.

*Three layers of approximation\.*The simulator is an MDP onSS\. The world is a POMDP on\(S,H\)\(S,H\)\. To compare the two, we need an MDP onSSalone; we construct one by averaging the hidden state out at every step, which loses information \(the true world has memory inHHthat this construction discards\) but the loss is bounded and decays geometrically when latent dynamics are contracting \(Remark[1](https://arxiv.org/html/2605.21458#Thmremark1)\)\. The gap between simulator and world factors through two such intermediate Markov objects onSS, each obtained by averagingHHout against a different conditional\. Letℳ⋆\\mathcal\{M\}^\{\\star\}denote the truth\. The*deployment\-time Markov projection*ℳobs⋆=\(ρobs⋆,℘obs⋆\)\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}=\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\)averagesHHout at every step under the deployment\-time conditionalℙt\(H∣S\)\\mathbb\{P\}\_\{t\}\(H\\mid S\), treatingHHas if it were drawn fresh from this conditional independent of the past\(S,A\)\(S,A\)trajectory; the true POMDP marginal over\(S,A,R\)\(S,A,R\)is not Markov onSSalone \(since the past trajectory carries information about the currentHH\), so this projection is itself an approximation, with costϵhist\\epsilon^\{\\mathrm\{hist\}\}analyzed in Remark[1](https://arxiv.org/html/2605.21458#Thmremark1)\. The*calibration kernel*ℳ¯calib=\(ρ¯calib,℘¯calib\)\\bar\{\\mathcal\{M\}\}\_\{\\mathrm\{calib\}\}=\(\\bar\{\\rho\}\_\{\\mathrm\{calib\}\},\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\)averagesHHout under the calibration\-time conditionalℙcalib\(H∣S,A\)\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(H\\mid S,A\); this is the limit of the simulator’s training procedure under infinite calibration data and a perfectly specified parametric family\. The simulatorℳ^sim=\(ρ^sim,℘^sim\)\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}=\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)deviates fromℳ¯calib\\bar\{\\mathcal\{M\}\}\_\{\\mathrm\{calib\}\}by finite\-sample noise and parametric misspecification\.

The four objects sit on a chain

ℳ⋆→ϵhistℳobs⋆→ϵhℳ¯calib→ϵmℳ^sim,\\mathcal\{M\}^\{\\star\}\\;\\xrightarrow\{\\;\\;\\epsilon^\{\\mathrm\{hist\}\}\\;\\;\}\\;\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}\\;\\xrightarrow\{\\;\\;\\epsilon^\{h\}\\;\\;\}\\;\\bar\{\\mathcal\{M\}\}\_\{\\mathrm\{calib\}\}\\;\\xrightarrow\{\\;\\;\\epsilon^\{m\}\\;\\;\}\\;\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\},\(1\)with intermediate kernels

ρobs⋆\(s,a\)\\displaystyle\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\):=𝔼H∼ℙt\(⋅∣s\)\[r¯\(s,a,H\)\],\\displaystyle:=\\mathbb\{E\}\_\{H\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)\}\[\\bar\{r\}\(s,a,H\)\],℘obs⋆\(⋅∣s,a\)\\displaystyle\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s,a\):=𝔼H∼ℙt\(⋅∣s\)\[℘S\(⋅∣s,a,H\)\],\\displaystyle:=\\mathbb\{E\}\_\{H\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)\}\[\\wp\_\{S\}\(\\cdot\\mid s,a,H\)\],\(2\)ρ¯calib\(s,a\)\\displaystyle\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}\(s,a\):=𝔼H∼ℙcalib\(⋅∣s,a\)\[r¯\(s,a,H\)\],\\displaystyle:=\\mathbb\{E\}\_\{H\\sim\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s,a\)\}\[\\bar\{r\}\(s,a,H\)\],℘¯calib\(⋅∣s,a\)\\displaystyle\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s,a\):=𝔼H∼ℙcalib\(⋅∣s,a\)\[℘S\(⋅∣s,a,H\)\]\.\\displaystyle:=\\mathbb\{E\}\_\{H\\sim\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s,a\)\}\[\\wp\_\{S\}\(\\cdot\\mid s,a,H\)\]\.\(3\)
The three arrows correspond to three distinct sources of error\.

- •ϵhist\\epsilon^\{\\mathrm\{hist\}\}\(*Markov\-projection error*\): the cost of replacing the history\-dependentℙ\(Ht∣ℋt\)\\mathbb\{P\}\(H\_\{t\}\\mid\\mathcal\{H\}\_\{t\}\)with the time\-ttMarkov conditionalℙt\(H∣St\)\\mathbb\{P\}\_\{t\}\(H\\mid S\_\{t\}\)\. Reduced by latent\-dynamics contraction \(Remark[1](https://arxiv.org/html/2605.21458#Thmremark1)\)\. Not reduced by any deployment\-time intervention\.
- •ϵh\\epsilon^\{h\}\(*calibration–deployment regime shift*\): the cost of replacingℙt\(H∣S\)\\mathbb\{P\}\_\{t\}\(H\\mid S\)withℙcalib\(H∣S,A\)\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(H\\mid S,A\)\. This is the*drift*term named earlier — the gap between the deployment and calibration conditionals onHH\. Reduced by a randomizedSS\-measurable pilot, which severs the calibration regime’sH→AH\\to Aedge \(Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2), Section[4](https://arxiv.org/html/2605.21458#S4)\)\. Not reduced by passive interaction\.
- •ϵm\\epsilon^\{m\}\(*functional misspecification*\): the cost of replacingℳ¯calib\\bar\{\\mathcal\{M\}\}\_\{\\mathrm\{calib\}\}with the simulator’s actual finite\-sample, possibly mis\-specified output\. Irreducible by any further interaction with the real world\.

Randomization can pay downϵh\\epsilon^\{h\}, cannot reachϵm\\epsilon^\{m\}, and we setϵhist\\epsilon^\{\\mathrm\{hist\}\}aside on contraction grounds \(Remark[1](https://arxiv.org/html/2605.21458#Thmremark1)\)\. By the triangle inequality,\|ρ^sim\(s,a\)−ρobs⋆\(s,a\)\|≤ϵrh\(s,a\)\+ϵrm\(s,a\)\|\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\)\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\|\\leq\\epsilon\_\{r\}^\{h\}\(s,a\)\+\\epsilon\_\{r\}^\{m\}\(s,a\), with the analogous inequality for transitions; the per\-component definitions ofϵrh,ϵrm,ϵph,ϵpm\\epsilon\_\{r\}^\{h\},\\epsilon\_\{r\}^\{m\},\\epsilon\_\{p\}^\{h\},\\epsilon\_\{p\}^\{m\}appear in Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)\.

The hidden\-state distribution is taken to be stationary within an episode but may differ between episodes \(e\.g\., calibration vs\. deployment\)\.

*Assumptions\.*Four assumptions support the analysis\. Three are housekeeping \(tabular state, observability, conjugacy\); the substantive one is Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3)\(calibration distribution\), which is where the confounding bias enters\.

###### Assumption 1\(Bounded rewards, finite tabular state\)\.

𝕊,𝔸,ℍ\\mathbb\{S\},\\mathbb\{A\},\\mathbb\{H\}are finite\.Ri,t∈\[0,Rmax\]R\_\{i,t\}\\in\[0,R\_\{\\max\}\]almost surely\.γ∈\[0,1\)\\gamma\\in\[0,1\)\.

###### Assumption 2\(Observable\-state policies\)\.

The planner’s policy at timettdepends only on the observed history\{Sj,s,Aj,s,Rj,s\}j,s≤t\\\{S\_\{j,s\},A\_\{j,s\},R\_\{j,s\}\\\}\_\{j,\\,s\\leq t\}\. In particular, it does not condition onHH\.

###### Assumption 3\(Calibration distribution\)\.

The simulator was trained from trajectories collected under a behavior policyπbeh\(a∣s,h\)\\pi^\{\\mathrm\{beh\}\}\(a\\mid s,h\)withA⟂̸H∣SA\\not\\perp H\\mid S\. This induces a calibration conditionalℙcalib\(H∣S,A\)\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(H\\mid S,A\)that differs from the deployment conditionalℙt\(H∣S\)\\mathbb\{P\}\_\{t\}\(H\\mid S\), so the simulator implicitly learns𝔼\[R∣S,A\]\\mathbb\{E\}\[R\\mid S,A\]rather than𝔼\[R∣S,do\(A\)\]\\mathbb\{E\}\[R\\mid S,\\mathrm\{do\}\(A\)\]and inherits a confounding bias

βconf\(s,a\)=∑hr¯\(s,a,h\)\[ℙcalib\(h∣s,a\)−ℙt\(h∣s\)\]\\beta\_\{\\mathrm\{conf\}\}\(s,a\)\\;=\\;\\sum\_\{h\}\\bar\{r\}\(s,a,h\)\\bigl\[\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(h\\mid s,a\)\-\\mathbb\{P\}\_\{t\}\(h\\mid s\)\\bigr\]that does*not*shrink with more observational data of the same kind\.

###### Assumption 4\(Conjugate independent priors\)\.

The planner’s prior on the parameters ofℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}factorises across\(s,a\)\(s,a\): a Gaussian prior on the reward parameterθs,a\\theta\_\{s,a\}with varianceσs,a\(0\)2\\sigma^\{\(0\)\}\_\{s,a\}\{\}^\{2\}, and a Dirichlet prior on the transition℘obs⋆\(⋅∣s,a\)\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s,a\)with total concentrationαs,a\(0\)=∑s′αs,a,s′\(0\)\\alpha^\{\(0\)\}\_\{s,a\}=\\sum\_\{s^\{\\prime\}\}\\alpha^\{\(0\)\}\_\{s,a,s^\{\\prime\}\}\.

Assumptions[1](https://arxiv.org/html/2605.21458#Thmassumption1)and[2](https://arxiv.org/html/2605.21458#Thmassumption2)are standard tabular and observability constraints\. The function\-approximation extension is left to future work \(Section[6](https://arxiv.org/html/2605.21458#S6)\)\. Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3)is the substantive one: it makes simulator error*structural*, in the sense that no amount of further observational data of the same kind shrinksβconf\(s,a\)\\beta\_\{\\mathrm\{conf\}\}\(s,a\)\. Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)gives closed\-form posteriors and the block\-diagonal Fisher form used in Section[4](https://arxiv.org/html/2605.21458#S4)\. The hierarchical extension is in Appendix[C\.18](https://arxiv.org/html/2605.21458#A3.SS18)\.

*Value function and Bayes objective\.*For a policyπ\\piand an MDPℳ=\(ρ,℘\)\\mathcal\{M\}=\(\\rho,\\wp\)on𝕊\\mathbb\{S\}, write

Vπ\(ℳ\):=𝔼π,ℳ\[∑t=0Tγt∑iRi,t\]V^\{\\pi\}\(\\mathcal\{M\}\)\\;:=\\;\\mathbb\{E\}^\{\\pi,\\mathcal\{M\}\}\\\!\\left\[\\sum\_\{t=0\}^\{T\}\\gamma^\{t\}\\sum\_\{i\}R\_\{i,t\}\\right\]for the discounted population value of runningπ\\pionℳ\\mathcal\{M\}\. We measure policy quality by the Bayes risk ofVπ\(ℳ⋆\)V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\)over the prior𝒫\\mathcal\{P\}of Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4):

W\(π\):=𝔼ℳ⋆∼𝒫\[Vπ\(ℳ⋆\)\]=𝔼ℳ⋆∼𝒫\[𝔼π,ℳ⋆\[∑t=0Tγt∑iRi,t\]\]\.W\(\\pi\)\\;:=\\;\\mathbb\{E\}\_\{\\mathcal\{M\}^\{\\star\}\\sim\\mathcal\{P\}\}\\\!\\left\[V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\)\\right\]\\;=\\;\\mathbb\{E\}\_\{\\mathcal\{M\}^\{\\star\}\\sim\\mathcal\{P\}\}\\\!\\left\[\\,\\mathbb\{E\}^\{\\pi,\\mathcal\{M\}^\{\\star\}\}\\\!\\left\[\\sum\_\{t=0\}^\{T\}\\gamma^\{t\}\\sum\_\{i\}R\_\{i,t\}\\right\]\\right\]\.\(4\)The inner expectation is over the policy’s interaction with a single drawn world\. The outer expectation averages over draws of the unknown world from the prior, which gives a single objective comparable across A\-SOP and SEP\. This is the standard Bayes\-adaptive objective\. Under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)it reduces to the closed\-form posterior recursion of posterior\-sampling\-style analyses\(Osbandet al\.,[2013](https://arxiv.org/html/2605.21458#bib.bib91); Russoet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib55)\)\.

*Policy classes\.*Letℱt\\mathcal\{F\}\_\{t\}denote the observed\-history filtration, the sigma\-algebra generated by\{\(Sj,s,Aj,s,Rj,s\):j≤n,s≤t\}\\\{\(S\_\{j,s\},A\_\{j,s\},R\_\{j,s\}\):j\\leq n,\\,s\\leq t\\\}, and letbtb\_\{t\}denote the conjugate posterior overℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}started from the prior in Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)and updated fromℱt\\mathcal\{F\}\_\{t\}\. Letdπ\(s\):=\(1−γ\)∑t=0∞γtℙπ\(St=s\)d\_\{\\pi\}\(s\):=\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\mathbb\{P\}\_\{\\pi\}\(S\_\{t\}=s\)denote the discounted state\-visitation distribution underπ\\pi\.

###### Definition 1\(Policy classes\)\.

1. \(i\)*Non\-adaptive*\(Πna\\Pi\_\{\\mathrm\{na\}\}\):πt\(⋅∣s\)\\pi\_\{t\}\(\\cdot\\mid s\)depends only on the current observed state and is fixed before any data is collected\.
2. \(ii\)*Passive\-learning*\(Πpassive\\Pi\_\{\\mathrm\{passive\}\}\): stochastic policiesπt\(⋅∣St,bt\)\\pi\_\{t\}\(\\cdot\\mid S\_\{t\},b\_\{t\}\)measurable with respect to the observed state and the current Bayesian posteriorbtb\_\{t\}, withbtb\_\{t\}updated fromℱt\\mathcal\{F\}\_\{t\}via the conjugate rules of Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)\.111Non\-adaptive policies are the constant\-belief subclass of passive\-learning policies, soΠna⊆Πpassive\\Pi\_\{\\mathrm\{na\}\}\\subseteq\\Pi\_\{\\mathrm\{passive\}\}holds by definition\. The strict inclusionΠna⊊Πpassive\\Pi\_\{\\mathrm\{na\}\}\\subsetneq\\Pi\_\{\\mathrm\{passive\}\}is witnessed by any policy with non\-degenerate belief updates\.
3. \(iii\)*Adaptive*\(Πadapt\\Pi\_\{\\mathrm\{adapt\}\}\): the full class of history\-dependent policiesπt\(⋅∣ℱt\)\\pi\_\{t\}\(\\cdot\\mid\\mathcal\{F\}\_\{t\}\), including those that deliberately deviate from the posterior optimum to gather information\.

The three classes correspond to three uses of observed data\. The non\-adaptive class ignores it\. The passive class uses it to update beliefs but not to direct exploration\. The adaptive class can use it for both\. Whether to run a real\-world experiment is the question of whether to leaveΠpassive\\Pi\_\{\\mathrm\{passive\}\}forΠadapt\\Pi\_\{\\mathrm\{adapt\}\}, and at what cost\. We define one named policy in each class\.

###### Definition 2\(Simulator\-optimal policy, SOP\)\.

πsim⋆:=argmaxπ∈Πna⁡Vπ\(ℳ^sim\)\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}:=\\operatorname\*\{arg\\,max\}\_\{\\pi\\in\\Pi\_\{\\mathrm\{na\}\}\}V^\{\\pi\}\(\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}\)\. The SOP is trained on the simulator and deployed without updates from real\-world data\.

###### Definition 3\(Adaptive simulator\-optimal policy, A\-SOP\)\.

πsima:=argmaxπ∈Πpassive⁡W\(π\)\\pi^\{a\}\_\{\\mathrm\{sim\}\}:=\\operatorname\*\{arg\\,max\}\_\{\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}\}W\(\\pi\), restricted to the*posterior\-mean\-optimal*subclass: at eachtt, the A\-SOP plays the actionaathat maximizes the expected return under the posterior\-mean MDP,πt\(s\)=argmaxa⁡𝔼ℳ∼bt\[Qℳ\(s,a\)\]\\pi\_\{t\}\(s\)=\\operatorname\*\{arg\\,max\}\_\{a\}\\mathbb\{E\}\_\{\\mathcal\{M\}\\sim b\_\{t\}\}\[Q^\{\\mathcal\{M\}\}\(s,a\)\], whereQℳ\(s,a\)Q^\{\\mathcal\{M\}\}\(s,a\)is the state\-action value underℳ\\mathcal\{M\}\. The A\-SOP takes the simulator as a Bayesian prior and updates from observed data without ever deliberately deviating from the posterior\-mean\-optimal action\. Thompson sampling is the posterior\-sampling instance ofΠpassive\\Pi\_\{\\mathrm\{passive\}\}\. Neither deliberately deviates from the posterior; they differ in whether they sample from it or collapse to its mean\.

###### Definition 4\(Simulation\-aided experimental policy, SEP\)\.

A SEP is a policy inΠadapt\\Pi\_\{\\mathrm\{adapt\}\}specified by a triple\(𝒟,πexplore,πexploit\)\(\\mathcal\{D\},\\pi^\{\\mathrm\{explore\}\},\\pi^\{\\mathrm\{exploit\}\}\): a design criterion𝒟\\mathcal\{D\}that scores exploration policies usingℳ^sim\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}, an explorerπexplore\\pi^\{\\mathrm\{explore\}\}that optimizes𝒟\\mathcal\{D\}, and an exploiterπtexploit\\pi^\{\\mathrm\{exploit\}\}\_\{t\}that is posterior\-mean\-optimal underbtb\_\{t\}\. A SEP allocates a fraction of decisions toπexplore\\pi^\{\\mathrm\{explore\}\}and the rest toπtexploit\\pi^\{\\mathrm\{exploit\}\}\_\{t\}, choosing the fraction and the explorer’s distribution by minimizing𝒟\\mathcal\{D\}\. Section[4](https://arxiv.org/html/2605.21458#S4)fills in a concrete𝒟\\mathcal\{D\}\.

The SOP’s visitation distribution does not adapt\. The A\-SOP’s visitation distribution stays close to the SOP’s, since the A\-SOP’s exploit action is posterior\-mean\-optimal and the posterior is anchored at the simulator\. The SEP’s visitation distribution can move outside the SOP’s support by spending pilot budget on the explorer\.

## 3Value gap decomposition

A simulator’s value error has two distinct components\. The calibration data may have been collected under one regime while deployment lives in another, producing a regime shift that randomization can identify\. Alternatively, the simulator’s parametric family may not include the true kernel, producing a misspecification residual that no further interaction can remove\. The two components look the same in aggregate but respond differently to a pilot\. This section gives three results that separate them\. Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)extends the simulation lemma to expose the calibration–deployment shift and the parametric residual as separate terms\. Proposition[1](https://arxiv.org/html/2605.21458#Thmproposition1)decomposes the value gap into a local part \(on visited states\) and a reachability part \(on unvisited states\)\. The reachability part is non\-vacuous: a combination\-lock construction in Appendix[E](https://arxiv.org/html/2605.21458#A5)exhibits an MDP on which it staysΩ\(Rmax\)\\Omega\(R\_\{\\max\}\)at any horizon\. Proofs are in Appendix[A](https://arxiv.org/html/2605.21458#A1)\.

### 3\.1Simulation lemma

The classical simulation lemma\(Kearns and Singh,[2002](https://arxiv.org/html/2605.21458#bib.bib114)\)bounds the value loss when a planner deploys under the wrong MDP: if the model has per\-pair reward errorϵr\(s,a\)\\epsilon\_\{r\}\(s,a\)\(absolute difference in mean reward\) and per\-pair transition errorϵp\(s,a\)\\epsilon\_\{p\}\(s,a\)\(total\-variation distance in next\-state distribution\), then for any policyπ\\pi, the values under model and truth differ by a weighted sum scaling linearly in the effective horizonTeff=1/\(1−γ\)T\_\{\\mathrm\{eff\}\}=1/\(1\-\\gamma\)for reward and quadratically \(γRmaxTeff2\\gamma R\_\{\\max\}T\_\{\\mathrm\{eff\}\}^\{2\}\) for transitions\. The classical bound treatsϵr\\epsilon\_\{r\}andϵp\\epsilon\_\{p\}as monolithic\. Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)splits each into a calibration–deployment pieceϵh\\epsilon^\{h\}, generated by the gap betweenℙcalib\(H∣S,A\)\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(H\\mid S,A\)andℙt\(H∣S\)\\mathbb\{P\}\_\{t\}\(H\\mid S\)\(Section[2](https://arxiv.org/html/2605.21458#S2)\), and a misspecification residualϵm\\epsilon^\{m\}that contains the remaining error\.

###### Lemma 1\(Simulation lemma, structural decomposition\)\.

Under Assumptions[1](https://arxiv.org/html/2605.21458#Thmassumption1)–[3](https://arxiv.org/html/2605.21458#Thmassumption3), letρobs⋆,℘obs⋆,ρ¯calib,℘¯calib\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\},\\bar\{\\rho\}\_\{\\mathrm\{calib\}\},\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}be the kernels defined in Section[2](https://arxiv.org/html/2605.21458#S2)\. Define

ϵr\(s,a\):=\|ρ^sim\(s,a\)−ρobs⋆\(s,a\)\|,ϵp\(s,a\):=∥℘^sim\(⋅∣s,a\)−℘obs⋆\(⋅∣s,a\)∥1\.\\epsilon\_\{r\}\(s,a\):=\|\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\)\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\|,\\qquad\\epsilon\_\{p\}\(s,a\):=\\\|\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\(\\cdot\\mid s,a\)\-\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s,a\)\\\|\_\{1\}\.By the triangle inequality throughρ¯calib\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}\(resp\.℘¯calib\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\), each splits asϵr\(s,a\)≤ϵrh\(s,a\)\+ϵrm\(s,a\)\\epsilon\_\{r\}\(s,a\)\\leq\\epsilon\_\{r\}^\{h\}\(s,a\)\+\\epsilon\_\{r\}^\{m\}\(s,a\)andϵp\(s,a\)≤ϵph\(s,a\)\+ϵpm\(s,a\)\\epsilon\_\{p\}\(s,a\)\\leq\\epsilon\_\{p\}^\{h\}\(s,a\)\+\\epsilon\_\{p\}^\{m\}\(s,a\), where the*calibration–deployment*pieces are

ϵrh\(s,a\)\\displaystyle\\epsilon\_\{r\}^\{h\}\(s,a\):=\|ρobs⋆\(s,a\)−ρ¯calib\(s,a\)\|,\\displaystyle:=\|\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\-\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}\(s,a\)\|,ϵph\(s,a\)\\displaystyle\\epsilon\_\{p\}^\{h\}\(s,a\):=∥℘obs⋆\(⋅∣s,a\)−℘¯calib\(⋅∣s,a\)∥1,\\displaystyle:=\\\|\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s,a\)\-\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s,a\)\\\|\_\{1\},and the*misspecification residuals*are

ϵrm\(s,a\)\\displaystyle\\epsilon\_\{r\}^\{m\}\(s,a\):=\|ρ^sim\(s,a\)−ρ¯calib\(s,a\)\|,\\displaystyle:=\|\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\)\-\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}\(s,a\)\|,ϵpm\(s,a\)\\displaystyle\\epsilon\_\{p\}^\{m\}\(s,a\):=∥℘^sim\(⋅∣s,a\)−℘¯calib\(⋅∣s,a\)∥1\.\\displaystyle:=\\\|\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\(\\cdot\\mid s,a\)\-\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s,a\)\\\|\_\{1\}\.Writingϵr:=maxs,a⁡ϵr\(s,a\)\\epsilon\_\{r\}:=\\max\_\{s,a\}\\epsilon\_\{r\}\(s,a\)andϵp:=maxs,a⁡ϵp\(s,a\)\\epsilon\_\{p\}:=\\max\_\{s,a\}\\epsilon\_\{p\}\(s,a\), for any policyπ\\pidepending on observed states only,

\|Vπ\(ℳobs⋆\)−Vπ\(ℳ^sim\)\|≤21−γ\(ϵrh\+ϵrm\)\+2γRmax\(1−γ\)2\(ϵph\+ϵpm\)\.\|V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}\)\-V^\{\\pi\}\(\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}\)\|\\leq\\frac\{2\}\{1\-\\gamma\}\(\\epsilon\_\{r\}^\{h\}\+\\epsilon\_\{r\}^\{m\}\)\+\\frac\{2\\gamma R\_\{\\max\}\}\{\(1\-\\gamma\)^\{2\}\}\(\\epsilon\_\{p\}^\{h\}\+\\epsilon\_\{p\}^\{m\}\)\.\(5\)

The decision to experiment depends on the ratioϵh/ϵm\\epsilon^\{h\}/\\epsilon^\{m\}, not on the absolute size of the simulator’s error\. Whenϵm\\epsilon^\{m\}dominates, no pilot of any length reduces the value error\. Whenϵh\\epsilon^\{h\}dominates, randomization is informative\. A randomizedSS\-measurable pilot identifiesϵh\\epsilon^\{h\}at any pair it covers, since randomizing actions independently ofHHsevers the calibration regime’sH→AH\\to Aedge and recoversℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}rather thanℳ¯calib\\bar\{\\mathcal\{M\}\}\_\{\\mathrm\{calib\}\}\. The misspecification residualϵm\\epsilon^\{m\}persists at every pair, since it lies in the simulator’s parametric family rather than in the data\.

###### Corollary 1\(Post\-pilot finite\-sample form\)\.

Letℐ⊆𝕊×𝔸\\mathcal\{I\}\\subseteq\\mathbb\{S\}\\times\\mathbb\{A\}be the pilot\-covered pairs withns,an\_\{s,a\}samples at\(s,a\)∈ℐ\(s,a\)\\in\\mathcal\{I\}gathered under anSS\-measurable randomized pilot\. Then there exist absolute constantscr,cp\>0c\_\{r\},c\_\{p\}\>0such that, for any policyπ\\pidepending on observed states only, with probability at least1−δ1\-\\delta,

\|Vπ\(ℳobs⋆\)−Vπ\(ℳ^sim\)\|≤21−γ\[max\(s,a\)∈ℐ⁡crlog⁡\(1/δ\)ns,a\+max\(s,a\)∉ℐ⁡ϵrh\(s,a\)\+ϵrm\]\+2γRmax\(1−γ\)2\[max\(s,a\)∈ℐ⁡cp\|𝕊\|log⁡\(1/δ\)ns,a\+max\(s,a\)∉ℐ⁡ϵph\(s,a\)\+ϵpm\]\.\|V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}\)\-V^\{\\pi\}\(\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}\)\|\\leq\\frac\{2\}\{1\-\\gamma\}\\Bigl\[\\max\_\{\(s,a\)\\in\\mathcal\{I\}\}c\_\{r\}\\sqrt\{\\tfrac\{\\log\(1/\\delta\)\}\{n\_\{s,a\}\}\}\+\\max\_\{\(s,a\)\\notin\\mathcal\{I\}\}\\epsilon\_\{r\}^\{h\}\(s,a\)\+\\epsilon\_\{r\}^\{m\}\\Bigr\]\\\\ \+\\frac\{2\\gamma R\_\{\\max\}\}\{\(1\-\\gamma\)^\{2\}\}\\Bigl\[\\max\_\{\(s,a\)\\in\\mathcal\{I\}\}c\_\{p\}\\sqrt\{\\tfrac\{\|\\mathbb\{S\}\|\\log\(1/\\delta\)\}\{n\_\{s,a\}\}\}\+\\max\_\{\(s,a\)\\notin\\mathcal\{I\}\}\\epsilon\_\{p\}^\{h\}\(s,a\)\+\\epsilon\_\{p\}^\{m\}\\Bigr\]\.

The corollary is the planner\-facing form\. At pilot\-covered pairs, the calibration–deployment sup norm is replaced by a Hoeffding\-styleO\(1/ns,a\)O\(1/\\sqrt\{n\_\{s,a\}\}\)rate for rewards and a Weissman\-styleO\(\|𝕊\|/ns,a\)O\(\\sqrt\{\|\\mathbb\{S\}\|/n\_\{s,a\}\}\)rate\(Weissmanet al\.,[2003](https://arxiv.org/html/2605.21458#bib.bib167)\)for transitions\. At uncovered pairs the original sup\-norm bound is retained, andϵm\\epsilon^\{m\}persists everywhere\. The proof is Azuma–Hoeffding at each covered pair combined with \([5](https://arxiv.org/html/2605.21458#S3.E5)\); see Appendix[A](https://arxiv.org/html/2605.21458#A1)\. The constants in \([5](https://arxiv.org/html/2605.21458#S3.E5)\) are those ofKearns and Singh \([2002](https://arxiv.org/html/2605.21458#bib.bib114)\)\.Lobel and Parr \([2024](https://arxiv.org/html/2605.21458#bib.bib144)\)give a tighter constant in policy\-dependent settings, which does not affect the per\-pair Fisher\-SEP ranking in Section[4](https://arxiv.org/html/2605.21458#S4)because that ranking is invariant to a common multiplicative factor\.

The decomposition has two implications\. Before any pilot is run, the structural split bounds the value gain achievable from experimentation\. Whenϵm\\epsilon^\{m\}dominates, the bound is small and a pilot is not warranted on value\-of\-experimentation grounds\. The simulator’s stated prior parametersσ\(0\)\\sigma^\{\(0\)\}andα\(0\)\\alpha^\{\(0\)\}in Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)bound the planner’s posterior uncertainty, but they do not estimateϵr\\epsilon\_\{r\}orϵp\\epsilon\_\{p\}themselves\. After a short pilot, the residual ratioϵ^p/ϵ^r\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}at covered pairs is estimable from post\-pilot residuals \(Appendix[B](https://arxiv.org/html/2605.21458#A2)\)\. This ratio determines whether the local or reachability regime applies\.

### 3\.2Local and reachability gaps

The three policy classes from Section[2](https://arxiv.org/html/2605.21458#S2)carry the next result:Πna\\Pi\_\{\\mathrm\{na\}\}\(the SOP\),Πpassive\\Pi\_\{\\mathrm\{passive\}\}\(the A\-SOP\),Πadapt\\Pi\_\{\\mathrm\{adapt\}\}\(any SEP\)\. We callsupp\(dπ\)\\mathrm\{supp\}\(d\_\{\\pi\}\)the*visitation support*ofπ\\pi, the set of statesπ\\pivisits with positive discounted probability — the formal name for what the introduction’s contributions block called the set of states the deployed policy visits\. The proposition says two things:W\(⋅\)W\(\\cdot\)is monotone across the three classes, and the design gap𝒢\\mathcal\{G\}— the value attainable by introducing experimental design, relative to the SOP — splits cleanly by visitation support\.

###### Proposition 1\(Dominance chain and gap decomposition\)\.

Under Assumptions[1](https://arxiv.org/html/2605.21458#Thmassumption1)–[4](https://arxiv.org/html/2605.21458#Thmassumption4),

supπ∈ΠnaW\(π\)≤supπ∈ΠpassiveW\(π\)≤supπ∈ΠadaptW\(π\)\.\\sup\_\{\\pi\\in\\Pi\_\{\\mathrm\{na\}\}\}W\(\\pi\)\\leq\\sup\_\{\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}\}W\(\\pi\)\\leq\\sup\_\{\\pi\\in\\Pi\_\{\\mathrm\{adapt\}\}\}W\(\\pi\)\.LetΠSEP⊂Πadapt\\Pi\_\{\\mathrm\{SEP\}\}\\subset\\Pi\_\{\\mathrm\{adapt\}\}be the SEP class of Definition[4](https://arxiv.org/html/2605.21458#Thmdefinition4)\. WriteWna⋆=W\(πsim⋆\)W^\{\\star\}\_\{\\mathrm\{na\}\}=W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)andWSEP⋆=supΠSEPWW^\{\\star\}\_\{\\mathrm\{SEP\}\}=\\sup\_\{\\Pi\_\{\\mathrm\{SEP\}\}\}W, and letΠSEPloc:=\{π∈ΠSEP:supp\(dπ\)⊆supp\(dπsim⋆\)\}\\Pi^\{\\mathrm\{loc\}\}\_\{\\mathrm\{SEP\}\}:=\\\{\\pi\\in\\Pi\_\{\\mathrm\{SEP\}\}:\\mathrm\{supp\}\(d\_\{\\pi\}\)\\subseteq\\mathrm\{supp\}\(d\_\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\)\\\}be the SEPs whose visitation support is contained in the SOP’s\. The design gap𝒢:=WSEP⋆−Wna⋆≥0\\mathcal\{G\}:=W^\{\\star\}\_\{\\mathrm\{SEP\}\}\-W^\{\\star\}\_\{\\mathrm\{na\}\}\\geq 0decomposes as𝒢=𝒢local\+𝒢reach\\mathcal\{G\}=\\mathcal\{G\}\_\{\\mathrm\{local\}\}\+\\mathcal\{G\}\_\{\\mathrm\{reach\}\}, with

𝒢local:=supπ∈ΠSEPlocW\(π\)−Wna⋆≥0,𝒢reach:=𝒢−𝒢local≥0\.\\mathcal\{G\}\_\{\\mathrm\{local\}\}:=\\sup\_\{\\pi\\in\\Pi^\{\\mathrm\{loc\}\}\_\{\\mathrm\{SEP\}\}\}W\(\\pi\)\-W^\{\\star\}\_\{\\mathrm\{na\}\}\\;\\geq\\;0,\\qquad\\mathcal\{G\}\_\{\\mathrm\{reach\}\}:=\\mathcal\{G\}\-\\mathcal\{G\}\_\{\\mathrm\{local\}\}\\;\\geq\\;0\.\(6\)

𝒢local\\mathcal\{G\}\_\{\\mathrm\{local\}\}is the value attainable by an explorer whose visitation support coincides with the SOP’s, by choosing different actions at states the SOP already visits\. The A\-SOP closes this term asymptotically: at visited states, online data shrinks the posterior, and once the posterior places the optimal action on the correct alternative the A\-SOP matches it\.𝒢reach\\mathcal\{G\}\_\{\\mathrm\{reach\}\}is the value attainable only by visiting states outside the SOP’s visitation support\. It is invisible to passive learning, since evidence at an unvisited\(s,a\)\(s,a\)is generated only by an action with zero probability under the posterior\-mean\-optimal policy\.

The decomposition mirrors the within\- versus out\-of\-positivity distinction in causal inference\. The local gap is the*within\-positivity*case: the deployed policy already produces evidence at every relevant\(s,a\)\(s,a\), so the inference problem is one of waiting for enough samples\. The reachability gap is the*out\-of\-positivity*case: the relevant\(s,a\)\(s,a\)pairs lie outside the deployed policy’s support and produce*no*evidence under it, so no amount of waiting helps\. Recall that*positivity*requires every action to have nonzero probability under the data\-generating mechanism at every covariate value\(Pearl,[2009](https://arxiv.org/html/2605.21458#bib.bib140)\); the analog here is that every state has positive discounted visitation under the deployed policy\. Any policy that closes the reachability gap must deliberately visit states outside the deployed policy’s visitation support\(Parikhet al\.,[2024b](https://arxiv.org/html/2605.21458#bib.bib163),[2025b](https://arxiv.org/html/2605.21458#bib.bib164)\)\. The simulator’s prior provides the extrapolation outside the SOP’s support, and Section[4](https://arxiv.org/html/2605.21458#S4)uses that prior to choose where the extrapolation is informative enough to act on\. The inequalities in \([6](https://arxiv.org/html/2605.21458#S3.E6)\) are weak in general; the reachability\-separation construction in Appendix[E](https://arxiv.org/html/2605.21458#A5)\(Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)\) shows that on a concrete MDP the reachability part is bounded away from zero, summarized below\.

The reachability term𝒢reach\\mathcal\{G\}\_\{\\mathrm\{reach\}\}is non\-vacuous\. Appendix[E](https://arxiv.org/html/2605.21458#A5)\(Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)\) constructs a deterministic combination\-lock MDP — a chain in which the simulator gets every transition right except at the terminal state, where it underestimates the reward by a multiplicative factorη<1\\eta<1— on which𝒢reach≥\(1−η\)Rmax\\mathcal\{G\}\_\{\\mathrm\{reach\}\}\\geq\(1\-\\eta\)R\_\{\\max\}for anyTeff≥kT\_\{\\mathrm\{eff\}\}\\geq k, withkkthe chain length\. The bound is independent of horizon: when the simulator’s miscalibration is concentrated on states the deployed policy never reaches, the gap is not closed by additional online interaction at any horizon\. A directed explorer that allocatesΩ\(k\)\\Omega\(k\)exploratory steps to the terminal state recovers the gap up too\(1\)o\(1\)\. A class\-level extension to stochastic transitions is conjectured \(Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2), Appendix[E\.7](https://arxiv.org/html/2605.21458#A5.SS7)\); it is established for posterior\-mean\-optimal policies, bounded\-temperature softmax, polynomial\-shrinkage upper\-confidence\-bound \(UCB\) policies, and Thompson sampling under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4), with the unconditional case open\.

Section[4](https://arxiv.org/html/2605.21458#S4)turns to the design question: which experimental policy minimizes posterior uncertainty about the target policy’s value, given a fixed pilot budget?

## 4Fisher\-information design

Section[3](https://arxiv.org/html/2605.21458#S3)shows that the reachability gap closes only when the simulator is used to choose where to experiment\. This section specifies a Bayesian design criterion𝒟\\mathcal\{D\}for the SEP class of Definition[4](https://arxiv.org/html/2605.21458#Thmdefinition4): a posterior predictive variance of a target policy’s value\. We name the SEP that minimizes this criterion*Fisher\-SEP*\(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\), and identify two natural specializations\.

### 4\.1Posterior predictive value variance

The criterion to minimize is the posterior variance of a chosen target policy’s value, rather than the variance of every parameter\. Designs that target overall parameter variance, such as A\-optimal design \(which minimizes the trace of the inverse Fisher of all parameters and so weights every parameter equally\)\(Chaloner and Verdinelli,[1995](https://arxiv.org/html/2605.21458#bib.bib3); Pukelsheim,[2006](https://arxiv.org/html/2605.21458#bib.bib136)\), allocate pilot budget to parameters whose perturbations do not change the deployed value\. Regret\-minimization explorers such asϵ\\epsilon\-greedy \(which explores by random action\) treat all unknowns as exchangeable and do not distinguish local from reachability errors\. We propose a target\-policy\-aware criterion: weight each parameter by how much it would change the deployed policy’s value if perturbed\. Parameters that do not affect a target policy’s value contribute nothing to the target\-policy posterior variance, so allocating pilot effort to them is wasteful\. The value gradient∇θVπtgt\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}summarizes which parameters affect the value of the target policyπtgt\\pi^\{\\mathrm\{tgt\}\}, and its squared entries give the rate at which a parameter error at\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)propagates into a value error\.

We use the label*posterior predictive value variance*\(PVV\) because the quantity measures prediction error onVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}\. Strictly, it is the posterior variance of the target\-policy value functional, not a predictive variance over observables\. Recall that the*delta method*approximates the posterior variance of a smooth functionalV\(θ\)V\(\\theta\)as\(∇θV\)⊤Cov\(θ\)∇θV\(\\nabla\_\{\\theta\}V\)^\{\\top\}\\mathrm\{Cov\}\(\\theta\)\\,\\nabla\_\{\\theta\}V, valid when the posterior onθ\\thetais approximately Gaussian; under that approximation, the posterior variance ofVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}and a predictive variance over observables agree \(Theorem[3](https://arxiv.org/html/2605.21458#Thmtheorem3), Appendix[A](https://arxiv.org/html/2605.21458#A1)\)\.

###### Definition 5\(Posterior predictive value variance, PVV; Fisher\-SEP\)\.

The criterion operates with a target policyπtgt\\pi^\{\\mathrm\{tgt\}\}\(held fixed\) and an explorer policyπ\\pi\(the argmin variable\)\. The parameter vector at\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)stacks the reward parameterrs′,a′:=ρobs⋆\(s′,a′\)r\_\{s^\{\\prime\},a^\{\\prime\}\}:=\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s^\{\\prime\},a^\{\\prime\}\), the interventional mean reward underℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}identified by a randomizedSS\-measurable pilot \(Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\), and the transition vectorps′,a′:=℘obs⋆\(⋅∣s′,a′\)∈Δ\(𝕊\)p\_\{s^\{\\prime\},a^\{\\prime\}\}:=\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\\in\\Delta\(\\mathbb\{S\}\), asθs′,a′:=\(rs′,a′,ps′,a′\)\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}:=\(r\_\{s^\{\\prime\},a^\{\\prime\}\},p\_\{s^\{\\prime\},a^\{\\prime\}\}\)\. Under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4), the prior covariance is block\-diagonal,

Σθ\(s′,a′\):=blkdiag\(\(σs′,a′\(0\)\)2,Σp\(s′,a′\)\),\\Sigma\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\):=\\mathrm\{blkdiag\}\\\!\\left\(\(\\sigma^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\},\\;\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)\\right\),\(7\)with reward block\(σs′,a′\(0\)\)2\(\\sigma^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}from Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)and transition blockΣp\(s′,a′\)=\(αs′,a′\(0\)\)−1\(diag\(p0\)−p0p0⊤\)\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)=\(\\alpha^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{\-1\}\\bigl\(\\mathrm\{diag\}\(p\_\{0\}\)\-p\_\{0\}p\_\{0\}^\{\\top\}\\bigr\), the Dirichlet prior covariance on the simplex tangent at the prior meanp0:=αs′,a′,⋅\(0\)/αs′,a′\(0\)p\_\{0\}:=\\alpha^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\},\\cdot\}/\\alpha^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\}\}\. Recall that the*Fisher information*of an observation likelihood quantifies how much each observation tells us about a parameter; concretely, it is the expected outer product of the score∂log⁡p\(R,S′∣s′,a′,θ\)/∂θ\\partial\\log p\(R,S^\{\\prime\}\\mid s^\{\\prime\},a^\{\\prime\},\\theta\)/\\partial\\theta\. Larger Fisher information means more information per observation, so the posterior contracts faster\. We writeℐθint\(s′,a′\)\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)for the per\-observation*interventional*Fisher information, the Fisher of the interventional likelihoodℙ\(⋅∣s′,do\(a′\)\)\\mathbb\{P\}\(\\cdot\\mid s^\{\\prime\},\\mathrm\{do\}\(a^\{\\prime\}\)\)identified from a randomizedSS\-measurable pilot \(Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\), as opposed to the observed\-data Fisher under the behavior policy, which is biased underA⟂̸H∣SA\\not\\perp H\\mid S\(Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3); Appendix[A\.8](https://arxiv.org/html/2605.21458#A1.SS8)makes the distinction precise\)\. It is also block\-diagonal,

ℐθint\(s′,a′\):=blkdiag\(τs′,a′int,ℐpint\(s′,a′\)\),\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\):=\\mathrm\{blkdiag\}\\\!\\left\(\\tau^\{\\mathrm\{int\}\}\_\{s^\{\\prime\},a^\{\\prime\}\},\\;\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)\\right\),\(8\)with reward blockτs′,a′int:=Var\(R∣s′,do\(a′\)\)−1\\tau^\{\\mathrm\{int\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}:=\\mathrm\{Var\}\(R\\mid s^\{\\prime\},\\mathrm\{do\}\(a^\{\\prime\}\)\)^\{\-1\}the inverse interventional reward variance and transition blockℐpint\(s′,a′\)=diag\(ps′,a′\)−1\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)=\\mathrm\{diag\}\(p\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{\-1\}projected onto the simplex tangent atps′,a′p\_\{s^\{\\prime\},a^\{\\prime\}\}\. The expected pilot count at\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)isns′,a′\(π\):=T⋅dπ\(s′\)⋅π\(a′∣s′\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\):=T\\cdot d\_\{\\pi\}\(s^\{\\prime\}\)\\cdot\\pi\(a^\{\\prime\}\\mid s^\{\\prime\}\)\. The PVV criterion is

PVV\(π;πtgt\):=∑sdπtgt\(s\)∑\(s′,a′\)∇θVπtgt\(s\)⊤\[Σθ\(s′,a′\)−1\+ns′,a′\(π\)ℐθint\(s′,a′\)\]−1∇θVπtgt\(s\)\.\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\):=\\sum\_\{s\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\sum\_\{\(s^\{\\prime\},a^\{\\prime\}\)\}\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)^\{\\\!\\top\}\\\!\\left\[\\Sigma\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)^\{\-1\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\,\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)\\right\]^\{\\\!\-1\}\\\!\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\.\(9\)The joint gradient∇θVπtgt\(s\)=\(∂Vπtgt\(s\)/∂rs′,a′,∇p\(⋅∣s′,a′\)Vπtgt\(s\)\)\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)=\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)/\\partial r\_\{s^\{\\prime\},a^\{\\prime\}\},\\;\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\)stacks a scalar reward sensitivity and a\|𝕊\|\|\\mathbb\{S\}\|\-vector transition sensitivity\. Both are supplied by the*Bellman resolvent*\(I−γPπtgt\)−1\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}, wherePπtgtP^\{\\pi^\{\\mathrm\{tgt\}\}\}is the\|𝕊\|×\|𝕊\|\|\\mathbb\{S\}\|\\times\|\\mathbb\{S\}\|transition matrix induced on𝕊\\mathbb\{S\}by the target policyπtgt\\pi^\{\\mathrm\{tgt\}\}— entriesPπtgt\(s′∣s\)=∑aπtgt\(a∣s\)℘obs⋆\(s′∣s,a\)P^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s^\{\\prime\}\\mid s\)=\\sum\_\{a\}\\pi^\{\\mathrm\{tgt\}\}\(a\\mid s\)\\,\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(s^\{\\prime\}\\mid s,a\)— and the resolvent’s\(s,s′\)\(s,s^\{\\prime\}\)entry is the expected discounted number of visits tos′s^\{\\prime\}whenπtgt\\pi^\{\\mathrm\{tgt\}\}starts fromss\(by the geometric series\(I−γPπtgt\)−1=∑k=0∞γk\(Pπtgt\)k\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}=\\sum\_\{k=0\}^\{\\infty\}\\gamma^\{k\}\(P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{k\}, valid becauseγ<1\\gamma<1\)\.*Fisher\-SEP*is the SEP whose design criterion is PVV:𝒟:=PVV\(⋅;πtgt\)\\mathcal\{D\}:=\\mathrm\{PVV\}\(\\cdot;\\,\\pi^\{\\mathrm\{tgt\}\}\), withπ⋆=argminπ⁡PVV\(π;πtgt\)\\pi^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{\\pi\}\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)over stochastic explorers\.

The expression in \([9](https://arxiv.org/html/2605.21458#S4.E9)\) is a delta\-method posterior variance forVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}\. The outer sum, weighted bydπtgt\(s\)d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\), is over states the target visits\. The inner sum is over candidate pilot pairs\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)\. At each pair the contribution is a quadratic form in the value gradient, weighted by the inverse posterior precisionΣθ−1\+nℐθint\\Sigma\_\{\\theta\}^\{\-1\}\+n\\,\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}\. Pairs whose value\-gradient contribution is small contribute little to the criterion\. Pairs with large contribution dominate\. The block\-diagonal structure separates the contribution into reward and transition pieces\.

Note thatπ\\pienters PVV only through the pilot countsns′,a′\(π\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)inside the block inverse\. The value gradient and the target visitation are properties ofπtgt\\pi^\{\\mathrm\{tgt\}\}alone and are fixed once the target policy is chosen\. PVV is convex inn\(π\)n\(\\pi\), since the inverse of an increasing positive\-definite matrix is matrix\-decreasing, so the minimization is well\-posed and admits a closed\-form first\-order optimum on the simplex of explorer visitation distributions \(Appendix[B](https://arxiv.org/html/2605.21458#A2)\)\.

Block\-diagonality ofℐθint\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}comes from the MDP factorizationℙ\(R,S′∣s,a,h\)=ρ\(R∣s,a,h\)℘S\(S′∣s,a,h\)\\mathbb\{P\}\(R,S^\{\\prime\}\\mid s,a,h\)=\\rho\(R\\mid s,a,h\)\\,\\wp\_\{S\}\(S^\{\\prime\}\\mid s,a,h\): under intervention, reward and next\-state are conditionally independent given\(s,a,h\)\(s,a,h\), andSS\-measurability of the pilot preserves the independence after marginalizing overhh\. The corresponding observed\-data Fisher under the behavior policy does not block\-diagonalize, since marginalizing overHHcouples the two likelihoods\. This is the consequence ofA⟂̸H∣SA\\not\\perp H\\mid S\(Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3)\)\. Identification therefore requires a randomizedSS\-measurable pilot\.

###### Corollary 2\(Fisher\-SEP\-R, reward\-dominates special case\)\.

When the transition block ofΣθ\(s′,a′\)\\Sigma\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)is pre\-pinned at every\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)on the target’s support \(either because transitions are known given action, or because the target’s value is insensitive to transition perturbations at those pairs beyond what the reward gradient already captures\), the transition\-block contribution to \([9](https://arxiv.org/html/2605.21458#S4.E9)\) vanishes and the block inverse decomposes\. The joint PVV reduces to

PVVr\(π;πtgt\)=∑sdπtgt\(s\)∑\(s′,a′\)\(∂Vπtgt\(s\)/∂rs′,a′\)2\(σs′,a′\(0\)\)−2\+ns′,a′\(π\)τs′,a′int,\\mathrm\{PVV\}\_\{r\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)=\\sum\_\{s\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\sum\_\{\(s^\{\\prime\},a^\{\\prime\}\)\}\\frac\{\\left\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)/\\partial r\_\{s^\{\\prime\},a^\{\\prime\}\}\\right\)^\{2\}\}\{\(\\sigma^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{\-2\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\,\\tau^\{\\mathrm\{int\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\},\(10\)the sum of prior and pilot\-observation precisions, inverted per\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)\.

###### Corollary 3\(Fisher\-SEP\-T, transition\-dominates special case\)\.

When the reward block ofΣθ\(s′,a′\)\\Sigma\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)is pre\-pinned by an exact observation model, or the reward\-gradient block is dominated by the transition\-gradient block at the horizon scaleTeffT\_\{\\mathrm\{eff\}\}, the reward\-block contribution to \([9](https://arxiv.org/html/2605.21458#S4.E9)\) vanishes\. The joint PVV reduces to

PVVp\(π;πtgt\)=∑sdπtgt\(s\)∑\(s′,a′\)∇p\(⋅∣s′,a′\)Vπtgt\(s\)⊤\[Σp\(s′,a′\)−1\+ns′,a′\(π\)ℐpint\(s′,a′\)\]−1∇p\(⋅∣s′,a′\)Vπtgt\(s\),\\mathrm\{PVV\}\_\{p\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)=\\sum\_\{s\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\sum\_\{\(s^\{\\prime\},a^\{\\prime\}\)\}\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)^\{\\\!\\top\}\\\!\\left\[\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)^\{\-1\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\,\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)\\right\]^\{\\\!\-1\}\\\!\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\),\(11\)with Bellman\-resolvent gradient \(an\|𝕊\|\|\\mathbb\{S\}\|\-vector indexed by the destinations′′s^\{\\prime\\prime\}\)

\[∇p\(⋅∣s′,a′\)Vπtgt\(s\)\]s′′=γ\[\(I−γPπtgt\)−1\]s,s′⋅πtgt\(a′∣s′\)⋅Vπtgt\(s′′\)\.\\bigl\[\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\bigr\]\_\{s^\{\\prime\\prime\}\}\\;=\\;\\gamma\\,\\bigl\[\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}\\bigr\]\_\{s,\\,s^\{\\prime\}\}\\cdot\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\cdot V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s^\{\\prime\\prime\}\)\.

The two specializations correspond to the two case\-study regimes\. Fisher\-SEP\-R applies when transitions are well\-calibrated \(e\.g\., physics\-pinned dynamics\) and reward parameters carry the residual error\. Fisher\-SEP\-T applies when transitions encode geography or accessibility and reward parameters are pinned by an exact observation model\. The*navigation\-restricted*variant of Fisher\-SEP\-T constrains the explorer to actions whose target\-policy transitions are reachable in the environment graph; this is the variant used in the HIV case study \(Section[5\.2](https://arxiv.org/html/2605.21458#S5.SS2)\) and analyzed formally in Appendix[D\.13](https://arxiv.org/html/2605.21458#A4.SS13)\(Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)bounds the navigation\-restricted PVV gap to the unrestricted minimizer by an overlap constant; Theorem[5](https://arxiv.org/html/2605.21458#Thmtheorem5)establishes the bound on a regular subclass\)\.

Fisher\-SEP therefore allocates pilot budget to pairs that are target\-relevant but explorer\-unvisited, even when the simulator’s prior is tight on mean \(smallΣθ\\Sigma\_\{\\theta\}\) but weak in observation precision \(smallτs′,a′int\\tau^\{\\mathrm\{int\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\)\. This is the difference between Fisher\-SEP and a regret\-minimization explorer: Fisher\-SEP can prioritize an\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)at which the simulator’s prior is narrow, provided the target’s value gradient at that pair is large\. Pilot budget is allocated where small parameter errors propagate into large value errors, not where the prior variance is widest\.

### 4\.2Identification and exploration priority

Computing PVV requires the per\-pair observation precisionτs,aint\\tau^\{\\mathrm\{int\}\}\_\{s,a\}, the inverse interventional reward variance\. The empirical reward variance under the historical behavior policy is biased underA⟂̸H∣SA\\not\\perp H\\mid S\(Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3)\)\. A randomizedSS\-measurable pilot identifies the interventional precision instead\.

###### Lemma 2\(Identification fromSS\-measurable randomized policies\)\.

Under Assumption[2](https://arxiv.org/html/2605.21458#Thmassumption2), ifπexp\(a∣s,h\)=πexp\(a∣s\)\\pi^\{\\mathrm\{exp\}\}\(a\\mid s,h\)=\\pi^\{\\mathrm\{exp\}\}\(a\\mid s\)for allhh, then reward samples at\(s,a\)\(s,a\)underπexp\\pi^\{\\mathrm\{exp\}\}are marginally identically distributed asR∼ℙ\(⋅∣s,do\(a\)\)R\\sim\\mathbb\{P\}\(\\cdot\\mid s,\\mathrm\{do\}\(a\)\), independent across units \(within a unit, samples may be auto\-correlated throughHi,tH\_\{i,t\}\)\. The empirical reward variance consistently estimates\(τs,aint\)−1\(\\tau^\{\\mathrm\{int\}\}\_\{s,a\}\)^\{\-1\}\.

The recipe is short\. A fraction of decisions is allocated to a randomizedSS\-measurable explorerπexp\(a∣s\)\\pi^\{\\mathrm\{exp\}\}\(a\\mid s\)\. The empirical reward variance at each\(s,a\)\(s,a\)pair is consistent for1/τs,aint1/\\tau^\{\\mathrm\{int\}\}\_\{s,a\}\. Note that this identification depends onπexp\\pi^\{\\mathrm\{exp\}\}beingSS\-measurable: if the explorer conditioned onHH, the empirical variance would estimateVarπbeh\(R∣s,a\)\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)instead, with bias bounded by a propensity\-odds sensitivity constant \(the worst\-case ratio of historical\-behavior to randomized propensity at\(s,a\)\(s,a\); Appendix[A](https://arxiv.org/html/2605.21458#A1)\)\.

The two terms separate the sources of uncertainty PVV reduces\. The first combines the prior reward variance with the squared estimated confounding bias\. The second is a transition\-uncertainty term scaling as the effective horizonTeffT\_\{\\mathrm\{eff\}\}and the simplex tangent dimension\. The factordπsim⋆\(s\)d\_\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(s\)weights by SOP visitation, so unvisited states contribute zero and high\-traffic states are ranked higher\.

## 5Case studies

Two case studies anchor the local and reachability regimes\. A vending\-machine supply chain instantiates the local regime: every machine is reachable, the simulator’s error is a miscalibration at visited states, and the question is whether a deliberate pilot pays for itself before the horizon ends\. An HIV mobile\-testing program instantiates the reachability regime: a corridor cell separates a well\-calibrated region from a region the simulator never sampled, and the question is whether a planner anchored to the simulator ever crosses the corridor\. Both are constructed mechanism illustrations, not calibrated deployments\. The vending data\-generating process \(DGP\) draws seasonality fromSingh \([2022](https://arxiv.org/html/2605.21458#bib.bib135)\)but is otherwise synthetic\. The HIV DGP draws SIS dynamics fromGonsalveset al\.\([2018](https://arxiv.org/html/2605.21458#bib.bib150)\); Warrenet al\.\([2025](https://arxiv.org/html/2605.21458#bib.bib152)\)on a stylized5×85\\times 8grid\.

Comparators are as follows\. The*oracle*acts optimally given the hidden stateHHand serves as the upper benchmark\.*SOP*and*A\-SOP*are defined in Section[2](https://arxiv.org/html/2605.21458#S2)\.*Thompson sampling*\(Posterior Sampling for Reinforcement Learning, PSRL,Osbandet al\.\([2013](https://arxiv.org/html/2605.21458#bib.bib91)\); Russoet al\.\([2018](https://arxiv.org/html/2605.21458#bib.bib55)\)\) samples one MDP from the posterior at each step and acts greedily\.*Fisher\-SEP\-R*and*Fisher\-SEP\-T*are the reward\-only and transition\-only specializations of Fisher\-SEP \(Section[4](https://arxiv.org/html/2605.21458#S4), Corollaries[2](https://arxiv.org/html/2605.21458#Thmcorollary2)and[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\); the HIV case study uses the navigation\-restricted variant of Fisher\-SEP\-T introduced after Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\.*UCRL2*\(Jakschet al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib93)\)and*UCBVI*\(Azaret al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib95)\)are optimism\-based regret\-minimization baselines run on a coarser2×22\\times 2state representation, included as a representation\-handicapped reference rather than a representation\-matched comparison \(Appendix[C\.16](https://arxiv.org/html/2605.21458#A3.SS16)\)\. Values are reported as means over 30*common\-seed trials*\(paired evaluation on the same environment realization\), with±\\pmmarking the half\-width of the marginal95%95\\%confidence interval \(CI\)\. Headline separations use the paired Wilcoxon signed\-rank test on per\-seed differences, and values are reported in*percentage points \(pp\)*of the oracle’s value\.

### 5\.1Vending\-machine supply chain

An operator runs five vending machines, each stocking three product categories across three customer segments\. The historical operator observed a latent demand multiplierHi,tH\_\{i,t\}at each machine and conditioned stocking on it, yieldingA⟂̸H∣SA\\not\\perp H\\mid S\(Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3)\)\. The simulator captures only3939–47%47\\%of true demand at two of the five machines, with the under\-calibration concentrated on a high\-demand segment that historically had concentrated stocking\. Observations are censored: the planner seesmin⁡\(demand,stock\)\\min\(\\text\{demand\},\\text\{stock\}\), so stock\-outs at under\-calibrated machines suppress observations of true demand at exactly those machines\. The simulator’s error is on the reward side and at visited states\. Policies are evaluated atT∈\{100,200,400,800,1600\}T\\in\\\{100,200,400,800,1600\\\}days\. The full DGP is in Appendix[3\.2](https://arxiv.org/html/2605.21458#S3.SS2)\.

Table[1](https://arxiv.org/html/2605.21458#S5.T1)reports value as a percentage of oracle cash\. The SOP degrades from89\.0%89\.0\\%atT=100T=100to37\.7%37\.7\\%atT=1600T=1600, consistent with Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)\. A\-SOP and Thompson sampling close the local gap by updating off the SOP’s path\. Both reach about70%70\\%atT=1600T=1600and are statistically indistinguishable in paired tests\. Fisher\-SEP\-R incurs a short\-horizon cost from front\-loaded randomization \(76\.6%76\.6\\%atT=100T=100\), reaches parity atT=800T=800, and atT=1600T=1600leads A\-SOP by\+4\.83\+4\.83pp \(p=0\.005p=0\.005\) and Thompson by\+3\.92\+3\.92pp \(p=0\.008p=0\.008\)\.

Table 1:Vending supply chain, % of oracle cash \(mean±\\pm95% CI half\-width, 30 common\-seed trials\)\. Bold: best non\-oracle policy per horizon\. DGP in Appendix[3\.2](https://arxiv.org/html/2605.21458#S3.SS2)\.After a five\-day randomized pilot \(the first five days of Fisher\-SEP\-R’s exploration phase, in which the explorer mixes uniformly over actions to identifyτs,aint\\tau^\{\\mathrm\{int\}\}\_\{s,a\}at covered pairs by Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\), the residual ratioϵ^p/ϵ^r\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}at the under\-calibrated machines is approximately22to33\(Appendix[B\.5](https://arxiv.org/html/2605.21458#A2.SS5)\)\. Combining this with the post\-pilot Corollary[1](https://arxiv.org/html/2605.21458#Thmcorollary1)places the predicted Fisher\-SEP\-R crossover atT≈400T\\approx 400–600600, consistent with the observed crossover betweenT=400T=400andT=800T=800\. This is an in\-sample descriptive consistency check, not a held\-out validation\. The pattern matches Remark[3](https://arxiv.org/html/2605.21458#Thmremark3): at short horizons the reward\-error termTeffϵrT\_\{\\mathrm\{eff\}\}\\epsilon\_\{r\}dominates and A\-SOP closes it on the SOP’s path; at long horizons the transition\-error termTeff2RmaxϵpT\_\{\\mathrm\{eff\}\}^\{2\}R\_\{\\max\}\\epsilon\_\{p\}dominates and Fisher\-SEP\-R’s front\-loaded pilot is amortized over enough exploitation steps to repay the cost\.

### 5\.2HIV mobile\-testing program

![Refer to caption](https://arxiv.org/html/2605.21458v1/x2.png)Figure 2:HIV mobile\-testing program \(30 common\-seed trials,±2\\pm 2standard error \(SE\) bands\)\.\(a\)Total active infections as % of oracle \(higher is worse\): SOP and A\-SOP let the outbreak grow; navigation\-restricted Fisher\-SEP\-T contains it within 30 days\.\(b\)Cumulative cases found: Fisher\-SEP\-T reaches85\.2%85\.2\\%atT=400T=400; A\-SOP plateaus near58%58\\%; Thompson partly closes the gap via posterior\-sample exploration but trails Fisher\-SEP\-T\.\(c\)Visitation heatmaps: A\-SOP stays in Region A; Fisher\-SEP\-T crosses the corridor within days\.A mobile testing program operates over a5×85\\times 8grid with eight testing teams per day\. A wall with a single corridor cell splits the grid into two regions\. Region A is well\-surveilled\. Region B is peri\-urban \(an under\-surveilled region\) and requires a three\-day community\-engagement warmup before testing yields can be realized\(Gonsalveset al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib150); Warrenet al\.,[2025](https://arxiv.org/html/2605.21458#bib.bib152)\)\. Disease dynamics follow an SIS \(Susceptible\-Infected\-Susceptible\) compartmental model\. The hidden stateHi,tH\_\{i,t\}is each cell’s true active\-infection prevalence\. Region B contains a previously\-unsampled cluster \(a “hidden cluster” in the sense that the simulator’s training data never reached it\) with true prevalence30%30\\%\. The simulator, calibrated from clinic\-based surveillance that never sampled Region B, estimates Region\-B prevalence at2%2\\%\. The simulator’s error is concentrated outside the SOP’s visitation support, the𝒢reach\\mathcal\{G\}\_\{\\mathrm\{reach\}\}regime of Proposition[1](https://arxiv.org/html/2605.21458#Thmproposition1)\. Policies are evaluated atT∈\{50,100,200,300,400\}T\\in\\\{50,100,200,300,400\\\}days\. The full DGP, including a magnitude ablation that sweeps the Region\-B underestimate from5×5\\timesto30×30\\times, is in Appendix[D](https://arxiv.org/html/2605.21458#A4)\.

Figure[2](https://arxiv.org/html/2605.21458#S5.F2)shows infection containment \(panel a\), cumulative cases found \(panel b\), and per\-zone visitation heatmaps \(panel c\)\. Active infections grow under SOP and A\-SOP, neither of which crosses the corridor in time\. Navigation\-restricted Fisher\-SEP\-T contains the outbreak within 30 days\. Cumulative cases plateau for A\-SOP at58\.1%58\.1\\%of oracle atT=400T=400, with corridor\-crossing rate below5%5\\%over a 400\-day deployment\. Thompson reaches76\.9%76\.9\\%as occasional high posterior draws push some teams across\. Fisher\-SEP\-T reaches85\.2%85\.2\\%\. Panel \(c\) shows the visitation heatmaps\. A\-SOP’s visitation is concentrated in Region A\. Fisher\-SEP\-T’s visitation crosses the corridor within days\.

The headline separations atT=400T=400are: Fisher\-SEP\-T−\-A\-SOP,\+27\.1\+27\.1pp \(p<10−3p<10^\{\-3\}\); Fisher\-SEP\-T−\-Thompson,\+8\.3\+8\.3pp \(p=0\.003p=0\.003\); Fisher\-SEP\-T−\-optimism baselines,\+34\.8\+34\.8pp on the coarser2×22\\times 2representation\. UCRL2 and UCBVI cross the corridor in8080–87%87\\%of trials but plateau, since they do not prioritize*which*corridor crossings carry the most target\-relevant information\. They treat all unvisited cells equally\. We leave a*navigation\-matched*Thompson variant — Thompson sampling restricted to the same navigation graph that Fisher\-SEP\-T’s navigation\-restricted explorer uses — to future work\.

Remark[4](https://arxiv.org/html/2605.21458#Thmremark4)explains the mechanism\. Although the simulator’s prior at Region\-B cells is concentrated \(it predicts2%2\\%prevalence with high confidence\), the target’s value gradient at those cells is non\-trivial: SIS dynamics couple the two regions through the Bellman resolvent, so errors in the Region\-B prevalence estimate propagate to Region\-A’s near\-term infection\-load forecast\. Fisher\-SEP\-T responds to this gradient through the explore\-under\-ignorance contribution \([12](https://arxiv.org/html/2605.21458#S4.E12)\) of Remark[4](https://arxiv.org/html/2605.21458#Thmremark4)\. A\-SOP, anchored at the posterior mean, does not\. Thompson partially closes the gap because occasional high posterior draws send some teams through the corridor\. The residual8\.38\.3pp gap to navigation\-restricted Fisher\-SEP\-T reflects the cost of using the simulator as a sampling distribution rather than to choose where to experiment\.

The same reasoning applies more broadly\. When a simulator’s miscalibration is concentrated at unreachable states, the gap is not closed by additional online interaction at any horizon: the unreachable states never produce evidence, so there is no asymptotic rate to wait out\. Even when the simulator is15×15\\timesoff on Region\-B prevalence, Fisher\-SEP\-T finds the cluster, since the Bellman resolvent in the value gradient depends on the simulator’s connectivity pattern rather than on its parameter values\.

## 6Discussion

The local\-reachability decomposition is the central observation of this paper\. The simulator’s value gap, relative to the deployment\-time projection, splits into a local part on states that the deployed policy already visits and a reachability part on states it does not\. The two parts respond differently to additional data\. Passive online updating closes the local part, since the deployed policy continues to generate evidence at visited states\. Closing the reachability part requires deliberate exploration, since the evidence required to identify the relevant parameters is generated only by actions that lie outside the support of the deployed policy\. Fisher\-SEP allocates a real\-world pilot to minimize the posterior predictive variance of a target policy’s value, using the simulator as a Bayesian prior over how parameter errors propagate\.

The decomposition mirrors the within\-positivity versus out\-of\-positivity distinction in causal inference \(Section[3](https://arxiv.org/html/2605.21458#S3)\)\. The local gap is a within\-positivity inference problem: the deployed policy already produces evidence at every relevant\(s,a\)\(s,a\)\. The reachability gap is an out\-of\-positivity one: the relevant\(s,a\)\(s,a\)pairs lie outside the deployed policy’s support and produce no evidence under it\. Online updating treats the latter as a slow version of the former, but Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)\(with Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)on a regular subclass\) shows that the difference is structural rather than asymptotic\. The reachability gap is bounded away from zero on the combination\-lock construction at any horizon, and the off\-fork visitation probability under any passive learner decays geometrically in chain length on the regular subclass studied in Appendix[E\.7](https://arxiv.org/html/2605.21458#A5.SS7)\. The implication for causal inference is that no amount of additional observational data identifies a state\-action pair outside the deployed policy’s support\(Parikhet al\.,[2024b](https://arxiv.org/html/2605.21458#bib.bib163),[2025b](https://arxiv.org/html/2605.21458#bib.bib164)\)\.

Two consequences follow\. The horizon trade\-off \(Remark[3](https://arxiv.org/html/2605.21458#Thmremark3), Table[1](https://arxiv.org/html/2605.21458#S5.T1)\) — whether the horizon is long enough for passive learning to amortize a pilot — applies only to the local component\. On the reachability side, horizon is not the right axis\. When the gap is dominated by reachability, increasingTTdoes not amortize the pilot against passive learning, because passive learning never reaches the relevant states\. The right diagnostic is not the magnitude of the simulator’s error but where that error sits relative to the deployed policy’s visitation\. The residual ratioϵ^p/ϵ^r\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}at covered pairs \(Appendix[B\.5](https://arxiv.org/html/2605.21458#A2.SS5)\) addresses the local side\. The Exploration Priority Index \(Remark[5](https://arxiv.org/html/2605.21458#Thmremark5)\) is the per\-pair readout for the reachability side\.

Fisher\-SEP is not an algorithm for producing a more accurate simulator\. It changes how the simulator is used\. The sim\-to\-real literature trains policies inside a simulator and deploys them outside\(Tobinet al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib13); Wagenmaker and Jamieson,[2024](https://arxiv.org/html/2605.21458#bib.bib118); Memmelet al\.,[2024](https://arxiv.org/html/2605.21458#bib.bib145)\)\. Fisher\-SEP uses the same simulator as a Bayesian prior on how parameter errors propagate to target value, then allocates real\-world evidence accordingly\. The key technical object is the Bellman resolvent\(I−γPπtgt\)−1\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}in the gradient of Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5): it summarizes which states reach which others under the target policy, and PVV weights the simulator’s prior covariance at\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)by that propagation factor\.

This is why Fisher\-SEP performs well even when the simulator’s parameter values are far from accurate at unvisited states\. In the HIV case study the simulator is15×15\\timesoff on Region\-B prevalence, and Fisher\-SEP\-T still finds the cluster, since the resolvent gradient depends on the simulator’s connectivity rather than on its parameter values\. The framework’s requirement on the simulator is therefore weaker than the sim\-to\-real literature’s: the simulator need not be accurate, but its connectivity pattern on the target’s visitation must be approximately correct\.

Thompson sampling closes part of the reachability gap because occasional high posterior draws send teams across the corridor in the HIV case study\. Empirically this gives a\+18\.8\+18\.8pp lift over A\-SOP atT=400T=400\(Thompson reaches76\.9%76\.9\\%of oracle, A\-SOP plateaus at58\.1%58\.1\\%\)\. Thompson is not a representation\-matched competitor for Fisher\-SEP, however\. It samples target and explorer from the same posterior, does not respect a navigation restriction, and pays full posterior\-sample variance for every parameter, including those whose perturbations do not change the target’s value\. Fisher\-SEP\-T leads Thompson by\+8\.3\+8\.3pp atT=400T=400\(85\.2%85\.2\\%of oracle\), which reflects the cost of using the simulator as a sampling distribution rather than to choose where to experiment\. Appendix[E\.7](https://arxiv.org/html/2605.21458#A5.SS7), Remark[26](https://arxiv.org/html/2605.21458#Thmremark26)shows that Thompson sampling satisfies the asymptotico\(1\)o\(1\)claim of Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)on a regular subclass, with rate governed byp⋆p^\{\\star\}\. We leave a navigation\-matched Thompson variant to future work\.

### Limitations and future work

The framework is tabular with finite latent state\. Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)extends conceptually to function approximation by replacing the sup\-norm with appropriate function\-space norms, but the closed\-form posterior in Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)does not\. A Gaussian\-process or Bayesian\-neural\-network treatment of the parameter posterior is the natural successor\. The technical challenge is computing∇θVπtgt\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}whenθ\\thetais infinite\-dimensional\.

A natural further question is which minimal assumptions on the simulator preserve Fisher\-SEP’s usefulness, and how to diagnose their failure online\. The post\-pilot residual diagnostic \(Appendix[B\.5](https://arxiv.org/html/2605.21458#A2.SS5)\) is a starting point\. A more general representation\-mismatch test that detects misspecification before the pilot is run, perhaps via residual structure on the calibration data, is open\.

The independent prior across\(s,a\)\(s,a\)pairs in Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)is conservative\. Under a hierarchical prior \(e\.g\., a Gaussian\-process kernel encoding spatial proximity\), part of the reachability gap is transportable into the deployed policy’s visitation support by passive learning, with the closable fraction governed by the kernel’s effective rank\. Appendix[C\.18](https://arxiv.org/html/2605.21458#A3.SS18)\(Proposition[9](https://arxiv.org/html/2605.21458#Thmproposition9)\) bounds this closable fraction; quantifying it in finite samples is open\.

Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)establishes the geometric reachability\-gap rate on a regular subclass of passive policies \(posterior\-mean\-optimal, bounded\-temperature softmax, polynomial\-shrinkage upper\-confidence\-bound \(UCB\) policies\)\. The unconditional version remains open\. A counterexample withp⋆→1p^\{\\star\}\\to 1would identify the regularity conditions under which posterior sampling fails to escape a reachability gap\.

The analysis treatsϵhist\\epsilon^\{\\mathrm\{hist\}\}as negligible, justified by latent\-state contraction \(Remark[1](https://arxiv.org/html/2605.21458#Thmremark1)\)\. Non\-contracting dynamics, slowly mixing latent states, would inflateϵhist\\epsilon^\{\\mathrm\{hist\}\}and require a belief\-state\-policy generalization that we do not pursue here\.

### Ethical considerations

The phrase “hidden cluster” in Section[5\.2](https://arxiv.org/html/2605.21458#S5.SS2)is a statistical label, not a clinical one: it names a region the simulator’s training data never reached, not a population characteristic\. We use it to map cleanly to the formal𝒢reach\\mathcal\{G\}\_\{\\mathrm\{reach\}\}object\. Real outreach programs work*with*under\-surveilled communities, and the framework’s language should be read with that translation in mind\.

Fisher\-SEP’s concentration of exploration in under\-surveilled regions is statistically efficient under a value\-centric objective, but it can shift short\-term testing access away from already\-served communities\. Any operational use should include domain\-specific ethical review, community\-level consent and engagement, and quantitative equity constraints on the exploration allocation, designed jointly by methodologists, clinicians, and community representatives\. Examples include capping the per\-community fraction of exploration capacity, or requiring equal\-or\-better cumulative testing rates across communities, even at a cost to the value\-maximizing allocation\. The residual\-ratio diagnostic and the Fisher\-SEP prescription provide decision support\. They do not replace allocation decisions, and the framework characterizes the trade\-off without resolving it\.

### Summary

Passive online updating closes the local component of the simulator’s value gap\. Deliberate exploration is required to close the reachability component\. Which component dominates is determined by the ratioϵh/ϵm\\epsilon^\{h\}/\\epsilon^\{m\}at visited states \(Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)\) and by the deployed policy’s visitation support relative to the simulator’s miscalibration \(Proposition[1](https://arxiv.org/html/2605.21458#Thmproposition1), Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)\)\. Fisher\-SEP gives a Bayesian design criterion for the second case\. The simulator’s contribution is its connectivity pattern on the target policy’s visitation, not its point accuracy on parameter values\.

## References

- Adaptive Platform Trials Coalition \(2019\)Adaptive platform trials: definition, design, conduct and reporting considerations\.Nature Reviews Drug Discovery18\(10\),pp\. 797–807\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- A\. E\. Ades, G\. Lu, and K\. Claxton \(2004\)Expected value of sample information calculations in medical decision modeling\.Medical Decision Making24\(2\),pp\. 207–227\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- A\. Agarwal, D\. Hsu, S\. Kale, J\. Langford, L\. Li, and R\. Schapire \(2014\)Taming the monster: a fast and simple algorithm for contextual bandits\.InInternational Conference on Machine Learning \(ICML\),pp\. 1638–1646\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- S\. Agrawal and N\. Goyal \(2012\)Analysis of Thompson sampling for the multi\-armed bandit problem\.InConference on Learning Theory \(COLT\),pp\. 39\.1–39\.26\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1)\.
- I\. Akkaya, M\. Andrychowicz, M\. Chociej, M\. Litwin, B\. McGrew, A\. Petron, A\. Paino, M\. Plappert, G\. Powell, R\. Ribas,et al\.\(2019\)Solving rubik’s cube with a robot hand\.arXiv preprint arXiv:1910\.07113\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px11.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1)\.
- A\. Allevato, E\. S\. Short, M\. Pryor, and A\. Thomaz \(2020\)TuneNet: one\-shot residual tuning for system identification and sim\-to\-real robot task transfer\.InConference on Robot Learning \(CoRL\),pp\. 445–455\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1)\.
- J\. Asmuth, L\. Li, M\. L\. Littman, A\. Nouri, and D\. Wingate \(2009\)A Bayesian sampling approach to exploration in reinforcement learning \(BOSS\)\.InConference on Uncertainty in Artificial Intelligence \(UAI\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px13.p1.1)\.
- J\. Audibert and S\. Bubeck \(2010\)Best arm identification in multi\-armed bandits\.InConference on Learning Theory \(COLT\),pp\. 41–53\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p2.1)\.
- P\. Auer, N\. Cesa\-Bianchi, and P\. Fischer \(2002\)Finite\-time analysis of the multiarmed bandit problem\.Machine Learning47\(2\),pp\. 235–256\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p3.3)\.
- M\. G\. Azar, I\. Osband, and R\. Munos \(2017\)Minimax regret bounds for reinforcement learning\.InInternational Conference on Machine Learning \(ICML\),pp\. 263–272\.Cited by:[§C\.16](https://arxiv.org/html/2605.21458#A3.SS16.p1.2),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px21.p3.2),[§5](https://arxiv.org/html/2605.21458#S5.p2.4)\.
- P\. J\. Ball, L\. Smith, I\. Kostrikov, and S\. Levine \(2023\)Efficient online reinforcement learning with offline data\.International Conference on Machine Learning \(ICML\)\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- E\. Bareinboim and J\. Pearl \(2016\)Causal inference and the data\-fusion problem\.Proceedings of the National Academy of Sciences113\(27\),pp\. 7345–7352\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p1.1),[§1](https://arxiv.org/html/2605.21458#S1.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- A\. D\. Barker, C\. C\. Sigman, G\. J\. Kelloff, N\. M\. Hylton, D\. A\. Berry, and L\. J\. Esserman \(2009\)I\-SPY 2: an adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy\.Clinical Pharmacology & Therapeutics86\(1\),pp\. 97–100\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- A\. Bennett and N\. Kallus \(2024\)Proximal reinforcement learning: efficient off\-policy evaluation in partially observed Markov decision processes\.Operations Research\.Note:Online publication 2023; print 2024\. arXiv:2110\.15332\.External Links:[Document](https://dx.doi.org/10.1287/opre.2021.0781)Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p2.1)\.
- F\. Berkenkamp, M\. Turchetta, A\. Schoellig, and A\. Krause \(2017\)Safe model\-based reinforcement learning with stability guarantees\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30,pp\. 908–918\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px7.p1.1)\.
- R\. Berman, L\. Pekelis, A\. Scott, and C\. Van den Bulte \(2022\)False discovery in A/B testing\.Management Science68\(9\),pp\. 6762–6782\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- D\. A\. Berry \(2006\)Bayesian clinical trials\.Nature Reviews Drug Discovery5\(1\),pp\. 27–36\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- D\. A\. Berry \(2012\)Adaptive clinical trials in oncology\.Nature Reviews Clinical Oncology9\(4\),pp\. 199–207\.Note:Cite key retained for backward compatibility \(year is 2012\)\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- S\. M\. Berry, J\. T\. Connor, and R\. J\. Lewis \(2015\)The platform trial: an efficient strategy for evaluating multiple treatments\.JAMA313\(16\),pp\. 1619–1620\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- T\. Blau, E\. V\. Bonilla, I\. Chades, and A\. Dezfouli \(2022\)Optimizing sequential experimental design with deep reinforcement learning\.InInternational Conference on Machine Learning \(ICML\),pp\. 2107–2128\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p2.1)\.
- R\. I\. Brafman and M\. Tennenholtz \(2002\)R\-max: a general polynomial time algorithm for near\-optimal reinforcement learning\.Journal of Machine Learning Research3,pp\. 213–231\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px13.p1.1)\.
- A\. Brennan, S\. Kharroubi, A\. O’Hagan, and J\. Chilcott \(2007\)Calculating partial expected value of perfect information via Monte Carlo sampling algorithms\.Medical Decision Making27\(4\),pp\. 448–470\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- A\. Briggs, K\. Claxton, and M\. Sculpher \(2006\)Decision modelling for health economic evaluation\.Oxford University Press,Oxford\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- K\. Chaloner and I\. Verdinelli \(1995\)Bayesian experimental design: a review\.Statistical Science10\(3\),pp\. 273–304\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px10.p2.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p2.1),[§4\.1](https://arxiv.org/html/2605.21458#S4.SS1.p1.4)\.
- Y\. Chebotar, A\. Handa, V\. Makoviychuk, M\. Macklin, J\. Issac, N\. Ratliff, and D\. Fox \(2019\)Closing the sim\-to\-real loop: adapting simulation randomization with real world experience\.InInternational Conference on Robotics and Automation \(ICRA\),pp\. 8973–8979\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1)\.
- C\. Chen, J\. Lin, E\. Yücesan, and S\. E\. Chick \(2000\)Simulation budget allocation for further enhancing the efficiency of ordinal optimization\.Discrete Event Dynamic Systems10\(3\),pp\. 251–270\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- F\. Chen, M\. Ghavamzadeh, Y\. Chow, and A\. Malek \(2025\)Multi\-fidelity hybrid reinforcement learning via information gain maximization\.InarXiv preprint arXiv:2509\.14848,Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p2.1)\.
- S\. E\. Chick, J\. Branke, and C\. Schmidt \(2010\)Sequential sampling to myopically maximize the expected value of information\.INFORMS Journal on Computing22\(1\),pp\. 71–80\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- S\. E\. Chick and K\. Inoue \(2001\)New two\-stage and sequential procedures for selecting the best simulated system\.Operations Research49\(5\),pp\. 732–743\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- K\. Claxton, M\. Sculpher, and M\. Drummond \(2002\)A rational framework for decision making by the National Institute for Clinical Excellence \(NICE\)\.The Lancet360\(9334\),pp\. 711–715\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- K\. Claxton \(1999\)The irrelevance of inference: a decision\-making approach to the stochastic evaluation of health care technologies\.Journal of Health Economics18\(3\),pp\. 341–364\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- I\. J\. Dahabreh, S\. E\. Robertson, E\. J\. Tchetgen Tchetgen, E\. A\. Stuart, and M\. A\. Hernán \(2019\)Generalizing causal inferences from randomized trials: counterfactual and graphical identification\.Biometrics\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p1.1)\.
- M\. H\. DeGroot \(1970\)Optimal statistical decisions\.McGraw\-Hill,New York\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- I\. Degtiar and S\. Rose \(2023\)A review of generalizability and transportability\.Annual Review of Statistics and Its Application10,pp\. 501–524\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p1.1)\.
- A\. K\. Dixit and R\. S\. Pindyck \(1994\)Investment under uncertainty\.Princeton University Press,Princeton, NJ\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px19.p1.1)\.
- M\. O\. Duff \(2002\)Optimal learning: computational procedures for Bayes\-adaptive Markov decision processes\.Ph\.D\. Thesis,University of Massachusetts Amherst\.Cited by:[§A\.2](https://arxiv.org/html/2605.21458#A1.SS2.p2.5),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p3.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- A\. Fickinger \(2025\)Statistical guarantees for offline domain randomization\.arXiv preprint arXiv:2506\.10133\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p2.1)\.
- A\. Foster, D\. R\. Ivanova, I\. Malik, and T\. Rainforth \(2021\)Deep adaptive design: amortizing sequential bayesian experimental design\.InInternational Conference on Machine Learning \(ICML\),pp\. 3384–3395\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p2.1)\.
- D\. J\. Foster, A\. Agarwal, M\. Dudík, H\. Luo, and R\. E\. Schapire \(2018\)Practical contextual bandits with regression oracles\.InInternational Conference on Machine Learning \(ICML\),pp\. 1539–1548\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- D\. J\. Foster and A\. Rakhlin \(2020\)Beyond UCB: optimal and efficient contextual bandits with regression oracles\.InInternational Conference on Machine Learning \(ICML\),pp\. 3199–3210\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- P\. I\. Frazier, W\. B\. Powell, and S\. Dayanik \(2008\)A knowledge\-gradient policy for sequential information collection\.SIAM Journal on Control and Optimization47\(5\),pp\. 2410–2439\.Cited by:[§A\.2](https://arxiv.org/html/2605.21458#A1.SS2.p2.5),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p2.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p3.3)\.
- P\. I\. Frazier, W\. B\. Powell, and S\. Dayanik \(2009\)The knowledge\-gradient policy for correlated normal beliefs\.INFORMS Journal on Computing21\(4\),pp\. 599–613\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p2.1)\.
- J\. Fu, M\. Norouzi, O\. Nachum, G\. Tucker, Z\. Wang, A\. Novikov, M\. Yang, M\. R\. Zhang, Y\. Chen, A\. Kumar, C\. Paduraru, S\. Levine, and T\. L\. Paine \(2024\)Benchmarks for reinforcement learning with biased offline data and imperfect simulators\.InarXiv preprint arXiv:2407\.00806,Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p2.1)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.InInternational Conference on Machine Learning \(ICML\),pp\. 2052–2062\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p1.1)\.
- J\. C\. Gittins \(1979\)Bandit processes and dynamic allocation indices\.Journal of the Royal Statistical Society: Series B \(Methodological\)41\(2\),pp\. 148–164\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1)\.
- J\. Gittins, K\. Glazebrook, and R\. Weber \(2011\)Multi\-armed bandit allocation indices\.2nd edition,John Wiley & Sons,Chichester\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1)\.
- G\. S\. Gonsalves, F\. W\. Crawford, P\. D\. Cleary, E\. H\. Kaplan, and A\. D\. Paltiel \(2018\)An adaptive approach to locating mobile HIV testing services\.Medical Decision Making38\(2\),pp\. 262–272\.Cited by:[§D\.16](https://arxiv.org/html/2605.21458#A4.SS16.SSS0.Px1.p1.4),[§D\.6](https://arxiv.org/html/2605.21458#A4.SS6.p1.8),[§1](https://arxiv.org/html/2605.21458#S1.p1.1),[§5\.2](https://arxiv.org/html/2605.21458#S5.SS2.p1.8),[§5](https://arxiv.org/html/2605.21458#S5.p1.1)\.
- A\. Guez, D\. Silver, and P\. Dayan \(2012\)Efficient Bayes\-adaptive reinforcement learning using sample\-based search\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.25,pp\. 1025–1033\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p3.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- V\. Hadad, D\. A\. Hirshberg, R\. Zhan, S\. Wager, and S\. Athey \(2021\)Confidence intervals for policy evaluation in adaptive experiments\.Proceedings of the National Academy of Sciences118\(15\),pp\. e2014602118\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap \(2023\)Mastering diverse domains through world models\.arXiv preprint arXiv:2301\.04104\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p1.1)\.
- N\. Hansen, H\. Su, and X\. Wang \(2022\)Temporal difference learning for model predictive control\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p1.1)\.
- N\. Hansen, X\. Wang, and H\. Su \(2024\)TD\-MPC2: scalable, robust world models for continuous control\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p1.1)\.
- E\. Hazan, S\. M\. Kakade, K\. Singh, and A\. Van Soest \(2019\)Provably efficient maximum entropy exploration\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px12.p1.1)\.
- A\. Heath, N\. Kunst, C\. Jackson, M\. Strong, F\. Alarid\-Escudero, J\. D\. Goldhaber\-Fiebert, G\. Baio, N\. A\. Menzies, and H\. Jalal \(2020\)Calculating the expected value of sample information in practice: considerations from 3 case studies\.Medical Decision Making40\(3\),pp\. 314–326\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- L\. J\. Hong, Z\. Huang, and H\. Lam \(2021\)Review on ranking and selection: a new perspective\.Frontiers of Engineering Management8\(3\),pp\. 321–343\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- R\. A\. Howard \(1966\)Information value theory\.IEEE Transactions on Systems Science and Cybernetics2\(1\),pp\. 22–26\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- S\. R\. Howard, A\. Ramdas, J\. McAuliffe, and J\. Sekhon \(2021\)Time\-uniform, nonparametric, nonasymptotic confidence sequences\.The Annals of Statistics49\(2\),pp\. 1055–1080\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- F\. Hu and W\. F\. Rosenberger \(2006\)The theory of response\-adaptive randomization in clinical trials\.John Wiley & Sons,Hoboken, NJ\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- M\. Y\. Huang and H\. Parikh \(2024\)Towards generalizing inferences from trials to target populations\.arXiv preprint arXiv:2402\.17042\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p1.1)\.
- D\. R\. Ivanova, A\. Foster, S\. Kleinegesse, M\. U\. Gutmann, and T\. Rainforth \(2021\)Implicit deep adaptive design: policy\-based experimental design without likelihoods\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34,pp\. 25785–25798\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p2.1)\.
- T\. Jaksch, R\. Ortner, and P\. Auer \(2010\)Near\-optimal regret bounds for reinforcement learning\.Journal of Machine Learning Research11,pp\. 1563–1600\.Cited by:[§C\.16](https://arxiv.org/html/2605.21458#A3.SS16.p1.2),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px13.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px21.p3.2),[§5](https://arxiv.org/html/2605.21458#S5.p2.4),[Remark 8](https://arxiv.org/html/2605.21458#Thmremark8.p1.3.3)\.
- H\. Jalal and F\. Alarid\-Escudero \(2018\)A gaussian approximation approach for value of information analysis\.Medical Decision Making38\(2\),pp\. 174–188\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- M\. Janner, J\. Fu, M\. Zhang, and S\. Levine \(2019\)When to trust your model: model\-based policy optimization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.32,pp\. 12519–12530\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p1.1)\.
- C\. Jennison and B\. W\. Turnbull \(1999\)Group sequential methods with applications to clinical trials\.Chapman and Hall/CRC,Boca Raton, FL\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- C\. Jin, Z\. Allen\-Zhu, S\. Bubeck, and M\. I\. Jordan \(2018\)Is Q\-learning provably efficient?\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.31,pp\. 4863–4873\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p1.1)\.
- C\. Jin, A\. Krishnamurthy, M\. Simchowitz, and T\. Yu \(2020\)Reward\-free exploration for reinforcement learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px12.p1.1)\.
- R\. Johari, P\. Koomen, L\. Pekelis, and D\. Walsh \(2022\)Always valid inference: continuous monitoring of A/B tests\.Operations Research70\(3\),pp\. 1806–1821\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- S\. Kakade and J\. Langford \(2002\)Approximately optimal approximate reinforcement learning\.InInternational Conference on Machine Learning \(ICML\),pp\. 267–274\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px6.p2.1)\.
- N\. Kallus, A\. M\. Puli, and U\. Shalit \(2018\)Removing hidden confounding by experimental grounding\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.31\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p2.1)\.
- N\. Kallus and A\. Zhou \(2018\)Confounding\-robust policy improvement\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.31\.Cited by:[§A\.11](https://arxiv.org/html/2605.21458#A1.SS11.p1.4),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1),[Remark 12](https://arxiv.org/html/2605.21458#Thmremark12.p1.3.3)\.
- K\. Kandasamy, G\. Dasarathy, J\. B\. Oliva, J\. Schneider, and B\. Póczos \(2016\)Gaussian process bandit optimisation with multi\-fidelity evaluations\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.29,pp\. 992–1000\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px5.p1.1)\.
- K\. Kandasamy, G\. Dasarathy, B\. Póczos, and J\. Schneider \(2019\)Multi\-fidelity gaussian process bandit optimisation\.Journal of Artificial Intelligence Research66,pp\. 151–196\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px5.p1.1)\.
- K\. Kandasamy, G\. Dasarathy, J\. Schneider, and B\. Póczos \(2017\)Multi\-fidelity bayesian optimisation with continuous approximations\.InInternational Conference on Machine Learning \(ICML\),pp\. 1799–1808\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- M\. Kasy and A\. Sautmann \(2021\)Adaptive treatment assignment in experiments for policy choice\.Econometrica89\(1\),pp\. 113–132\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p3.1)\.
- M\. Kato, K\. Okumura, T\. Ishihara, and T\. Kitagawa \(2024\)Adaptive experimental design for policy learning\.InInternational Conference on Machine Learning \(ICML\),Note:arXiv:2401\.03756\. Cite key retained for backward compatibility\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p3.1)\.
- E\. Kaufmann, O\. Cappé, and A\. Garivier \(2016\)On the complexity of best\-arm identification in multi\-armed bandit models\.Journal of Machine Learning Research17\(1\),pp\. 1–42\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p2.1)\.
- E\. Kaufmann, P\. Ménard, O\. D\. Domingues, A\. Jonsson, E\. Leurent, and M\. Valko \(2021\)Adaptive reward\-free exploration\.InAlgorithmic Learning Theory \(ALT\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px12.p1.1)\.
- M\. Kearns and S\. Singh \(2002\)Near\-optimal reinforcement learning in polynomial time\.Machine Learning49\(2–3\),pp\. 209–232\.Cited by:[§A\.1](https://arxiv.org/html/2605.21458#A1.SS1.SSS0.Px5.p1.8),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p2.6),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px6.p1.1),[§3\.1](https://arxiv.org/html/2605.21458#S3.SS1.p1.11),[§3\.1](https://arxiv.org/html/2605.21458#S3.SS1.p3.3)\.
- R\. Kidambi, A\. Rajeswaran, P\. Netrapalli, and T\. Joachims \(2020\)MOReL: model\-based offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 21810–21823\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p1.1)\.
- S\. Kim and B\. L\. Nelson \(2006\)Selecting the best system\.InSimulation,Handbooks in Operations Research and Management Science, Vol\.13,pp\. 501–534\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- R\. Kohavi, R\. Longbotham, D\. Sommerfield, and R\. M\. Henne \(2009\)Controlled experiments on the web: survey and practical guide\.Data Mining and Knowledge Discovery18\(1\),pp\. 140–181\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- R\. Kohavi, D\. Tang, and Y\. Xu \(2020\)Trustworthy online controlled experiments: a practical guide to A/B testing\.Cambridge University Press,Cambridge\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- I\. Kostrikov, A\. Nair, and S\. Levine \(2022\)Offline reinforcement learning with implicit q\-learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p1.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 1179–1191\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p1.1)\.
- T\. L\. Lai and H\. Robbins \(1985\)Asymptotically efficient adaptive allocation rules\.Advances in Applied Mathematics6\(1\),pp\. 4–22\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1)\.
- S\. Lange, T\. Gabel, and M\. Riedmiller \(2012\)Batch reinforcement learning\.InReinforcement Learning: State\-of\-the\-Art,pp\. 45–73\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p1.1)\.
- Q\. Lanners, C\. Rudin, A\. Volfovsky, and H\. Parikh \(2025\)Data fusion for partial identification of causal effects\.arXiv preprint arXiv:2505\.24296\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p2.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p4.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- E\. H\. Lee, V\. Perrone, C\. Archambeau, and M\. Seeger \(2020\)Cost\-aware bayesian optimization\.arXiv preprint arXiv:2003\.10870\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px5.p1.1)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.InarXiv preprint arXiv:2005\.01643,Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p1.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- L\. Li, W\. Chu, J\. Langford, and R\. E\. Schapire \(2010\)A contextual\-bandit approach to personalized news article recommendation\.InInternational Conference on World Wide Web \(WWW\),pp\. 661–670\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- S\. Lobel and R\. Parr \(2024\)The bellman operator is a contraction mapping inℓp\\ell\_\{p\}norms \(and so is the simulation lemma\): an optimal tightness bound for the simulation lemma\.Reinforcement Learning Journal2,pp\. 785–797\.Note:Reinforcement Learning Conference \(RLC\) 2024; arXiv:2406\.16249Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px6.p2.1),[§3\.1](https://arxiv.org/html/2605.21458#S3.SS1.p3.3),[Remark 6](https://arxiv.org/html/2605.21458#Thmremark6.p1.2.2)\.
- X\. Lu, B\. Van Roy, V\. Dwaracherla, M\. Ibrahimi, I\. Osband, and Z\. Wen \(2023\)Reinforcement learning, bit by bit\.Foundations and Trends in Machine Learning\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px14.p1.2)\.
- F\. Matějka and A\. McKay \(2015\)Rational inattention to discrete choices: a new foundation for the multinomial logit model\.American Economic Review105\(1\),pp\. 272–298\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px19.p2.1)\.
- T\. Matsushima, H\. Furuta, Y\. Matsuo, O\. Nachum, and S\. Gu \(2021\)Deployment\-efficient reinforcement learning via model\-based offline optimization\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px7.p1.1)\.
- R\. McDonald and D\. Siegel \(1986\)The value of waiting to invest\.Quarterly Journal of Economics101\(4\),pp\. 707–727\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px19.p1.1)\.
- R\. Meager \(2019\)Understanding the average impact of microcredit expansions: a Bayesian hierarchical analysis of seven randomized experiments\.American Economic Journal: Applied Economics11\(1\),pp\. 57–91\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p3.1)\.
- B\. Mehta, M\. Diaz, F\. Golber, C\. Sim, P\. Englert, and D\. Fox \(2020\)Active domain randomization\.InConference on Robot Learning \(CoRL\),pp\. 1162–1176\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px11.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1)\.
- M\. Memmel, A\. Wagenmaker, C\. Zhang, M\. Yin, D\. Fox, and A\. Gupta \(2024\)ASID: active exploration for system identification in robotic manipulation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px10.p5.4),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p3.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1),[§6](https://arxiv.org/html/2605.21458#S6.p4.2)\.
- A\. Modi, N\. Jiang, A\. Tewari, and S\. Singh \(2020\)Sample complexity of reinforcement learning using linearly combined model ensembles\.InArtificial Intelligence and Statistics \(AISTATS\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px14.p1.2)\.
- S\. Morris and P\. Strack \(2019\)The Wald problem and the relation of sequential sampling and ex\-ante information costs\.SSRN Working Paper\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px19.p2.1)\.
- F\. Muratore, F\. Ramos, G\. Turk, W\. Yu, M\. Gienger, and J\. Peters \(2022\)Robot learning from randomized simulations: a review\.Frontiers in Robotics and AI9,pp\. 799893\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1)\.
- H\. Namkoong, R\. Keramati, S\. Yadlowsky, and E\. Brunskill \(2020\)Off\-policy policy evaluation for sequential decisions under unobserved confounding\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p3.1),[Remark 12](https://arxiv.org/html/2605.21458#Thmremark12.p1.3.3)\.
- X\. Nie, X\. Tian, J\. Taylor, and J\. Zou \(2018\)Why adaptively collected data have negative bias and how to correct for it\.InInternational Conference on Artificial Intelligence and Statistics \(AISTATS\),pp\. 1261–1269\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- H\. Niu, T\. Ji, B\. Liu, H\. Zhao, X\. Zhu, J\. Zheng, P\. Huang, G\. Zhou, J\. Hu, and X\. Zhan \(2023\)H2O\+: an improved framework for hybrid offline\-and\-online RL with dynamics gaps\.InarXiv preprint arXiv:2309\.12716,Note:Cite key retained for backward compatibility\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p2.1)\.
- H\. Niu, Y\. Qiu, M\. Li, G\. Zhou, J\. HU, and X\. Zhan \(2022\)When to trust your simulator: dynamics\-aware hybrid offline\-and\-online reinforcement learning\.Advances in Neural Information Processing Systems \(NeurIPS\)35,pp\. 36599–36612\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p2.1)\.
- P\. C\. O’Brien and T\. R\. Fleming \(1979\)A multiple testing procedure for clinical trials\.Biometrics35\(3\),pp\. 549–556\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- I\. Osband, D\. Russo, and B\. Van Roy \(2013\)\(More\) efficient reinforcement learning via posterior sampling\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.26,pp\. 3003–3011\.Cited by:[5th item](https://arxiv.org/html/2605.21458#A5.I2.i5.p1.1),[5th item](https://arxiv.org/html/2605.21458#A5.I3.i5.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p3.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1),[§2](https://arxiv.org/html/2605.21458#S2.p9.8),[§5](https://arxiv.org/html/2605.21458#S5.p2.4),[Remark 8](https://arxiv.org/html/2605.21458#Thmremark8.p1.3.3)\.
- H\. Parikh, M\. Morucci, V\. Orlandi, S\. Roy, C\. Rudin, and A\. Volfovsky \(2025a\)A double machine learning approach to combining experimental and observational data\.Observational Studies\.Note:arXiv preprint arXiv:2307\.01449Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p2.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p4.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- H\. Parikh, R\. Ross, E\. Kern, K\. Rudolph, and J\. R\. Zubizarreta \(2025b\)Regularizing extrapolation in causal inference\.arXiv preprint arXiv:2509\.17180\.Cited by:[§1](https://arxiv.org/html/2605.21458#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.21458#S3.SS2.p3.2),[§6](https://arxiv.org/html/2605.21458#S6.p2.2)\.
- H\. Parikh, R\. Ross, E\. A\. Stuart, and K\. E\. Rudolph \(2024a\)Who are we missing? A principled approach to characterizing the underrepresented population\.arXiv preprint arXiv:2401\.14512\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p3.1)\.
- H\. Parikh, R\. Ross, E\. Stuart, and K\. Rudolph \(2024b\)Who are we missing? a principled approach to characterizing the underrepresented population\.arXiv preprint arXiv:2401\.14512\.Cited by:[§1](https://arxiv.org/html/2605.21458#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.21458#S3.SS2.p3.2),[§6](https://arxiv.org/html/2605.21458#S6.p2.2)\.
- J\. Pearl and E\. Bareinboim \(2011\)Transportability of causal and statistical relations: a formal approach\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.25,pp\. 247–254\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p1.1)\.
- J\. Pearl \(2009\)Causality: models, reasoning, and inference\.2nd edition,Cambridge University Press,Cambridge\.Cited by:[§A\.11](https://arxiv.org/html/2605.21458#A1.SS11.p1.4),[§A\.5](https://arxiv.org/html/2605.21458#A1.SS5.5.p3.6),[§1](https://arxiv.org/html/2605.21458#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.21458#S3.SS2.p3.2),[Remark 12](https://arxiv.org/html/2605.21458#Thmremark12.p1.3.3)\.
- B\. Peherstorfer, K\. Willcox, and M\. Gunzburger \(2018\)Survey of multifidelity methods in uncertainty propagation, inference, and optimization\.SIAM Review60\(3\),pp\. 550–591\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px5.p1.1)\.
- X\. B\. Peng, M\. Andrychowicz, W\. Zaremba, and P\. Abbeel \(2018\)Sim\-to\-real transfer of robotic control with dynamics randomization\.InIEEE International Conference on Robotics and Automation \(ICRA\),pp\. 3803–3810\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px11.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1)\.
- R\. S\. Pindyck \(1993\)Investments of uncertain cost\.Journal of Financial Economics34\(1\),pp\. 53–76\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px19.p1.1)\.
- S\. J\. Pocock \(1977\)Group sequential methods in the design and analysis of clinical trials\.Biometrika64\(2\),pp\. 191–199\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- M\. Poloczek, J\. Wang, and P\. I\. Frazier \(2017\)Multi\-information source optimization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30,pp\. 4288–4298\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- F\. Pukelsheim \(2006\)Optimal design of experiments\.2nd edition,Classics in Applied Mathematics,SIAM\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px10.p2.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p2.1),[§4\.1](https://arxiv.org/html/2605.21458#S4.SS1.p1.4)\.
- M\. L\. Puterman \(1994\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons,New York\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p1.1)\.
- H\. Raiffa and R\. Schlaifer \(1961\)Applied statistical decision theory\.Harvard University Press,Boston, MA\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- T\. Rainforth, A\. Foster, D\. R\. Ivanova, and F\. Bickford Smith \(2024\)Modern bayesian experimental design\.Statistical Science\.Note:To appearCited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p2.1)\.
- A\. Ramdas, P\. Grünwald, V\. Vovk, and G\. Shafer \(2023\)Game\-theoretic statistics and safe anytime\-valid inference\.Statistical Science38\(4\),pp\. 576–601\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- H\. Robbins \(1952\)Some aspects of the sequential design of experiments\.Bulletin of the American Mathematical Society58\(5\),pp\. 527–535\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1)\.
- P\. R\. Rosenbaum \(2002\)Observational studies\.2 edition,Springer Series in Statistics,Springer\.Cited by:[Remark 14](https://arxiv.org/html/2605.21458#Thmremark14.p1.5.5)\.
- W\. F\. Rosenberger and J\. M\. Lachin \(2012\)Randomization in clinical trials: theory and practice\.2nd edition,John Wiley & Sons,Hoboken, NJ\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- E\. T\. R\. Rosenman, G\. Basse, A\. B\. Owen, and M\. Baiocchi \(2023\)Combining observational and experimental datasets using shrinkage estimators\.Biometrics79\(2\),pp\. 926–938\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p2.1)\.
- D\. J\. Russo, B\. Van Roy, A\. Kazerouni, I\. Osband, and Z\. Wen \(2018\)A tutorial on Thompson sampling\.Foundations and Trends in Machine Learning11\(1\),pp\. 1–96\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px21.p2.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p3.3),[§1](https://arxiv.org/html/2605.21458#S1.p5.1),[§2](https://arxiv.org/html/2605.21458#S2.p9.8),[§5](https://arxiv.org/html/2605.21458#S5.p2.4)\.
- D\. Russo and B\. Van Roy \(2014\)Learning to optimize via posterior sampling\.Mathematics of Operations Research39\(4\),pp\. 1221–1243\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px21.p2.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p3.3)\.
- D\. Russo \(2019\)Worst\-case regret bounds for exploration via randomized value functions\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.32\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p3.1)\.
- D\. Russo \(2020\)Simple bayesian algorithms for best\-arm identification\.Operations Research68\(6\),pp\. 1625–1647\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p2.1)\.
- E\. Salvato, G\. Fenu, E\. Medvet, and F\. A\. Pellegrino \(2021\)Crossing the reality gap: a survey on sim\-to\-real transferability of robot controllers in reinforcement learning\.IEEE Access9,pp\. 153171–153187\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1)\.
- D\. Simchi\-Levi and C\. Wang \(2023\)Multi\-armed bandit experimental design: online decision\-making and adaptive inference\.arXiv preprint arXiv:2206\.05324\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p3.1)\.
- C\. A\. Sims \(2003\)Implications of rational inattention\.Journal of Monetary Economics50\(3\),pp\. 665–690\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px19.p2.1)\.
- A\. Singh \(2022\)Vending machine sales\.Kaggle\.Note:[https://www\.kaggle\.com/datasets/awesomeasingh/vending\-machine\-sales](https://www.kaggle.com/datasets/awesomeasingh/vending-machine-sales)CC0: Public Domain\. 9,617 transactions from four machines in Central New Jersey\.Cited by:[3rd item](https://arxiv.org/html/2605.21458#A3.I3.i3.p1.2),[§5](https://arxiv.org/html/2605.21458#S5.p1.1)\.
- Y\. Song, Y\. Zhou, A\. Sekhari, J\. A\. Bagnell, A\. Krishnamurthy, and W\. Sun \(2023\)Hybrid RL: using both offline and online data can make RL efficient\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- A\. L\. Strehl, L\. Li, E\. Wiewiora, J\. Langford, and M\. L\. Littman \(2009\)PAC model\-free reinforcement learning\.Journal of Machine Learning Research10,pp\. 2413–2444\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p1.1)\.
- M\. Strong, J\. E\. Oakley, and A\. Brennan \(2014\)Estimating multiparameter partial expected value of perfect information from a probabilistic sensitivity analysis sample: a nonparametric regression approach\.Medical Decision Making34\(3\),pp\. 311–326\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1)\.
- E\. A\. Stuart, S\. R\. Cole, C\. P\. Bradshaw, and P\. J\. Leaf \(2011\)The use of propensity scores to assess the generalizability of results from randomized trials\.Journal of the Royal Statistical Society: Series A \(Statistics in Society\)174\(2\),pp\. 369–386\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p1.1)\.
- Y\. Sui, A\. Gotovos, J\. W\. Burdick, and A\. Krause \(2015\)Safe exploration for optimization with Gaussian processes\.InInternational Conference on Machine Learning \(ICML\),pp\. 997–1005\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px7.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.2nd edition,MIT Press,Cambridge, MA\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px1.p1.1)\.
- R\. S\. Sutton \(1991\)Dyna, an integrated architecture for learning, planning, and reacting\.ACM SIGART Bulletin2\(4\),pp\. 160–163\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p1.1)\.
- Z\. Tan \(2006\)A distributional approach for causal inference using propensity scores\.Journal of the American Statistical Association101\(476\),pp\. 1619–1637\.Cited by:[§A\.13](https://arxiv.org/html/2605.21458#A1.SS13.p1.3),[Remark 14](https://arxiv.org/html/2605.21458#Thmremark14.p1.5.5)\.
- D\. Tang, A\. Agarwal, D\. O’Brien, and M\. Meyer \(2010\)Overlapping experiment infrastructure: more, better, faster experimentation\.InProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 17–26\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- W\. R\. Thompson \(1933\)On the likelihood that one unknown probability exceeds another in view of the evidence of two samples\.Biometrika25\(3–4\),pp\. 285–294\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p3.3)\.
- J\. Tobin, R\. Fong, A\. Ray, J\. Schneider, W\. Zaremba, and P\. Abbeel \(2017\)Domain randomization for transferring deep neural networks from simulation to the real world\.InIEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 23–30\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px11.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1),[§6](https://arxiv.org/html/2605.21458#S6.p4.2)\.
- S\. S\. Villar, J\. Bowden, and J\. Wason \(2015\)Multi\-armed bandit models for the optimal design of clinical trials: benefits and challenges\.Statistical Science30\(2\),pp\. 199–215\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- E\. Vivalt \(2020\)How much can we generalize from impact evaluations?\.Journal of the European Economic Association18\(6\),pp\. 3045–3089\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p3.1)\.
- A\. Wagenmaker and K\. Jamieson \(2024\)Overcoming the sim\-to\-real gap: leveraging simulation to learn to explore for real\-world RL\.arXiv preprint arXiv:2410\.20254\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px3.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1),[§6](https://arxiv.org/html/2605.21458#S6.p4.2)\.
- L\. Wang, Z\. Yang, and Z\. Wang \(2021\)Provably efficient causal reinforcement learning with confounded observational data\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- J\. L\. Warren, O\. Prunas, A\. D\. Paltiel, T\. Thornhill, and G\. S\. Gonsalves \(2025\)Integrating testing volume into bandit algorithms for infectious disease surveillance\.Journal of the Royal Statistical Society: Series A188\(4\),pp\. 1029–1043\.External Links:[Document](https://dx.doi.org/10.1093/jrsssa/qnae090)Cited by:[§D\.4](https://arxiv.org/html/2605.21458#A4.SS4.p1.3),[§1](https://arxiv.org/html/2605.21458#S1.p1.1),[§5\.2](https://arxiv.org/html/2605.21458#S5.SS2.p1.8),[§5](https://arxiv.org/html/2605.21458#S5.p1.1)\.
- T\. Weissman, E\. Ordentlich, G\. Seroussi, S\. Verdú, and M\. J\. Weinberger \(2003\)Inequalities for the L1 deviation of the empirical distribution\.Technical reportTechnical ReportHPL\-2003\-97,Hewlett\-Packard Labs\.Cited by:[§A\.1](https://arxiv.org/html/2605.21458#A1.SS1.SSS0.Px5.4.p4.6),[§3\.1](https://arxiv.org/html/2605.21458#S3.SS1.p3.3)\.
- E\. C\. F\. Wilson \(2015\)A practical guide to value of information analysis\.PharmacoEconomics33\(2\),pp\. 105–121\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px17.p1.1),[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px9.p1.1)\.
- J\. Woodcock and L\. M\. LaVange \(2017\)Master protocols to study multiple therapies, multiple diseases, or both\.New England Journal of Medicine377\(1\),pp\. 62–70\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px8.p1.1)\.
- P\. Wu, A\. Escontrela, D\. Hafner, P\. Abbeel, and K\. Goldberg \(2023\)DayDreamer: world models for physical robot learning\.Conference on Robot Learning \(CoRL\),pp\. 2226–2240\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px4.p1.1)\.
- Y\. Xu and A\. Zeevi \(2023\)Bayesian design principles for frequentist sequential learning\.InInternational Conference on Machine Learning \(ICML\),Note:PMLR vol\. 202; arXiv:2310\.00806\. Cite key retained for backward compatibility\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px11.p3.1)\.
- S\. Yang and P\. Ding \(2020\)Combining multiple observational data sources to estimate causal effects\.Journal of the American Statistical Association115\(531\),pp\. 1540–1554\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px16.p2.1)\.
- T\. Yu, G\. Thomas, L\. Yu, S\. Ermon, J\. Zou, S\. Levine, C\. Finn, and T\. Ma \(2020\)MOPO: model\-based offline policy optimization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 14129–14142\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p1.1)\.
- J\. Zhang and E\. Bareinboim \(2016\)Markov decision processes with unobserved confounders: a causal approach\.InTechnical Report R\-23, Purdue AI Lab,Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px15.p2.1),[§1](https://arxiv.org/html/2605.21458#S1.p3.1),[§1](https://arxiv.org/html/2605.21458#S1.p5.1)\.
- K\. Zhang, L\. Janson, and S\. Murphy \(2020\)Inference for batched bandits\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 9818–9829\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px18.p1.1)\.
- Y\. Zhao, K\. Jun, T\. Fiez, and L\. Jain \(2024\)Adaptive experimentation when you can’t experiment\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.37\.Note:arXiv:2406\.10738\. Cite key retained for backward compatibility\.Cited by:[Appendix H](https://arxiv.org/html/2605.21458#A8.SS0.SSS0.Px2.p3.1)\.

## Appendix AProofs for Section 3 and Stateless Warmup

This appendix contains the proofs for the results of Section[3](https://arxiv.org/html/2605.21458#S3)and develops the stateless and contextual warmup settings that motivate the main\-text threshold intuition\.

### A\.1Proof of Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)\(extended simulation lemma\)

##### Setup\.

The simulatorℳ^sim=\(ρ^sim,℘^sim\)\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}=\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)is calibrated from historical data generated under a behavior policyπbeh\(a∣s,h\)\\pi^\{\\mathrm\{beh\}\}\(a\\mid s,h\)that could depend on the hidden stateHH\. We use the named kernels of Section[2](https://arxiv.org/html/2605.21458#S2)throughout\. The deployment\-time Markov projection is

ρobs⋆\(s,a\):=𝔼h∼ℙt\(h∣s\)\[r¯\(s,a,h\)\],℘obs⋆\(s′∣s,a\):=𝔼h∼ℙt\(h∣s\)\[℘S\(s′∣s,a,h\)\],\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\):=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{t\}\(h\\mid s\)\}\[\\bar\{r\}\(s,a,h\)\],\\qquad\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(s^\{\\prime\}\\mid s,a\):=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{t\}\(h\\mid s\)\}\[\\wp\_\{S\}\(s^\{\\prime\}\\mid s,a,h\)\],and the calibration\-asymptotic kernel \(the simulator’s training\-procedure limit at infinite calibration data and a perfect parametric family\) is

ρ¯calib\(s,a\):=𝔼h∼ℙcalib\(h∣s,a\)\[r¯\(s,a,h\)\],℘¯calib\(s′∣s,a\):=𝔼h∼ℙcalib\(h∣s,a\)\[℘S\(s′∣s,a,h\)\]\.\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}\(s,a\):=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(h\\mid s,a\)\}\[\\bar\{r\}\(s,a,h\)\],\\qquad\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\(s^\{\\prime\}\\mid s,a\):=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(h\\mid s,a\)\}\[\\wp\_\{S\}\(s^\{\\prime\}\\mid s,a,h\)\]\.Under Assumption[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(a\)–\(c\), the actual simulator deviates from the calibration\-asymptotic kernel by per\-pair residuals,

ρ^sim\(s,a\)\\displaystyle\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\)=ρ¯calib\(s,a\)\+ξrm\(s,a\),\\displaystyle=\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}\(s,a\)\+\\xi\_\{r\}^\{m\}\(s,a\),\(13\)℘^sim\(s′∣s,a\)\\displaystyle\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\(s^\{\\prime\}\\mid s,a\)=℘¯calib\(s′∣s,a\)\+ξpm\(s,a,s′\),\\displaystyle=\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\(s^\{\\prime\}\\mid s,a\)\+\\xi\_\{p\}^\{m\}\(s,a,s^\{\\prime\}\),\(14\)whereξrm\\xi\_\{r\}^\{m\}andξpm\\xi\_\{p\}^\{m\}collect functional\-form mis\-specification, finite\-sample calibration noise, and prior/regularization artifacts; both are properties of the simulator’s training procedure,*not*of the hidden\-state distribution, so they are not reduced by any deployment\-time intervention \(Assumption[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(d\) freezes the simulator\)\.

##### Decomposition of the simulator\-to\-projection gap\.

Apply the triangle inequality throughρ¯calib\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}and℘¯calib\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}:

ϵr\(s,a\):=\|ρ^sim\(s,a\)−ρobs⋆\(s,a\)\|≤ϵrh\(s,a\)\+ϵrm\(s,a\),\\epsilon\_\{r\}\(s,a\):=\|\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\)\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\|\\;\\leq\\;\\epsilon\_\{r\}^\{h\}\(s,a\)\+\\epsilon\_\{r\}^\{m\}\(s,a\),with

ϵrh\(s,a\)\\displaystyle\\epsilon\_\{r\}^\{h\}\(s,a\):=\|ρobs⋆\(s,a\)−ρ¯calib\(s,a\)\|,\\displaystyle:=\\bigl\|\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\-\\bar\{\\rho\}\_\{\\mathrm\{calib\}\}\(s,a\)\\bigr\|,\(15\)ϵrm\(s,a\)\\displaystyle\\epsilon\_\{r\}^\{m\}\(s,a\):=\|ξrm\(s,a\)\|\.\\displaystyle:=\|\\xi\_\{r\}^\{m\}\(s,a\)\|\.\(16\)The*calibration–deployment*componentϵrh\\epsilon\_\{r\}^\{h\}captures two structural differences between the calibration conditionalℙcalib\(H∣S,A\)\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(H\\mid S,A\)and the deployment conditionalℙt\(H∣S\)\\mathbb\{P\}\_\{t\}\(H\\mid S\): \(i\)*confounding*— the calibration distribution conditions onaa\(becauseπbeh\\pi^\{\\mathrm\{beh\}\}depended onhh\), while the deployment projection conditions only onss; \(ii\)*drift*— the calibration\-time and deployment\-time marginals ofh∣sh\\mid sdiffer even ignoring the conditioning onaa\. Both effects are folded into the difference of expectations in \([15](https://arxiv.org/html/2605.21458#A1.E15)\); App\.[A\.12](https://arxiv.org/html/2605.21458#A1.SS12)separates them\. The*misspecification residual*ϵrm\\epsilon\_\{r\}^\{m\}captures everything else\. The analogous decomposition holds for transitions:

ϵp\(s,a\)\\displaystyle\\epsilon\_\{p\}\(s,a\)≤ϵph\(s,a\)\+ϵpm\(s,a\),\\displaystyle\\leq\\epsilon\_\{p\}^\{h\}\(s,a\)\+\\epsilon\_\{p\}^\{m\}\(s,a\),\(17\)ϵph\(s,a\)\\displaystyle\\epsilon\_\{p\}^\{h\}\(s,a\):=∥℘obs⋆\(⋅∣s,a\)−℘¯calib\(⋅∣s,a\)∥1,\\displaystyle:=\\\|\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s,a\)\-\\bar\{\\wp\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s,a\)\\\|\_\{1\},\(18\)ϵpm\(s,a\)\\displaystyle\\epsilon\_\{p\}^\{m\}\(s,a\):=‖ξpm\(s,a,⋅\)‖1\.\\displaystyle:=\\\|\\xi\_\{p\}^\{m\}\(s,a,\\cdot\)\\\|\_\{1\}\.\(19\)

##### Reducibility ofϵh\\epsilon^\{h\}\.

AnSS\-measurable randomized policyπexp\(a∣s,h\)=πexp\(a∣s\)\\pi^\{\\mathrm\{exp\}\}\(a\\mid s,h\)=\\pi^\{\\mathrm\{exp\}\}\(a\\mid s\)severs theH→AH\\to Aedge \(Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\), so pilot observations at\(s,a\)\(s,a\)are i\.i\.d\. draws \(marginally, across units\) fromℙ\(R∣s,do\(a\)\)\\mathbb\{P\}\(R\\mid s,\\mathrm\{do\}\(a\)\)andℙ\(S′∣s,do\(a\)\)\\mathbb\{P\}\(S^\{\\prime\}\\mid s,\\mathrm\{do\}\(a\)\)— i\.e\., from the interventional distributions that defineρobs⋆\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}and℘obs⋆\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\. The empirical means therefore estimateρobs⋆\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}and℘obs⋆\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}consistently; withnnpilot samples, the residual error at\(s,a\)\(s,a\)isO\(1/n\)O\(1/\\sqrt\{n\}\), which shrinks to zero asn→∞n\\to\\infty\. This reducesϵrh\+ϵph\\epsilon\_\{r\}^\{h\}\+\\epsilon\_\{p\}^\{h\}to finite\-sample noise\.

##### Irreducibility ofϵm\\epsilon^\{m\}\.

The residualsξrm,ξpm\\xi\_\{r\}^\{m\},\\xi\_\{p\}^\{m\}are properties of the simulator’s training procedure \(parametric family, regularizer, finite calibration sample\), not of the hidden\-state distribution\. No amount of randomized interaction at\(s,a\)\(s,a\)reveals information aboutξm\\xi^\{m\}, because randomization at deployment time corrects theHH\-conditioning but not the simulator’s intrinsic parametric bias\. In particular,ϵm\\epsilon^\{m\}is reducible only by changing the simulator itself — retraining, richer functional family, more calibration data of better quality — operations the planner has ruled out by Assumption[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(d\) \(fixed simulator\)\.

##### Value bound\.

With these per\-\(s,a\)\(s,a\)decompositions, the sup\-norm versionsϵr:=maxs,a⁡ϵr\(s,a\)\\epsilon\_\{r\}:=\\max\_\{s,a\}\\epsilon\_\{r\}\(s,a\)andϵp:=maxs,a⁡ϵp\(s,a\)\\epsilon\_\{p\}:=\\max\_\{s,a\}\\epsilon\_\{p\}\(s,a\)satisfyϵr≤ϵrm\+ϵrh\\epsilon\_\{r\}\\leq\\epsilon\_\{r\}^\{m\}\+\\epsilon\_\{r\}^\{h\}andϵp≤ϵpm\+ϵph\\epsilon\_\{p\}\\leq\\epsilon\_\{p\}^\{m\}\+\\epsilon\_\{p\}^\{h\}with the sups taken correspondingly\. The value inequality \([5](https://arxiv.org/html/2605.21458#S3.E5)\) is then the classical simulation lemma ofKearns and Singh \[[2002](https://arxiv.org/html/2605.21458#bib.bib114)\]applied to the pair of MDPs\(ρ^sim,℘^sim\)\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)and\(ρobs⋆,℘obs⋆\)\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\): for any policyπ\\pidepending only on observed states,

\|Vπ\(ℳobs⋆\)−Vπ\(ℳ^sim\)\|≤2ϵr1−γ\+2γRmaxϵp\(1−γ\)2=21−γ\(ϵr\+γRmax1−γϵp\)\.\|V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}\)\-V^\{\\pi\}\(\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}\)\|\\leq\\frac\{2\\epsilon\_\{r\}\}\{1\-\\gamma\}\+\\frac\{2\\gamma R\_\{\\max\}\\epsilon\_\{p\}\}\{\(1\-\\gamma\)^\{2\}\}=\\frac\{2\}\{1\-\\gamma\}\\\!\\left\(\\epsilon\_\{r\}\+\\frac\{\\gamma R\_\{\\max\}\}\{1\-\\gamma\}\\,\\epsilon\_\{p\}\\right\)\.
###### Proof of Corollary[1](https://arxiv.org/html/2605.21458#Thmcorollary1)\.

Letℐ⊆𝕊×𝔸\\mathcal\{I\}\\subseteq\\mathbb\{S\}\\times\\mathbb\{A\}be the pilot\-covered pairs and let the pilot be anSS\-measurable randomized policy withns,an\_\{s,a\}samples at\(s,a\)∈ℐ\(s,a\)\\in\\mathcal\{I\}\. By Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2), reward and next\-state samples at\(s,a\)\(s,a\)under the pilot are i\.i\.d\. draws \(marginally, across units\) fromℙ\(R∣s,do\(a\)\)\\mathbb\{P\}\(R\\mid s,\\mathrm\{do\}\(a\)\)andℙ\(S′∣s,do\(a\)\)\\mathbb\{P\}\(S^\{\\prime\}\\mid s,\\mathrm\{do\}\(a\)\)—i\.e\., from the kernels that defineρobs⋆\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}and℘obs⋆\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\.

*Reward term\.*For each\(s,a\)∈ℐ\(s,a\)\\in\\mathcal\{I\}, letρ^pilot\(s,a\)\\hat\{\\rho\}\_\{\\mathrm\{pilot\}\}\(s,a\)be the empirical mean ofns,an\_\{s,a\}pilot rewards\. Since rewards lie in\[0,Rmax\]\[0,R\_\{\\max\}\], Azuma–Hoeffding gives

ℙ\(\|ρ^pilot\(s,a\)−ρobs⋆\(s,a\)\|\>u\)≤2exp⁡\(−2ns,au2/Rmax2\),\\mathbb\{P\}\\bigl\(\|\\hat\{\\rho\}\_\{\\mathrm\{pilot\}\}\(s,a\)\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\|\>u\\bigr\)\\leq 2\\exp\(\-2n\_\{s,a\}u^\{2\}/R\_\{\\max\}^\{2\}\),so with probability at least1−δ/\(2\|ℐ\|\)1\-\\delta/\(2\|\\mathcal\{I\}\|\),\|ρ^pilot\(s,a\)−ρobs⋆\(s,a\)\|≤Rmaxlog⁡\(4\|ℐ\|/δ\)/\(2ns,a\)\|\\hat\{\\rho\}\_\{\\mathrm\{pilot\}\}\(s,a\)\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\|\\leq R\_\{\\max\}\\sqrt\{\\log\(4\|\\mathcal\{I\}\|/\\delta\)/\(2n\_\{s,a\}\)\}\. Union\-bounding overℐ\\mathcal\{I\}gives\|ρ^pilot\(s,a\)−ρobs⋆\(s,a\)\|≤crlog⁡\(\|ℐ\|/δ\)/ns,a\|\\hat\{\\rho\}\_\{\\mathrm\{pilot\}\}\(s,a\)\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\|\\leq c\_\{r\}\\sqrt\{\\log\(\|\\mathcal\{I\}\|/\\delta\)/n\_\{s,a\}\}uniformly onℐ\\mathcal\{I\}, wherecr:=Rmax/2c\_\{r\}:=R\_\{\\max\}/\\sqrt\{2\}is an absolute constant and\|ℐ\|\|\\mathcal\{I\}\|appears inside the log \(so the dependence on\|ℐ\|\|\\mathcal\{I\}\|islog⁡\|ℐ\|\\sqrt\{\\log\|\\mathcal\{I\}\|\}, which we absorb into thecrlog⁡\(1/δ\)/ns,ac\_\{r\}\\sqrt\{\\log\(1/\\delta\)/n\_\{s,a\}\}form of the corollary by re\-definingδ←δ/\|ℐ\|\\delta\\leftarrow\\delta/\|\\mathcal\{I\}\|where needed\)\.

The updated reward\-error bound replacesρ^sim\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}by the pilot\-refined estimate onℐ\\mathcal\{I\}: for\(s,a\)∈ℐ\(s,a\)\\in\\mathcal\{I\}, the post\-pilot reward error is at mostcrlog⁡\(1/δ\)/ns,ac\_\{r\}\\sqrt\{\\log\(1/\\delta\)/n\_\{s,a\}\}; for\(s,a\)∉ℐ\(s,a\)\\notin\\mathcal\{I\}, it remainsϵrh\(s,a\)\+ϵrm\(s,a\)≤max\(s,a\)∉ℐ⁡ϵrh\(s,a\)\+ϵrm\\epsilon\_\{r\}^\{h\}\(s,a\)\+\\epsilon\_\{r\}^\{m\}\(s,a\)\\leq\\max\_\{\(s,a\)\\notin\\mathcal\{I\}\}\\epsilon\_\{r\}^\{h\}\(s,a\)\+\\epsilon\_\{r\}^\{m\}\. Plugging these into the Kearns–Singh sup\-norm bound of Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)yields the reward\-contribution term\.

*Transition term\.*The same argument applied to the empirical Dirichlet mean℘^pilot\(⋅∣s,a\)\\hat\{\\wp\}\_\{\\mathrm\{pilot\}\}\(\\cdot\\mid s,a\), usingWeissmanet al\.\[[2003](https://arxiv.org/html/2605.21458#bib.bib167)\]’sL1L\_\{1\}deviation inequality for\|𝕊\|\|\\mathbb\{S\}\|\-dimensional empirical distributions, gives∥℘^pilot\(⋅∣s,a\)−℘obs⋆\(⋅∣s,a\)∥1≤cp\|𝕊\|log⁡\(1/δ\)/ns,a\\\|\\hat\{\\wp\}\_\{\\mathrm\{pilot\}\}\(\\cdot\\mid s,a\)\-\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s,a\)\\\|\_\{1\}\\leq c\_\{p\}\\sqrt\{\|\\mathbb\{S\}\|\\log\(1/\\delta\)/n\_\{s,a\}\}uniformly onℐ\\mathcal\{I\}with probability at least1−δ/21\-\\delta/2\. Combining the two tail events by a final union bound and substituting into the sup\-norm bound of Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)yields the stated corollary\. ∎

### A\.2Stateless warmup: finite\-sample regret and thresholds

In the stateless case \(two periods,nnunits, two actions, priorΔ∼𝒩\(δ0,σ02\)\\Delta\\sim\\mathcal\{N\}\(\\delta\_\{0\},\\sigma\_\{0\}^\{2\}\)\), the net value of experimenting withnen\_\{e\}units is

ΔV\(ne\)=−neδ02\+γnν\(ne\)ψ\(δ0/ν\(ne\)\),\\Delta V\(n\_\{e\}\)\\;=\\;\-\\,\\frac\{n\_\{e\}\\,\\delta\_\{0\}\}\{2\}\\;\+\\;\\gamma\\,n\\,\\nu\(n\_\{e\}\)\\,\\psi\\\!\\bigl\(\\delta\_\{0\}/\\nu\(n\_\{e\}\)\\bigr\),whereν2\(ne\)=σ04ne/\(4σ2\+σ02ne\)\\nu^\{2\}\(n\_\{e\}\)=\\sigma\_\{0\}^\{4\}n\_\{e\}\\big/\(4\\sigma^\{2\}\+\\sigma\_\{0\}^\{2\}n\_\{e\}\)andψ\(r\)=ϕ\(r\)−rΦ\(−r\)\\psi\(r\)=\\phi\(r\)\-r\\,\\Phi\(\-r\)\.

###### Theorem 1\(Finite\-sample regret\)\.

BayesRegret\(ne=0\)=γnσ0ψ\(κ\)\\mathrm\{BayesRegret\}\(n\_\{e\}\{=\}0\)=\\gamma\\,n\\,\\sigma\_\{0\}\\,\\psi\(\\kappa\), whereκ=δ0/σ0\\kappa=\\delta\_\{0\}/\\sigma\_\{0\}\. For any fixedκ\>0\\kappa\>0, the optimal experiment sizene⋆=ne⋆\(n,γ,σ2,σ02,κ\)n\_\{e\}^\{\\star\}=n\_\{e\}^\{\\star\}\(n,\\gamma,\\sigma^\{2\},\\sigma\_\{0\}^\{2\},\\kappa\)is interior \(not at the budget constraint\) and satisfies

ne⋆=Θ\(σσ0γnκ\)=Θ\(σ2γnδ0σ0\)asn→∞,n\_\{e\}^\{\\star\}\\;=\\;\\Theta\\\!\\left\(\\frac\{\\sigma\}\{\\sigma\_\{0\}\}\\sqrt\{\\frac\{\\gamma n\}\{\\kappa\}\}\\right\)\\;=\\;\\Theta\\\!\\left\(\\sqrt\{\\frac\{\\sigma^\{2\}\\,\\gamma n\}\{\\delta\_\{0\}\\,\\sigma\_\{0\}\}\}\\right\)\\qquad\\text\{as \}n\\to\\infty,\(20\)i\.e\.,ne⋆n\_\{e\}^\{\\star\}grows asn\\sqrt\{n\}with a prefactorσ2/\(δ0σ0\)\\sqrt\{\\sigma^\{2\}/\(\\delta\_\{0\}\\sigma\_\{0\}\)\}\(equivalently,σ/σ0⋅κ−1/2\\sigma/\\sigma\_\{0\}\\cdot\\kappa^\{\-1/2\}\)\. Asκ→0\+\\kappa\\to 0^\{\+\}\(uninformative prior\), the interior argmax disappears andne⋆n\_\{e\}^\{\\star\}hits the budget constraintnn, matching the intuition that an uninformative prior justifies an arbitrarily large pilot relative to its cost\.

###### Proof\.

The oracle value is𝔼\[max⁡\(Δ,0\)\]=δ0Φ\(κ\)\+σ0ϕ\(κ\)\\mathbb\{E\}\[\\max\(\\Delta,0\)\]=\\delta\_\{0\}\\,\\Phi\(\\kappa\)\+\\sigma\_\{0\}\\,\\phi\(\\kappa\)\. The SOP value isn\(μ0\+δ0\)n\(\\mu\_\{0\}\+\\delta\_\{0\}\)\. The Bayes regret is thereforeγnσ0ψ\(κ\)\\gamma\\,n\\,\\sigma\_\{0\}\\,\\psi\(\\kappa\)\. After an experiment withnen\_\{e\}units, the preposterior isδ1∼𝒩\(δ0,ν2\)\\delta\_\{1\}\\sim\\mathcal\{N\}\(\\delta\_\{0\},\\nu^\{2\}\), yielding net value−neδ0/2\+γnνψ\(δ0/ν\)\-n\_\{e\}\\,\\delta\_\{0\}/2\+\\gamma\\,n\\,\\nu\\,\\psi\(\\delta\_\{0\}/\\nu\)\. The first\-order condition isδ0/2=γn\(dν/dne\)ϕ\(δ0/ν\)\\delta\_\{0\}/2=\\gamma\\,n\\,\(d\\nu/dn\_\{e\}\)\\,\\phi\(\\delta\_\{0\}/\\nu\)\. We haveν2\(ne\)=σ04ne/\(4σ2\+σ02ne\)\\nu^\{2\}\(n\_\{e\}\)=\\sigma\_\{0\}^\{4\}n\_\{e\}/\(4\\sigma^\{2\}\+\\sigma\_\{0\}^\{2\}n\_\{e\}\), soν→σ0\\nu\\to\\sigma\_\{0\}asne→∞n\_\{e\}\\to\\inftyanddν2/dne=4σ2σ04/\(4σ2\+σ02ne\)2d\\nu^\{2\}/dn\_\{e\}=4\\sigma^\{2\}\\sigma\_\{0\}^\{4\}/\(4\\sigma^\{2\}\+\\sigma\_\{0\}^\{2\}n\_\{e\}\)^\{2\}, givingdν/dne=2σ2σ04/\(\(4σ2\+σ02ne\)2ν\)d\\nu/dn\_\{e\}=2\\sigma^\{2\}\\sigma\_\{0\}^\{4\}/\\bigl\(\(4\\sigma^\{2\}\+\\sigma\_\{0\}^\{2\}n\_\{e\}\)^\{2\}\\nu\\bigr\), which scales as2σ2/\(σ0ne2\)2\\sigma^\{2\}/\(\\sigma\_\{0\}n\_\{e\}^\{2\}\)for largenen\_\{e\}\(usingν→σ0\\nu\\to\\sigma\_\{0\}and\(4σ2\+σ02ne\)2→σ04ne2\(4\\sigma^\{2\}\+\\sigma\_\{0\}^\{2\}n\_\{e\}\)^\{2\}\\to\\sigma\_\{0\}^\{4\}n\_\{e\}^\{2\}\)\. Solving the FOC asymptotically,δ0/2∼γn⋅2σ2/\(σ0ne2\)⋅ϕ\(κ\)\\delta\_\{0\}/2\\sim\\gamma n\\cdot 2\\sigma^\{2\}/\(\\sigma\_\{0\}n\_\{e\}^\{2\}\)\\cdot\\phi\(\\kappa\), so\(ne⋆\)2∼4γnσ2ϕ\(κ\)/\(δ0σ0\)\(n\_\{e\}^\{\\star\}\)^\{2\}\\sim 4\\gamma n\\,\\sigma^\{2\}\\phi\(\\kappa\)/\(\\delta\_\{0\}\\sigma\_\{0\}\), i\.e\.

ne⋆∼2γnσ2ϕ\(κ\)δ0σ0=Θ\(σσ0γnκ\)n\_\{e\}^\{\\star\}\\;\\sim\\;2\\sqrt\{\\frac\{\\gamma n\\,\\sigma^\{2\}\\,\\phi\(\\kappa\)\}\{\\delta\_\{0\}\\,\\sigma\_\{0\}\}\}\\;=\\;\\Theta\\\!\\left\(\\frac\{\\sigma\}\{\\sigma\_\{0\}\}\\sqrt\{\\frac\{\\gamma n\}\{\\kappa\}\}\\right\)\(usingδ0=κσ0\\delta\_\{0\}=\\kappa\\sigma\_\{0\}and absorbingϕ\(κ\)\\phi\(\\kappa\)into the constant at fixedκ\\kappa\), confirming \([20](https://arxiv.org/html/2605.21458#A1.E20)\)\.

For theκ→0\+\\kappa\\to 0^\{\+\}behavior: atκ=0\\kappa=0\(δ0=0\\delta\_\{0\}=0\), the FOC becomes0=γn\(dν/dne\)/2π0=\\gamma n\(d\\nu/dn\_\{e\}\)/\\sqrt\{2\\pi\}, which fails at any finitenen\_\{e\}sincedν/dne\>0d\\nu/dn\_\{e\}\>0; hence the net value is strictly increasing innen\_\{e\}over the allowed range, and the argmax is at the budget constraintne=nn\_\{e\}=nrather than at a finite interior point\. This recovers the intuition that uninformative priors justify arbitrarily large pilots\. ∎

In the contextual case \(KKactions, independent priors\), the optimal allocation equalizes the augmented index across active actions:

ma⏟earn\+γnKGa\(𝐧⋆\)⏟learn=λ∀awithna⋆\>0,\\underbrace\{m\_\{a\}\}\_\{\\text\{earn\}\}\+\\underbrace\{\\gamma\\,n\\,\\mathrm\{KG\}\_\{a\}\(\\mathbf\{n\}^\{\\star\}\)\}\_\{\\text\{learn\}\}=\\lambda\\qquad\\forall\\,a\\text\{ with \}n\_\{a\}^\{\\star\}\>0,\(21\)whereKGa\\mathrm\{KG\}\_\{a\}denotes the knowledge gradient\[Frazieret al\.,[2008](https://arxiv.org/html/2605.21458#bib.bib48)\]\. The SOP is optimal if and only ifmasim⋆−ma≥γn\[KGa−KGasim⋆\]m\_\{a^\{\\star\}\_\{\\mathrm\{sim\}\}\}\-m\_\{a\}\\geq\\gamma\\,n\\,\[\\mathrm\{KG\}\_\{a\}\-\\mathrm\{KG\}\_\{a^\{\\star\}\_\{\\mathrm\{sim\}\}\}\]for allaa\. OverT\+1T\{\+\}1periods, beliefs evolve via conjugate updating\[Duff,[2002](https://arxiv.org/html/2605.21458#bib.bib116)\]\.

### A\.3Empirical Illustration: Stateless Threshold

This subsection provides an empirical illustration of the stateless theory developed above\. We considern=100n=100units,K=2K=2actions,σ2=1\\sigma^\{2\}=1,σ0=1\\sigma\_\{0\}=1, and sweepκ=δ0/σ0\\kappa=\\delta\_\{0\}/\\sigma\_\{0\}over\[0,2\]\[0,2\]forγ∈\{0\.5,0\.8,0\.9,1\.0\}\\gamma\\in\\\{0\.5,0\.8,0\.9,1\.0\\\}\. The analytical thresholdκ⋆\(γ\)\\kappa^\{\\star\}\(\\gamma\)is the root of2γψ\(κ\)=κ2\\gamma\\,\\psi\(\\kappa\)=\\kappa\.

Figure[A3](https://arxiv.org/html/2605.21458#A1.F3)presents three panels\. Panel \(a\) shows that the net valueΔV\(ne⋆\)\\Delta V\(n\_\{e\}^\{\\star\}\)crosses zero atκ⋆\\kappa^\{\\star\}, matching the analytical threshold\. Panel \(b\) confirms that the optimal experiment sizene⋆n\_\{e\}^\{\\star\}decreases monotonically inκ\\kappa\. Panel \(c\) provides Monte Carlo validation \(NMC=10,000N\_\{\\mathrm\{MC\}\}=10\{,\}000,γ=1\.0\\gamma=1\.0\), confirming agreement with the closed\-form formula to within0\.5%0\.5\\%\.

![Refer to caption](https://arxiv.org/html/2605.21458v1/x3.png)Figure A3:Stateless experimentation threshold\. \(a\) Net valueΔV\(ne⋆\)\\Delta V\(n\_\{e\}^\{\\star\}\)versusκ\\kappafor four discount factors\. Dotted lines indicate the analytical thresholdκ⋆\(γ\)\\kappa^\{\\star\}\(\\gamma\)\. \(b\) Optimal experiment sizene⋆n\_\{e\}^\{\\star\}versusκ\\kappa\. \(c\) Monte Carlo validation \(γ=1\.0\\gamma=1\.0\): analytical \(solid\) versus simulated \(hatched,±2SE\\pm 2\\,\\mathrm\{SE\}\)\.
### A\.4Value decomposition \(proof of Theorem[2](https://arxiv.org/html/2605.21458#Thmtheorem2)\)

###### Theorem 2\(Value decomposition bound\)\.

Letℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}denote the true observed\-state MDP and letVobsoracle:=V⋆\(ρobs⋆,℘obs⋆\)V^\{\\mathrm\{oracle\}\}\_\{\\mathrm\{obs\}\}:=V^\{\\star\}\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\)be its optimal value\. Define the single\-perturbation value differences \(holding the simulator\-optimal policyπsim⋆\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}fixed\):

Δr\\displaystyle\\Delta\_\{r\}:=Vπsim⋆\(ρobs⋆,℘^sim\)−Vπsim⋆\(ρ^sim,℘^sim\),\\displaystyle:=V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)\-V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\),\(22\)Δp\\displaystyle\\Delta\_\{p\}:=Vπsim⋆\(ρ^sim,℘obs⋆\)−Vπsim⋆\(ρ^sim,℘^sim\),\\displaystyle:=V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\)\-V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\),\(23\)whereVπ\(ρ,℘\)V^\{\\pi\}\(\\rho,\\wp\)denotes the value of policyπ\\piin the MDP with rewardsρ\\rhoand transitions℘\\wp\. Then under Assumptions[1](https://arxiv.org/html/2605.21458#Thmassumption1)–[3](https://arxiv.org/html/2605.21458#Thmassumption3), Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)applied to each one\-sided perturbation gives\|Δr\|≤Ur:=2ϵr/\(1−γ\)\|\\Delta\_\{r\}\|\\leq U\_\{r\}:=2\\epsilon\_\{r\}/\(1\-\\gamma\)and\|Δp\|≤Up:=2γRmaxϵp/\(1−γ\)2\|\\Delta\_\{p\}\|\\leq U\_\{p\}:=2\\gamma R\_\{\\max\}\\epsilon\_\{p\}/\(1\-\\gamma\)^\{2\}, and the ratio of worst\-case bounds satisfies

UpUr=γRmaxϵp\(1−γ\)ϵr→∞asγ→1for fixedϵr,ϵp\>0\.\\frac\{U\_\{p\}\}\{U\_\{r\}\}\\;=\\;\\frac\{\\gamma R\_\{\\max\}\\epsilon\_\{p\}\}\{\(1\-\\gamma\)\\,\\epsilon\_\{r\}\}\\;\\to\\;\\infty\\quad\\text\{as\}\\quad\\gamma\\to 1\\quad\\text\{for fixed \}\\epsilon\_\{r\},\\epsilon\_\{p\}\>0\.\(24\)The full oracle gapVobsoracle−Vπsim⋆\(ρ^sim,℘^sim\)V^\{\\mathrm\{oracle\}\}\_\{\\mathrm\{obs\}\}\-V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)decomposes asΔr\+Δp\+Δpol\+Δint\\Delta\_\{r\}\+\\Delta\_\{p\}\+\\Delta\_\{\\mathrm\{pol\}\}\+\\Delta\_\{\\mathrm\{int\}\}, whereΔpol:=Vobsoracle−Vπsim⋆\(ρobs⋆,℘obs⋆\)\\Delta\_\{\\mathrm\{pol\}\}:=V^\{\\mathrm\{oracle\}\}\_\{\\mathrm\{obs\}\}\-V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\)is the policy\-misspecification gap \(always≥0\\geq 0\) andΔint\\Delta\_\{\\mathrm\{int\}\}is the residual interaction term captured by the two simultaneous perturbations\. The individual reward\-only bound\|Δr\|≤Ur\|\\Delta\_\{r\}\|\\leq U\_\{r\}and the transition\-only bound\|Δp\|≤Up\|\\Delta\_\{p\}\|\\leq U\_\{p\}are about*single\-source*perturbations at the fixed policyπsim⋆\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}; they do not individually bound the full oracle gap, which also includesΔpol\\Delta\_\{\\mathrm\{pol\}\}\.

###### Proof of Theorem[2](https://arxiv.org/html/2605.21458#Thmtheorem2)\.

The bound onΔr\\Delta\_\{r\}follows by applying Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)to the pair of MDPs\(ρobs⋆,℘^sim\)\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)vs\.\(ρ^sim,℘^sim\)\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\): the transitions match, so the lemma’s transition\-error term vanishes \(ϵp\(r\)=0\\epsilon\_\{p\}^\{\(r\)\}=0\) and the bound reduces to2ϵr/\(1−γ\)2\\epsilon\_\{r\}/\(1\-\\gamma\)\. The bound onΔp\\Delta\_\{p\}follows analogously with the reward\-error term vanishing and gives2γRmaxϵp/\(1−γ\)22\\gamma R\_\{\\max\}\\epsilon\_\{p\}/\(1\-\\gamma\)^\{2\}\. The additive decompositionVobsoracle−Vπsim⋆\(ρ^sim,℘^sim\)=Δr\+Δp\+Δpol\+ΔintV^\{\\mathrm\{oracle\}\}\_\{\\mathrm\{obs\}\}\-V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)=\\Delta\_\{r\}\+\\Delta\_\{p\}\+\\Delta\_\{\\mathrm\{pol\}\}\+\\Delta\_\{\\mathrm\{int\}\}is immediate from the definitions by telescoping\(ρ^sim,℘^sim\)→\(ρobs⋆,℘^sim\)→\(ρobs⋆,℘obs⋆\)→Vobsoracle\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)\\to\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)\\to\(\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\)\\to V^\{\\mathrm\{oracle\}\}\_\{\\mathrm\{obs\}\}, withΔpol\\Delta\_\{\\mathrm\{pol\}\}capturing the finalπsim⋆→πoracle⋆\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\\to\\pi^\{\\star\}\_\{\\mathrm\{oracle\}\}policy switch andΔint\\Delta\_\{\\mathrm\{int\}\}absorbing second\-order interaction between the reward and transition perturbations at the fixed policy\. The ratioUp/Ur=γRmaxϵp/\(\(1−γ\)ϵr\)U\_\{p\}/U\_\{r\}=\\gamma R\_\{\\max\}\\epsilon\_\{p\}/\(\(1\-\\gamma\)\\epsilon\_\{r\}\)tends to infinity asγ→1\\gamma\\to 1for fixedϵr,ϵp\>0\\epsilon\_\{r\},\\epsilon\_\{p\}\>0\. ∎

### A\.5Proofs for Fisher\-SEP \(Section 3\)

This subsection states and proves the do\-variance decomposition \(Lemma[3](https://arxiv.org/html/2605.21458#Thmlemma3)\), Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2), and the observational\-vs\-interventional gap \(Proposition[2](https://arxiv.org/html/2605.21458#Thmproposition2)\)\. It also specifies the empirical\-Bayes estimator ofτs,aobs\\tau^\{\\mathrm\{obs\}\}\_\{s,a\}used in the experiments of Section[5](https://arxiv.org/html/2605.21458#S5)\. \(In this appendix we writeτs,aobs\\tau^\{\\mathrm\{obs\}\}\_\{s,a\}for the per\-observation Fisher informationτs,aint\\tau^\{\\mathrm\{int\}\}\_\{s,a\}of Section[4](https://arxiv.org/html/2605.21458#S4); the superscript “obs\\mathrm\{obs\}” refers to “observation under a randomized pilot,” soτs,aobs=τs,aint\\tau^\{\\mathrm\{obs\}\}\_\{s,a\}=\\tau^\{\\mathrm\{int\}\}\_\{s,a\}under Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\.\)

###### Lemma 3\(Do\-variance decomposition\)\.

Under the POMDP of Section[2](https://arxiv.org/html/2605.21458#S2)with reward kernelR∣S,A,HR\\mid S,A,Hhaving conditional meanr¯\(s,a,h\)\\bar\{r\}\(s,a,h\)and varianceσr2\(s,a,h\)\\sigma^\{2\}\_\{r\}\(s,a,h\), the variance of the reward under the interventional distribution at\(s,do\(a\)\)\(s,\\mathrm\{do\}\(a\)\)decomposes as

Var\(R∣s,do\(a\)\)=𝔼h∼ℙt\(⋅∣s\)\[σr2\(s,a,h\)\]\+Varh∼ℙt\(⋅∣s\)\[r¯\(s,a,h\)\]\.\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)\\;=\\;\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)\}\\\!\\bigl\[\\sigma^\{2\}\_\{r\}\(s,a,h\)\\bigr\]\\;\+\\;\\mathrm\{Var\}\_\{h\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)\}\\\!\\bigl\[\\bar\{r\}\(s,a,h\)\\bigr\]\.\(25\)

###### Proof of Lemma[3](https://arxiv.org/html/2605.21458#Thmlemma3)\.

Under the POMDP of Section[2](https://arxiv.org/html/2605.21458#S2), the rewardR∣S,A,HR\\mid S,A,Hhas conditional meanr¯\(s,a,h\)\\bar\{r\}\(s,a,h\)and varianceσr2\(s,a,h\)\\sigma^\{2\}\_\{r\}\(s,a,h\)\. Intervening to fixA=aA=abydo\(A=a\)\\mathrm\{do\}\(A=a\)severs incoming edges toAAin the causal graph; in particular, it severs anyH→AH\\to Adependence that a behavior policy would induce\. It does not affect theS→HS\\to Hrelation or the reward kernel, so underdo\(A=a\)\\mathrm\{do\}\(A=a\)atS=sS=swe haveH∣s,do\(a\)∼ℙt\(⋅∣s\)H\\mid s,\\mathrm\{do\}\(a\)\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)andR∣s,a,h∼ρ\(⋅∣s,a,h\)R\\mid s,a,h\\sim\\rho\(\\cdot\\mid s,a,h\)\.

Apply the law of total variance conditional on\(s,do\(a\)\)\(s,\\mathrm\{do\}\(a\)\)withHHas the inner conditioning variable:

Var\(R∣s,do\(a\)\)\\displaystyle\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)=𝔼H∣s,do\(a\)\[Var\(R∣s,do\(a\),H\)\]\+VarH∣s,do\(a\)\[𝔼\(R∣s,do\(a\),H\)\]\\displaystyle=\\mathbb\{E\}\_\{H\\mid s,\\mathrm\{do\}\(a\)\}\\\!\\bigl\[\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\),H\)\\bigr\]\+\\mathrm\{Var\}\_\{H\\mid s,\\mathrm\{do\}\(a\)\}\\\!\\bigl\[\\mathbb\{E\}\(R\\mid s,\\mathrm\{do\}\(a\),H\)\\bigr\]=𝔼h∼ℙt\(⋅∣s\)\[σr2\(s,a,h\)\]\+Varh∼ℙt\(⋅∣s\)\[r¯\(s,a,h\)\],\\displaystyle=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)\}\\\!\\bigl\[\\sigma^\{2\}\_\{r\}\(s,a,h\)\\bigr\]\+\\mathrm\{Var\}\_\{h\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)\}\\\!\\bigl\[\\bar\{r\}\(s,a,h\)\\bigr\],which is \([25](https://arxiv.org/html/2605.21458#A1.E25)\)\. ∎

###### Proof of Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\.

Letπexp\\pi^\{\\mathrm\{exp\}\}satisfyπexp\(a∣s,h\)=πexp\(a∣s\)\\pi^\{\\mathrm\{exp\}\}\(a\\mid s,h\)=\\pi^\{\\mathrm\{exp\}\}\(a\\mid s\)for everyh∈ℋh\\in\\mathcal\{H\}\. Section[2](https://arxiv.org/html/2605.21458#S2)takes the hidden\-state conditionalH∣S=sH\\mid S=sto be stationary within an episode; the pilot operates inside a single episode, so the within\-episode distribution ofH∣S=sH\\mid S=sunderπexp\\pi^\{\\mathrm\{exp\}\}isℙt\(⋅∣s\)\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\), the same conditional used to defineρobs⋆\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}and℘obs⋆\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\. Underπexp\\pi^\{\\mathrm\{exp\}\}, conditional onS=sS=s, the joint law of\(A,H,R\)\(A,H,R\)factors as

πexp\(a∣s\)ℙt\(h∣s\)ρ\(r∣s,a,h\),\\pi^\{\\mathrm\{exp\}\}\(a\\mid s\)\\;\\mathbb\{P\}\_\{t\}\(h\\mid s\)\\;\\rho\(r\\mid s,a,h\),whereAAis independent ofHHgivenSSbySS\-measurability ofπexp\\pi^\{\\mathrm\{exp\}\}, and the reward kernelρ\\rhois the same under observational and interventional regimes\. Conditioning further onA=aA=agives

ℙπexp\(H,R∣S=s,A=a\)=ℙt\(h∣s\)ρ\(r∣s,a,h\)\.\\mathbb\{P\}\_\{\\pi^\{\\mathrm\{exp\}\}\}\(H,R\\mid S=s,A=a\)\\;=\\;\\mathbb\{P\}\_\{t\}\(h\\mid s\)\\,\\rho\(r\\mid s,a,h\)\.But the right\-hand side is preciselyℙ\(H,R∣s,do\(A=a\)\)\\mathbb\{P\}\(H,R\\mid s,\\mathrm\{do\}\(A=a\)\), becausedo\(A=a\)\\mathrm\{do\}\(A=a\)leavesℙt\(h∣s\)\\mathbb\{P\}\_\{t\}\(h\\mid s\)andρ\\rhounchanged\. Marginalizing overhhyieldsℙπexp\(R∣s,a\)=ℙ\(R∣s,do\(a\)\)\\mathbb\{P\}\_\{\\pi^\{\\mathrm\{exp\}\}\}\(R\\mid s,a\)=\\mathbb\{P\}\(R\\mid s,\\mathrm\{do\}\(a\)\)\.

Consequently, reward samples at\(s,a\)\(s,a\)underπexp\\pi^\{\\mathrm\{exp\}\}are i\.i\.d\. draws \(across units\) fromℙ\(R∣s,do\(a\)\)\\mathbb\{P\}\(R\\mid s,\\mathrm\{do\}\(a\)\); within a unit, samples may be auto\-correlated throughHi,tH\_\{i,t\}but each such trajectory marginalizes to the sameℙ\(R∣s,do\(a\)\)\\mathbb\{P\}\(R\\mid s,\\mathrm\{do\}\(a\)\)by the sameSS\-measurability argument applied at every step\. The empirical variancev^s,a:=\(n−1\)−1∑i\(Ri−R¯\)2\\hat\{v\}\_\{s,a\}:=\(n\-1\)^\{\-1\}\\sum\_\{i\}\(R\_\{i\}\-\\bar\{R\}\)^\{2\}is consistent forVar\(R∣s,do\(a\)\)\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)by the law of large numbers, and1/v^s,a1/\\hat\{v\}\_\{s,a\}is consistent forτs,a=Var\(R∣s,do\(a\)\)−1\\tau\_\{s,a\}=\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)^\{\-1\}by continuity away from zero\.

Equivalently,\{S\}\\\{S\\\}is a valid back\-door adjustment set for the effect ofAAonRRbecause the only back\-door pathA←H→RA\\leftarrow H\\to Ris blocked by the absence of theH→AH\\to Aedge underπexp\\pi^\{\\mathrm\{exp\}\}, and the standard back\-door adjustment formula\[Pearl,[2009](https://arxiv.org/html/2605.21458#bib.bib140)\]recovers the do\-distribution from the observational conditional\. ∎

###### Proposition 2\(Behavior\-policy reward variance vs\. interventional\)\.

Under Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3)\(calibration behavior policyπbeh\(a∣s,h\)\\pi^\{\\mathrm\{beh\}\}\(a\\mid s,h\)withA⟂̸H∣SA\\not\\perp H\\mid S\):

1. \(a\)Varπbeh\(R∣s,a\)\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)generally differs fromVar\(R∣s,do\(a\)\)\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\), and the sign of the difference is not determined byπbeh\\pi^\{\\mathrm\{beh\}\}alone\.
2. \(b\)Under bounded propensity\-oddsw\(h∣s,a\)∈\[1/M,M\]w\(h\\mid s,a\)\\in\[1/M,M\], the ratioVarπbeh\(R∣s,a\)/Var\(R∣s,do\(a\)\)\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)/\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)is bounded in\[1/M2,M2\]\[1/M^\{2\},M^\{2\}\]\.
3. \(c\)The Fisher\-regret incurred by plugging the observational variance into the A\-optimality criterion in place of the interventional one is at most a multiplicative factorM2M^\{2\}\.

###### Proof of Proposition[2](https://arxiv.org/html/2605.21458#Thmproposition2)\.

We prove each part in turn\.

*Part \(a\) — observational conditional variance is biased\.*Under the behavior policyπbeh\\pi^\{\\mathrm\{beh\}\}, the conditional law ofH∣S=s,A=aH\\mid S=s,A=ais re\-weighted by the propensity likelihood ratio:

ℙπbeh\(h∣s,a\)=ℙt\(h∣s\)πbeh\(a∣s,h\)πbeh\(a∣s\)=ℙt\(h∣s\)w\(h∣s,a\)\.\\mathbb\{P\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(h\\mid s,a\)\\;=\\;\\mathbb\{P\}\_\{t\}\(h\\mid s\)\\,\\frac\{\\pi^\{\\mathrm\{beh\}\}\(a\\mid s,h\)\}\{\\pi^\{\\mathrm\{beh\}\}\(a\\mid s\)\}\\;=\\;\\mathbb\{P\}\_\{t\}\(h\\mid s\)\\,w\(h\\mid s,a\)\.Applying the law of total variance underℙπbeh\(⋅∣s,a\)\\mathbb\{P\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(\\cdot\\mid s,a\),

Varπbeh\(R∣s,a\)=𝔼πbeh\[σr2\(s,a,H\)∣s,a\]\+Varπbeh\[r¯\(s,a,H\)∣s,a\]\.\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)\\;=\\;\\mathbb\{E\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\\\!\\bigl\[\\sigma^\{2\}\_\{r\}\(s,a,H\)\\mid s,a\\bigr\]\+\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\\\!\\bigl\[\\bar\{r\}\(s,a,H\)\\mid s,a\\bigr\]\.The biasVarπbeh\(R∣s,a\)≠Var\(R∣s,do\(a\)\)\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)\\neq\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)arises becauseℙπbeh\(⋅∣s,a\)≠ℙt\(⋅∣s\)\\mathbb\{P\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(\\cdot\\mid s,a\)\\neq\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)wheneverw\(h∣s,a\)≠1w\(h\\mid s,a\)\\neq 1\. The direction of the bias depends on how the re\-weighting interacts with the integrand: concentration on low\-g\(h\)g\(h\)values makes𝔼πbeh\[g\]\\mathbb\{E\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\[g\]smaller than𝔼ℙt\[g\]\\mathbb\{E\}\_\{\\mathbb\{P\}\_\{t\}\}\[g\], concentration on extreme values makes it larger\. In particular,Varπbeh\(R∣s,a\)\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)can be*either*smaller or larger thanVar\(R∣s,do\(a\)\)\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\); the two\-sided bound of Part \(b\) is the robust statement\.

*Part \(b\) — multiplicative bound under Rosenbaum sensitivity\.*Ifw\(h∣s,a\)∈\[1/M,M\]w\(h\\mid s,a\)\\in\[1/M,M\]for everyhh, we showVarπbeh\(R∣s,a\)/Var\(R∣s,do\(a\)\)∈\[1/M2,M2\]\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)/\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)\\in\[1/M^\{2\},M^\{2\}\], so\|Varπbeh/Vardo−1\|≤M2−1\\bigl\|\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}/\\mathrm\{Var\}\_\{\\mathrm\{do\}\}\-1\\bigr\|\\leq M^\{2\}\-1\. Decompose both variances via the law of total variance:

Var\(R∣s,do\(a\)\)=𝔼ℙt\[σr2\(s,a,H\)\]\+Varℙt\[r¯\(s,a,H\)\],\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)=\\mathbb\{E\}\_\{\\mathbb\{P\}\_\{t\}\}\\\!\\bigl\[\\sigma^\{2\}\_\{r\}\(s,a,H\)\\bigr\]\+\\mathrm\{Var\}\_\{\\mathbb\{P\}\_\{t\}\}\\\!\\bigl\[\\bar\{r\}\(s,a,H\)\\bigr\],Varπbeh\(R∣s,a\)=𝔼ℙbeh\[σr2\(s,a,H\)\]\+Varℙbeh\[r¯\(s,a,H\)\],\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)=\\mathbb\{E\}\_\{\\mathbb\{P\}^\{\\mathrm\{beh\}\}\}\\\!\\bigl\[\\sigma^\{2\}\_\{r\}\(s,a,H\)\\bigr\]\+\\mathrm\{Var\}\_\{\\mathbb\{P\}^\{\\mathrm\{beh\}\}\}\\\!\\bigl\[\\bar\{r\}\(s,a,H\)\\bigr\],whereℙbeh\(h\):=ℙt\(h∣s\)w\(h∣s,a\)\\mathbb\{P\}^\{\\mathrm\{beh\}\}\(h\):=\\mathbb\{P\}\_\{t\}\(h\\mid s\)w\(h\\mid s,a\)and both decompositions have nonnegative summands\. For the first summand, nonnegativity ofσr2\\sigma^\{2\}\_\{r\}andw∈\[1/M,M\]w\\in\[1/M,M\]give𝔼ℙbeh\[σr2\]∈\[1/M,M\]⋅𝔼ℙt\[σr2\]\\mathbb\{E\}\_\{\\mathbb\{P\}^\{\\mathrm\{beh\}\}\}\[\\sigma^\{2\}\_\{r\}\]\\in\[1/M,M\]\\cdot\\mathbb\{E\}\_\{\\mathbb\{P\}\_\{t\}\}\[\\sigma^\{2\}\_\{r\}\], which is contained in\[1/M2,M2\]⋅𝔼ℙt\[σr2\]\[1/M^\{2\},M^\{2\}\]\\cdot\\mathbb\{E\}\_\{\\mathbb\{P\}\_\{t\}\}\[\\sigma^\{2\}\_\{r\}\]sinceM≥1M\\geq 1\. For the second summand, use the Gini / pair\-difference form of the variance: for any probability distributionQQonℋ\\mathcal\{H\}and any functiongg,

VarQ\[g\(H\)\]=12∑h,h′Q\(h\)Q\(h′\)\(g\(h\)−g\(h′\)\)2\.\\mathrm\{Var\}\_\{Q\}\[g\(H\)\]\\;=\\;\\tfrac\{1\}\{2\}\\sum\_\{h,h^\{\\prime\}\}Q\(h\)Q\(h^\{\\prime\}\)\\,\\bigl\(g\(h\)\-g\(h^\{\\prime\}\)\\bigr\)^\{2\}\.Becauseℙbeh\(h\)ℙbeh\(h′\)=w\(h\)w\(h′\)ℙt\(h\)ℙt\(h′\)∈\[1/M2,M2\]⋅ℙt\(h\)ℙt\(h′\)\\mathbb\{P\}^\{\\mathrm\{beh\}\}\(h\)\\mathbb\{P\}^\{\\mathrm\{beh\}\}\(h^\{\\prime\}\)=w\(h\)w\(h^\{\\prime\}\)\\mathbb\{P\}\_\{t\}\(h\)\\mathbb\{P\}\_\{t\}\(h^\{\\prime\}\)\\in\[1/M^\{2\},M^\{2\}\]\\cdot\\mathbb\{P\}\_\{t\}\(h\)\\mathbb\{P\}\_\{t\}\(h^\{\\prime\}\), summing over the nonnegative pair terms yields

Varℙbeh\[r¯\(s,a,H\)\]∈\[1/M2,M2\]⋅Varℙt\[r¯\(s,a,H\)\]\.\\mathrm\{Var\}\_\{\\mathbb\{P\}^\{\\mathrm\{beh\}\}\}\[\\bar\{r\}\(s,a,H\)\]\\;\\in\\;\[1/M^\{2\},M^\{2\}\]\\cdot\\mathrm\{Var\}\_\{\\mathbb\{P\}\_\{t\}\}\[\\bar\{r\}\(s,a,H\)\]\.Each summand ofVarπbeh\(R∣s,a\)\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)is therefore within\[1/M2,M2\]\[1/M^\{2\},M^\{2\}\]of itsdo\\mathrm\{do\}\-counterpart; since both are nonnegative the same interval applies to their sum, givingVarπbeh\(R∣s,a\)/Var\(R∣s,do\(a\)\)∈\[1/M2,M2\]\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)/\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)\\in\[1/M^\{2\},M^\{2\}\]\.

*Part \(c\) — downstream Fisher regret\.*Letτs,a⋆:=Var\(R∣s,do\(a\)\)−1\\tau^\{\\star\}\_\{s,a\}:=\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)^\{\-1\}andτ~s,a:=Varπbeh\(R∣s,a\)−1\\tilde\{\\tau\}\_\{s,a\}:=\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)^\{\-1\}; by Part \(b\),τ~s,a/τs,a⋆∈\[1/M2,M2\]\\tilde\{\\tau\}\_\{s,a\}/\\tau^\{\\star\}\_\{s,a\}\\in\[1/M^\{2\},M^\{2\}\]\. Letπ⋆,π~\\pi^\{\\star\},\\tilde\{\\pi\}be the Fisher\-maximizers underτ⋆,τ~\\tau^\{\\star\},\\tilde\{\\tau\}respectively\. Becauseℱ\\mathcal\{F\}is linear in the per\-\(s,a\)\(s,a\)weightsτs,a\\tau\_\{s,a\}for any fixedπ\\pi,

ℱ\(π⋆\)=∑s,aτs,a⋆ωs,a\(π⋆\)≤M2∑s,aτ~s,aωs,a\(π⋆\)=M2ℱ~\(π⋆\)≤M2ℱ~\(π~\),\\mathcal\{F\}\(\\pi^\{\\star\}\)\\;=\\;\\sum\_\{s,a\}\\tau^\{\\star\}\_\{s,a\}\\,\\omega\_\{s,a\}\(\\pi^\{\\star\}\)\\;\\leq\\;M^\{2\}\\\!\\\!\\sum\_\{s,a\}\\tilde\{\\tau\}\_\{s,a\}\\,\\omega\_\{s,a\}\(\\pi^\{\\star\}\)\\;=\\;M^\{2\}\\,\\tilde\{\\mathcal\{F\}\}\(\\pi^\{\\star\}\)\\;\\leq\\;M^\{2\}\\,\\tilde\{\\mathcal\{F\}\}\(\\tilde\{\\pi\}\),whereωs,a\(π\):=∑s′dπ\(s′\)\(∂Vπ\(s′\)/∂θs,a\)2\\omega\_\{s,a\}\(\\pi\):=\\sum\_\{s^\{\\prime\}\}d\_\{\\pi\}\(s^\{\\prime\}\)\\,\\bigl\(\\partial V^\{\\pi\}\(s^\{\\prime\}\)/\\partial\\theta\_\{s,a\}\\bigr\)^\{2\}\(the visitation\-weighted squared sensitivity at\(s,a\)\(s,a\), summed over initial statess′s^\{\\prime\}\)\. Henceℱ\(π⋆\)≤M2ℱ~\(π~\)\\mathcal\{F\}\(\\pi^\{\\star\}\)\\leq M^\{2\}\\,\\tilde\{\\mathcal\{F\}\}\(\\tilde\{\\pi\}\): the Fisher\-regret factor incurred by plugging the observational variance into the A\-optimality criterion is at mostM2M^\{2\}\. ∎

#### Empirical\-Bayes estimator forτs,a\\tau\_\{s,a\}

In practice we estimateτs,a\\tau\_\{s,a\}from pilot data via a Normal\-Inverse\-Gamma conjugate posterior with prior hyperparameters\(μ0,κ0,α0,β0\)\(\\mu\_\{0\},\\kappa\_\{0\},\\alpha\_\{0\},\\beta\_\{0\}\)whereμ0=r^sim\(s,a\)\\mu\_\{0\}=\\hat\{r\}\_\{\\mathrm\{sim\}\}\(s,a\),α0=2\\alpha\_\{0\}=2,β0=α0σs,a2\(0\)\\beta\_\{0\}=\\alpha\_\{0\}\\,\\sigma^\{2\}\_\{s,a\}\(0\), andκ0\\kappa\_\{0\}is the prior strength\. Givenns,an\_\{s,a\}i\.i\.d\. randomized\-action samples\{r1,…,rns,a\}\\\{r\_\{1\},\\ldots,r\_\{n\_\{s,a\}\}\\\}with sample meanr¯\\bar\{r\}and sum of squared deviationsSSE=∑i\(ri−r¯\)2\\mathrm\{SSE\}=\\sum\_\{i\}\(r\_\{i\}\-\\bar\{r\}\)^\{2\}, the posterior isNIG\(μn,κn,αn,βn\)\\mathrm\{NIG\}\(\\mu\_\{n\},\\kappa\_\{n\},\\alpha\_\{n\},\\beta\_\{n\}\)with

κn\\displaystyle\\kappa\_\{n\}=κ0\+ns,a,αn=α0\+ns,a/2,\\displaystyle=\\kappa\_\{0\}\+n\_\{s,a\},\\qquad\\alpha\_\{n\}=\\alpha\_\{0\}\+n\_\{s,a\}/2,\(26\)βn\\displaystyle\\beta\_\{n\}=β0\+12SSE\+κ0ns,a2\(κ0\+ns,a\)\(r¯−μ0\)2\.\\displaystyle=\\beta\_\{0\}\+\\tfrac\{1\}\{2\}\\mathrm\{SSE\}\+\\tfrac\{\\kappa\_\{0\}n\_\{s,a\}\}\{2\(\\kappa\_\{0\}\+n\_\{s,a\}\)\}\(\\bar\{r\}\-\\mu\_\{0\}\)^\{2\}\.\(27\)We takeτ^s,a=αn/βn\\hat\{\\tau\}\_\{s,a\}=\\alpha\_\{n\}/\\beta\_\{n\}, the posterior mean of the precision\.

*Adaptive prior strengthκ^eff\\hat\{\\kappa\}\_\{\\mathrm\{eff\}\}\.*Rather than fixκ0\\kappa\_\{0\}, we estimate simulator credibility from the pooled pilot\-vs\-simulator discrepancies across all\(s,a\)\(s,a\)withns,a≥nminn\_\{s,a\}\\geq n\_\{\\min\}\. Under the null that the simulator is unbiased, thezz\-statisticzs,a:=\(r¯s,a−μ0\)/v^s,a/ns,az\_\{s,a\}:=\(\\bar\{r\}\_\{s,a\}\-\\mu\_\{0\}\)/\\sqrt\{\\hat\{v\}\_\{s,a\}/n\_\{s,a\}\}satisfies𝔼\[zs,a2\]≈1\+ns,aσs,a2\(0\)/\(κeffv^s,a\)\\mathbb\{E\}\[z^\{2\}\_\{s,a\}\]\\approx 1\+n\_\{s,a\}\\sigma^\{2\}\_\{s,a\}\(0\)/\(\\kappa\_\{\\mathrm\{eff\}\}\\,\\hat\{v\}\_\{s,a\}\), yielding the method\-of\-moments estimator

κ^eff=max⁡\(1,∑\(s,a\)∈ℐns,aσs,a2\(0\)/v^s,a∑\(s,a\)∈ℐ\(zs,a2−1\)\+\),\\hat\{\\kappa\}\_\{\\mathrm\{eff\}\}\\;=\\;\\max\\\!\\left\(1,\\;\\frac\{\\sum\_\{\(s,a\)\\in\\mathcal\{I\}\}n\_\{s,a\}\\sigma^\{2\}\_\{s,a\}\(0\)/\\hat\{v\}\_\{s,a\}\}\{\\sum\_\{\(s,a\)\\in\\mathcal\{I\}\}\(z^\{2\}\_\{s,a\}\-1\)\_\{\+\}\}\\right\),whereℐ=\{\(s,a\):ns,a≥nmin\}\\mathcal\{I\}=\\\{\(s,a\):n\_\{s,a\}\\geq n\_\{\\min\}\\\}and\(x\)\+:=max⁡\(x,0\)\(x\)\_\{\+\}:=\\max\(x,0\)\. Whenℐ\\mathcal\{I\}is empty or the denominator is zero, fall back to a fixedκ0=2\\kappa\_\{0\}=2\. This is a pooled shrinkage estimator in the James\-Stein family; it shrinks toward the simulator \(largeκ^eff\\hat\{\\kappa\}\_\{\\mathrm\{eff\}\}\) when the pilot agrees with simulator predictions and toward data \(smallκ^eff\\hat\{\\kappa\}\_\{\\mathrm\{eff\}\}\) when they systematically disagree\.

### A\.6A\-optimal PVV derivation and recovery corollaries

This subsection derives Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)’s posterior predictive value variance from first principles and proves the two recovery corollaries \(Corollary[4](https://arxiv.org/html/2605.21458#Thmcorollary4)and Corollary[5](https://arxiv.org/html/2605.21458#Thmcorollary5)\) and the explore\-under\-ignorance proposition \(Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4)\) stated in Section[4](https://arxiv.org/html/2605.21458#S4)\.

##### Setup\.

Writeθ:=\(θs,a\)s,a∈ℝSK\\theta:=\(\\theta\_\{s,a\}\)\_\{s,a\}\\in\\mathbb\{R\}^\{SK\}for the vector of reward parametersθs,a=𝔼h∼ℙt\(⋅∣s\)\[r¯\(s,a,h\)\]\\theta\_\{s,a\}=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{t\}\(\\cdot\\mid s\)\}\[\\bar\{r\}\(s,a,h\)\]identified under the POMDP’s time\-tthidden\-state distribution\. Fix a*target policy*πtgt\\pi^\{\\mathrm\{tgt\}\}\(the planner’s value estimand\) and a candidate*exploration policy*π\\piwith discounted state\-visitationdπd\_\{\\pi\}and expected pilot observation countsns,a\(π\)=T⋅dπ\(s\)π\(a∣s\)n\_\{s,a\}\(\\pi\)=T\\cdot d\_\{\\pi\}\(s\)\\pi\(a\\mid s\)under effective horizonTT\. Assume the planner carries a Normal\-Inverse\-Gamma conjugate prior on each\(θs,a,vs,a\)\(\\theta\_\{s,a\},v\_\{s,a\}\)pair with prior meanμ0\(s,a\)=r^sim\(s,a\)\\mu\_\{0\}\(s,a\)=\\hat\{r\}\_\{\\mathrm\{sim\}\}\(s,a\), prior strengthκ0\(s,a\)≥0\\kappa\_\{0\}\(s,a\)\\geq 0\(soθs,a∣v∼𝒩\(μ0,v/κ0\)\\theta\_\{s,a\}\\mid v\\sim\\mathcal\{N\}\(\\mu\_\{0\},v/\\kappa\_\{0\}\)\), and prior varianceσs,a2\(0\)\\sigma^\{2\}\_\{s,a\}\(0\)\(sovs,a∼IG\(α0,α0σs,a2\(0\)\)v\_\{s,a\}\\sim\\mathrm\{IG\}\(\\alpha\_\{0\},\\alpha\_\{0\}\\sigma^\{2\}\_\{s,a\}\(0\)\)\)\. The per\-observation Fisher information isτs,aobs=Var\(R∣s,do\(a\)\)−1\\tau^\{\\mathrm\{obs\}\}\_\{s,a\}=\\mathrm\{Var\}\(R\\mid s,\\mathrm\{do\}\(a\)\)^\{\-1\}, identified from pilot data \(Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\), and the prior Fisher information isI0\(s,a\)=κ0\(s,a\)/σs,a2\(0\)I\_\{0\}\(s,a\)=\\kappa\_\{0\}\(s,a\)/\\sigma^\{2\}\_\{s,a\}\(0\)\.

###### Theorem 3\(Posterior predictive variance interpretation of PVV\)\.

Suppose \(i\)θs,a\\theta\_\{s,a\}are a priori independent across\(s,a\)\(s,a\), \(ii\) the reward likelihood at\(s,a\)\(s,a\)is𝒩\(θs,a,\(τs,aobs\)−1\)\\mathcal\{N\}\\bigl\(\\theta\_\{s,a\},\(\\tau^\{\\mathrm\{obs\}\}\_\{s,a\}\)^\{\-1\}\\bigr\), and \(iii\)Vπtgt\(s;θ\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s;\\theta\)admits a first\-order expansion aroundθ⋆\\theta^\{\\star\}that dominates higher\-order terms in the relevant posterior scale\. Then, after observingns,a\(π\)n\_\{s,a\}\(\\pi\)pilot reward samples under the exploration policyπ\\pi, the posterior predictive variance ofV^πtgt\\hat\{V\}^\{\\pi^\{\\mathrm\{tgt\}\}\}averaged overdπtgtd\_\{\\pi^\{\\mathrm\{tgt\}\}\}equals

𝔼s∼dπtgt\[Var\(V^πtgt\(s\)∣𝒟π\)\]=PVV\(π;πtgt\)\+o\(1\)\\mathbb\{E\}\_\{s\\sim d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\}\\\!\\bigl\[\\mathrm\{Var\}\(\\hat\{V\}^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\mid\\mathcal\{D\}\_\{\\pi\}\)\\bigr\]\\;=\\;\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)\\;\+\\;o\(1\)\(28\)wherePVV\(π;πtgt\)\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)is the functional in Eq\. \([9](https://arxiv.org/html/2605.21458#S4.E9)\) and the remainder vanishes as the posterior concentrates\.

###### Proof\.

By conjugacy, the marginal posterior ofθs′,a′\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}given pilot data𝒟π\\mathcal\{D\}\_\{\\pi\}is Gaussian \(under knownτobs\\tau^\{\\mathrm\{obs\}\}, Assumption \(ii\)\) with meanμn\(s′,a′\)→θs′,a′⋆\\mu\_\{n\}\(s^\{\\prime\},a^\{\\prime\}\)\\to\\theta^\{\\star\}\_\{s^\{\\prime\},a^\{\\prime\}\}and variance

Var\(θs′,a′∣𝒟π\)=1I0\(s′,a′\)\+ns′,a′\(π\)τs′,a′obs,\\mathrm\{Var\}\(\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\\mid\\mathcal\{D\}\_\{\\pi\}\)\\;=\\;\\frac\{1\}\{I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\,\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\},via the standard NIG posterior precision updateκn=κ0\+ns′,a′\(π\)\\kappa\_\{n\}=\\kappa\_\{0\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\. Under assumption \(i\) the posterior covariance matrixΣn\\Sigma\_\{n\}is diagonal with these entries\. \(Ifτobs\\tau^\{\\mathrm\{obs\}\}is itself estimated via a full NIG posterior on\(θ,v\)\(\\theta,v\)rather than plugged in, the marginal is Student\-ttand the variance expression picks up a factorβn/\[\(αn−1\)κn\]\\beta\_\{n\}/\[\(\\alpha\_\{n\}\-1\)\\kappa\_\{n\}\]that tends to the display above asαn→∞\\alpha\_\{n\}\\to\\infty; theo\(1\)o\(1\)remainder absorbs this term\.\)

ExpandVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}around the posterior meanμn\\mu\_\{n\}:V^πtgt\(s\)=Vπtgt\(s;μn\)\+\(θ−μn\)⊤∇θVπtgt\(s\)\+O\(‖θ−μn‖2\)\\hat\{V\}^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)=V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s;\\mu\_\{n\}\)\+\(\\theta\-\\mu\_\{n\}\)^\{\\top\}\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\+O\(\\\|\\theta\-\\mu\_\{n\}\\\|^\{2\}\)\. Under assumption \(iii\) the quadratic remainder iso\(1\)o\(1\)relative to the leading term, so

Var\(V^πtgt\(s\)∣𝒟π\)=∇θVπtgt\(s\)⊤Σn∇θVπtgt\(s\)\+o\(1\)=∑s′,a′\(∂Vπtgt\(s\)/∂θs′,a′\)2I0\(s′,a′\)\+ns′,a′\(π\)τs′,a′obs\+o\(1\)\.\\mathrm\{Var\}\(\\hat\{V\}^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\mid\\mathcal\{D\}\_\{\\pi\}\)\\;=\\;\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)^\{\\top\}\\Sigma\_\{n\}\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\+o\(1\)\\;=\\;\\sum\_\{s^\{\\prime\},a^\{\\prime\}\}\\frac\{\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}\}\{I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\,\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\}\+o\(1\)\.Averaging overs∼dπtgts\\sim d\_\{\\pi^\{\\mathrm\{tgt\}\}\}yields Eq\. \([28](https://arxiv.org/html/2605.21458#A1.E28)\)\. ∎

###### Corollary 4\(Data\-rich homoskedastic limit of PVV\)\.

Supposens′,a′\(π\)τs′,a′obs≫I0\(s′,a′\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\,\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\\gg I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)for all\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)with non\-zero target\-value gradient, andτs′,a′obs≡τobs\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\\equiv\\tau^\{\\mathrm\{obs\}\}is uniform\. ThenPVV\(π;πtgt\)\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)is approximately the visitation\-weighted A\-optimal design objective in reward parameters, and the minimizer’s state\-action occupancy satisfiesdπ\(s′\)π\(a′∣s′\)∝w\(s′,a′\)d\_\{\\pi\}\(s^\{\\prime\}\)\\,\\pi\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\propto\\sqrt\{w\(s^\{\\prime\},a^\{\\prime\}\)\}wherew\(s′,a′\):=∑udπtgt\(u\)\(∂Vπtgt\(u\)/∂θs′,a′\)2w\(s^\{\\prime\},a^\{\\prime\}\):=\\sum\_\{u\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)\\,\\bigl\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\\bigr\)^\{2\}\.

###### Proof of Corollary[4](https://arxiv.org/html/2605.21458#Thmcorollary4)\(data\-rich homoskedastic limit\)\.

Supposens′,a′\(π\)τs′,a′obs≫I0\(s′,a′\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\\gg I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)for all\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)with non\-zero target\-value gradient, andτs′,a′obs≡τobs\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\\equiv\\tau^\{\\mathrm\{obs\}\}is uniform\. ThenI0\+nτobs≈ns′,a′\(π\)τobs=Tτobsdπ\(s′\)π\(a′∣s′\)I\_\{0\}\+n\\tau^\{\\mathrm\{obs\}\}\\approx n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\tau^\{\\mathrm\{obs\}\}=T\\tau^\{\\mathrm\{obs\}\}d\_\{\\pi\}\(s^\{\\prime\}\)\\pi\(a^\{\\prime\}\\mid s^\{\\prime\}\), so

PVV\(π;πtgt\)≈1Tτobs∑s′,a′w\(s′,a′\)dπ\(s′\)π\(a′∣s′\),w\(s′,a′\):=∑udπtgt\(u\)\(∂Vπtgt\(u\)/∂θs′,a′\)2\.\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)\\;\\approx\\;\\frac\{1\}\{T\\tau^\{\\mathrm\{obs\}\}\}\\,\\sum\_\{s^\{\\prime\},a^\{\\prime\}\}\\frac\{w\(s^\{\\prime\},a^\{\\prime\}\)\}\{d\_\{\\pi\}\(s^\{\\prime\}\)\\,\\pi\(a^\{\\prime\}\\mid s^\{\\prime\}\)\},\\qquad w\(s^\{\\prime\},a^\{\\prime\}\):=\\sum\_\{u\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)\\,\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}\.Here the weightsw\(s′,a′\)w\(s^\{\\prime\},a^\{\\prime\}\)are determined by the*target*and are fixed during the minimization overπ\\pi\. Theargmin\\operatorname\*\{arg\\,min\}of∑s′,a′w\(s′,a′\)/\[dπ\(s′\)π\(a′∣s′\)\]\\sum\_\{s^\{\\prime\},a^\{\\prime\}\}w\(s^\{\\prime\},a^\{\\prime\}\)/\[d\_\{\\pi\}\(s^\{\\prime\}\)\\pi\(a^\{\\prime\}\\mid s^\{\\prime\}\)\]subject to∑a′π\(a′∣s′\)=1\\sum\_\{a^\{\\prime\}\}\\pi\(a^\{\\prime\}\\mid s^\{\\prime\}\)=1\(per state\) is attained when the state\-action occupancy measureμ\(s′,a′\):=dπ\(s′\)π\(a′∣s′\)\\mu\(s^\{\\prime\},a^\{\\prime\}\):=d\_\{\\pi\}\(s^\{\\prime\}\)\\pi\(a^\{\\prime\}\\mid s^\{\\prime\}\)satisfiesμ\(s′,a′\)∝w\(s′,a′\)\\mu\(s^\{\\prime\},a^\{\\prime\}\)\\propto\\sqrt\{w\(s^\{\\prime\},a^\{\\prime\}\)\}, by Cauchy–Schwarz\. Up to the Cauchy–Schwarz square\-root step \(standard in A\-optimal design\), the minimizer allocates exploration visitation in proportion to the square root of the target\-weighted squared sensitivity\. ∎

###### Corollary 5\(Data\-starved limit of PVV\)\.

Supposens′,a′\(π\)τs′,a′obs≪I0\(s′,a′\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\,\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\\ll I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)for all\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)contributing toPVV\(π;πtgt\)\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)\. ThenPVV\(π;πtgt\)\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)is approximately independent of the exploration policyπ\\pi, and the per\-pair reducible PVV contribution ranks\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)by the productw\(s′,a′\)⋅σs′,a′2\(0\)/κ0\(s′,a′\)w\(s^\{\\prime\},a^\{\\prime\}\)\\cdot\\sigma^\{2\}\_\{s^\{\\prime\},a^\{\\prime\}\}\(0\)/\\kappa\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\), withw\(s′,a′\)w\(s^\{\\prime\},a^\{\\prime\}\)as in Corollary[4](https://arxiv.org/html/2605.21458#Thmcorollary4)\.

###### Proof of Corollary[5](https://arxiv.org/html/2605.21458#Thmcorollary5)\(data\-starved limit\)\.

Supposens′,a′\(π\)τs′,a′obs≪I0\(s′,a′\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\\ll I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)for all\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)contributing toPVV\(π;πtgt\)\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)\. ThenI0\+nτobs≈I0\(s′,a′\)=κ0\(s′,a′\)/σs′,a′2\(0\)I\_\{0\}\+n\\tau^\{\\mathrm\{obs\}\}\\approx I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)=\\kappa\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)/\\sigma^\{2\}\_\{s^\{\\prime\},a^\{\\prime\}\}\(0\), and

PVV\(π;πtgt\)≈∑s′,a′w\(s′,a′\)⋅σs′,a′2\(0\)κ0\(s′,a′\),\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)\\;\\approx\\;\\sum\_\{s^\{\\prime\},a^\{\\prime\}\}w\(s^\{\\prime\},a^\{\\prime\}\)\\cdot\\frac\{\\sigma^\{2\}\_\{s^\{\\prime\},a^\{\\prime\}\}\(0\)\}\{\\kappa\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\},which depends on the target throughw\(s′,a′\)=∑udπtgt\(u\)\(∂Vπtgt\(u\)/∂θs′,a′\)2w\(s^\{\\prime\},a^\{\\prime\}\)=\\sum\_\{u\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}but not on the exploration policyπ\\pi\. The argmin overπ\\piis vacuous—no amount of exploration breaks the prior without pilot data\. The ranking of reducible contributions across\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)\(i\.e\., what visitation would reduce the most if pilot data were collected\) is given by the productw\(s′,a′\)⋅σ2\(0\)/κ0w\(s^\{\\prime\},a^\{\\prime\}\)\\cdot\\sigma^\{2\}\(0\)/\\kappa\_\{0\}: pairs where the simulator’s prior is uncertain \(smallκ0\\kappa\_\{0\}\) and target\-sensitive \(largeww\) dominate\. Fisher\-SEP uses this ranking to allocate its pilot\-phase visitation, which then carries the framework into the non\-vacuous data\-rich regime of Corollary[4](https://arxiv.org/html/2605.21458#Thmcorollary4)\. ∎

###### Proof of Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4)\(explore under ignorance\)\.

Fix\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)withns′,a′\(π\)=0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)=0and∂Vπtgt\(u\)/∂θs′,a′≠0\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\\neq 0for someu∈supp\(dπtgt\)u\\in\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\)\. The corresponding term inPVV\(π;πtgt\)\\mathrm\{PVV\}\(\\pi;\\,\\pi^\{\\mathrm\{tgt\}\}\)is

∑udπtgt\(u\)\(∂Vπtgt\(u\)/∂θs′,a′\)2I0\(s′,a′\)\+0⋅τs′,a′obs=σs′,a′2\(0\)κ0\(s′,a′\)∑udπtgt\(u\)\(∂Vπtgt\(u\)/∂θs′,a′\)2,\\sum\_\{u\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)\\,\\frac\{\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}\}\{I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\+0\\cdot\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\}\\;=\\;\\frac\{\\sigma^\{2\}\_\{s^\{\\prime\},a^\{\\prime\}\}\(0\)\}\{\\kappa\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\}\\,\\sum\_\{u\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)\\,\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(u\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\},a positive finite quantity whenκ0\(s′,a′\)\>0\\kappa\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\>0and the target is sensitive toθs′,a′\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\. Any policyπ′\\pi^\{\\prime\}withns′,a′\(π′\)\>0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\prime\}\)\>0replaces the denominatorI0\(s′,a′\)I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)withI0\(s′,a′\)\+ns′,a′\(π′\)τs′,a′obs\>I0\(s′,a′\)I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\prime\}\)\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}\>I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\), strictly reducing this term\. Therefore this term is*reducible*by visiting\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)under the exploration policy, and any A\-optimal minimizer ofPVV\\mathrm\{PVV\}overπ\\pimust visit every\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)with positive target\-value sensitivity and finiteκ0\\kappa\_\{0\}\.

The key point—which distinguishes this from the standard Bayesian exploration bonus—is that the target’s gradient∂Vπtgt/∂θs′,a′\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}can be nonzero at pairs the*target*itself never directly visits: the Bellman resolvent\(I−γPπtgt\)−1\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}propagates reward sensitivity through the transition dynamics\. In the HIV DGP \(Appendix[D\.11](https://arxiv.org/html/2605.21458#A4.SS11)\), the target policy \(the posterior\-optimal exploit\) does not visit Region B directly, but Region\-B prevalence entersVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}through disease spread from Region B into Region A via the between\-zone force of infection\. The nonzero target gradient at Region\-B zones, combined with their low simulator trust \(κ0≈κmin\\kappa\_\{0\}\\approx\\kappa\_\{\\min\}under self\-consistency analysis\), places them high in the PVV reducibility ranking\. Minimizing PVV over the exploration policyπ\\pitherefore allocates pilot visitation to Region B, which is the mechanism by which Fisher\-SEP discovers the cluster\. ∎

##### Simulator\-self\-consistency estimator forκ0\(s,a\)\\kappa\_\{0\}\(s,a\)\.

In practice we setκ0\(s,a\)\\kappa\_\{0\}\(s,a\)from a simulator\-self\-consistency procedure that does not require external calibration data\. Letv^s,asim\\hat\{v\}^\{\\mathrm\{sim\}\}\_\{s,a\}be the empirical reward variance at\(s,a\)\(s,a\)estimated from Monte Carlo rollouts under a mixture of the simulator\-optimal and uniform\-random policies, and letσs,a2\(0\)\\sigma^\{2\}\_\{s,a\}\(0\)be the simulator’s stated calibration variance\. Define

cs,a:=min⁡\(v^s,asim,σs,a2\(0\)\)max⁡\(v^s,asim,σs,a2\(0\)\)∈\(0,1\],c\_\{s,a\}\\;:=\\;\\frac\{\\min\(\\hat\{v\}^\{\\mathrm\{sim\}\}\_\{s,a\},\\,\\sigma^\{2\}\_\{s,a\}\(0\)\)\}\{\\max\(\\hat\{v\}^\{\\mathrm\{sim\}\}\_\{s,a\},\\,\\sigma^\{2\}\_\{s,a\}\(0\)\)\}\\;\\in\\;\(0,1\],a ratio close to11when the simulator is internally consistent at\(s,a\)\(s,a\)and close to0when its stated variance contradicts its own rollouts\. Separately define the normalized value\-sensitivity

g¯s,a:=∑udπsim⋆\(u\)\(∂Vπsim⋆\(u\)/∂θs,a\)2max\(s′,a′\)∑udπsim⋆\(u\)\(∂Vπsim⋆\(u\)/∂θs′,a′\)2∈\[0,1\],\\bar\{g\}\_\{s,a\}\\;:=\\;\\frac\{\\sum\_\{u\}d\_\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(u\)\(\\partial V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(u\)/\\partial\\theta\_\{s,a\}\)^\{2\}\}\{\\max\_\{\(s^\{\\prime\},a^\{\\prime\}\)\}\\sum\_\{u\}d\_\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(u\)\(\\partial V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(u\)/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}\}\\;\\in\\;\[0,1\],the simulator\-optimal policy’s visitation\-weighted sensitivity at\(s,a\)\(s,a\)normalized to\[0,1\]\[0,1\]\. We then set

κ0\(s,a\)=κmin\+\(κmax−κmin\)cs,ag¯s,a,\\kappa\_\{0\}\(s,a\)\\;=\\;\\kappa\_\{\\min\}\+\(\\kappa\_\{\\max\}\-\\kappa\_\{\\min\}\)\\,c\_\{s,a\}\\,\\bar\{g\}\_\{s,a\},defaulting to\(κmin,κmax\)=\(1,20\)\(\\kappa\_\{\\min\},\\kappa\_\{\\max\}\)=\(1,20\)\. This places high prior strength at\(s,a\)\(s,a\)where the simulator is both self\-consistent \(cs,a≈1c\_\{s,a\}\\approx 1\) and treats the pair as important \(g¯s,a≈1\\bar\{g\}\_\{s,a\}\\approx 1\), and low prior strength elsewhere\. The procedure is purely simulator\-intrinsic and does not leak information from real\-world data\. On the HIV DGP, Region\-B zones receiveκ0≈κmin\\kappa\_\{0\}\\approx\\kappa\_\{\\min\}\(the simulator\-optimal policy never visits them, sog¯\\bar\{g\}is small\), triggering the Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4)mechanism described above\.

### A\.7Bayes\-regret bound for Fisher\-SEP in the reward\-dominates regime

Fisher\-SEP is a design criterion rather than a regret\-minimizing algorithm, but its exploit phase has a Bayes\-regret bound in the reward\-dominates regime \(Corollary[2](https://arxiv.org/html/2605.21458#Thmcorollary2)\) that makes explicit how pilot length trades against exploration cost\. We state it here\.

##### Setup\.

Consider the two\-phase Fisher\-SEP\-R protocol: for the firstTpilotT\_\{\\mathrm\{pilot\}\}steps the planner runs the A\-optimal PVV\-minimizing explorerπexplore\\pi^\{\\mathrm\{explore\}\}over a fractionf∈\(0,1\)f\\in\(0,1\)of the unit population; the remainingn\(1−f\)n\(1\-f\)units follow an exploit policyπexploit\\pi^\{\\mathrm\{exploit\}\}\(the SOP or A\-SOP\)\. AfterTpilotT\_\{\\mathrm\{pilot\}\}steps the explorers switch to the posterior\-mean\-optimal policy computed from the pilot\-updated beliefs\. WriteπFS\\pi^\{\\mathrm\{FS\}\}for this two\-phase policy andW⋆:=supπ∈ΠadaptW\(π\)W^\{\\star\}:=\\sup\_\{\\pi\\in\\Pi\_\{\\mathrm\{adapt\}\}\}W\(\\pi\)for the Bayes\-optimal value\. In this subsection we state the bound in*per\-unit\-normalized*form:ℛ\(πFS\):=\(W⋆−W\(πFS\)\)/n\\mathcal\{R\}\(\\pi^\{\\mathrm\{FS\}\}\):=\(W^\{\\star\}\-W\(\\pi^\{\\mathrm\{FS\}\}\)\)/nis the per\-unit Bayes regret, so the unit\-population factornncancels in both the exploration\-cost and misranking terms below\.

##### Reward\-dominates assumption\.

The Fisher\-SEP\-R special case \(Corollary[2](https://arxiv.org/html/2605.21458#Thmcorollary2)\) applies under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)with the transition block of the parameter\-space covariance pinned by exact observation of transitions under a randomized pilot\. In that regime,ϵp≤ϵpm\\epsilon\_\{p\}\\leq\\epsilon\_\{p\}^\{m\}\(the irreducible transition model residual\) and the leading regret comes from reward\-parameter uncertainty\.

###### Theorem 4\(Bayes regret of Fisher\-SEP\-R\)\.

Under Assumptions[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(a\)–\(e\) and the reward\-dominates regime, with a pilot of lengthTpilotT\_\{\\mathrm\{pilot\}\}and exploration fractionff, there exists a universal constantC\>0C\>0such that

ℛ\(πFS\)≤2fRmaxTpilot⏟exploration cost\+CRmax\(1−γ\)2\|𝕊\|\|𝔸\|log⁡\(\|𝕊\|\|𝔸\|/δ\)fTpilot⏟post\-pilot misranking\+2Rmax1−γϵrm\+2γRmax2\(1−γ\)2ϵpm⏟irreducible residual\\mathcal\{R\}\(\\pi^\{\\mathrm\{FS\}\}\)\\;\\leq\\;\\underbrace\{2fR\_\{\\max\}\\,T\_\{\\mathrm\{pilot\}\}\}\_\{\\text\{exploration cost\}\}\\;\+\\;\\underbrace\{\\frac\{C\\,R\_\{\\max\}\}\{\(1\-\\gamma\)^\{2\}\}\\sqrt\{\\frac\{\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|\\log\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|/\\delta\)\}\{f\\,T\_\{\\mathrm\{pilot\}\}\}\}\}\_\{\\text\{post\-pilot misranking\}\}\\;\+\\;\\underbrace\{\\tfrac\{2R\_\{\\max\}\}\{1\-\\gamma\}\\,\\epsilon^\{m\}\_\{r\}\+\\tfrac\{2\\gamma R\_\{\\max\}^\{2\}\}\{\(1\-\\gamma\)^\{2\}\}\\,\\epsilon^\{m\}\_\{p\}\}\_\{\\text\{irreducible residual\}\}\(29\)with probability at least1−δ1\-\\delta, whereϵrm:=maxs,a⁡ϵrm\(s,a\)\\epsilon^\{m\}\_\{r\}:=\\max\_\{s,a\}\\epsilon\_\{r\}^\{m\}\(s,a\)andϵpm:=maxs,a⁡ϵpm\(s,a\)\\epsilon^\{m\}\_\{p\}:=\\max\_\{s,a\}\\epsilon\_\{p\}^\{m\}\(s,a\)are the sup\-norm model residuals from Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)\. The bound is minimized atTpilot⋆=Θ\(\(\|𝕊\|\|𝔸\|log⁡\(\|𝕊\|\|𝔸\|/δ\)f3\(1−γ\)4\)1/3\)T\_\{\\mathrm\{pilot\}\}^\{\\star\}=\\Theta\\Bigl\(\\bigl\(\\frac\{\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|\\log\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|/\\delta\)\}\{f^\{3\}\(1\-\\gamma\)^\{4\}\}\\bigr\)^\{1/3\}\\Bigr\), yieldingℛ\(πFS\)≤O\(\(\|𝕊\|\|𝔸\|\)1/3\(1−γ\)−4/3Rmax\)\+O\(ϵm/\(1−γ\)2\)\\mathcal\{R\}\(\\pi^\{\\mathrm\{FS\}\}\)\\leq O\\bigl\(\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|\)^\{1/3\}\(1\-\\gamma\)^\{\-4/3\}R\_\{\\max\}\\bigr\)\+O\(\\epsilon^\{m\}/\(1\-\\gamma\)^\{2\}\)\.

###### Proof\.

Decompose the regret into three terms\.

*Exploration cost\.*During the pilot, a fractionffof units followπexplore\\pi^\{\\mathrm\{explore\}\}rather than the exploit policy; each such unit forgoes at mostRmaxR\_\{\\max\}per step relative to the Bayes\-optimal baseline\. Summing overTpilotT\_\{\\mathrm\{pilot\}\}pilot steps and weighting by the explorer fractionffgives per\-unit pilot\-phase cost at mostfRmaxTpilotf\\,R\_\{\\max\}\\,T\_\{\\mathrm\{pilot\}\}; the additional factor of 2 in the display absorbs the exploit\-versus\-Bayes\-optimal gap on the non\-pilot fraction during the same window \(the exploit policy is sub\-Bayes\-optimal by at mostRmaxR\_\{\\max\}per step before observing the pilot\)\. The per\-unit cost therefore is at most2fRmaxTpilot2fR\_\{\\max\}T\_\{\\mathrm\{pilot\}\}; the1/\(1−γ\)1/\(1\-\\gamma\)discount factor does not appear because the lost reward is summed over the finite pilot window, not compounded over an infinite horizon\.

*Post\-pilot misranking\.*After the pilot, the explorers switch to the posterior\-mean\-optimal policy\. The post\-pilot posterior concentrates at rateO\(1/fTpilot\)O\(1/\\sqrt\{fT\_\{\\mathrm\{pilot\}\}\}\)per covered pair \(Cor\.[1](https://arxiv.org/html/2605.21458#Thmcorollary1)\)\. Fisher\-SEP\-R’s pilot allocationns,a\(πexplore\)=Θ\(fTpilot/\(\|𝕊\|\|𝔸\|\)\)n\_\{s,a\}\(\\pi^\{\\mathrm\{explore\}\}\)=\\Theta\(fT\_\{\\mathrm\{pilot\}\}/\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|\)\)balances sample allocation across pairs \(by construction of the A\-optimal minimizer on reward parameters, Cor\.[2](https://arxiv.org/html/2605.21458#Thmcorollary2)\)\. Standard Azuma\-Hoeffding on each pair, union\-bounded with alog⁡\(\|𝕊\|\|𝔸\|/δ\)\\log\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|/\\delta\)factor, gives per\-pair concentrationcr\|𝕊\|\|𝔸\|log⁡\(\|𝕊\|\|𝔸\|/δ\)/\(fTpilot\)c\_\{r\}\\sqrt\{\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|\\log\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|/\\delta\)/\(fT\_\{\\mathrm\{pilot\}\}\)\}\. Feeding through Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)’s value bound\|Vπ\(ℳ^\)−Vπ\(ℳ⋆\)\|≤2ϵr/\(1−γ\)\+2γRmaxϵp/\(1−γ\)2\|V^\{\\pi\}\(\\hat\{\\mathcal\{M\}\}\)\-V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\)\|\\leq 2\\epsilon\_\{r\}/\(1\-\\gamma\)\+2\\gamma R\_\{\\max\}\\epsilon\_\{p\}/\(1\-\\gamma\)^\{2\}and noting that in the reward\-dominates regimeϵp\\epsilon\_\{p\}contributes only its residualϵpm\\epsilon\_\{p\}^\{m\}gives the second term\.

*Irreducible residual\.*Theϵrm\\epsilon^\{m\}\_\{r\}andϵpm\\epsilon^\{m\}\_\{p\}residuals are functional model\-form error, not reducible by any pilot\. Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)’s bound withϵr=ϵrm\\epsilon\_\{r\}=\\epsilon\_\{r\}^\{m\},ϵp=ϵpm\\epsilon\_\{p\}=\\epsilon\_\{p\}^\{m\}gives the third term\.

Summing and minimizing the first two overTpilotT\_\{\\mathrm\{pilot\}\}: set derivative of2fRmaxTpilot\+CRmax\(1−γ\)−2\|S\|\|A\|log/\(fTpilot\)2fR\_\{\\max\}T\_\{\\mathrm\{pilot\}\}\+CR\_\{\\max\}\(1\-\\gamma\)^\{\-2\}\\sqrt\{\|S\|\|A\|\\log/\(fT\_\{\\mathrm\{pilot\}\}\)\}w\.r\.t\.TpilotT\_\{\\mathrm\{pilot\}\}to zero\. This givesTpilot⋆=Θ\(\(\(\|𝕊\|\|𝔸\|log⁡\(\|𝕊\|\|𝔸\|/δ\)\)/\(f3\(1−γ\)4\)\)1/3\)T\_\{\\mathrm\{pilot\}\}^\{\\star\}=\\Theta\(\(\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|\\log\(\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|/\\delta\)\)/\(f^\{3\}\(1\-\\gamma\)^\{4\}\)\)^\{1/3\}\)and the stated rate\. ∎

### A\.8Fisher information notation: interventional vs\. observational

The main text uses the symbolℐθint\(s′,a′\)\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)for the per\-observation Fisher information that enters Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\. The “int\\mathrm\{int\}” superscript marks this as the*interventional*Fisher, computed from randomized pilot observations rather than from the behavior\-policy observed\-data likelihood\. This subsection makes the distinction explicit to avoid a potential terminology collision with the missing\-data / EM literature, where “observed\-data Fisher” has a different standard meaning\.

###### Definition 6\(Interventional Fisher information\)\.

For each\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)letℙ\(R,S′∣s′,do\(a′\)\)\\mathbb\{P\}\(R,S^\{\\prime\}\\mid s^\{\\prime\},\\mathrm\{do\}\(a^\{\\prime\}\)\)be the interventional distribution of reward and next\-state under a do\-operation setting the action toa′a^\{\\prime\}at observed states′s^\{\\prime\}\. The*interventional per\-observation Fisher information*ofθs′,a′=\(rs′,a′,ps′,a′\)\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}=\(r\_\{s^\{\\prime\},a^\{\\prime\}\},p\_\{s^\{\\prime\},a^\{\\prime\}\}\)is

ℐθint\(s′,a′\):=𝔼\(R,S′\)∼ℙ\(⋅∣s′,do\(a′\)\)\[−∇θ2log⁡ℙ\(R,S′∣s′,a′;θ\)\]\.\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)\\;:=\\;\\mathbb\{E\}\_\{\(R,S^\{\\prime\}\)\\sim\\mathbb\{P\}\(\\cdot\\mid s^\{\\prime\},\\mathrm\{do\}\(a^\{\\prime\}\)\)\}\\bigl\[\-\\nabla^\{2\}\_\{\\theta\}\\log\\mathbb\{P\}\(R,S^\{\\prime\}\\mid s^\{\\prime\},a^\{\\prime\};\\theta\)\\bigr\]\.\(30\)

###### Proposition 3\(Identification of the main\-text Fisher with the interventional Fisher\)\.

Under the identifiability conditions of Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\(randomizedSS\-measurable pilot policy\), the quantityℐθint\(s′,a′\)\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}\(s^\{\\prime\},a^\{\\prime\}\)used in Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)of the main text coincides with the interventional Fisher of Definition[6](https://arxiv.org/html/2605.21458#Thmdefinition6)\. The Fisher of an observed\-data likelihood under the historical behavior policy is a different object \(denotedℐθbeh\\mathcal\{I\}^\{\\mathrm\{beh\}\}\_\{\\theta\}below\); using it in place ofℐθint\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}would introduce confounding\-induced bias\.

###### Proof\.

Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)shows that under anSS\-measurable randomized policyπexp\(a∣s,h\)=πexp\(a∣s\)\\pi^\{\\mathrm\{exp\}\}\(a\\mid s,h\)=\\pi^\{\\mathrm\{exp\}\}\(a\\mid s\), pilot observations at\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)are marginally i\.i\.d\. draws fromℙ\(R∣s′,do\(a′\)\)\\mathbb\{P\}\(R\\mid s^\{\\prime\},\\mathrm\{do\}\(a^\{\\prime\}\)\)andℙ\(S′∣s′,do\(a′\)\)\\mathbb\{P\}\(S^\{\\prime\}\\mid s^\{\\prime\},\\mathrm\{do\}\(a^\{\\prime\}\)\)\. The empirical\-to\-expectation Fisher identity, applied to the conjugate model of Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4), gives \([30](https://arxiv.org/html/2605.21458#A1.E30)\)\. The behavior\-policy observational Fisherℐθbeh:=𝔼ℙbeh\(⋅∣s′,a′\)\[−∇θ2log⁡ℙbeh\]\\mathcal\{I\}^\{\\mathrm\{beh\}\}\_\{\\theta\}:=\\mathbb\{E\}\_\{\\mathbb\{P\}\_\{\\mathrm\{beh\}\}\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}\[\-\\nabla^\{2\}\_\{\\theta\}\\log\\mathbb\{P\}\_\{\\mathrm\{beh\}\}\]differs fromℐθint\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{\\theta\}underA⟂̸H∣SA\\not\\perp H\\mid S\(Assumption[3](https://arxiv.org/html/2605.21458#Thmassumption3)\); usingℐθbeh\\mathcal\{I\}^\{\\mathrm\{beh\}\}\_\{\\theta\}in Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)would introduce bias bounded by the Rosenbaum\-style multiplicative ratio of Proposition[2](https://arxiv.org/html/2605.21458#Thmproposition2)\(Section[4](https://arxiv.org/html/2605.21458#S4), and this appendix below\)\. ∎

### A\.9Delta\-method validity for PVV at the design stage

The PVV \([9](https://arxiv.org/html/2605.21458#S4.E9)\) approximates the posterior predictive variance ofVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}by a first\-order delta\-method expansion around a reference parameterθ⋆\\theta^\{\\star\}\. This subsection bounds the approximation error and shows when the first\-order term dominates the remainder, including in the data\-starved design regime where the posterior has not yet concentrated\.

###### Lemma 4\(Delta\-method remainder for PVV\)\.

Letθ⋆\\theta^\{\\star\}be the prior mean under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)and letVπtgt\(s;⋅\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s;\\cdot\)be the value function as a function of MDP parameters\. AssumeVπtgt\(s;⋅\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s;\\cdot\)is twice continuously differentiable on a convex open neighborhoodΘ0⊆ℝ\|𝕊\|\|𝔸\|\(\|𝕊\|\+1\)\\Theta\_\{0\}\\subseteq\\mathbb\{R\}^\{\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|\(\|\\mathbb\{S\}\|\+1\)\}containing the posterior support with positive posterior probability≥1−η\\geq 1\-\\eta\. Denote byH\(θ;s\)H\(\\theta;s\)the Hessian ofVπtgt\(s;⋅\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s;\\cdot\)atθ\\thetaand by‖H‖op,Θ0:=supθ∈Θ0,s‖H\(θ;s\)‖op\\\|H\\\|\_\{\\mathrm\{op\},\\Theta\_\{0\}\}:=\\sup\_\{\\theta\\in\\Theta\_\{0\},s\}\\\|H\(\\theta;s\)\\\|\_\{\\mathrm\{op\}\}its maximum operator norm overΘ0\\Theta\_\{0\}\. Letm4:=𝔼bt\[‖θ−θ⋆‖4\]m\_\{4\}:=\\mathbb\{E\}\_\{b\_\{t\}\}\[\\\|\\theta\-\\theta^\{\\star\}\\\|^\{4\}\]denote the posterior fourth central moment\. Then for the posterior variance ofVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}under posteriorbtb\_\{t\},

\|Varbt\[Vπtgt\(s;θ\)\]−PVVs,bt\|≤14‖H‖op,Θ02⋅m4\+‖H‖op,Θ0⋅m4⋅PVVs,bt\+2η⋅\(Rmax/\(1−γ\)\)2,\\bigl\|\\mathrm\{Var\}\_\{b\_\{t\}\}\[V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s;\\theta\)\]\-\\mathrm\{PVV\}\_\{s,b\_\{t\}\}\\bigr\|\\;\\leq\\;\\tfrac\{1\}\{4\}\\\|H\\\|\_\{\\mathrm\{op\},\\Theta\_\{0\}\}^\{2\}\\cdot m\_\{4\}\\;\+\\;\\\|H\\\|\_\{\\mathrm\{op\},\\Theta\_\{0\}\}\\cdot\\sqrt\{m\_\{4\}\\cdot\\mathrm\{PVV\}\_\{s,b\_\{t\}\}\}\\;\+\\;2\\eta\\cdot\(R\_\{\\max\}/\(1\-\\gamma\)\)^\{2\},\(31\)wherePVVs,bt=∇V\(θ⋆\)⊤Σbt∇V\(θ⋆\)\\mathrm\{PVV\}\_\{s,b\_\{t\}\}=\\nabla V\(\\theta^\{\\star\}\)^\{\\top\}\\Sigma\_\{b\_\{t\}\}\\nabla V\(\\theta^\{\\star\}\)is the per\-state first\-order PVV atss\.

###### Proof\.

WriteV\(θ\):=Vπtgt\(s;θ\)V\(\\theta\):=V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s;\\theta\)\. Taylor\-expand atθ⋆\\theta^\{\\star\}:V\(θ\)=V\(θ⋆\)\+X\(θ\)\+Y\(θ\)V\(\\theta\)=V\(\\theta^\{\\star\}\)\+X\(\\theta\)\+Y\(\\theta\)withX\(θ\):=∇V\(θ⋆\)⊤\(θ−θ⋆\)X\(\\theta\):=\\nabla V\(\\theta^\{\\star\}\)^\{\\top\}\(\\theta\-\\theta^\{\\star\}\)the linear term andY\(θ\):=12\(θ−θ⋆\)⊤H\(θ~\)\(θ−θ⋆\)Y\(\\theta\):=\\tfrac\{1\}\{2\}\(\\theta\-\\theta^\{\\star\}\)^\{\\top\}H\(\\tilde\{\\theta\}\)\(\\theta\-\\theta^\{\\star\}\)the quadratic remainder, for someθ~\\tilde\{\\theta\}on the segment betweenθ\\thetaandθ⋆\\theta^\{\\star\}\. OnΘ0\\Theta\_\{0\},\|Y\(θ\)\|≤12‖H‖op,Θ0⋅‖θ−θ⋆‖2\|Y\(\\theta\)\|\\leq\\tfrac\{1\}\{2\}\\\|H\\\|\_\{\\mathrm\{op\},\\Theta\_\{0\}\}\\cdot\\\|\\theta\-\\theta^\{\\star\}\\\|^\{2\}, so𝔼\[Y2\]≤14‖H‖2⋅m4\\mathbb\{E\}\[Y^\{2\}\]\\leq\\tfrac\{1\}\{4\}\\\|H\\\|^\{2\}\\cdot m\_\{4\}and henceVar\(Y\)≤14‖H‖2⋅m4\\mathrm\{Var\}\(Y\)\\leq\\tfrac\{1\}\{4\}\\\|H\\\|^\{2\}\\cdot m\_\{4\}\.

DecomposingV\(θ\)−V\(θ⋆\)=X\(θ\)\+Y\(θ\)V\(\\theta\)\-V\(\\theta^\{\\star\}\)=X\(\\theta\)\+Y\(\\theta\)and using bilinearity of variance,

Var\[V\(θ\)\]=Var\[X\]\+2Cov\(X,Y\)\+Var\[Y\],\\mathrm\{Var\}\[V\(\\theta\)\]=\\mathrm\{Var\}\[X\]\+2\\mathrm\{Cov\}\(X,Y\)\+\\mathrm\{Var\}\[Y\],so\|Var\[V\]−Var\[X\]\|≤Var\[Y\]\+2\|Cov\(X,Y\)\|\\bigl\|\\mathrm\{Var\}\[V\]\-\\mathrm\{Var\}\[X\]\\bigr\|\\leq\\mathrm\{Var\}\[Y\]\+2\|\\mathrm\{Cov\}\(X,Y\)\|\. Cauchy–Schwarz gives\|Cov\(X,Y\)\|≤Var\(X\)Var\(Y\)\|\\mathrm\{Cov\}\(X,Y\)\|\\leq\\sqrt\{\\mathrm\{Var\}\(X\)\\,\\mathrm\{Var\}\(Y\)\}, withVar\(X\)=PVVs,bt\\mathrm\{Var\}\(X\)=\\mathrm\{PVV\}\_\{s,b\_\{t\}\}andVar\(Y\)≤14‖H‖2m4\\mathrm\{Var\}\(Y\)\\leq\\tfrac\{1\}\{4\}\\\|H\\\|^\{2\}m\_\{4\}\. Combining,

\|Var\[V\]−PVVs,bt\|≤14‖H‖2m4\+2PVVs,bt⋅14‖H‖2m4=14‖H‖2m4\+‖H‖m4⋅PVVs,bt\.\\bigl\|\\mathrm\{Var\}\[V\]\-\\mathrm\{PVV\}\_\{s,b\_\{t\}\}\\bigr\|\\;\\leq\\;\\tfrac\{1\}\{4\}\\\|H\\\|^\{2\}m\_\{4\}\+2\\sqrt\{\\mathrm\{PVV\}\_\{s,b\_\{t\}\}\\cdot\\tfrac\{1\}\{4\}\\\|H\\\|^\{2\}m\_\{4\}\}\\;=\\;\\tfrac\{1\}\{4\}\\\|H\\\|^\{2\}m\_\{4\}\+\\\|H\\\|\\sqrt\{m\_\{4\}\\cdot\\mathrm\{PVV\}\_\{s,b\_\{t\}\}\}\.The2η\(Rmax/\(1−γ\)\)22\\eta\(R\_\{\\max\}/\(1\-\\gamma\)\)^\{2\}additive term absorbs the contribution fromΘ0c\\Theta\_\{0\}^\{c\}via the uniform bound\|V\(θ\)\|≤Rmax/\(1−γ\)\|V\(\\theta\)\|\\leq R\_\{\\max\}/\(1\-\\gamma\)applied toVar\[V\]≤𝔼\[V2\]\\mathrm\{Var\}\[V\]\\leq\\mathbb\{E\}\[V^\{2\}\]on the low\-probability set\. ∎

###### Proposition 4\(Validity of PVV\-minimization ranking in the data\-starved regime\)\.

Letπ,π′\\pi,\\pi^\{\\prime\}be two candidate exploration policies with PVV valuesP:=PVV\(π;πtgt\)P:=\\mathrm\{PVV\}\(\\pi;\\pi^\{\\mathrm\{tgt\}\}\)andP′:=PVV\(π′;πtgt\)P^\{\\prime\}:=\\mathrm\{PVV\}\(\\pi^\{\\prime\};\\pi^\{\\mathrm\{tgt\}\}\), and letPmax:=max⁡\(P,P′\)P\_\{\\max\}:=\\max\(P,P^\{\\prime\}\)\. LetRΔ:=14‖H‖op,Θ02m4\+‖H‖op,Θ0m4Pmax\+2η\(Rmax/\(1−γ\)\)2R\_\{\\Delta\}:=\\tfrac\{1\}\{4\}\\\|H\\\|\_\{\\mathrm\{op\},\\Theta\_\{0\}\}^\{2\}m\_\{4\}\+\\\|H\\\|\_\{\\mathrm\{op\},\\Theta\_\{0\}\}\\sqrt\{m\_\{4\}\\,P\_\{\\max\}\}\+2\\eta\(R\_\{\\max\}/\(1\-\\gamma\)\)^\{2\}be the per\-policy remainder bound from Lemma[4](https://arxiv.org/html/2605.21458#Thmlemma4), withm4m\_\{4\}the posterior fourth moment\. Then the sign of the true posterior\-variance difference agrees with the sign of the PVV difference whenever\|P−P′\|\>2RΔ\|P\-P^\{\\prime\}\|\>2R\_\{\\Delta\}\. In the data\-starved regime \(t=0t=0, pre\-pilot\) with prior from Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)satisfyingσ\(0\)≤Rmax\\sigma^\{\(0\)\}\\leq R\_\{\\max\}and Dirichlet concentrationα\(0\)≥1\\alpha^\{\(0\)\}\\geq 1, the prior fourth momentm4m\_\{4\}is finite andRΔ=O\(Rmax2‖H‖op,Θ02\)R\_\{\\Delta\}=O\(R\_\{\\max\}^\{2\}\\\|H\\\|\_\{\\mathrm\{op\},\\Theta\_\{0\}\}^\{2\}\), independent of the horizonTT\.

###### Proof\.

Apply Lemma[4](https://arxiv.org/html/2605.21458#Thmlemma4)to each ofπ,π′\\pi,\\pi^\{\\prime\}: each first\-order PVV differs from the true posterior variance ofVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}by at mostRΔR\_\{\\Delta\}\. By the triangle inequality, the true\-variance difference lies within2RΔ2R\_\{\\Delta\}of the PVV difference, so a PVV gap of size greater than2RΔ2R\_\{\\Delta\}determines the sign of the true gap\. ∎

### A\.10Pre\-pilot Fisher sensitivity: robustness to the initialτobs\\tau^\{\\mathrm\{obs\}\}guess

The PVV criterion requires a per\-\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)observation precisionτs′,a′obs\\tau^\{\\mathrm\{obs\}\}\_\{s^\{\\prime\},a^\{\\prime\}\}that by Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)is identified from pilot data\. At design time, before the pilot has been run, the planner uses a pre\-pilot guessτ^0\\hat\{\\tau\}\_\{0\}derived from the simulator’s stated reward variance \(see Appendix[A\.6](https://arxiv.org/html/2605.21458#A1.SS6.SSS0.Px2)\)\. This raises a circularity concern: the design object depends on a quantity whose identification follows from the design being executed\. This subsection shows the PVV\-minimization ranking is stable underMM\-bounded multiplicative perturbation of the pre\-pilot guess, so the concern is quantitative rather than fundamental\.

###### Proposition 5\(Rank\-stability of PVV minimizer underτobs\\tau^\{\\mathrm\{obs\}\}perturbation\)\.

Letτ^0,τ⋆\\hat\{\\tau\}\_\{0\},\\tau^\{\\star\}be two candidate per\-observation Fisher vectors satisfyingτ^0\(s′,a′\)/τ⋆\(s′,a′\)∈\[1/M,M\]\\hat\{\\tau\}\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)/\\tau^\{\\star\}\(s^\{\\prime\},a^\{\\prime\}\)\\in\[1/M,M\]for someM≥1M\\geq 1and all\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)\. Letπ⋆\(τ\):=argminπ⁡PVV\(π;πtgt,τ\)\\pi^\{\\star\}\(\\tau\):=\\operatorname\*\{arg\\,min\}\_\{\\pi\}\\mathrm\{PVV\}\(\\pi;\\pi^\{\\mathrm\{tgt\}\},\\tau\)be the PVV minimizer underτ\\tau\. Then

PVV\(π⋆\(τ^0\);πtgt,τ⋆\)≤M2⋅PVV\(π⋆\(τ⋆\);πtgt,τ⋆\)\.\\mathrm\{PVV\}\(\\pi^\{\\star\}\(\\hat\{\\tau\}\_\{0\}\);\\pi^\{\\mathrm\{tgt\}\},\\tau^\{\\star\}\)\\;\\leq\\;M^\{2\}\\cdot\\mathrm\{PVV\}\(\\pi^\{\\star\}\(\\tau^\{\\star\}\);\\pi^\{\\mathrm\{tgt\}\},\\tau^\{\\star\}\)\.\(32\)

###### Proof\.

For any\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\), letcs′,a′\(π\)c\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)denote the target\-gradient\-weighted per\-pair prior\-variance term so that the\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)contribution to PVV under per\-observation Fisherτ\\tauand pilot countn:=ns′,a′\(π\)n:=n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)iscs′,a′\(π\)/\(I0\(s′,a′\)\+nτ\)c\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)/\(I\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\+n\\tau\)\. Under the hypothesisτ^0/τ⋆∈\[1/M,M\]\\hat\{\\tau\}\_\{0\}/\\tau^\{\\star\}\\in\[1/M,M\], we haveτ^0\(s′,a′\)≥τ⋆\(s′,a′\)/M\\hat\{\\tau\}\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\)\\geq\\tau^\{\\star\}\(s^\{\\prime\},a^\{\\prime\}\)/M, so

I0\+nτ^0≥I0\+nτ⋆M≥I0\+nτ⋆MI\_\{0\}\+n\\hat\{\\tau\}\_\{0\}\\;\\geq\\;I\_\{0\}\+\\frac\{n\\tau^\{\\star\}\}\{M\}\\;\\geq\\;\\frac\{I\_\{0\}\+n\\tau^\{\\star\}\}\{M\}usingM≥1M\\geq 1andI0≥0I\_\{0\}\\geq 0\(soI0≥I0/MI\_\{0\}\\geq I\_\{0\}/M\)\. Hence for any fixed policyπ\\piand fixed pair\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\),

cs′,a′\(π\)I0\+nτ^0≤M⋅cs′,a′\(π\)I0\+nτ⋆,\\frac\{c\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\}\{I\_\{0\}\+n\\hat\{\\tau\}\_\{0\}\}\\;\\leq\\;\\frac\{M\\cdot c\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)\}\{I\_\{0\}\+n\\tau^\{\\star\}\},givingPVV\(π;τ^0\)≤M⋅PVV\(π;τ⋆\)\\mathrm\{PVV\}\(\\pi;\\,\\hat\{\\tau\}\_\{0\}\)\\leq M\\cdot\\mathrm\{PVV\}\(\\pi;\\,\\tau^\{\\star\}\)and symmetricallyPVV\(π;τ⋆\)≤M⋅PVV\(π;τ^0\)\\mathrm\{PVV\}\(\\pi;\\,\\tau^\{\\star\}\)\\leq M\\cdot\\mathrm\{PVV\}\(\\pi;\\,\\hat\{\\tau\}\_\{0\}\)\. Then

PVV\(π⋆\(τ^0\);τ⋆\)≤M⋅PVV\(π⋆\(τ^0\);τ^0\)≤M⋅PVV\(π⋆\(τ⋆\);τ^0\)≤M2⋅PVV\(π⋆\(τ⋆\);τ⋆\),\\mathrm\{PVV\}\(\\pi^\{\\star\}\(\\hat\{\\tau\}\_\{0\}\);\\,\\tau^\{\\star\}\)\\;\\leq\\;M\\cdot\\mathrm\{PVV\}\(\\pi^\{\\star\}\(\\hat\{\\tau\}\_\{0\}\);\\,\\hat\{\\tau\}\_\{0\}\)\\;\\leq\\;M\\cdot\\mathrm\{PVV\}\(\\pi^\{\\star\}\(\\tau^\{\\star\}\);\\,\\hat\{\\tau\}\_\{0\}\)\\;\\leq\\;M^\{2\}\\cdot\\mathrm\{PVV\}\(\\pi^\{\\star\}\(\\tau^\{\\star\}\);\\,\\tau^\{\\star\}\),where the middle inequality uses thatπ⋆\(τ^0\)=argminπ⁡PVV\(π;τ^0\)\\pi^\{\\star\}\(\\hat\{\\tau\}\_\{0\}\)=\\operatorname\*\{arg\\,min\}\_\{\\pi\}\\mathrm\{PVV\}\(\\pi;\\,\\hat\{\\tau\}\_\{0\}\)\. ∎

### A\.11Reachability positivity: an RL analog of causal overlap

The main text uses the phrase “positivity violation in the causal sense” in Section[1](https://arxiv.org/html/2605.21458#S1)to describe the support\-mismatch that drives the reachability gap\. Classical causal\-inference positivity is a condition on the treatment\-assignment mechanism:ℙ\(A=a∣X\)\>0\\mathbb\{P\}\(A=a\\mid X\)\>0for all actionsaaand observed covariatesXX, whereXXis the adjustment set\[Pearl,[2009](https://arxiv.org/html/2605.21458#bib.bib140), Kallus and Zhou,[2018](https://arxiv.org/html/2605.21458#bib.bib76)\]\. The paper’s notion is related but not identical: it concerns the*deployed policy’s state visitation*rather than the*treatment assignment mechanism*\. This subsection defines the correct analog, shows it is the support condition that drives Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4), and relates it to Kallus\-style overlap in off\-policy evaluation\.

###### Definition 7\(Reachability positivity\)\.

Letπdep\\pi^\{\\mathrm\{dep\}\}be the planner’s deployed policy \(e\.g\., the SOP or A\-SOP\) and letπtgt\\pi^\{\\mathrm\{tgt\}\}be the target policy whose value is the object of Fisher\-SEP’s design\. We say the pair\(πdep,πtgt\)\(\\pi^\{\\mathrm\{dep\}\},\\pi^\{\\mathrm\{tgt\}\}\)satisfies*reachability positivity*if

supp\(dπtgt\)⊆supp\(dπdep\),\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\)\\subseteq\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{dep\}\}\}\),\(33\)i\.e\., the target’s discounted state\-visitation distribution is absolutely continuous with respect to the deployed policy’s\.*Reachability positivity fails*when \([33](https://arxiv.org/html/2605.21458#A1.E33)\) is violated:∃s\\exists swithdπtgt\(s\)\>0d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\>0anddπdep\(s\)=0d\_\{\\pi^\{\\mathrm\{dep\}\}\}\(s\)=0\.

###### Proposition 6\(Reachability positivity failure is the mechanism of Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4)\)\.

Under Assumptions[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(a\)–\(f\), the following are equivalent:

1. \(i\)Reachability positivity between\(πdep,πtgt\)\(\\pi^\{\\mathrm\{dep\}\},\\pi^\{\\mathrm\{tgt\}\}\)fails\.
2. \(ii\)There exists\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)withns′,a′\(πdep\)=0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\mathrm\{dep\}\}\)=0and∇θVπtgt\(s\)≠0\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\neq 0for somes∈supp\(dπtgt\)s\\in\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\)\.
3. \(iii\)The Fisher\-SEP explorerπ⋆=argminπ⁡PVV\(π;πtgt\)\\pi^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{\\pi\}\\mathrm\{PVV\}\(\\pi;\\pi^\{\\mathrm\{tgt\}\}\)hasns′,a′\(π⋆\)\>0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\)\>0at some\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)withns′,a′\(πdep\)=0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\mathrm\{dep\}\}\)=0and the PVV objective strictly prefersπ⋆\\pi^\{\\star\}over any policy supported onsupp\(dπdep\)\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{dep\}\}\}\)\.

###### Proof\.

\(i\)⇒\\Rightarrow\(ii\): If \(i\) holds, takes∈supp\(dπtgt\)∖supp\(dπdep\)s\\in\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\)\\setminus\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{dep\}\}\}\)and\(s′,a′\)=\(s,πtgt\(s\)\)\(s^\{\\prime\},a^\{\\prime\}\)=\(s,\\pi^\{\\mathrm\{tgt\}\}\(s\)\)\. Thenns′,a′\(πdep\)=0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\mathrm\{dep\}\}\)=0\(the deployed policy does not visitss\), and∇θVπtgt\(s\)≠0\\nabla\_\{\\theta\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\neq 0generically because perturbations ofθs′,a′\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}directly changeVπtgt\(s\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\.

\(ii\)⇒\\Rightarrow\(iii\): Given \(ii\), the PVV term at\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)is reducible \(Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4)\), and any explorer with positive visitation there achieves a strictly lower objective than anysupp\(dπdep\)\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{dep\}\}\}\)\-supported policy\.

\(iii\)⇒\\Rightarrow\(i\): If Fisher\-SEP’s optimum strictly dominates anysupp\(dπdep\)\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{dep\}\}\}\)\-supported policy, there must be a target\-sensitive pair off the deployed policy’s support; by definitiondπtgtd\_\{\\pi^\{\\mathrm\{tgt\}\}\}has positive mass at that pair whiledπdepd\_\{\\pi^\{\\mathrm\{dep\}\}\}does not\. ∎

### A\.12Confounding vs\. drift: separating the two components ofϵh\\epsilon^\{h\}

The hidden\-state errorϵh\\epsilon^\{h\}in Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)collapses two distinct phenomena:*confounding*\(the calibration distribution differs from the calibration marginal becauseπbeh\\pi\_\{\\mathrm\{beh\}\}depended onHH\) and*drift*\(the deployment\-time and calibration\-time hidden\-state marginals differ even ignoring theaa\-conditioning\)\. AnSS\-measurable randomized pilot reduces both to finite\-sample noise at once, but the two phenomena have very different implications for what auxiliary information would help\. This subsection defines a finer decomposition and gives a sufficient condition for the two to be separately identified\.

###### Definition 8\(Confounding and drift components\)\.

Let

ρ¯calib,marg\(s,a\)\\displaystyle\\bar\{\\rho\}\_\{\\mathrm\{calib,marg\}\}\(s,a\):=𝔼h∼ℙcalib\(⋅∣s\)\[r¯\(s,a,h\)\],\\displaystyle:=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s\)\}\[\\bar\{r\}\(s,a,h\)\],\(34\)ρ¯calib,cond\(s,a\)\\displaystyle\\bar\{\\rho\}\_\{\\mathrm\{calib,cond\}\}\(s,a\):=𝔼h∼ℙcalib\(⋅∣s,a\)\[r¯\(s,a,h\)\]\.\\displaystyle:=\\mathbb\{E\}\_\{h\\sim\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(\\cdot\\mid s,a\)\}\[\\bar\{r\}\(s,a,h\)\]\.\(35\)The reward\-side hidden\-state errorϵrh\(s,a\)\\epsilon^\{h\}\_\{r\}\(s,a\)of Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)decomposes as

ϵrh\(s,a\)≤\|ρ¯calib,cond\(s,a\)−ρ¯calib,marg\(s,a\)\|⏟=⁣:ϵrconf\(s,a\)\+\|ρ¯calib,marg\(s,a\)−ρobs⋆\(s,a\)\|⏟=⁣:ϵrdrift\(s,a\),\\epsilon^\{h\}\_\{r\}\(s,a\)\\;\\leq\\;\\underbrace\{\\bigl\|\\bar\{\\rho\}\_\{\\mathrm\{calib,cond\}\}\(s,a\)\-\\bar\{\\rho\}\_\{\\mathrm\{calib,marg\}\}\(s,a\)\\bigr\|\}\_\{=:\\epsilon^\{\\mathrm\{conf\}\}\_\{r\}\(s,a\)\}\\;\+\\;\\underbrace\{\\bigl\|\\bar\{\\rho\}\_\{\\mathrm\{calib,marg\}\}\(s,a\)\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\(s,a\)\\bigr\|\}\_\{=:\\epsilon^\{\\mathrm\{drift\}\}\_\{r\}\(s,a\)\},\(36\)whereϵconf\\epsilon^\{\\mathrm\{conf\}\}is the confounding bias \(difference between theaa\-conditional andaa\-marginal calibration expectations\) andϵdrift\\epsilon^\{\\mathrm\{drift\}\}is the drift bias \(difference between the calibration marginal and the deployment marginal\)\.

###### Lemma 5\(Separate identification ofϵconf\\epsilon^\{\\mathrm\{conf\}\}andϵdrift\\epsilon^\{\\mathrm\{drift\}\}\)\.

Suppose the planner has access to:

1. \(a\)Calibration\-time*unconditional*observations at\(s,a\)\(s,a\), i\.e\., a subsample𝒟unc⊂𝒟calib\\mathcal\{D\}\_\{\\mathrm\{unc\}\}\\subset\\mathcal\{D\}\_\{\\mathrm\{calib\}\}for which the calibration behavior policy’s action\-choice at\(s,a\)\(s,a\)was recorded and which can be re\-weighted toℙcalib\(h∣s\)\\mathbb\{P\}\_\{\\mathrm\{calib\}\}\(h\\mid s\); or equivalently, access to any validly re\-weighted calibration sample underπcalib,marg\\pi\_\{\\mathrm\{calib,marg\}\}\.
2. \(b\)A deployment\-time randomized pilot at\(s,a\)\(s,a\)\(Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\)\.

Then:

- •ϵrconf\(s,a\)\\epsilon^\{\\mathrm\{conf\}\}\_\{r\}\(s,a\)is identified from within\-calibration comparisons between the conditional and the re\-weighted marginal expectations\.
- •ϵrdrift\(s,a\)\\epsilon^\{\\mathrm\{drift\}\}\_\{r\}\(s,a\)is identified from calibration\-vs\-deployment comparisons: the re\-weighted calibration marginal vs\. the pilot estimate\.

###### Proof\.

ϵrconf=\|ρ¯calib,cond−ρ¯calib,marg\|\\epsilon^\{\\mathrm\{conf\}\}\_\{r\}=\|\\bar\{\\rho\}\_\{\\mathrm\{calib,cond\}\}\-\\bar\{\\rho\}\_\{\\mathrm\{calib,marg\}\}\|involves only calibration\-distribution quantities, both estimable from𝒟calib\\mathcal\{D\}\_\{\\mathrm\{calib\}\}with appropriate re\-weighting \(standard IPW estimator for the marginal; conditional empirical mean for the cond\)\.ϵrdrift=\|ρ¯calib,marg−ρobs⋆\|\\epsilon^\{\\mathrm\{drift\}\}\_\{r\}=\|\\bar\{\\rho\}\_\{\\mathrm\{calib,marg\}\}\-\\rho^\{\\star\}\_\{\\mathrm\{obs\}\}\|compares the calibration marginal \(estimable as above\) to the deployment interventional expectation \(estimable from the pilot via Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\)\. ∎

### A\.13Bounded\-propensity\-odds sensitivity bound

Section[4](https://arxiv.org/html/2605.21458#S4)bounds the bias from usingVarπbeh\(R∣s,a\)\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)in place of\(τs,aint\)−1\(\\tau^\{\\mathrm\{int\}\}\_\{s,a\}\)^\{\-1\}by a bounded\-propensity\-odds sensitivity constantMM, in the style ofTan \[[2006](https://arxiv.org/html/2605.21458#bib.bib166)\]; this subsection states the bound precisely\.

###### Proposition 7\(Bounded\-propensity\-odds sensitivity bound\)\.

Letw\(h∣s,a\):=πbeh\(a∣s,h\)/πbeh\(a∣s\)w\(h\\mid s,a\):=\\pi^\{\\mathrm\{beh\}\}\(a\\mid s,h\)/\\pi^\{\\mathrm\{beh\}\}\(a\\mid s\)be the propensity\-odds ratio, and supposew\(h∣s,a\)∈\[1/M,M\]w\(h\\mid s,a\)\\in\[1/M,M\]uniformly inhhand\(s,a\)\(s,a\)\. Then the behavior\-policy conditional reward variance and the interventional reward variance satisfy

1M2⋅Varℙ\(⋅∣s,do\(a\)\)\(R\)≤Varπbeh\(R∣s,a\)≤M2⋅Varℙ\(⋅∣s,do\(a\)\)\(R\)\.\\frac\{1\}\{M^\{2\}\}\\cdot\\mathrm\{Var\}\_\{\\mathbb\{P\}\(\\cdot\\mid s,\\mathrm\{do\}\(a\)\)\}\(R\)\\;\\leq\\;\\mathrm\{Var\}\_\{\\pi^\{\\mathrm\{beh\}\}\}\(R\\mid s,a\)\\;\\leq\\;M^\{2\}\\cdot\\mathrm\{Var\}\_\{\\mathbb\{P\}\(\\cdot\\mid s,\\mathrm\{do\}\(a\)\)\}\(R\)\.\(37\)

###### Proof\.

The behavior\-policy conditional distribution ofRRgiven\(s,a\)\(s,a\)is∑hw\(h∣s,a\)ℙ\(h∣s\)ρ\(⋅∣s,a,h\)\\sum\_\{h\}w\(h\\mid s,a\)\\mathbb\{P\}\(h\\mid s\)\\rho\(\\cdot\\mid s,a,h\)\(change of measure from the interventional distribution with importance weightww\)\. Writing both variances as expectations overhhof𝔼\[R∣s,a,h\]2\\mathbb\{E\}\[R\\mid s,a,h\]^\{2\}minus squared means, the multiplicative bound onwwgives the two\-sidedM2M^\{2\}factor\. ∎

## Appendix BPolicy Algorithms

This appendix provides Python\-style pseudocode for each level of the policy hierarchy \(Table[A12](https://arxiv.org/html/2605.21458#A5.T12)\)\. All algorithms operate on a tabular MDP withSSstates,KKactions, discount factorγ\\gamma, and a population ofnnunits\. The simulator supplies a prior mean MDPℳ^sim=\(ρ^sim,℘^sim\)\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}=\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)\. We first define the shared subroutines \(belief initialization, Bayesian updating, posterior MDP construction, value iteration, and EPI computation\), then present the policy algorithms in order of increasing sophistication\.

### B\.1Shared Subroutines

Listing 1:Initialize Bayesian belief state from the simulator\.1definit\_belief\(sim\_rewards,sim\_transitions,sigma\_0,alpha\_0\):

2"""Initializeconjugatepriorsfromthesimulator’spredictions\."""

3S,K=sim\_rewards\.shape

4

5

6means=sim\_rewards\.copy\(\)

7precisions=np\.full\(\(S,K\),1\.0/sigma\_0\*\*2\)

8

9

10

11alpha=np\.ones\(\(S,K,S\)\)\+alpha\_0\*sim\_transitions

12

13returnBeliefState\(means,precisions,alpha\)

Listing 2:Conjugate Bayesian update from a single observation\.1defbayes\_update\(belief,s,a,reward,next\_state,obs\_var\):

2"""Updatebeliefsfromone\(s,a,r,s’\)observation\."""

3

4

5obs\_precision=1\.0/obs\_var

6new\_precision=belief\.precisions\[s,a\]\+obs\_precision

7new\_mean=\(belief\.precisions\[s,a\]\*belief\.means\[s,a\]

8\+obs\_precision\*reward\)/new\_precision

9belief\.means\[s,a\]=new\_mean

10belief\.precisions\[s,a\]=new\_precision

11

12

13belief\.alpha\[s,a,next\_state\]\+=1\.0

Listing 3:Construct the posterior mean MDP from current beliefs\.1defposterior\_mdp\(belief,gamma\):

2"""BuildatabularMDPfromposteriormeans\."""

3S,K=belief\.means\.shape

4

5

6rewards=belief\.means\.copy\(\)

7

8

9alpha\_sums=belief\.alpha\.sum\(axis=2,keepdims=True\)

10transitions=belief\.alpha/alpha\_sums

11

12returnTabularMDP\(rewards,transitions,gamma\)

Listing 4:Value iteration for a tabular MDP\.1defvalue\_iteration\(mdp,tol=1e\-10\):

2"""Solvefortheoptimalvaluefunctionandgreedypolicy\."""

3S,K=mdp\.S,mdp\.K

4V=np\.zeros\(S\)

5

6whileTrue:

7

8Q=mdp\.rewards\+mdp\.gamma\*\(mdp\.transitions@V\)

9V\_new=Q\.max\(axis=1\)

10

11ifnp\.max\(np\.abs\(V\_new\-V\)\)<tol:

12break

13V=V\_new

14

15

16pi=Q\.argmax\(axis=1\)

17returnV,pi

Listing 5:Compute the Exploration Priority Index \(Definition[3\.2](https://arxiv.org/html/2605.21458#S3.SS2)\)\.1defcompute\_epi\(mdp,pi,belief,beta\_conf,obs\_var,T\):

2"""EPI\(s,a\)=d\(s\)\*\(reward\_term\+transition\_term\)\."""

3S,K,gamma=mdp\.S,mdp\.K,mdp\.gamma

4

5

6mu=np\.ones\(S\)/S

7d=np\.zeros\(S\)

8fortinrange\(T\+1\):

9d\+=\(gamma\*\*t\)\*mu

10

11P\_pi=mdp\.transitions\[np\.arange\(S\),pi,:\]

12mu=mu@P\_pi

13d\*=\(1\-gamma\)/\(1\-gamma\*\*\(T\+1\)\)

14

15

16sigmas=1\.0/np\.sqrt\(belief\.precisions\)

17reward\_term=\(sigmas\*\*2\+beta\_conf\*\*2\)/obs\_var

18alpha\_sums=belief\.alpha\.sum\(axis=2\)

19transition\_term=gamma\*mdp\.R\_max\*\(S\-1\)/\(\(1\-gamma\)\*alpha\_sums\)

20

21epi=d\[:,None\]\*\(reward\_term\+transition\_term\)

22returnepi

### B\.2Policy Algorithms

Listing 6:Level 0: Simulator\-Optimal Policy \(SOP\)\.1defrun\_sop\(sim\_mdp,n\_units,T\):

2"""Deploythesimulator\-trainedpolicywithoutadaptation\."""

3

4V,pi=value\_iteration\(sim\_mdp\)

5

6fortinrange\(T\):

7foriinrange\(n\_units\):

8action=pi\[states\[i\]\]

9

Listing 7:Level 1:ϵ\\epsilon\-Greedy Perturbation\.1defrun\_epsilon\_greedy\(sim\_mdp,n\_units,T,epsilon\):

2"""Undirectedstochasticexploration,nolearning\."""

3V,pi=value\_iteration\(sim\_mdp\)

4

5fortinrange\(T\):

6foriinrange\(n\_units\):

7ifnp\.random\.random\(\)<epsilon:

8action=np\.random\.randint\(K\)

9else:

10action=pi\[states\[i\]\]

11

Listing 8:Level 1′: Adaptive SOP \(A\-SOP / Passive Learning\)\.1defrun\_asop\(sim\_mdp,n\_units,T,obs\_var,replan\_interval=5\):

2"""Passivelearning:updatebeliefs,alwaysfollowposterioroptimum\."""

3belief=init\_belief\(sim\_mdp\.rewards,sim\_mdp\.transitions,

4sigma\_0=1\.0,alpha\_0=5\.0\)

5

6fortinrange\(T\):

7

8ift%replan\_interval==0:

9mdp\_post=posterior\_mdp\(belief,sim\_mdp\.gamma\)

10V,pi=value\_iteration\(mdp\_post\)

11

12foriinrange\(n\_units\):

13

14action=pi\[states\[i\]\]

15reward,next\_state=env\.step\(states\[i\],action\)

16

17

18bayes\_update\(belief,states\[i\],action,reward,next\_state,obs\_var\)

Listing 9:Level 2: KG\-SEP \(Myopic Knowledge Gradient\)\.1defrun\_kg\_sep\(sim\_mdp,n\_units,T,obs\_var,replan\_interval=5\):

2"""Targetedper\-stateexplorationusingtheaugmentedQ\-value\."""

3belief=init\_belief\(sim\_mdp\.rewards,sim\_mdp\.transitions,

4sigma\_0=1\.0,alpha\_0=5\.0\)

5V\_sim,pi\_sim=value\_iteration\(sim\_mdp\)

6

7fortinrange\(T\):

8mdp\_post=posterior\_mdp\(belief,sim\_mdp\.gamma\)

9

10foriinrange\(n\_units\):

11s=states\[i\]

12

13forainrange\(K\):

14earn=belief\.means\[s,a\]

15learn=gamma\*\(reward\_info\(belief,s,a,obs\_var\)

16\+transition\_info\(belief,s,a,sim\_mdp\)\)

17position=gamma\*np\.dot\(

18mdp\_post\.transitions\[s,a\],V\_sim\)

19Q\_aug\[a\]=earn\+learn\+position

20

21

22action=np\.argmax\(Q\_aug\)

23reward,next\_state=env\.step\(s,action\)

24bayes\_update\(belief,s,action,reward,next\_state,obs\_var\)

The KG\-SEP listing above shows the per\-unit reduction\. In the multi\-unit\-per\-step setting that the experiments use, the implementation \(methods/policies/knowledge\_gradient\.py\) allocates units across actions proportionally to the augmented Q\-values rather than concentrating all units on the argmax; the per\-unit and proportional rules coincide when only one unit is being assigned at a state, which is the semantics the pseudocode targets\.

Listing 10:Level 3: Trajectory\-Planned SEP \(EPI\-Directed\)\.1defrun\_sep\(sim\_mdp,n\_units,T,obs\_var,

2T\_pilot=5,T\_explore=15,eps\_explore=0\.5,

3replan\_interval=5,reexplore\_threshold=0\.8\):

4"""Three\-phaseSEP:pilot\-\>EPI\-directedexplore\-\>exploit\."""

5belief=init\_belief\(sim\_mdp\.rewards,sim\_mdp\.transitions,

6sigma\_0=1\.0,alpha\_0=5\.0\)

7V\_sim,pi\_sim=value\_iteration\(sim\_mdp\)

8epi=compute\_epi\(sim\_mdp,pi\_sim,belief,beta\_conf=0,obs\_var=obs\_var,

9T=int\(1/\(1\-sim\_mdp\.gamma\)\)\)

10pi\_exploit=pi\_sim\.copy\(\)

11epi\_baseline=None

12

13fortinrange\(T\):

14

15ift<T\_pilot:

16foriinrange\(n\_units\):

17action=np\.random\.randint\(K\)

18reward,next\_state=env\.step\(states\[i\],action\)

19bayes\_update\(belief,states\[i\],action,reward,

20next\_state,obs\_var\)

21

22

23elift<T\_pilot\+T\_explore:

24foriinrange\(n\_units\):

25ifnp\.random\.random\(\)<eps\_explore:

26action=np\.argmax\(epi\[states\[i\]\]\)

27else:

28action=pi\_exploit\[states\[i\]\]

29reward,next\_state=env\.step\(states\[i\],action\)

30bayes\_update\(belief,states\[i\],action,reward,

31next\_state,obs\_var\)

32

33

34else:

35foriinrange\(n\_units\):

36action=pi\_exploit\[states\[i\]\]

37reward,next\_state=env\.step\(states\[i\],action\)

38bayes\_update\(belief,states\[i\],action,reward,

39next\_state,obs\_var\)

40

41

42

43

44ifepi\_baselineisNone:

45epi\_baseline=epi\.max\(\)

46elifepi\.max\(\)\>reexplore\_threshold\*epi\_baseline:

47

48pass

49

50

51ift%replan\_interval==0:

52mdp\_post=posterior\_mdp\(belief,sim\_mdp\.gamma\)

53\_,pi\_exploit=value\_iteration\(mdp\_post\)

54epi=compute\_epi\(mdp\_post,pi\_exploit,belief,

55beta\_conf=0,obs\_var=obs\_var,

56T=int\(1/\(1\-sim\_mdp\.gamma\)\)\)

The Fisher\-SEP implementation below invokes four quantities defined in Section[4](https://arxiv.org/html/2605.21458#S4)of the main text: the value gradient∇θVπ\\nabla\_\{\\theta\}V^\{\\pi\}\(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\), the prior Fisher informationI0\(s,a\)=κ0\(s,a\)/σs,a2\(0\)I\_\{0\}\(s,a\)=\\kappa\_\{0\}\(s,a\)/\\sigma^\{2\}\_\{s,a\}\(0\)with per\-\(s,a\)\(s,a\)prior strengthκ0\\kappa\_\{0\}set by the simulator\-self\-consistency heuristic of Appendix[A\.6](https://arxiv.org/html/2605.21458#A1.SS6.SSS0.Px2), the Normal\-Inverse\-Gamma \(or Beta\-Binomial, for the HIV DGP\) posterior varianceVar\(θs,a∣𝒟\)\\mathrm\{Var\}\(\\theta\_\{s,a\}\\mid\\mathcal\{D\}\)after pilot data, and the per\-\(s,a\)\(s,a\)observation precisionτs,aobs\\tau^\{\\mathrm\{obs\}\}\_\{s,a\}identified from randomized\-action pilot data \(Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2); in this appendixτs,aobs=τs,aint\\tau^\{\\mathrm\{obs\}\}\_\{s,a\}=\\tau^\{\\mathrm\{int\}\}\_\{s,a\}of Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\)\. The posterior predictive value variancePVV\(π\)\\mathrm\{PVV\}\(\\pi\)\(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\) is evaluated on a candidate stochastic policyπ\\pibycompute\_pvv;minimize\_pvv\_policyminimizes it directly via coordinate descent \(non\-linear inπ\\pibecause the expected observation countns′,a′\(π\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\)appears in the denominator\)\.

Listing 11:Level 3′: Fisher\-SEP \(Bayesian A\-optimal posterior\-value\-variance\)\.1defrun\_fisher\_sep\(sim\_mdp,n\_units,T,obs\_var,

2T\_pilot=5,T\_explore=15,eps\_explore=0\.5,

3replan\_interval=10,fisher\_iters=60\):

4"""Three\-phaseFisher\-SEP\(v3\):pilot\-\>PVV\-minimize\-\>exploit\.

5MinimizestheposteriorpredictivevaluevariancePVV\(pi\)of

6Definition~\\ref\{def:pvv\}viacoordinatedescentonstochasticpi\.

7"""

8

9kappa\_0=compute\_kappa0\_self\_consistency\(

10sim\_mdp,sim\_mdp\.stated\_variances,n\_rollouts=100\)

11I\_0=kappa\_0/sim\_mdp\.stated\_variances

12

13belief=init\_belief\(sim\_mdp\.rewards,sim\_mdp\.transitions,

14sigma\_0=1\.0,alpha\_0=2\.0\)

15pilot\_obs=\{\}

16

17

18theta\_var=1\.0/I\_0

19tau\_obs=1\.0/obs\_var

20

21\_,pi\_exploit=value\_iteration\(sim\_mdp\)

22pi\_fisher=minimize\_pvv\_policy\(

23sim\_mdp,theta\_var,tau\_obs,T=int\(1/\(1\-sim\_mdp\.gamma\)\),

24n\_iter=fisher\_iters\)

25

26fortinrange\(T\):

27ift<T\_pilot:

28

29foriinrange\(n\_units\):

30action=np\.random\.randint\(K\)

31r,s\_next=env\.step\(states\[i\],action\)

32pilot\_obs\.setdefault\(\(states\[i\],action\),\[\]\)\.append\(r\)

33bayes\_update\(belief,states\[i\],action,r,s\_next,obs\_var\)

34

35elift<T\_pilot\+T\_explore:

36

37foriinrange\(n\_units\):

38ifnp\.random\.random\(\)<eps\_explore:

39action=np\.random\.choice\(K,p=pi\_fisher\[states\[i\]\]\)

40else:

41action=pi\_exploit\[states\[i\]\]

42r,s\_next=env\.step\(states\[i\],action\)

43pilot\_obs\.setdefault\(\(states\[i\],action\),\[\]\)\.append\(r\)

44bayes\_update\(belief,states\[i\],action,r,s\_next,obs\_var\)

45

46else:

47

48foriinrange\(n\_units\):

49action=pi\_exploit\[states\[i\]\]

50r,s\_next=env\.step\(states\[i\],action\)

51bayes\_update\(belief,states\[i\],action,r,s\_next,obs\_var\)

52

53

54

55ift%replan\_interval==0:

56mdp\_post=posterior\_mdp\(belief,sim\_mdp\.gamma\)

57\_,pi\_exploit=value\_iteration\(mdp\_post\)

58theta\_var=compute\_theta\_posterior\_variance\(

59pilot\_obs,sim\_mdp\.stated\_variances,sim\_mdp\.rewards,

60kappa\_0=kappa\_0,alpha\_0=2\.0\)

61pi\_fisher=minimize\_pvv\_policy\(

62mdp\_post,theta\_var,tau\_obs,T=int\(1/\(1\-sim\_mdp\.gamma\)\),

63n\_iter=fisher\_iters,init\_pi=pi\_fisher\)

64

65

66defcompute\_pvv\(mdp,pi,theta\_var,tau\_obs,T\):

67"""PVV\(pi\)=sum\_sd\_pi\(s\)\*sum\_\{s’,a’\}grad\_V\(s;s’,a’\)^2

68\*PostVar\(theta\_\{s’,a’\}\|n\_\{s’,a’\}\(pi\)pilotobs\)\.

69SeeDefinition~\\ref\{def:pvv\}\.

70"""

71d\_pi,n\_sa=visitation\_and\_obs\_counts\(mdp,pi,T\)

72prior\_prec=1\.0/theta\_var

73post\_prec=prior\_prec\+n\_sa\*tau\_obs

74post\_var\_candidate=1\.0/post\_prec

75

76grad\_V=reward\_gradient\_matrix\(mdp,pi\)

77weighted=grad\_V\*\*2\*post\_var\_candidate\.flatten\(\)\[None,:\]

78returnfloat\(d\_pi@weighted\.sum\(axis=1\)\)

79

80

81defminimize\_pvv\_policy\(mdp,theta\_var,tau\_obs,T,

82n\_iter=60,lr=0\.3,init\_pi=None\):

83"""CoordinatedescenttoMINIMIZEPVV\(pi\)\.

84

85Ateachstep:pickrandoms,tryeach"shiftlrtowardactiona"

86candidate,keeptherowwithlowestPVV\.Non\-linearinpibecause

87n\_\{s’,a’\}\(pi\)appearsinthedenominator\.

88"""

89S,K=mdp\.S,mdp\.K

90pi=init\_pi\.copy\(\)ifinit\_piisnotNoneelsenp\.full\(\(S,K\),1\.0/K\)

91base=compute\_pvv\(mdp,pi,theta\_var,tau\_obs,T\)

92

93for\_inrange\(n\_iter\):

94s=np\.random\.randint\(S\)

95best\_pvv,best\_row=base,pi\[s\]\.copy\(\)

96forainrange\(K\):

97row=\(1\-lr\)\*pi\[s\]\.copy\(\)

98row\[a\]\+=lr

99row/=row\.sum\(\)

100pi\_try=pi\.copy\(\);pi\_try\[s\]=row

101trial=compute\_pvv\(mdp,pi\_try,theta\_var,tau\_obs,T\)

102iftrial<best\_pvv:

103best\_pvv,best\_row=trial,row

104pi\[s\],base=best\_row,best\_pvv

105returnpi

##### DGP\-specific realizations\.

The listing above is the generic tabular Fisher\-SEP\. Two of our case studies deviate from it for DGP\-specific reasons:

- •*Vending \(fisher\_sep\_v3\_vending\.py\):*per\-\(vm, product\) PVV scores are ranked for allocation rather than running the full coordinate\-descent minimizer, because the vending DGP has deterministic visitation given an allocation so the ranking\-based policy achieves the PVV minimum up to combinatorial rounding\. The gradient∂V/∂λij\\partial V/\\partial\\lambda\_\{ij\}is computed by a 30\-day yield finite difference withδ=0\.5\\delta=0\.5\.
- •*HIV \(fisher\_sep\_v3\_hiv\.py\):*the static Bellman\-resolvent gradient is replaced by a 15\-day SIS\-propagated finite\-difference gradient, because perturbing per\-zone prevalence changes future prevalence at neighbouring zones through the disease dynamics, and that spread effect dominates the local sensitivity \(Appendix[D\.11](https://arxiv.org/html/2605.21458#A4.SS11)\)\. The A\-optimal policy class is restricted to the navigation\-restricted variant of Fisher\-SEP\-T \(Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)\): rank zones by their per\-zone PVV contribution, allocate teams between explore \(Region B targets\) and exploit \(Region A best\-known\) proportional to the PVV share in each region\. Posterior parameter variance uses the Beta\-Binomial form of Eq\. \([52](https://arxiv.org/html/2605.21458#A4.E52)\) in place of NIG\.

Listing 12:Thompson Sampling \(PSRL — External Baseline\)\.1defrun\_thompson\_sampling\(sim\_mdp,n\_units,T,obs\_var,replan\_interval=5\):

2"""PosteriorSamplingforRL\(Osbandetal\.,2013\)\."""

3belief=init\_belief\(sim\_mdp\.rewards,sim\_mdp\.transitions,

4sigma\_0=1\.0,alpha\_0=5\.0\)

5

6fortinrange\(T\):

7ift%replan\_interval==0:

8

9sampled\_rewards=np\.zeros\(\(S,K\)\)

10sampled\_transitions=np\.zeros\(\(S,K,S\)\)

11forsinrange\(S\):

12forainrange\(K\):

13

14mu=belief\.means\[s,a\]

15sigma=1\.0/np\.sqrt\(belief\.precisions\[s,a\]\)

16sampled\_rewards\[s,a\]=np\.random\.normal\(mu,sigma\)

17

18

19sampled\_transitions\[s,a\]=np\.random\.dirichlet\(

20belief\.alpha\[s,a\]\)

21

22

23sampled\_mdp=TabularMDP\(sampled\_rewards,sampled\_transitions,

24sim\_mdp\.gamma\)

25\_,pi=value\_iteration\(sampled\_mdp\)

26

27foriinrange\(n\_units\):

28

29action=pi\[states\[i\]\]

30reward,next\_state=env\.step\(states\[i\],action\)

31bayes\_update\(belief,states\[i\],action,reward,

32next\_state,obs\_var\)

### B\.3Reduction to classical limits

Fisher\-SEP reduces to two familiar objects at the endpoints of the information regime\. When observations at every target\-relevant\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)dominate the prior \(nτobs≫I0n\\tau^\{\\mathrm\{obs\}\}\\gg I\_\{0\}\) andτobs\\tau^\{\\mathrm\{obs\}\}is uniform across\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\), minimizing \([9](https://arxiv.org/html/2605.21458#S4.E9)\) is equivalent to maximizing the visitation\-weighted squared sensitivity∑sdπtgt\(s\)\(∂Vπtgt/∂θs′,a′\)2\\sum\_\{s\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}—the classical A\-optimal design for reward\-parameter estimation\. When no pilot observations have been collected \(nτobs≪I0n\\tau^\{\\mathrm\{obs\}\}\\ll I\_\{0\}\), the ranking across\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)is the same but weighted by the prior varianceσs′,a′\(0\)/2κ0\(s′,a′\)\\sigma^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\}\}\{\}^\{2\}/\\kappa\_\{0\}\(s^\{\\prime\},a^\{\\prime\}\); the explorer allocates pilot visitation to the top\-ranked pairs to break out of the data\-starved regime\. Fisher\-SEP interpolates smoothly between these two limits as pilot data accumulates, so it inherits the asymptotic guarantees of classical A\-optimality in the data\-rich regime and the finite\-sample structure of simulator\-prior\-weighted ranking in the data\-starved regime\.

### B\.4Transition\-parameter PVV \(Fisher\-SEP\-T derivation\)

This subsection derives the transition\-parameter variant of the PVV \(Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\) and connects it to the SIS finite\-difference gradient used in the HIV experiment\.

##### Transition Bellman resolvent gradient\.

Write the value functionVπtgt\(s\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)of the target policy under a transition kernelp\(⋅∣⋅,⋅\)p\(\\cdot\\mid\\cdot,\\cdot\)and rewardr\(⋅,⋅\)r\(\\cdot,\\cdot\)\. Fix\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)and perturbp\(⋅∣s′,a′\)p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)by a zero\-sum perturbationδ∈ℝ\|𝕊\|\\delta\\in\\mathbb\{R\}^\{\|\\mathbb\{S\}\|\}\(i\.e\.,∑s′′δs′′=0\\sum\_\{s^\{\\prime\\prime\}\}\\delta\_\{s^\{\\prime\\prime\}\}=0\)\. The first\-order change inVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}at statessis

∇p\(⋅∣s′,a′\)Vπtgt\(s\)⊤δ=γ\[\(I−γPπtgt\)−1\]s,s′πtgt\(a′∣s′\)∑s′′δs′′Vπtgt\(s′′\)\.\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)^\{\\top\}\\delta\\;=\\;\\gamma\\,\\bigl\[\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}\\bigr\]\_\{s,s^\{\\prime\}\}\\,\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\,\\sum\_\{s^\{\\prime\\prime\}\}\\delta\_\{s^\{\\prime\\prime\}\}\\,V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s^\{\\prime\\prime\}\)\.The Bellman resolvent\(I−γPπtgt\)−1\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}at\(s,s′\)\(s,s^\{\\prime\}\)encodes the discounted occupancy ofs′s^\{\\prime\}starting fromss; the factorπtgt\(a′∣s′\)\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)is the probability the target choosesa′a^\{\\prime\}ats′s^\{\\prime\}; and∑δs′′Vπtgt\(s′′\)\\sum\\delta\_\{s^\{\\prime\\prime\}\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s^\{\\prime\\prime\}\)is the perturbation’s expected change in next\-state value\.

##### Dirichlet posterior variance\.

Under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4), the prior on℘obs⋆\(⋅∣s′,a′\)\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)isDir\(αs′,a′,⋅\(0\)\)\\mathrm\{Dir\}\(\\alpha^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\},\\cdot\}\)\. Afterns′,a′n\_\{s^\{\\prime\},a^\{\\prime\}\}pilot observations with empirical next\-state counts\{cs′,a′,s′′\}\\\{c\_\{s^\{\\prime\},a^\{\\prime\},s^\{\\prime\\prime\}\}\\\}, the posterior isDir\(αs′,a′,s′′\(0\)\+cs′,a′,s′′\)\\mathrm\{Dir\}\(\\alpha^\{\(0\)\}\_\{s^\{\\prime\},a^\{\\prime\},s^\{\\prime\\prime\}\}\+c\_\{s^\{\\prime\},a^\{\\prime\},s^\{\\prime\\prime\}\}\)\. The posterior covariance matrixΣp\(s′,a′\)\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)on the simplex has entries

Σp\(s′,a′\)s′′,s′′′=αs′′,s′,a′\(α0,s′,a′\+ns′,a′−αs′′,s′,a′\)\(α0,s′,a′\+ns′,a′\)2\(α0,s′,a′\+ns′,a′\+1\)δs′′=s′′′−αs′′,s′,a′αs′′′,s′,a′\(α0,s′,a′\+ns′,a′\)2\(α0,s′,a′\+ns′,a′\+1\)δs′′≠s′′′,\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)\_\{s^\{\\prime\\prime\},s^\{\\prime\\prime\\prime\}\}\\;=\\;\\frac\{\\alpha\_\{s^\{\\prime\\prime\},s^\{\\prime\},a^\{\\prime\}\}\(\\alpha\_\{0,s^\{\\prime\},a^\{\\prime\}\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\-\\alpha\_\{s^\{\\prime\\prime\},s^\{\\prime\},a^\{\\prime\}\}\)\}\{\(\\alpha\_\{0,s^\{\\prime\},a^\{\\prime\}\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}\(\\alpha\_\{0,s^\{\\prime\},a^\{\\prime\}\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\+1\)\}\\,\\delta\_\{s^\{\\prime\\prime\}=s^\{\\prime\\prime\\prime\}\}\\;\-\\;\\frac\{\\alpha\_\{s^\{\\prime\\prime\},s^\{\\prime\},a^\{\\prime\}\}\\,\\alpha\_\{s^\{\\prime\\prime\\prime\},s^\{\\prime\},a^\{\\prime\}\}\}\{\(\\alpha\_\{0,s^\{\\prime\},a^\{\\prime\}\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}\(\\alpha\_\{0,s^\{\\prime\},a^\{\\prime\}\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\+1\)\}\\,\\delta\_\{s^\{\\prime\\prime\}\\neq s^\{\\prime\\prime\\prime\}\},with the diagonal scaling as1/\(1\+ns′,a′/α0,s′,a′\)1/\(1\+n\_\{s^\{\\prime\},a^\{\\prime\}\}/\\alpha\_\{0,s^\{\\prime\},a^\{\\prime\}\}\), matching the PVV denominator of Eq\. \([9](https://arxiv.org/html/2605.21458#S4.E9)\) atϕ=p\\phi=p\.

##### SIS finite\-difference implementation \(HIV\)\.

The HIV DGP has non\-tabular transitions: prevalence at zonejjat dayt\+1t\+1is a continuous function of prevalence at zonejjand neighboring zones at dayttthrough the SIS dynamics of Eq\. \([46](https://arxiv.org/html/2605.21458#A4.E46)\)\. Evaluating∇p\(⋅∣s′,a′\)Vπtgt\(s\)\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)analytically in this non\-tabular setting would require linearizing the SIS update about the current prevalence, which is well\-defined but more cumbersome than a finite difference\. We therefore approximate the gradient by perturbingprevj\\mathrm\{prev\}\_\{j\}byδ=0\.01\\delta=0\.01, iteratingspread\_diseaseforTlook=15T\_\{\\mathrm\{look\}\}=15days, and computing the change in total discounted yield\. This numerical implementation is an approximation of∇p\(⋅∣s′,a′\)Vπtgt\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}under the Gonsalves SIS dynamics—not a separate algorithm\. Fisher\-SEP\-T \(Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\) is the prescription; the SIS finite\-difference is its numerical realization in a non\-tabular dynamics setting\.

### B\.5Post\-pilot residual estimator forϵ^r,ϵ^p\\hat\{\\epsilon\}\_\{r\},\\hat\{\\epsilon\}\_\{p\}

The crossover\-horizon diagnostic of Remark[2](https://arxiv.org/html/2605.21458#Thmremark2)depends on the ratioϵp/ϵr\\epsilon\_\{p\}/\\epsilon\_\{r\}\. These are not pre\-pilot observables; they are distances between the simulator’s stated kernels and the kernels ofℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}\. After anSS\-measurable pilot of lengthTpilotT\_\{\\mathrm\{pilot\}\}with empirical estimatesρ^pilot\(s,a\)\\hat\{\\rho\}\_\{\\mathrm\{pilot\}\}\(s,a\)and℘^pilot\(⋅∣s,a\)\\hat\{\\wp\}\_\{\\mathrm\{pilot\}\}\(\\cdot\\mid s,a\)on the pilot\-covered subsetℐ\\mathcal\{I\}, the residuals

ϵ^r\(s,a\)\\displaystyle\\hat\{\\epsilon\}\_\{r\}\(s,a\):=\|ρ^sim\(s,a\)−ρ^pilot\(s,a\)\|,\\displaystyle:=\|\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\)\-\\hat\{\\rho\}\_\{\\mathrm\{pilot\}\}\(s,a\)\|,ϵ^p\(s,a\)\\displaystyle\\hat\{\\epsilon\}\_\{p\}\(s,a\):=∥℘^sim\(⋅∣s,a\)−℘^pilot\(⋅∣s,a\)∥1\\displaystyle:=\\\|\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\(\\cdot\\mid s,a\)\-\\hat\{\\wp\}\_\{\\mathrm\{pilot\}\}\(\\cdot\\mid s,a\)\\\|\_\{1\}are unbiased estimates \(up to finite\-sample noise\) ofϵr\(s,a\)\\epsilon\_\{r\}\(s,a\)andϵp\(s,a\)\\epsilon\_\{p\}\(s,a\)onℐ\\mathcal\{I\}, because Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)makes the pilot mean an unbiased estimate ofρobs⋆,℘obs⋆\\rho^\{\\star\}\_\{\\mathrm\{obs\}\},\\wp^\{\\star\}\_\{\\mathrm\{obs\}\}\. Pooled estimatesϵ^r:=max\(s,a\)∈ℐ⁡ϵ^r\(s,a\)\\hat\{\\epsilon\}\_\{r\}:=\\max\_\{\(s,a\)\\in\\mathcal\{I\}\}\\hat\{\\epsilon\}\_\{r\}\(s,a\)andϵ^p:=max\(s,a\)∈ℐ⁡ϵ^p\(s,a\)\\hat\{\\epsilon\}\_\{p\}:=\\max\_\{\(s,a\)\\in\\mathcal\{I\}\}\\hat\{\\epsilon\}\_\{p\}\(s,a\)feed the post\-pilot crossover diagnosticT^≈TeffRmaxϵ^p/ϵ^r\\hat\{T\}\\approx T\_\{\\mathrm\{eff\}\}R\_\{\\max\}\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}\. The estimator is uninformative offℐ\\mathcal\{I\}; under the working assumption thatϵ^p/ϵ^r\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}onℐ\\mathcal\{I\}is representative of the ratio on uncovered pairs, the post\-pilot ratio predicts the A\-SOP\-to\-Fisher\-SEP crossover ordering observed in the vending experiment \(Section[5\.1](https://arxiv.org/html/2605.21458#S5.SS1)\)\.

##### Diagnostic vs\. identification\.

The “predicts the crossover” wording in Section[5\.1](https://arxiv.org/html/2605.21458#S5.SS1)is a descriptive diagnostic rather than an identification claim\. The residual ratioϵ^p/ϵ^r\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}is computed from the same trials on which the crossover is observed, so the agreement \(ϵ^p/ϵ^r≈2\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}\\approx 2–33; observed crossover betweenT=400T=400andT=800T=800\) is an in\-sample consistency check of Remark[2](https://arxiv.org/html/2605.21458#Thmremark2)’s horizon asymmetry, not an out\-of\-sample prediction\. We record three scope clarifications:

1. 1\.*In\-sample vs\. out\-of\-sample\.*The diagnostic is computed post\-hoc and reported as corroborative evidence\. An identification version would require a held\-out pilot protocol: estimateϵ^p/ϵ^r\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}from an initial pilot on a subset of units, then predict the crossover horizon for the remaining units\. We leave this held\-out protocol to future work \(it requires a larger trial budget than the 30 common\-seed trials used here\)\.
2. 2\.*Representativeness offℐ\\mathcal\{I\}\.*The working assumption thatϵ^p/ϵ^r\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}onℐ\\mathcal\{I\}equals the ratio on uncovered pairs is unverifiable from the pilot alone\. When the uncovered pairs are dominated byϵm\\epsilon^\{m\}\(irreducible model residual\) and the covered pairs byϵh\\epsilon^\{h\}\(identifiable hidden\-state error\), the two ratios can differ systematically\. In the vending DGP the ratio is approximately uniform across pairs by construction, so the extrapolation is defensible; in a real deployment it would need auxiliary justification\.
3. 3\.*Asymptotic vs\. finite\-TT\.*The crossover predictionT≈TeffRmaxϵ^p/ϵ^rT\\approx T\_\{\\mathrm\{eff\}\}R\_\{\\max\}\\hat\{\\epsilon\}\_\{p\}/\\hat\{\\epsilon\}\_\{r\}uses the sup\-norm Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)bound, which is loose by a factor that depends on the spread of errors across\(s,a\)\(s,a\)\. The diagnostic captures the order of magnitude of the crossover but not its exact value\.

This reframing does not change any empirical result; it clarifies the statistical status of the diagnostic\.

## Appendix CSimulation Details

### C\.1Standing assumptions

All theoretical results in the main text and this appendix operate under the following standing assumptions\.

###### Assumption 5\(Standing assumptions\)\.

1. \(a\)*Finite tabular state and action spaces\.*\|𝕊\|=S<∞\|\\mathbb\{S\}\|=S<\\inftyand\|𝔸\|=K<∞\|\\mathbb\{A\}\|=K<\\infty\. Both the true POMDP and the simulator operate on these spaces, with the simulator’s model restricted to the observed state𝕊\\mathbb\{S\}\.
2. \(b\)*Bounded rewards\.*\|Ri,t\|≤Rmax<∞\|R\_\{i,t\}\|\\leq R\_\{\\max\}<\\inftyalmost surely, and the simulator’s reward modelρ^sim\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}is likewise bounded byRmaxR\_\{\\max\}\.
3. \(c\)*Stationary marginalized dynamics within the planning horizon\.*The true observed\-state transition kernel℘S\(⋅∣s,a,h\)\\wp\_\{S\}\(\\cdot\\mid s,a,h\)and the reward kernelρ\(⋅∣s,a,h\)\\rho\(\\cdot\\mid s,a,h\)are time\-homogeneous overt∈\{0,…,T\}t\\in\\\{0,\\ldots,T\\\}\. The hidden\-state distributionℙt\(h∣s\)\\mathbb\{P\}\_\{t\}\(h\\mid s\)may drift, but the conditional kernels givenhhdo not\.
4. \(d\)*Fixed simulator\.*The simulatorℳ^sim=\(ρ^sim,℘^sim\)\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}=\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)is frozen during deployment; no retraining is performed on the data collected betweent=0t=0andt=Tt=T\. \(As Proposition[8](https://arxiv.org/html/2605.21458#Thmproposition8)shows, this is a communication choice rather than a modeling restriction: an updatable\-simulator formulation produces identical working models under parts \(e\)–\(f\) below\.\)
5. \(e\)*Conjugate priors on simulator parameters\.*The planner’s belief over the reward parametersθ=\{r\(s,a\)\}\\theta=\\\{r\(s,a\)\\\}is Gaussian with known observation varianceσ2\\sigma^\{2\}; the belief over the transition parameters\{p\(⋅∣s,a\)\}\\\{p\(\\cdot\\mid s,a\)\\\}is Dirichlet\. The simulator’s predictionsρ^sim\(s,a\)\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\)and℘^sim\(⋅∣s,a\)\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\(\\cdot\\mid s,a\)supply the prior means\.
6. \(f\)*Prior independence across state–action pairs\.*The priors factorize as∏\(s,a\)π0\(r\(s,a\)\)⋅∏\(s,a\)π0\(p\(⋅∣s,a\)\)\\prod\_\{\(s,a\)\}\\pi\_\{0\}\(r\(s,a\)\)\\cdot\\prod\_\{\(s,a\)\}\\pi\_\{0\}\(p\(\\cdot\\mid s,a\)\)\. Hierarchical structure \(e\.g\., correlated priors across neighboring states\) is not considered; see Remark[15](https://arxiv.org/html/2605.21458#Thmremark15)for discussion\.

We first state the formal equivalence between the fixed\-simulator framework and an updatable\-simulator framework \(Remark[C\.2](https://arxiv.org/html/2605.21458#A3.SS2)\), then provide the complete specification of the vending\-machine data\-generating process\.

### C\.2Equivalence of fixed and updatable simulators

###### Proposition 8\(Fixed simulator with Bayesian overlay≡\\equivBayesian\-updatable simulator\)\.

Consider two formulations of the planner’s problem:

1. \(i\)*Fixed simulator \+ belief state\.*The simulatorℳ^sim=\(ρ^sim,℘^sim\)\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}=\(\\hat\{\\rho\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)is fixed \(Assumption[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(d\)\)\. The planner maintains a separate belief stateℬt=\(\{ms,a\(t\),\(σs,a\(t\)\)2\},\{𝜶s,a\(t\)\}\)\\mathcal\{B\}\_\{t\}=\(\\\{m\_\{s,a\}^\{\(t\)\},\(\\sigma\_\{s,a\}^\{\(t\)\}\)^\{2\}\\\},\\\{\\boldsymbol\{\\alpha\}\_\{s,a\}^\{\(t\)\}\\\}\)initialized from the simulator:ms,a\(0\)=ρ^sim\(s,a\)m\_\{s,a\}^\{\(0\)\}=\\hat\{\\rho\}\_\{\\mathrm\{sim\}\}\(s,a\),𝜶s,a\(0\)\\boldsymbol\{\\alpha\}\_\{s,a\}^\{\(0\)\}derived from℘^sim\(⋅\|s,a\)\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\(\\cdot\|s,a\)\. After observing rewardrrat\(s,a\)\(s,a\), the belief updates via the Gaussian conjugate rule; after observing transitions→s′s\\to s^\{\\prime\}under actionaa, it updates via the Dirichlet conjugate rule\. The planner’s “working model” at timettis the posterior mean MDPℳ^\(t\)=\(m\(t\),p^\(t\)\)\\hat\{\\mathcal\{M\}\}^\{\(t\)\}=\(m^\{\(t\)\},\\hat\{p\}^\{\(t\)\}\)\.
2. \(ii\)*Bayesian\-updatable simulator\.*The simulator is a Bayesian modelℳ^\(t\)\\hat\{\\mathcal\{M\}\}^\{\(t\)\}that is retrained after each batch of observations\. At time0,ℳ^\(0\)=ℳ^sim\\hat\{\\mathcal\{M\}\}^\{\(0\)\}=\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}\. After observing data𝒟t\\mathcal\{D\}\_\{t\}at timett, the simulator is updated toℳ^\(t\+1\)\\hat\{\\mathcal\{M\}\}^\{\(t\+1\)\}via Bayes’ rule under the same conjugate priors\.

Under Assumptions[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(e\)–\(f\) \(conjugate priors, prior independence\), the two formulations produce identical sequences of working models:ℳ^\(i\)\(t\)=ℳ^\(ii\)\(t\)\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\_\{\\mathrm\{\(i\)\}\}=\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\_\{\\mathrm\{\(ii\)\}\}for allttand all data realizations\. Consequently, any policy that is optimal under formulation \(i\) is also optimal under formulation \(ii\), and vice versa\.

###### Proof\.

Both formulations maintain the same sufficient statistics\. Under formulation \(i\), the belief state at timettisℬt=\(\{ms,a\(t\),τs,a\(t\)\},\{𝜶s,a\(t\)\}\)\\mathcal\{B\}\_\{t\}=\(\\\{m\_\{s,a\}^\{\(t\)\},\\tau\_\{s,a\}^\{\(t\)\}\\\},\\\{\\boldsymbol\{\\alpha\}\_\{s,a\}^\{\(t\)\}\\\}\), whereτs,a\(t\)=1/\(σs,a\(t\)\)2\\tau\_\{s,a\}^\{\(t\)\}=1/\(\\sigma\_\{s,a\}^\{\(t\)\}\)^\{2\}is the precision\. The Gaussian update rule givesτs,a\(t\+1\)=τs,a\(t\)\+ns,a\(t\)/σ2\\tau\_\{s,a\}^\{\(t\+1\)\}=\\tau\_\{s,a\}^\{\(t\)\}\+n\_\{s,a\}^\{\(t\)\}/\\sigma^\{2\}andms,a\(t\+1\)=\(τs,a\(t\)ms,a\(t\)\+\(ns,a\(t\)/σ2\)r¯s,a\(t\)\)/τs,a\(t\+1\)m\_\{s,a\}^\{\(t\+1\)\}=\(\\tau\_\{s,a\}^\{\(t\)\}m\_\{s,a\}^\{\(t\)\}\+\(n\_\{s,a\}^\{\(t\)\}/\\sigma^\{2\}\)\\bar\{r\}\_\{s,a\}^\{\(t\)\}\)/\\tau\_\{s,a\}^\{\(t\+1\)\}, wherens,a\(t\)n\_\{s,a\}^\{\(t\)\}is the number of observations andr¯s,a\(t\)\\bar\{r\}\_\{s,a\}^\{\(t\)\}their mean\. The Dirichlet update gives𝜶s,a\(t\+1\)=𝜶s,a\(t\)\+𝐜s,a\(t\)\\boldsymbol\{\\alpha\}\_\{s,a\}^\{\(t\+1\)\}=\\boldsymbol\{\\alpha\}\_\{s,a\}^\{\(t\)\}\+\\mathbf\{c\}\_\{s,a\}^\{\(t\)\}, where𝐜s,a\(t\)∈ℕ\|𝕊\|\\mathbf\{c\}\_\{s,a\}^\{\(t\)\}\\in\\mathbb\{N\}^\{\|\\mathbb\{S\}\|\}is the vector of transition counts from\(s,a\)\(s,a\)to next statess′∈𝕊s^\{\\prime\}\\in\\mathbb\{S\}observed during the interval betweenttandt\+1t\+1\.

Under formulation \(ii\), the “retrained simulator” at timettis the posterior under the same conjugate model with the same data, producing identical sufficient statistics\. Since both formulations start from the same prior \(ℳ^sim\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}\), process the same data \(𝒟1,…,𝒟t\\mathcal\{D\}\_\{1\},\\ldots,\\mathcal\{D\}\_\{t\}\), and apply the same update rule \(Bayes’ rule under conjugate priors\), the posterior is identical by the uniqueness of the Bayesian posterior\. The working modelℳ^\(t\)\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\(posterior mean rewards and transitions\) is a deterministic function of the sufficient statistics, so it is identical under both formulations\. ∎

The proposition establishes that Assumption[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(d\) is a*communication choice*, not a modeling restriction\. We separate the simulator \(prior\) from the belief state \(posterior\) because this separation clarifies the theoretical analysis: the SOP optimizes the prior, the A\-SOP is the Bayes\-adaptive policy that uses the simulator as a warm start, and the SEP uses the simulator’s structure to design experiments\. The gap between them is the value of experimentation\. In an “updatable simulator” formulation, the SOP would be the policy that never collects new data \(and therefore never updates\), while the SEP would be the policy that designs experiments to update the simulator optimally—the same distinction, expressed differently\.

The practical implication is that any simulator—including non\-Bayesian ones such as neural networks or physics engines—can be wrapped in a conjugate Bayesian layer that treats the simulator’s outputs as prior means\. The “fixed simulator” then serves as the prior, and the Bayesian overlay provides the update mechanism atO\(1\)O\(1\)cost per observation, without retraining the underlying model\.

### C\.3Data\-generating process

This subsection provides the complete specification of the data\-generating process for the vending\-machine simulation of Section[5\.1](https://arxiv.org/html/2605.21458#S5.SS1)\. All numerical values match the reference implementation \(vending\_v3\.py\)\. The main text describes the setup at a conceptual level; the full technical specification follows\.

### C\.4Confounding and drift

Each machineiihas a hidden demand multiplierhi,t∈ℝ\>0h\_\{i,t\}\\in\\mathbb\{R\}\_\{\>0\}evolving as a random walk:

hi,t\+1=hi,t\+di\+ζi,t,ζi,t∼𝒩\(0,σh,i2\),h\_\{i,t\+1\}=h\_\{i,t\}\+d\_\{i\}\+\\zeta\_\{i,t\},\\qquad\\zeta\_\{i,t\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{h,i\}^\{2\}\),\(38\)wheredid\_\{i\}is a deterministic drift andσh,i\\sigma\_\{h,i\}controls the stochastic component\. The true demand rate at machineiion dayttisλtrue\(i,t\)=λbase\(i\)⋅hi,t\\lambda\_\{\\mathrm\{true\}\}\(i,t\)=\\lambda\_\{\\mathrm\{base\}\}\(i\)\\cdot h\_\{i,t\}\. For Downtown, Suburb, and Office Park,di=0d\_\{i\}=0andσh,i=0\.01\\sigma\_\{h,i\}=0\.01: the hidden state is essentially constant\. For University,di=0\.008d\_\{i\}=0\.008/day andσh,i=0\.02\\sigma\_\{h,i\}=0\.02: a new dormitory steadily increases student demand\. For New Neighborhood,di=0\.006d\_\{i\}=0\.006/day andσh,i=0\.02\\sigma\_\{h,i\}=0\.02: gentrification steadily increases family demand\.

The simulator was calibrated att0=−180t\_\{0\}=\-180from observational data collected by a historical operator who observedhhvia local knowledge and stocked proportionally, creating the dependenceA⟂̸H∣SA\\not\\perp H\\mid Sthat defines confounding\. The simulator’s demand estimate includes a confounding factorcic\_\{i\}: for Universityci=0\.47c\_\{i\}=0\.47, for New Neighborhoodci=0\.39c\_\{i\}=0\.39, and for the remaining machinesci≈1\.0c\_\{i\}\\approx 1\.0\. After 180 days of drift, the simulator underestimates demand at University and New Neighborhood by a factor of44–5×5\\times\.

### C\.5Demand model

For each machineii, customer segmentss, and daytt, the arrival rate is:

λi,s\(t\)\\displaystyle\\lambda\_\{i,s\}\(t\)=λi,sbase×hi,t×ws\(t\)×\(1\+0\.40sin⁡\(2πt/365\)\)×\(1\+βit\)\\displaystyle=\\lambda\_\{i,s\}^\{\\mathrm\{base\}\}\\;\\times\\;h\_\{i,t\}\\;\\times\\;w\_\{s\}\(t\)\\;\\times\\;\(1\+0\.40\\sin\(2\\pi t/365\)\)\\;\\times\\;\(1\+\\beta\_\{i\}t\)×ftod\(s,t\)×Compi\(t\)×Shocki\(t\)×\(1\+ηg\(i\)\(t\)\),\\displaystyle\\quad\\times\\;f\_\{\\mathrm\{tod\}\}\(s,t\)\\;\\times\\;\\mathrm\{Comp\}\_\{i\}\(t\)\\;\\times\\;\\mathrm\{Shock\}\_\{i\}\(t\)\\;\\times\\;\(1\+\\eta\_\{g\(i\)\}\(t\)\),\(39\)where:

- •hi,th\_\{i,t\}: hidden\-state multiplier from Eq\. \([38](https://arxiv.org/html/2605.21458#A3.E38)\) \(Section[C\.4](https://arxiv.org/html/2605.21458#A3.SS4)above\)\.
- •ws\(t\)w\_\{s\}\(t\): weekday/weekend multiplier \(wsweekdayw\_\{s\}^\{\\mathrm\{weekday\}\}iftmod7<5t\\bmod 7<5, elsewsweekendw\_\{s\}^\{\\mathrm\{weekend\}\}\)\.
- •1\+0\.40sin⁡\(2πt/365\)1\+0\.40\\sin\(2\\pi t/365\): 365\-day seasonal cycle with±40%\\pm 40\\%amplitude, calibrated from monthly sales data\[Singh,[2022](https://arxiv.org/html/2605.21458#bib.bib135)\]\.
- •1\+βit1\+\\beta\_\{i\}t: linear trend\. Downtownβ=−0\.01\\beta=\-0\.01; all othersβ=0\\beta=0\.
- •ftod\(s,t\)=3⋅pstod\(tmod3\)f\_\{\\mathrm\{tod\}\}\(s,t\)=3\\cdot p\_\{s\}^\{\\mathrm\{tod\}\}\(t\\bmod 3\): time\-of\-day weighting \(morning/afternoon/evening\)\.
- •Compi\(t\)=1−δi⋅𝟏\[t≥60\]\\mathrm\{Comp\}\_\{i\}\(t\)=1\-\\delta\_\{i\}\\cdot\\mathbf\{1\}\[t\\geq 60\]: competitor entry\. Office Parkδ=0\.25\\delta=0\.25; othersδ=0\\delta=0\.
- •Shocki\(t\)\\mathrm\{Shock\}\_\{i\}\(t\): multiplicative shock\. Downtownm=0\.45m=0\.45for 30 days starting day 50 \(sustained construction\); New Neighborhoodm=2\.5m=2\.5on day 35 \(single\-day festival\); othersm=1m=1\.
- •ηg\(i\)\(t\)=0\.08⋅Zg,t\\eta\_\{g\(i\)\}\(t\)=0\.08\\cdot Z\_\{g,t\},Zg,t∼𝒩\(0,1\)Z\_\{g,t\}\\sim\\mathcal\{N\}\(0,1\): correlated geographic group shock\.

The number of arriving customers isNi,s\(t\)∼Poisson\(max⁡\(0\.01,λi,s\(t\)\)\)N\_\{i,s\}\(t\)\\sim\\mathrm\{Poisson\}\(\\max\(0\.01,\\lambda\_\{i,s\}\(t\)\)\)\.

### C\.6Weather

The weather process is AR\(1\):Wt=0\.7Wt−1\+0\.3ZtW\_\{t\}=0\.7\\,W\_\{t\-1\}\+0\.3\\,Z\_\{t\},Zt∼𝒩\(0,1\)Z\_\{t\}\\sim\\mathcal\{N\}\(0,1\), clipped to\[−1\.5,1\.5\]\[\-1\.5,1\.5\]\. Weather modulates willingness\-to\-pay:WTPeff=WTP⋅\(1\+wpweather⋅Wt\)\\mathrm\{WTP\}\_\{\\mathrm\{eff\}\}=\\mathrm\{WTP\}\\cdot\(1\+w\_\{p\}^\{\\mathrm\{weather\}\}\\cdot W\_\{t\}\), wherewpweatherw\_\{p\}^\{\\mathrm\{weather\}\}is product\-specific \(soda: 0\.4, energy: 0\.1, snack: 0\.05\)\.

### C\.7Purchase mechanics

When a customer arrives, they draw a product from their segment’s preference distribution and a WTP from𝒩\(mean WTP,std2\)\\mathcal\{N\}\(\\text\{mean WTP\},\\text\{std\}^\{2\}\)\. The effective WTP is adjusted for weather\. If the posted price exceeds the effective WTP, no sale occurs\. If the preferred product is out of stock, the customer substitutes with their segment’s substitution probability, drawing an alternative from the remaining preference distribution \(re\-normalized\)\. The substitute is purchased only if in stock and priced below1\.1×WTP1\.1\\times\\mathrm\{WTP\}\.

### C\.8Logistics

- •*Depot→\\toVM*: same\-day delivery\. Constrained by depot stock and VM capacity\. Cost: $0\.05/unit\.
- •*Warehouse→\\toVM \(direct\)*: lead timemax⁡\(1,round\(𝒩\(2\.0,0\.72\)\)\)\\max\(1,\\mathrm\{round\}\(\\mathcal\{N\}\(2\.0,0\.7^\{2\}\)\)\)days\. Cost: wholesale\+\+$0\.15/unit\. 5% chance of full delay \(one extra day\); 8% chance of partial delivery \(70% arrives, rest next day\)\.
- •*Warehouse→\\toDepot*: lead timemax⁡\(1,round\(𝒩\(1\.0,0\.32\)\)\)\\max\(1,\\mathrm\{round\}\(\\mathcal\{N\}\(1\.0,0\.3^\{2\}\)\)\)days\. Cost: wholesale\+\+$0\.08/unit\.
- •*Spoilage*: units exceeding shelf life are discarded daily\. Planner pays wholesale cost\.
- •*Holding*: $0\.01/unit/day \(soda\), $0\.02 \(energy\), $0\.015 \(snack\)\.
- •*Breakdowns*: 2% daily probability per machine, causing one day of zero sales\.

### C\.9Fallback triggers

Two automatic safety mechanisms fire after demand realization each day:

1. 1\.*VM emergency*: when any product hits 0 stock, the depot shipsmin⁡\(3,depot stock\)\\min\(3,\\text\{depot stock\}\)units \(same\-day\)\.
2. 2\.*Depot emergency*: when any product drops to≤4\\leq 4units, the depot orders 10 units from the warehouse\.

### C\.10Cash flow

Starting cash: $2,000\. Daily profit:Πt=Revenuet−\(Shippingt\+Wholesalet\+Holdingt\+Spoilaget\+5×$2\.00\)\\Pi\_\{t\}=\\text\{Revenue\}\_\{t\}\-\(\\text\{Shipping\}\_\{t\}\+\\text\{Wholesale\}\_\{t\}\+\\text\{Holding\}\_\{t\}\+\\text\{Spoilage\}\_\{t\}\+5\\times\\mathdollar 2\.00\)\. Cash updates asCasht\+1=Casht\+Πt\\text\{Cash\}\_\{t\+1\}=\\text\{Cash\}\_\{t\}\+\\Pi\_\{t\}\. Bankruptcy occurs if cash is negative for 5 consecutive days\.

### C\.11Observation model \(censoring\)

The planner observesmin⁡\(demand,stock\)\\min\(\\text\{demand\},\\text\{stock\}\), not demand\. When stock remains positive after sales, the observation is uncensored\. When stock hits zero, the observation is censored: the planner knows demand≥\\geqstock but not by how much\.

### C\.12SEP belief model

The SEP maintains a running average of the most recent 10 uncensored sales observations per \(machine, product\) pair\. Initial beliefs equal the simulator’s demand rates\. A structural break is detected when the 7\-day rolling mean deviates from the preceding 7\-day mean by more than 80% \(and the preceding mean exceeds 1\.5 units/day\)\. Upon detection, the belief resets to the new mean and a one\-day re\-exploration phase triggers\.

### C\.13Configuration

Table[A2](https://arxiv.org/html/2605.21458#A3.T2)provides the complete parameterization\.

Table A2:Vending machine network: base demand rates, hidden\-state parameters, events, and pilot detection rates\. Bold entries indicate substantial simulator error\. The “Pilot detection” column shows the fraction of 30 trials in which the 5\-day pilot’s hypothesis test flags the machine as misspecified\.The pilot detection rates illustrate the difficulty of identifying misspecified machines from only 5 days of data\. University \(13%\) and New Neighborhood \(17%\) are detected at modest rates despite exhibiting the largest simulator errors, because the 5\-day pilot collects limited observations and the Bonferroni correction across 15 tests \(αBonf=0\.003\\alpha\_\{\\mathrm\{Bonf\}\}=0\.003\) is conservative\. For larger problems with many state–action pairs, a less conservative FDR\-controlling procedure such as Benjamini–Hochberg would improve detection power at the cost of additional false positives\. Suburb is flagged most frequently \(37%\) despite being correctly specified—a false positive driven by weekend demand variability\. In 33% of trials, no machine is flagged, and the SEP proceeds directly to exploitation at day 5, bypassing the exploration phase\. Despite the low detection rates, the pilot\-to\-policy protocol achieves 72% of the oracle atT=400T\{=\}400\(Table[A3](https://arxiv.org/html/2605.21458#A3.T3), A\-SOP \+ pilot row\), matching the performance of an oracle\-targeted SEP, because continuous learning during exploitation eventually identifies the misspecified machines regardless of the pilot’s outcome\.

### C\.14Multi\-horizon results

Table A3:Multi\-horizon results \(% of oracle cash, 30 trials,±\\pm95% CI\)\. Best non\-oracle per horizon in bold\. L = learns, E = explores, D = directed\.HorizonTT\(days\)PolicyLED1002004008001600Oracle100100100100100SOP✗✗89\.0±0\.989\.0\_\{\\pm 0\.9\}78\.8±1\.278\.8\_\{\\pm 1\.2\}66\.8±1\.666\.8\_\{\\pm 1\.6\}48\.6±2\.148\.6\_\{\\pm 2\.1\}37\.7±1\.937\.7\_\{\\pm 1\.9\}ϵ\\epsilon\-greedy✗✓✗75\.8±1\.575\.8\_\{\\pm 1\.5\}65\.0±1\.665\.0\_\{\\pm 1\.6\}48\.2±2\.148\.2\_\{\\pm 2\.1\}39\.2±2\.739\.2\_\{\\pm 2\.7\}32\.8±2\.832\.8\_\{\\pm 2\.8\}A\-SOP✓✗89\.0±0\.9\\mathbf\{89\.0\}\_\{\\pm 0\.9\}82\.2±1\.3\\mathbf\{82\.2\}\_\{\\pm 1\.3\}75\.6±2\.0\\mathbf\{75\.6\}\_\{\\pm 2\.0\}66\.8±3\.366\.8\_\{\\pm 3\.3\}70\.7±4\.170\.7\_\{\\pm 4\.1\}\+ pilot✓✗80\.3±1\.380\.3\_\{\\pm 1\.3\}76\.2±1\.576\.2\_\{\\pm 1\.5\}71\.7±1\.871\.7\_\{\\pm 1\.8\}66\.6±3\.566\.6\_\{\\pm 3\.5\}71\.5±4\.871\.5\_\{\\pm 4\.8\}L\-ϵ\\epsilon\(fixed\)✓✓✗76\.5±1\.876\.5\_\{\\pm 1\.8\}68\.6±1\.968\.6\_\{\\pm 1\.9\}56\.6±2\.356\.6\_\{\\pm 2\.3\}55\.3±4\.355\.3\_\{\\pm 4\.3\}62\.8±5\.662\.8\_\{\\pm 5\.6\}L\-ϵ\\epsilon\(adaptive\)✓✓✗75\.5±1\.775\.5\_\{\\pm 1\.7\}69\.6±1\.669\.6\_\{\\pm 1\.6\}62\.7±2\.062\.7\_\{\\pm 2\.0\}61\.3±3\.361\.3\_\{\\pm 3\.3\}67\.2±4\.767\.2\_\{\\pm 4\.7\}KG\-SEP✓✓✓80\.4±0\.980\.4\_\{\\pm 0\.9\}76\.9±1\.176\.9\_\{\\pm 1\.1\}72\.7±1\.572\.7\_\{\\pm 1\.5\}67\.5±3\.067\.5\_\{\\pm 3\.0\}71\.8±4\.471\.8\_\{\\pm 4\.4\}SEP✓✓✓77\.2±1\.877\.2\_\{\\pm 1\.8\}73\.3±1\.773\.3\_\{\\pm 1\.7\}66\.6±2\.266\.6\_\{\\pm 2\.2\}66\.1±4\.266\.1\_\{\\pm 4\.2\}73\.8±5\.873\.8\_\{\\pm 5\.8\}Fisher\-SEP\-R✓✓✓76\.6±1\.776\.6\_\{\\pm 1\.7\}73\.2±2\.573\.2\_\{\\pm 2\.5\}68\.5±3\.768\.5\_\{\\pm 3\.7\}68\.8±4\.868\.8\_\{\\pm 4\.8\}75\.3±6\.5\\mathbf\{75\.3\}\_\{\\pm 6\.5\}
### C\.15Confounding\-and\-drift ablation

To isolate the role of the two hidden\-state error sources in driving the SEP’s advantage over the SOP, we ran the vending DGP under three conditions atT=400T=400with 30 trials each: \(a\) no confounding and no drift \(ci=1\.0c\_\{i\}=1\.0for all machines, driftdi=0d\_\{i\}=0\), \(b\) confounding only \(ci<1c\_\{i\}<1at University and New Neighborhood as in the full DGP, but driftdi=0d\_\{i\}=0\), and \(c\) the full DGP with both confounding and drift\. Table[A4](https://arxiv.org/html/2605.21458#A3.T4)reports the % of oracle cash achieved by each policy under each condition\.

Table A4:Confounding\-and\-drift ablation atT=400T=400, 30 trials\. Values are % of oracle cash\.Three observations follow from the table\. The SOP’s performance drops from 92% \(no bias\) to 67% \(full DGP\) as hidden\-state errors are introduced, confirming that the structured bias is what the SOP cannot correct\. The A\-SOP \(passive Bayesian updating\) holds steady at 76–86% across conditions: passive learning absorbs the confounding\-only bias \(86%\) essentially as well as the no\-bias case \(85%\), but the added drift partially defeats it \(76%\), consistent with the extended simulation lemma’s prediction that drift in the transition marginalization is an irreducible error source for passively\-updated policies\. The SEP advantage over the A\-SOP reverses direction across conditions: with no bias, the A\-SOP dominates \(85% vs 67%\) because the SEP’s 15\-day exploration cost is unnecessary; with confounding only, the A\-SOP still leads \(86% vs 66%\); with full drift, the A\-SOP edges the SEP \(76% vs 67%\) atT=400T=400because drift forces continuous rather than front\-loaded learning\. AtT=1600T=1600\(main Table[A3](https://arxiv.org/html/2605.21458#A3.T3)\), this ordering is reversed and designed exploration wins\. The crossover horizon is a function of the drift rate and the exploration cost\.

### C\.16Comparison with UCRL2 and UCBVI \(simulator\-free tabular baselines\)

A natural question is how the simulator\-informed policies compare to simulator\-free, frequentist\-optimism baselines that learn entirely from online interaction\. We implemented UCRL2\[Jakschet al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib93)\]and UCBVI\[Azaret al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib95)\]for both case studies using a*factored*tabular abstraction: one algorithm instance per unit of decentralization \(per machine in vending; per team in HIV\)\. Both algorithms re\-solve their optimistic MDP either once per day \(UCBVI\) or on a slower cadence \(UCRL2, whose extended value iteration is more expensive\)\. Per\-instance state and action spaces are small \(binarized stock and stocking action,\|𝕊\|\|𝔸\|=4\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|=4, per machine in vending; zone index and 5\-way move,\|𝕊\|\|𝔸\|=200\|\\mathbb\{S\}\|\|\\mathbb\{A\}\|=200, per team in HIV\), so per\-step compute is manageable\. These baselines have no access to the simulator and must learn from scratch\.

Tables[A5](https://arxiv.org/html/2605.21458#A3.T5)and[A6](https://arxiv.org/html/2605.21458#A3.T6)report the results at 30 trials per condition\.

Table A5:UCRL2 and UCBVI on the vending DGP, % of oracle cash, mean±\\pm95% CI half\-width\. All policies at 30 trials on the common seed schedule; Fisher\-SEP here is the Def\.\-3 Fisher\-trace variant \(reward\-dominates Corollary[2](https://arxiv.org/html/2605.21458#Thmcorollary2)\) run byregenerate\_all\_figures\.run\_v3\_experiment\_for\_figures\. Thompson Sampling \(PSRL\) at 30 trials\.Table A6:UCRL2 and UCBVI on the HIV mobile\-testing DGP, % of oracle cases found, mean±\\pm95% CI half\-width\. Baselines at 30 trials; the Fisher\-SEP row uses the A\-optimal PVV algorithm \(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\) with 30 trials atT∈\{50,100,200,300,400\}T\\in\\\{50,100,200,300,400\\\}on the common\-seed schedule; Thompson Sampling \(PSRL\) at 30 trials on the same schedule\.##### What the tabular baselines tell us\.

The two case studies produce complementary pictures of simulator\-free exploration\.

On the vending DGP \(Table[A5](https://arxiv.org/html/2605.21458#A3.T5)\), UCRL2 and UCBVI both approximately match the A\-SOP / SEP / Fisher\-SEP cluster atT=1600T=1600\(70–74% of oracle\) and handily beat the SOP \(38%\)\. This is the expected outcome: UCRL2 and UCBVI start from scratch but eventually recover a reasonable coarse stocking policy from per\-machine observed sales, and byT=1600T=1600they have enough data that the absence of a simulator prior costs little\. The remaining gap to the simulator\-informed policies comes from the coarser2×22\\times 2factored discretization rather than the exploration strategy itself; the CIs overlap\.

On the HIV DGP \(Table[A6](https://arxiv.org/html/2605.21458#A3.T6)\), the picture is sharper and is the one that speaks directly to the paper’s thesis\. UCRL2 and UCBVI do cross the corridor and visit Region B in 80–87% of trials \(mean 5–7 corridor crossings, 46–55 Region\-B team\-days per trial atT=400T=400\), confirming that the tabular bonus structure does eventually push exploration across the wall\. But they plateau at SOP\-level performance \(48–50% of oracle atT=400T=400\) and do not match the simulator\-informed policies within 400 days\. The reason is exactly the mechanism the paper isolates: Region A already yields∼\\sim50% of oracle on its own, and corridor crossings through cold Region B return only 20% of warm yield for the first three days, so the cold\-start cost amortizes slowly\. SEP and Fisher\-SEP use the simulator as a prior that encodes the Region\-B geometry, front\-loading corridor crossings in the first 25 days; UCRL2 and UCBVI, with no such prior, spend most of their exploration budget locally refining Region\-A estimates\. AtT≤400T\\leq 400the catch\-up is incomplete\. This directly illustrates the reachability\-gap component of Theorem[1](https://arxiv.org/html/2605.21458#Thmproposition1): the simulator’s value is not that it is accurate \(it underestimates Region\-B prevalence by15×15\\times\) but that its structural prior identifies where to experiment first\.

##### Caveats\.

\(i\) UCRL2 and UCBVI both required a forced\-exploration\-of\-unvisited\-actions wrapper in the HIV implementation—without it, argmax tie\-breaking over equal initial Q\-values locked every team at its start zone\. This is an implementation issue, not a regret\-analysis issue\. \(ii\) UCRL2’s wall\-clock on the vending DGP is dominated by its extended value iteration inner loop \(614s of 943s total at 30 trials,T=1600T=1600\); on the HIV DGP the 40\-state MDP is small enough that UCRL2 replans on a 25\-day cadence with no observable behavior change\. \(iii\) Thirty\-trial marginal CIs are wide at long horizons \(±5\\pm 5–77pp for tabular baselines in vending\)\. We use the common seed schedule to compute paired\-difference statistics \(Appendix[C\.17](https://arxiv.org/html/2605.21458#A3.SS17)\); paired Wilcoxon tests establish the headline orderings \(A\-SOP\>\>SOP atT≥200T\\geq 200in vending; SEP\>\>A\-SOP at every horizon in HIV; Fisher\-SEP\>\>A\-SOP atT=1600T\{=\}1600in vending; Fisher\-SEP\>\>SEP atT≥300T\\geq 300in HIV\) atp<0\.01p<0\.01, despite the marginal CIs overlapping\.

### C\.17Paired\-difference analysis

Table[A7](https://arxiv.org/html/2605.21458#A3.T7)reports paired\-difference statistics for every headline ordering claimed in the main text\. All comparisons are on the same 30\-trial common seed schedule: policies within the same cache ran on identical DGP realisations, so the paired\-difference estimator cancels the across\-seed variance that dominates the marginal CIs\. We report the paired\-tt95% CI and a one\-sided Wilcoxon signed\-rankpp\-value \(alternative: mean of A−\-B is strictly positive\)\. Checkmarks mark rows withp<0\.05p<0\.05\.

Table A7:Paired\-difference statistics for every headline ordering\. Values in percentage points of oracle\.n=30n=30paired trials\. ✓ marks pairs for which the one\-sided Wilcoxonpp\-value<0\.05<0\.05\.The paired analysis confirms three qualitative patterns noted in §5\.1:

- •A\-SOP dominates short horizons\.The paired gap A\-SOP−\-SOP is near zero atT=100T\{=\}100\(\+0\.1\+0\.1pp,p=0\.52p=0\.52\), significant atT=200T\{=\}200\(\+3\.4\+3\.4pp,p<10−3p<10^\{\-3\}\), and grows to\+33\+33pp atT=1600T\{=\}1600\(p<10−3p<10^\{\-3\}\)\.
- •Fisher\-SEP’s crossover is significant atT=1600T\{=\}1600\.Fisher\-SEP−\-A\-SOP is−12\.6\-12\.6atT=100T\{=\}100\(exploration cost\), crosses zero aroundT=800T\{=\}800\(\+1\.2\+1\.2,p=0\.23p=0\.23\), and is\+4\.8\+4\.8atT=1600T\{=\}1600withp=0\.005p=0\.005\. The 30\-trial sample is sufficient to establish the crossover statistically once paired, even though the marginal CIs overlap\.
- •HIV reachability gap is large and significant at every horizon\.SEP−\-A\-SOP ranges from\+27\+27to\+39\+39pp acrossT∈\{50,100,200,300,400\}T\\in\\\{50,100,200,300,400\\\},p<10−3p<10^\{\-3\}at every horizon\. Fisher\-SEP overtakes SEP significantly atT≥300T\\geq 300\(p<0\.01p<0\.01\)\.

KG\-SEP−\-A\-SOP is not significantly positive at any vending horizon in the paired analysis; KG\-SEP’s per\-state greedy criterion is insufficient to unlock the gap that Fisher\-SEP’s stochastic A\-optimal design captures\. This is consistent with the class\-level gapW1′≤W3′W\_\{1^\{\\prime\}\}\\leq W\_\{3^\{\\prime\}\}in Theorem[7](https://arxiv.org/html/2605.21458#Thmtheorem7)and with the qualitative argument that a per\-state myopic criterion cannot plan multi\-step trajectories\.

##### Fisher\-SEP paired analysis\.

Table[A8](https://arxiv.org/html/2605.21458#A3.T8)reports paired\-difference statistics for Fisher\-SEP \(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\) against SEP, A\-SOP, and Thompson sampling \(PSRL\) on the common seed schedule\. The key findings:

- •On HIV, Fisher\-SEP beats A\-SOP by\+27\.1\+27\.1pp atT=400T\{=\}400withp<10−3p<10^\{\-3\}; Fisher\-SEP vs SEP is statistically tied atT≥200T\\geq 200\(mean difference∈\[−0\.6,\+2\.5\]\\in\[\-0\.6,\+2\.5\],p\>0\.1p\>0\.1\)\. Random exploration \(SEP\) is already competitive on HIV because the reachability gap does not require fine\-grained allocation: any policy that reliably sends teams to Region B captures most of the available value\.
- •Thompson sampling \(PSRL\) partly closes the HIV gap on its own, beating A\-SOP by\+18\.8\+18\.8pp atT=400T\{=\}400\(p<10−3p<10^\{\-3\}\)\. Occasional high posterior draws on Region\-B prevalence push teams across the corridor and yield implicit exploration\. Fisher\-SEP still beats TS by\+8\.3\+8\.3pp atT=400T\{=\}400\(p=0\.003p=0\.003\) and by\+20\.6\+20\.6pp atT=100T\{=\}100: directed exploration that concentrates effort on the Region\-B cluster outperforms unstructured posterior sampling, and the gap closes only slowly as the horizon grows\.
- •On vending, Fisher\-SEP vs A\-SOP crosses zero at long horizons, consistent with the main\-text crossover result \(Table[A7](https://arxiv.org/html/2605.21458#A3.T7)\)\. Fisher\-SEP\-R beats Thompson by\+3\.9\+3\.9pp atT=1600T\{=\}1600\(p=0\.008p=0\.008\); TS is statistically tied with A\-SOP at every vending horizon \(paired\|\|mean\|≤1\.5\|\\leq 1\.5pp,p≥0\.16p\\geq 0\.16\)\.

Table A8:Fisher\-SEP \(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\) paired\-difference statistics against A\-SOP, SEP, and Thompson sampling \(PSRL\)\.n=30n=30paired trials on the common seed schedule\. Values in percentage points of oracle\. ✓ marks pairs for which the one\-sided Wilcoxonpp\-value<0\.05<0\.05\. The HIV rows here use the v3 cache \(code/results/fisher\_v3/hiv\_v3\.\{npz,csv\}\) and supersede the corresponding HIV rows in Table[A7](https://arxiv.org/html/2605.21458#A3.T7), which were generated from an earlier SEP run that pre\-dates the v3 cache; mean differences agree to within±2\.6\\pm 2\.6pp across the two runs\.

### C\.18Hierarchical\-prior sensitivity: how far does the reachability gap survive correlated priors?

Assumption[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(f\) factorizes the prior across\(s,a\)\(s,a\)pairs\. Realistic deployments have structural correlation: HIV prevalence is smooth across adjacent zones, vending demand is correlated across neighborhood\. Under a hierarchical prior, visiting one pair reduces posterior uncertainty at other pairs through the kernel, and Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4)’s mechanism is quantitatively weaker: the A\-SOP can partly close the reachability gap because Region\-A observations inform the Region\-B posterior through the prior\. This subsection bounds how much the reachability gap shrinks under a Gaussian\-process prior with kernelkk\.

##### Setup\.

Replace Assumption[5](https://arxiv.org/html/2605.21458#Thmassumption5)\(f\) with a Gaussian\-process prior on the reward parameters:r∼𝒢𝒫\(μ0,k\)r\\sim\\mathcal\{GP\}\(\\mu\_\{0\},k\)with meanμ0\\mu\_\{0\}given by the simulator and kernelk\(s,s′\)k\(s,s^\{\\prime\}\)\. LetK∈ℝ\|𝕊\|×\|𝕊\|K\\in\\mathbb\{R\}^\{\|\\mathbb\{S\}\|\\times\|\\mathbb\{S\}\|\}be the Gram matrix restricted to the relevant\|𝕊\|\|\\mathbb\{S\}\|states, with spectral decompositionK=UΛU⊤K=U\\Lambda U^\{\\top\}\. Denote by\|𝕊\|V:=\|supp\(dπdep\)\|\|\\mathbb\{S\}\|\_\{V\}:=\|\\mathrm\{supp\}\(d\_\{\\pi^\{\\mathrm\{dep\}\}\}\)\|the number of deployed\-policy\-visited states and by\|𝕊\|U:=\|𝕊\|−\|𝕊\|V\|\\mathbb\{S\}\|\_\{U\}:=\|\\mathbb\{S\}\|\-\|\\mathbb\{S\}\|\_\{V\}the number of unvisited states\.

###### Proposition 9\(Reachability gap under hierarchical prior\)\.

Under the GP prior with kernelkk, letKUV∈ℝ\|𝕊\|U×\|𝕊\|VK\_\{UV\}\\in\\mathbb\{R\}^\{\|\\mathbb\{S\}\|\_\{U\}\\times\|\\mathbb\{S\}\|\_\{V\}\}be the off\-diagonal block ofKK\(visited states coupled to unvisited\), and letKUU∈ℝ\|𝕊\|U×\|𝕊\|UK\_\{UU\}\\in\\mathbb\{R\}^\{\|\\mathbb\{S\}\|\_\{U\}\\times\|\\mathbb\{S\}\|\_\{U\}\}be the unvisited\-states block\. After the A\-SOP collectsnnobservations per visited state, the posterior covariance on the unvisited states reduces to

ΣUpost\(n\)=KUU−KUV\(KVV\+σ2/n⋅I\)−1KVU\.\\Sigma\_\{U\}^\{\\mathrm\{post\}\}\(n\)\\;=\\;K\_\{UU\}\-K\_\{UV\}\(K\_\{VV\}\+\\sigma^\{2\}/n\\cdot I\)^\{\-1\}K\_\{VU\}\.\(40\)Letting𝒢reachhier\\mathcal\{G\}\_\{\\mathrm\{reach\}\}^\{\\mathrm\{hier\}\}denote the reachability gap under the hierarchical prior, there exist constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0\(depending on the kernel and onRmax,γR\_\{\\max\},\\gamma\) such that

𝒢reachhier≤c1⋅trace\(ΣUpost\(n\)\)1/2\+c2⋅ϵhierm,\\mathcal\{G\}\_\{\\mathrm\{reach\}\}^\{\\mathrm\{hier\}\}\\;\\leq\\;c\_\{1\}\\cdot\\mathrm\{trace\}\(\\Sigma\_\{U\}^\{\\mathrm\{post\}\}\(n\)\)^\{1/2\}\+c\_\{2\}\\cdot\\epsilon^\{m\}\_\{\\mathrm\{hier\}\},\(41\)whereϵhierm\\epsilon^\{m\}\_\{\\mathrm\{hier\}\}is the residual model error that the kernel cannot close\.

###### Proof\.

The reachability gap is proportional to the value\-function variance at unvisited states\. Under the GP prior, posterior variance at unvisited states is given by the Schur complement \([40](https://arxiv.org/html/2605.21458#A3.E40)\)\. Thec1c\_\{1\}constant comes from the value\-function delta\-method with Bellman resolvent; the residualc2ϵhiermc\_\{2\}\\epsilon^\{m\}\_\{\\mathrm\{hier\}\}captures the kernel’s inability to interpolate across model\-specification mismatches\. ∎

### C\.19Per\-trial conditional analysis: does the Fisher\-SEP\-R gap depend on pilot detection?

The unconditional Fisher\-SEP\-R−\-A\-SOP gap on vending atT=1600T\{=\}1600is\+4\.58\+4\.58pp \(paired Wilcoxonp=0\.007p\{=\}0\.007; Table[A7](https://arxiv.org/html/2605.21458#A3.T7)\)\. A natural question is whether this effect is driven by those trials in which the 5\-day pilot’s hypothesis test \(Bonferroni atα=0\.05/15\\alpha\{=\}0\.05/15\) plus the “add top\-3 Fisher\-sensitive VMs” heuristic tags at least one of the two truly\-misspecified machines—University \(VM 3\) or New Neighborhood \(VM 4\)\. We stratify the 30\-trial common\-seed cache by whether Fisher\-SEP\-R’s post\-pilot “uncertain” set contains either of those two VMs, and re\-run the paired test on each stratum\.

Table A9:Per\-trial conditional analysis of Fisher\-SEP\-R−\-A\-SOP atT=1600T\{=\}1600, stratified by whether the 5\-day pilot plus Fisher\-augmentation flagged at least one of \{University, New Neighborhood\} into the post\-pilot exploration set\. Paired\-Wilcoxonpp\-value is one\-sided \(alternative: Fisher\-SEP\-R mean\>\>A\-SOP mean\)\.n=30n\{=\}30common\-seed trials,SEED=42\+trial×100\\mathrm\{SEED\}\{=\}42\{\+\}\\mathrm\{trial\}\{\\times\}100\.Values in % of oracle cash atT=1600T\{=\}1600\. Marginal CI half\-widths are 95% \(ordinarytt\) within each stratum; paired\-difference CI half\-width is 95% paired\-tt\. Cache:code/results/vending\_conditional\.\{npz,csv\}\.

Three observations follow from the table\. The flagged stratum \(25 of 30 trials;83%83\\%\) carries essentially all of the unconditional effect: the within\-stratum paired gap is\+5\.45\+5\.45pp withp=0\.004p\{=\}0\.004, slightly larger than the\+4\.58\+4\.58pp pooled result, because the unflagged stratum’s near\-zero contribution dilutes the pooled mean\. The unflagged stratum \(5 of 30 trials;17%17\\%\) shows essentially no effect: paired difference\+0\.25\+0\.25pp with a 95% CI of\[−11\.75,\+12\.24\]\[\-11\.75,\+12\.24\]andp=0\.50p\{=\}0\.50\. With only 5 trials the CI is too wide to rule out a clinically meaningful gap, but the point estimate is consistent with the obvious mechanism: when the pilot does not tag either misspecified machine, Fisher\-SEP\-R’s post\-pilot exploration budget is spent on well\-calibrated machines and the effective learning rate over U/NN converges to the A\-SOP’s passive rate\. The A\-SOP’s own within\-stratum performance moves in the expected direction:69\.6%69\.6\\%when the pilot flags U or NN \(these trials happen to be seeds on which the simulator’s mis\-specification is slightly more exploitable by any learning policy\) versus75\.7%75\.7\\%on the unflagged subset\.

The conditional analysis shows that the unconditional\+4\.58\+4\.58pp effect is concentrated in the≈83%\\approx 83\\%of trials where the pilot protocol’s detection succeeds\. The remaining≈17%\\approx 17\\%of trials show no measurable gap\. This is the expected behavior of a pilot\-gated directed\-exploration policy: when the gate fails, the policy reduces to a minor perturbation of A\-SOP\. For a deployment\-scale analysis, the operational implication is that the gate’s false\-negative rate \(pilot failing to tag a truly\-misspecified machine\) upper\-bounds the headroom gained over passive A\-SOP\. A less conservative FDR\-controlling multiple\-testing procedure would raise the detection probability on the unflagged stratum at the cost of additional false positives on calibrated machines; we do not pursue this tuning here\.

## Appendix DHIV Mobile Testing: Data\-Generating Process

This appendix provides the complete specification of the data\-generating process for the HIV mobile testing experiment of Section[5\.2](https://arxiv.org/html/2605.21458#S5.SS2)\. All numerical values match the reference implementation \(hiv\_testing\.pyandexp\_hiv\_testing\.py\)\.

### D\.1Geography and grid structure

The environment is a5×85\\times 8grid \(rows=5\\texttt\{rows\}=5,cols=8\\texttt\{cols\}=8\) comprising\|𝒮\|=40\|\\mathcal\{S\}\|=40zones indexedj=r⋅cols\+cj=r\\cdot\\texttt\{cols\}\+cfor rowr∈\{0,…,4\}r\\in\\\{0,\\ldots,4\\\}and columnc∈\{0,…,7\}c\\in\\\{0,\\ldots,7\\\}\. A wall runs along columnmid\_col=⌊8/2⌋=4\\texttt\{mid\\\_col\}=\\lfloor 8/2\\rfloor=4, dividing the grid into two regions:

- •Region A\(columns0–33\): well\-surveilled urban zones with clinic\-based testing infrastructure\.
- •Region B\(columns44–77\): hard\-to\-reach peri\-urban areas requiring mobile outreach\.

The wall blocks all movement between regions except through a single corridor cell at\(mid\_row,mid\_col\)=\(2,4\)\(\\texttt\{mid\\\_row\},\\texttt\{mid\\\_col\}\)=\(2,4\)\. Formally, a move from\(r1,c1\)\(r\_\{1\},c\_\{1\}\)to\(r2,c2\)\(r\_\{2\},c\_\{2\}\)is blocked if\(c1<4\)≠\(c2<4\)\(c\_\{1\}<4\)\\neq\(c\_\{2\}<4\)unlessr1=r2=2r\_\{1\}=r\_\{2\}=2\. Each zonejjis adjacent to its four cardinal neighbors \(up, down, left, right\) subject to grid boundaries and the wall constraint; teams may also stay in place\.

### D\.2Population model

Each zonejjhas a fixed populationNjN\_\{j\}:

Nj=\{500ifj∈Region A\(cj<4\),300ifj∈Region B\(cj≥4\)\.N\_\{j\}=\\begin\{cases\}500&\\text\{if \}j\\in\\text\{Region~A\}\\;\(c\_\{j\}<4\),\\\\ 300&\\text\{if \}j\\in\\text\{Region~B\}\\;\(c\_\{j\}\\geq 4\)\.\\end\{cases\}\(42\)Region B’s smaller population reflects lower population density in peri\-urban areas\.

### D\.3Disease dynamics \(SIS with treatment\-reduced transmission\)

Prevalence evolves according to a discrete\-time Susceptible–Infected–Susceptible \(SIS\) model in which the treatment factorτ=0\.90\\tau=0\.90in Eq\. \([43](https://arxiv.org/html/2605.21458#A4.E43)\) acts as the recovery channel: diagnosed\-and\-treated individuals contribute\(1−τ\)\(1\-\\tau\)as much to the infectious pool, which under continued treatment renders them effectively susceptible to re\-acquisition under the SIS interpretation\. Letpj\(t\)∈\[0\.001,0\.80\]p\_\{j\}\(t\)\\in\[0\.001,0\.80\]denote the prevalence \(fraction infected\) at zonejjon daytt, and letDj\(t\)D\_\{j\}\(t\)denote the cumulative number of diagnosed individuals at zonejjby daytt\(capped atNjN\_\{j\}\)\. The effective number of infectious individuals at zonejjaccounts for the treatment effect:

Ijeff\(t\)=pj\(t\)⋅max⁡\(0,Nj−Dj\(t\)\)\+Dj\(t\)⋅\(1−τ\),I\_\{j\}^\{\\mathrm\{eff\}\}\(t\)=p\_\{j\}\(t\)\\cdot\\max\\bigl\(0,\\;N\_\{j\}\-D\_\{j\}\(t\)\\bigr\)\+D\_\{j\}\(t\)\\cdot\(1\-\\tau\),\(43\)whereτ=0\.90\\tau=0\.90is the treatment reduction factor \(diagnosed individuals transmit at 10% of the untreated rate\)\. The first term counts undiagnosed infected individuals \(prevalence times undiagnosed population\); the second counts diagnosed individuals whose residual infectiousness is reduced by treatment\.

The force of infection at zonejjhas two components:

FoIj\(w\)\(t\)\\displaystyle\\mathrm\{FoI\}\_\{j\}^\{\(w\)\}\(t\)=βw⋅Ijeff\(t\)Nj,\\displaystyle=\\beta\_\{w\}\\cdot\\frac\{I\_\{j\}^\{\\mathrm\{eff\}\}\(t\)\}\{N\_\{j\}\},\(44\)FoIj\(b\)\(t\)\\displaystyle\\mathrm\{FoI\}\_\{j\}^\{\(b\)\}\(t\)=∑k∈nb\(j\)βb⋅Ikeff\(t\)Nk,\\displaystyle=\\sum\_\{k\\in\\mathrm\{nb\}\(j\)\}\\beta\_\{b\}\\cdot\\frac\{I\_\{k\}^\{\\mathrm\{eff\}\}\(t\)\}\{N\_\{k\}\},\(45\)whereβw=0\.002\\beta\_\{w\}=0\.002is the within\-zone transmission rate,βb=0\.0005\\beta\_\{b\}=0\.0005is the between\-zone transmission rate, andnb\(j\)\\mathrm\{nb\}\(j\)denotes the set of zones adjacent tojj\(respecting the wall\)\.

Prevalence updates as:

pj\(t\+1\)=clip\[pj\(t\)\+\(1−pj\(t\)\)⋅\(FoIj\(w\)\(t\)\+FoIj\(b\)\(t\)\),0\.001,0\.80\]\.p\_\{j\}\(t\+1\)=\\mathrm\{clip\}\\Bigl\[\\;p\_\{j\}\(t\)\+\\bigl\(1\-p\_\{j\}\(t\)\\bigr\)\\cdot\\bigl\(\\mathrm\{FoI\}\_\{j\}^\{\(w\)\}\(t\)\+\\mathrm\{FoI\}\_\{j\}^\{\(b\)\}\(t\)\\bigr\),\\;0\.001,\\;0\.80\\;\\Bigr\]\.\(46\)The clip operation prevents prevalence from falling below0\.0010\.001\(background rate\) or exceeding0\.800\.80\.

### D\.4Treatment effect

When a testing team diagnoses an individual as HIV\-positive, that individual enters treatment\. Diagnosed individuals remain in the population but transmit at a reduced rate: their contribution to the force of infection is scaled by\(1−τ\)=0\.10\(1\-\\tau\)=0\.10, as shown in Eq\. \([43](https://arxiv.org/html/2605.21458#A4.E43)\)\. This reflects the well\-documented effect of antiretroviral therapy on viral suppression and onward transmission\[Warrenet al\.,[2025](https://arxiv.org/html/2605.21458#bib.bib152)\]\. The cumulative diagnosed countDj\(t\)D\_\{j\}\(t\)is non\-decreasing and capped atNjN\_\{j\}\.

### D\.5Testing model

Each of then=8n=8mobile testing teams can test at one zone per day\. Letmj\(t\)m\_\{j\}\(t\)denote the number of teams present at zonejjon daytt\. For each team at zonejj, the number of tests administered is drawn from a Poisson distribution:

ntests∼max⁡\(1,Poisson\(μtests\)\),μtests=8\.n\_\{\\mathrm\{tests\}\}\\sim\\max\\bigl\(1,\\;\\mathrm\{Poisson\}\(\\mu\_\{\\mathrm\{tests\}\}\)\\bigr\),\\qquad\\mu\_\{\\mathrm\{tests\}\}=8\.\(47\)Each test yields a positive result with probability equal to the effective prevalence at zonejj, adjusted for warmup status:

npos∼Binomial\(ntests,pj\(t\)⋅ϕj\(t\)\),n\_\{\\mathrm\{pos\}\}\\sim\\mathrm\{Binomial\}\\bigl\(n\_\{\\mathrm\{tests\}\},\\;p\_\{j\}\(t\)\\cdot\\phi\_\{j\}\(t\)\\bigr\),\(48\)whereϕj\(t\)\\phi\_\{j\}\(t\)is the yield multiplier defined in Section[D\.6](https://arxiv.org/html/2605.21458#A4.SS6)\.

### D\.6Warmup mechanism

Region B zones start “cold”: testing teams must establish community rapport before achieving full testing yield\. Each zonejjmaintains a warmup counterwj\(t\)∈\[0,wmax\]w\_\{j\}\(t\)\\in\[0,w\_\{\\max\}\]withwmax=3w\_\{\\max\}=3days\. All Region A zones are initialized as warm \(wj\(0\)=3w\_\{j\}\(0\)=3\); all Region B zones start cold \(wj\(0\)=0w\_\{j\}\(0\)=0\)\. On each day, if at least one team is present at zonejj, the counter increments:wj\(t\+1\)=min⁡\(wj\(t\)\+1,wmax\)w\_\{j\}\(t\+1\)=\\min\(w\_\{j\}\(t\)\+1,w\_\{\\max\}\)\. The yield multiplier is:

ϕj\(t\)=\{1\.0ifwj\(t\)≥wmax\(warm\),0\.20otherwise\(cold\)\.\\phi\_\{j\}\(t\)=\\begin\{cases\}1\.0&\\text\{if \}w\_\{j\}\(t\)\\geq w\_\{\\max\}\\quad\(\\text\{warm\}\),\\\\ 0\.20&\\text\{otherwise\}\\quad\(\\text\{cold\}\)\.\\end\{cases\}\(49\)This reflects the community engagement costs documented in mobile HIV testing programs: cold outreach in unfamiliar areas yields only 20% of the testing volume achievable in established sites\[Gonsalveset al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib150)\]\.

### D\.7Confounding mechanism

The simulator’s prevalence estimates are calibrated from clinic\-based surveillance data, which systematically underrepresents hard\-to\-reach populations in Region B\. For each zonejj, the true initial prevalencepj\(0\)p\_\{j\}\(0\)and the simulator’s estimatep^jsim\\hat\{p\}\_\{j\}^\{\\mathrm\{sim\}\}are drawn as:

pj\(0\)\\displaystyle p\_\{j\}\(0\)=\{0\.05\+ϵj,ϵj∼𝒩\(0,0\.0052\)ifcj<4\(Region A\),0\.04\+ϵj,ϵj∼𝒩\(0,0\.0052\)ifcj≥4\(Region B, non\-cluster\),\\displaystyle=\\begin\{cases\}0\.05\+\\epsilon\_\{j\},\\quad\\epsilon\_\{j\}\\sim\\mathcal\{N\}\(0,0\.005^\{2\}\)&\\text\{if \}c\_\{j\}<4\\;\\text\{\(Region~A\)\},\\\\ 0\.04\+\\epsilon\_\{j\},\\quad\\epsilon\_\{j\}\\sim\\mathcal\{N\}\(0,0\.005^\{2\}\)&\\text\{if \}c\_\{j\}\\geq 4\\;\\text\{\(Region~B, non\-cluster\)\},\\end\{cases\}\(50\)p^jsim\\displaystyle\\hat\{p\}\_\{j\}^\{\\mathrm\{sim\}\}=\{0\.05\+ηj,ηj∼𝒩\(0,0\.0032\)ifcj<4\(Region A\),0\.02\+ηj,ηj∼𝒩\(0,0\.0022\)ifcj≥4\(Region B\)\.\\displaystyle=\\begin\{cases\}0\.05\+\\eta\_\{j\},\\quad\\eta\_\{j\}\\sim\\mathcal\{N\}\(0,0\.003^\{2\}\)&\\text\{if \}c\_\{j\}<4\\;\\text\{\(Region~A\)\},\\\\ 0\.02\+\\eta\_\{j\},\\quad\\eta\_\{j\}\\sim\\mathcal\{N\}\(0,0\.002^\{2\}\)&\\text\{if \}c\_\{j\}\\geq 4\\;\\text\{\(Region~B\)\}\.\\end\{cases\}\(51\)Both are clipped to\[0\.001,0\.50\]\[0\.001,0\.50\]\. The key asymmetry is that the simulator estimates Region B prevalence atp^sim≈0\.02\\hat\{p\}^\{\\mathrm\{sim\}\}\\approx 0\.02, while the true non\-cluster prevalence isp≈0\.04p\\approx 0\.04—and the cluster prevalence is far higher \(see below\)\. This underestimation arises because clinic\-based surveillance captures patients who self\-present, missing the hard\-to\-reach population that mobile outreach is designed to serve\.

### D\.8Cluster specification

A hidden disease cluster is planted in the bottom\-right corner of Region B\. The epicenter is at zone\(r,c\)=\(rows−1,cols−1\)=\(4,7\)\(r,c\)=\(\\texttt\{rows\}\-1,\\texttt\{cols\}\-1\)=\(4,7\)with true initial prevalencepepi=0\.30p\_\{\\mathrm\{epi\}\}=0\.30\. All valid neighbors of the epicenter within Region B \(i\.e\., zones\(r′,c′\)\(r^\{\\prime\},c^\{\\prime\}\)with\|r′−4\|≤1\|r^\{\\prime\}\-4\|\\leq 1,\|c′−7\|≤1\|c^\{\\prime\}\-7\|\\leq 1,0≤r′<50\\leq r^\{\\prime\}<5, andc′≥4c^\{\\prime\}\\geq 4\) have initial prevalencepnb=0\.18p\_\{\\mathrm\{nb\}\}=0\.18\. The simulator’s estimate for these zones remainsp^sim≈0\.02\\hat\{p\}^\{\\mathrm\{sim\}\}\\approx 0\.02, so the cluster is invisible to the SOP\.

If the cluster remains undetected, SIS dynamics \(Eq\. \([46](https://arxiv.org/html/2605.21458#A4.E46)\)\) cause prevalence to grow: the high within\-zone force of infection at the epicenter spills over to neighboring zones via the between\-zone transmission term, gradually expanding the outbreak\. Early detection and treatment \(via the SEP’s corridor\-crossing exploration\) contain this growth by reducing the effective infectious population through the treatment factorτ=0\.90\\tau=0\.90\.

### D\.9Belief model

Adaptive policies maintain a Beta\-distributed belief over each zone’s prevalence\. The prior is initialized from the simulator’s estimates with strength parameterκ=10\\kappa=10:

αj\(0\)=p^jsim⋅κ\+1,βj\(0\)=\(1−p^jsim\)⋅κ\+1\.\\alpha\_\{j\}^\{\(0\)\}=\\hat\{p\}\_\{j\}^\{\\mathrm\{sim\}\}\\cdot\\kappa\+1,\\qquad\\beta\_\{j\}^\{\(0\)\}=\(1\-\\hat\{p\}\_\{j\}^\{\\mathrm\{sim\}\}\)\\cdot\\kappa\+1\.\(52\)After observingntestsn\_\{\\mathrm\{tests\}\}tests withnposn\_\{\\mathrm\{pos\}\}positives at zonejj, the belief updates via the conjugate rule:αj←αj\+npos\\alpha\_\{j\}\\leftarrow\\alpha\_\{j\}\+n\_\{\\mathrm\{pos\}\},βj←βj\+\(ntests−npos\)\\beta\_\{j\}\\leftarrow\\beta\_\{j\}\+\(n\_\{\\mathrm\{tests\}\}\-n\_\{\\mathrm\{pos\}\}\)\. The posterior meanp^j=αj/\(αj\+βj\)\\hat\{p\}\_\{j\}=\\alpha\_\{j\}/\(\\alpha\_\{j\}\+\\beta\_\{j\}\)serves as the prevalence estimate for replanning\.

### D\.10Policy specifications

Six policies are evaluated, plus the oracle:

- •Oracle: Knows the true prevalence map\. Sends 4 teams through the corridor to the cluster zones and 4 teams to the highest\-prevalence Region A zones\.
- •SOP: Assigns all 8 teams to the highest\-prevalence zones according to the simulator’s estimatesp^sim\\hat\{p\}^\{\\mathrm\{sim\}\}\. Sincep^sim\\hat\{p\}^\{\\mathrm\{sim\}\}is higher in Region A than Region B, all teams remain in Region A\.
- •ϵ\\epsilon\-greedy\(ϵ=0\.15\\epsilon=0\.15\): Follows SOP targets with probability1−ϵ1\-\\epsilon; with probabilityϵ\\epsilon, moves to a uniformly random valid neighbor\. Does not update beliefs\.
- •A\-SOP\(replan every 10 days\): Maintains a Beta belief \(Eq\. \([52](https://arxiv.org/html/2605.21458#A4.E52)\)\), updates from test observations, and replans targets to the highest posterior\-mean zones every 10 days\. Does not deliberately explore\.
- •Thompson Sampling\(replan every 5 days\): Samples prevalences from the Beta posterior and navigates to the zones with highest sampled prevalence\. Posterior sampling occasionally draws high values for Region B, providing implicit exploration\.
- •SEP\(nexplore=3n\_\{\\mathrm\{explore\}\}=3,Texplore=25T\_\{\\mathrm\{explore\}\}=25days, replan every 10 days\): Dedicates 3 of 8 teams to exploration for the first 25 days\. Explorers navigate through the corridor to the epicenter zone\(4,7\)\(4,7\); the remaining 5 teams follow SOP targets in Region A\. After day 25, all teams switch to posterior\-optimal targets\.
- •Fisher\-SEP\(Tpilot=3T\_\{\\mathrm\{pilot\}\}=3days,Texplore=25T\_\{\\mathrm\{explore\}\}=25days, replan every 5 days\): Three\-phase protocol\. During the 3\-day pilot, all teams perform a random walk to gather initial data\. During the 25\-day exploration phase, the A\-optimal PVV criterion of Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)determines the number of explore teams and their targets\. The HIV realization uses the disease\-dynamics value gradient \(15\-day SIS lookahead of cumulative yield with respect to per\-zone prevalence\) in place of the static Bellman\-resolvent gradient of Eq\. \([56](https://arxiv.org/html/2605.21458#A4.E56)\), because the static simulator MDP inherits the simulator’s Region\-A\-concentrated value structure and the corridor geometry requires a gradient that captures contagion spread through Region B \(Appendix[D\.11](https://arxiv.org/html/2605.21458#A4.SS11)\)\. Posterior parameter varianceVar\(θs,a\)\\mathrm\{Var\}\(\\theta\_\{s,a\}\)is taken from the Beta\-Binomial posterior \(Eq\. \([52](https://arxiv.org/html/2605.21458#A4.E52)\)\)\. After exploration, all teams follow posterior\-optimal targets\.

##### SS\-measurability of the HIV random\-walk pilot\.

Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)requires the pilot policyπexp\\pi^\{\\mathrm\{exp\}\}to beSS\-measurable:πexp\(a∣s,h\)=πexp\(a∣s\)\\pi^\{\\mathrm\{exp\}\}\(a\\mid s,h\)=\\pi^\{\\mathrm\{exp\}\}\(a\\mid s\)for all hidden stateshh\. On the HIV grid with 8 teams, the action is a joint assignmenta=\(a1,…,a8\)∈𝔸8a=\(a\_\{1\},\\ldots,a\_\{8\}\)\\in\\mathbb\{A\}^\{8\}; the random walk takes independent uniform\-random moves per team, conditioned only on each team’s current cell \(an observed quantity\)\. Corridor constraints are deterministic functions of the observed position \(\(2,4\)\(2,4\)is the only cell that connects Region A to Region B; other cells have full 4\-neighbor valid\-move sets\), so the per\-team action distribution isSS\-measurable in the 8\-tuple of observed cells\. The joint random walk is thereforeSS\-measurable as a product measure, and Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)applies: pilot observations at cellssare marginally distributed asℙ\(R∣s,do\(a\)\)\\mathbb\{P\}\(R\\mid s,\\mathrm\{do\}\(a\)\)under the interventional distribution\. The identification argument is unaffected by the fact that teams share a global action space; what matters is that each team’s action at its own cell does not depend on latentHHbeyond what its observed positionssalready encodes\.

### D\.11Numerical implementation of Fisher\-SEP\-T under the Gonsalves SIS dynamics

This subsection describes the numerical implementation of Fisher\-SEP\-T \(Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\) under the Gonsalves SIS dynamics\. The implementation is not a separate algorithm; it realizes the transition block of Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\(the reward block is pinned by the Beta observation model, cf\. Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\) with the DGP’s non\-tabular disease dynamics\. Three DGP\-specific choices follow from properties of the POMDP, not from tuning\.

##### Disease\-dynamics value gradient\.

The static Bellman\-resolvent gradient of Eq\. \([56](https://arxiv.org/html/2605.21458#A4.E56)\) treats transitions as the simulator’s deterministic valid\-move indicators\. This is insufficient for HIV because perturbing prevalence at a Region\-B cluster zone changes future prevalence at neighboring zones through the SIS dynamics of Eq\. \([46](https://arxiv.org/html/2605.21458#A4.E46)\), and that spread effect is the dominant contribution to∂V/∂θj\\partial V/\\partial\\theta\_\{j\}at cluster neighbors\. We therefore use a 15\-day SIS\-propagated finite\-difference gradient: perturbprevj\\mathrm\{prev\}\_\{j\}byδ=0\.01\\delta=0\.01, iteratespread\_diseaseforTlook=15T\_\{\\mathrm\{look\}\}=15days, and compute the change in total discounted yield∑t=0Tlookγt∑jprevj,t⋅ntests\\sum\_\{t=0\}^\{T\_\{\\mathrm\{look\}\}\}\\gamma^\{t\}\\sum\_\{j\}\\mathrm\{prev\}\_\{j,t\}\\cdot n\_\{\\mathrm\{tests\}\}\. This substitutes the simulator’s own dynamics for the Bellman\-resolvent gradient; the equivalence at the first\-order level is derived in App\.[D\.12](https://arxiv.org/html/2605.21458#A4.SS12)\.

##### Beta\-Binomial posterior parameter variance\.

The pilot reward at a zone isnpos∼Binomial\(ntests,pjϕj\)n\_\{\\mathrm\{pos\}\}\\sim\\mathrm\{Binomial\}\(n\_\{\\mathrm\{tests\}\},p\_\{j\}\\phi\_\{j\}\), and the belief model of Eq\. \([52](https://arxiv.org/html/2605.21458#A4.E52)\) gives a Beta posterior forpjp\_\{j\}whose variance isαβ/\[\(α\+β\)2\(α\+β\+1\)\]\\alpha\\beta/\[\(\\alpha\+\\beta\)^\{2\}\(\\alpha\+\\beta\+1\)\]\. We use this Beta posterior variance directly asVar\(θj\)\\mathrm\{Var\}\(\\theta\_\{j\}\)inPVV\(π\)\\mathrm\{PVV\}\(\\pi\), rather than constructing a Gaussian NIG posterior on the Gaussianized reward\. The two are equivalent in the limit of large tests\-per\-team; the Beta form is exact\.

##### Target\-based navigation: the navigation\-restricted Fisher\-SEP\-T instance\.

Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)minimizesPVV\\mathrm\{PVV\}over stochastic policies\. For HIV, a per\-step stochastic sampler cannot reliably traverse the single corridor withinTexplore=25T\_\{\\mathrm\{explore\}\}=25days because the expected number of “move right” actions needed to reach Region B exceeds what any non\-degenerate stochastic policy achieves\. We therefore run*navigation\-restricted Fisher\-SEP\-T*\(Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)\): the A\-optimal argmin is restricted to the deterministic\-navigation\-plus\-exploit classΠnav\\Pi\_\{\\mathrm\{nav\}\}, which ranks zones by their per\-zone PVV contribution and sends teams to the top\-ranked targets along shortest admissible paths\.

Letℛ⊂𝕊×𝔸\\mathcal\{R\}\\subset\\mathbb\{S\}\\times\\mathbb\{A\}denote the set of\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)pairs reachable by the unrestricted minimizerπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}withinTexploreT\_\{\\mathrm\{explore\}\}across the corridor, andℛnav\\mathcal\{R\}\_\{\\mathrm\{nav\}\}the corresponding set underπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\. The approximation error from navigation restriction is bounded \(on the corridor\-crossing\-failure event\) by the PVV\-weighted prior\-variance sum over pairs inℛ∖ℛnav\\mathcal\{R\}\\setminus\\mathcal\{R\}\_\{\\mathrm\{nav\}\}\(pairs the unrestricted minimizer reaches but the navigation\-restricted policy does not\):

Ctail:=∑\(s′,a′\)∈ℛ∖ℛnav∑sdπtgt\(s\)∇p\(⋅∣s′,a′\)Vπtgt\(s\)⊤Σp\(s′,a′\)∇p\(⋅∣s′,a′\)Vπtgt\(s\)\.C\_\{\\mathrm\{tail\}\}\\;:=\\;\\sum\_\{\(s^\{\\prime\},a^\{\\prime\}\)\\in\\mathcal\{R\}\\setminus\\mathcal\{R\}\_\{\\mathrm\{nav\}\}\}\\sum\_\{s\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\,\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)^\{\\top\}\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\.\(53\)Under the strong\-bottleneck condition \(every path from Region A to Region B inℛ\\mathcal\{R\}passes through a single corridor cellℬ\\mathcal\{B\}\),ℛ∖ℛnav\\mathcal\{R\}\\setminus\\mathcal\{R\}\_\{\\mathrm\{nav\}\}is precisely the set of beyond\-corridor pairs thatπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}reaches \(via the corridor\) butπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}misses when it fails to cross\. Concretely, the HIV grid has bottleneck setℬ=\{\(2,4\)\}\\mathcal\{B\}=\\\{\(2,4\)\\\}; the unrestricted minimizer’s per\-step policy mixes actions with probability bounded byπ\(a∣s\)≤1\\pi\(a\\mid s\)\\leq 1, so crossing the corridor inTexplore=25T\_\{\\mathrm\{explore\}\}=25days requires probability mass on the exact sequence of moves through\(2,4\)\(2,4\), which isO\(πminD\)O\(\\pi\_\{\\min\}^\{D\}\)whereDDis the minimum hitting distance\.

##### Numerical value of the path\-overlap constantκ\\kappaon the HIV grid\.

We computeκ\\kappadirectly from the HIV grid geometry\. The grid is5×85\\times 8with one corridor cell\(2,4\)\(2,4\)separating Region A \(left\) from Region B \(right\)\. Eight teams operate per day\. Once a team crosses the corridor, Region B is reachable via shortest admissible paths; the unrestricted PVV minimizerπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}and the navigation\-restrictedπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}agree on their post\-corridor allocation to cluster cells, because the cluster cells’ PVV weights dominate all other Region B pairs\. The overlap constantκ:=min\(s′,a′\)∈ℛ∩ℛnav⁡ns′,a′\(πnav⋆\)/ns′,a′\(πPVV⋆\)\\kappa:=\\min\_\{\(s^\{\\prime\},a^\{\\prime\}\)\\in\\mathcal\{R\}\\cap\\mathcal\{R\}\_\{\\mathrm\{nav\}\}\}n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)/n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)is computed by simulating both policies for 30 common\-seed trials and taking the empirical ratio over shared pairs:

κHIV=0\.92±0\.04\(30 common\-seed trials\)\.\\kappa\_\{\\mathrm\{HIV\}\}\\;=\\;0\.92\\pm 0\.04\\quad\\text\{\(30 common\-seed trials\)\}\.\(54\)The gapκHIV−1−1≈0\.087\\kappa\_\{\\mathrm\{HIV\}\}^\{\-1\}\-1\\approx 0\.087is small but non\-zero; propagating it through Theorem[5](https://arxiv.org/html/2605.21458#Thmtheorem5), the PVV\-excess bound becomes\(1−qnav\)Ctail\+0\.087Coverlap\(1\-q\_\{\\mathrm\{nav\}\}\)C\_\{\\mathrm\{tail\}\}\+0\.087\\,C\_\{\\mathrm\{overlap\}\}rather than\(1−qnav\)Ctail\(1\-q\_\{\\mathrm\{nav\}\}\)C\_\{\\mathrm\{tail\}\}alone\. On the HIV DGPqnav≈1q\_\{\\mathrm\{nav\}\}\\approx 1\(the navigation policy crosses the corridor in 99% of trials\), so the first term is≈0\\approx 0; the second term is≈0\.087Coverlap\\approx 0\.087C\_\{\\mathrm\{overlap\}\}, which is non\-trivial but bounded\. The conjecture’s original claimPVVp\(πnav⋆\)−PVVp\(πPVV⋆\)≤\(1−qnav\)⋅Ctail\\mathrm\{PVV\}\_\{p\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)\-\\mathrm\{PVV\}\_\{p\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)\\leq\(1\-q\_\{\\mathrm\{nav\}\}\)\\cdot C\_\{\\mathrm\{tail\}\}therefore holds up to an additional0\.087Coverlap0\.087\\,C\_\{\\mathrm\{overlap\}\}correction on HIV, which does not alter the empirical conclusion\.

In the corridor\-dominated regime,Πnav\\Pi\_\{\\mathrm\{nav\}\}achievesqnav→1q\_\{\\mathrm\{nav\}\}\\to 1\(the navigation policy’s own crossing probability\) and the gap\(1−qnav\)⋅Ctail\(1\-q\_\{\\mathrm\{nav\}\}\)\\cdot C\_\{\\mathrm\{tail\}\}in Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)collapses onto the beyond\-corridor tail, whose PVV contribution is bounded by the Dirichlet prior variance atns′,a′\(πnav⋆\)≥1n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)\\geq 1\. The allocation between Region A \(exploit\) and Region B \(explore\) teams is proportional to the PVV share in each region, mirroring the structure of the SEP policy\. This is an explicit approximation: navigation\-restricted Fisher\-SEP\-T is the named variant run on HIV; unrestricted Fisher\-SEP\-T \(Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\) is the reference minimum\. The bound itself remains a conjecture \(Section[6](https://arxiv.org/html/2605.21458#S6)\):CtailC\_\{\\mathrm\{tail\}\}in \([53](https://arxiv.org/html/2605.21458#A4.E53)\) is computable, but a formal proof that\(1−qnav\)⋅Ctail\+\(κ−1−1\)Coverlap\(1\-q\_\{\\mathrm\{nav\}\}\)\\cdot C\_\{\\mathrm\{tail\}\}\+\(\\kappa^\{\-1\}\-1\)C\_\{\\mathrm\{overlap\}\}dominates the PVV excess ofπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}overπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}requires controlling the cross\-contribution from pairs inℛ\\mathcal\{R\}thatπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}visits along a different path, which we leave to future work\.

##### Empirical result\.

With these three choices, Fisher\-SEP reaches85\.3±4\.885\.3\\pm 4\.8% of oracle atT=400T\{=\}400on HIV \(Table[A6](https://arxiv.org/html/2605.21458#A3.T6)\),\+30\+30pp above A\-SOP and statistically tied with SEP \(Table[A8](https://arxiv.org/html/2605.21458#A3.T8)\)\. Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4)explains the mechanism: Region\-B cluster zones have non\-zero value gradient but are unvisited under the simulator\-optimal policy \(lowκ0\\kappa\_\{0\}\), so they contribute a dominant reducible term\(∂V/∂θ\)2σ2\(0\)/κ0\(\\partial V/\\partial\\theta\)^\{2\}\\sigma^\{2\}\(0\)/\\kappa\_\{0\}toPVV\\mathrm\{PVV\}\. Minimizing PVV drives visitation there\.

Full results cached atcode/results/fisher\_v3/hiv\_v3\.\{npz,csv\}\.

### D\.12Yield\-sensitivity navigation explorer: derivation as a truncated Bellman resolvent

The previous subsection describes a 15\-day SIS\-propagated finite\-difference gradient, which is*not*literally the Bellman\-resolvent gradient∇p\(⋅∣s′,a′\)Vπtgt\(s\)=γ\[\(I−γPπtgt\)−1\]s,:diag\(πtgt\(a′∣⋅\)\)\(Vπtgt\)⊤\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)=\\gamma\[\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}\]\_\{s,:\}\\,\\mathrm\{diag\}\(\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid\\cdot\)\)\\,\(V^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\\top\}of Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\. This subsection \(i\) names the implemented variant the*yield\-sensitivity navigation explorer*\(YSNE\), \(ii\) derives it as aTlookT\_\{\\mathrm\{look\}\}\-truncated Neumann series of the full resolvent applied to a yield\-weighted observable, and \(iii\) bounds the truncation error as a function ofTlookT\_\{\\mathrm\{look\}\}, the spread\-matrix spectral radius, and the discount factor\. The variant used on HIV is therefore a computable approximation to the Bellman\-resolvent gradient with an explicit error budget, not an unrelated sensitivity object\.

##### Neumann expansion of the resolvent\.

For any discountγ<1\\gamma<1and substochasticPπtgtP^\{\\pi^\{\\mathrm\{tgt\}\}\}, the Bellman resolvent admits the absolutely convergent Neumann series

\(I−γPπtgt\)−1=∑t=0∞γt\(Pπtgt\)t\.\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}\\;=\\;\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\bigl\(P^\{\\pi^\{\\mathrm\{tgt\}\}\}\\bigr\)^\{t\}\.\(55\)The transition\-block gradient ofVπtgt\(s\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)with respect to the kernel entryp\(s′′∣s′,a′\)p\(s^\{\\prime\\prime\}\\mid s^\{\\prime\},a^\{\\prime\}\)\(an\|𝕊\|\|\\mathbb\{S\}\|\-vector indexed by the destinations′′s^\{\\prime\\prime\}\) is obtained by differentiating the Bellman equation:

\[∇p\(⋅∣s′,a′\)Vπtgt\(s\)\]s′′=γ∑t=0∞γt\[\(Pπtgt\)t\]s,s′⋅πtgt\(a′∣s′\)⋅Vπtgt\(s′′\)\.\\bigl\[\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\bigr\]\_\{s^\{\\prime\\prime\}\}\\;=\\;\\gamma\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\bigl\[\\bigl\(P^\{\\pi^\{\\mathrm\{tgt\}\}\}\\bigr\)^\{t\}\\bigr\]\_\{s,s^\{\\prime\}\}\\cdot\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\cdot V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s^\{\\prime\\prime\}\)\.\(56\)Equivalently, as a column vector,∇p\(⋅∣s′,a′\)Vπtgt\(s\)=γ\[\(I−γPπtgt\)−1\]s,s′πtgt\(a′∣s′\)⋅Vπtgt\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)=\\gamma\\,\[\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}\]\_\{s,s^\{\\prime\}\}\\,\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\cdot V^\{\\pi^\{\\mathrm\{tgt\}\}\}\. Truncating \([55](https://arxiv.org/html/2605.21458#A4.E55)\) att=Tlook−1t=T\_\{\\mathrm\{look\}\}\-1and replacingPπtgtP^\{\\pi^\{\\mathrm\{tgt\}\}\}with the SIS one\-step prevalence\-spread operatorLL\(the Jacobian ofspread\_diseasearound the simulator’s nominal prevalence map\) gives the YSNE gradient \(an\|𝕊\|\|\\mathbb\{S\}\|\-vector indexed bys′′s^\{\\prime\\prime\}\)

\[∇^p\(⋅∣s′,a′\)YSNEVπtgt\(s\)\]s′′:=∑t=0Tlook−1γt\[Lt\]s,s′⋅πtgt\(a′∣s′\)⋅ys′′,\\bigl\[\\widehat\{\\nabla\}^\{\\mathrm\{YSNE\}\}\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\bigr\]\_\{s^\{\\prime\\prime\}\}\\;:=\\;\\sum\_\{t=0\}^\{T\_\{\\mathrm\{look\}\}\-1\}\\gamma^\{t\}\\bigl\[L^\{t\}\\bigr\]\_\{s,s^\{\\prime\}\}\\cdot\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\cdot y\_\{s^\{\\prime\\prime\}\},\(57\)whereyj=prevj⋅ntestsy\_\{j\}=\\mathrm\{prev\}\_\{j\}\\cdot n\_\{\\mathrm\{tests\}\}is the per\-zone yield observable \(cumulative discounted cases found over the lookahead\)\. The finite\-difference recipe of the previous subsection — perturbprevj\\mathrm\{prev\}\_\{j\}byδ\\delta, iteratespread\_diseaseforTlookT\_\{\\mathrm\{look\}\}days, and read off the change in∑tγtyt\\sum\_\{t\}\\gamma^\{t\}y\_\{t\}— is a numerical realization of \([57](https://arxiv.org/html/2605.21458#A4.E57)\)\. The substitution\(L,y\)\(L,y\)for\(Pπtgt,Vπtgt\)\(P^\{\\pi^\{\\mathrm\{tgt\}\}\},V^\{\\pi^\{\\mathrm\{tgt\}\}\}\)is exact at the first\-order level because, under the SIS model, the change in expected future yield with respect to a prevalence perturbation at zonejjis precisely∑t=0Tlook−1γt\[Lt\]s,jy\\sum\_\{t=0\}^\{T\_\{\\mathrm\{look\}\}\-1\}\\gamma^\{t\}\[L^\{t\}\]\_\{s,j\}y\.

##### Truncation error bound\.

The YSNE gradient differs from the full resolvent gradient in \([56](https://arxiv.org/html/2605.21458#A4.E56)\) by the tail of the Neumann series\. Denote byρ\(L\)∈\[0,1\]\\rho\(L\)\\in\[0,1\]the spectral radius ofLL\(for the SIS dynamics with between\-zone coupling matrixMMand within\-zone decay rateμ\\mu,ρ\(L\)=maxλ∈σ\(M\)⁡\(1−μ\)\+βλ\\rho\(L\)=\\max\_\{\\lambda\\in\\sigma\(M\)\}\(1\-\\mu\)\+\\beta\\lambdaunder Gonsalves’ parameterization\)\.

###### Lemma 6\(YSNE truncation error vs\. full Neumann underLL\)\.

Let∇~full\\widetilde\{\\nabla\}\_\{\\mathrm\{full\}\}denote the*full Neumann gradient under the spread operatorLL*,

\[∇~full\]s′′:=∑t=0∞γt\[Lt\]s,s′⋅πtgt\(a′∣s′\)⋅ys′′,\\bigl\[\\widetilde\{\\nabla\}\_\{\\mathrm\{full\}\}\\bigr\]\_\{s^\{\\prime\\prime\}\}\\;:=\\;\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\[L^\{t\}\]\_\{s,s^\{\\prime\}\}\\cdot\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\cdot y\_\{s^\{\\prime\\prime\}\},and let∇^YSNE\\widehat\{\\nabla\}^\{\\mathrm\{YSNE\}\}denote the YSNE gradient \([57](https://arxiv.org/html/2605.21458#A4.E57)\)\. Assumeγρ\(L\)<1\\gamma\\rho\(L\)<1\. Then

‖∇^p\(⋅∣s′,a′\)YSNEVπtgt\(s\)−∇~full;p\(⋅∣s′,a′\)Vπtgt\(s\)‖∞≤\(γρ\(L\)\)Tlook1−γρ\(L\)⋅πtgt\(a′∣s′\)⋅‖y‖∞\.\\bigl\\\|\\widehat\{\\nabla\}^\{\\mathrm\{YSNE\}\}\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\-\\widetilde\{\\nabla\}\_\{\\mathrm\{full\};\\,p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\bigr\\\|\_\{\\infty\}\\;\\leq\\;\\frac\{\(\\gamma\\rho\(L\)\)^\{T\_\{\\mathrm\{look\}\}\}\}\{1\-\\gamma\\rho\(L\)\}\\cdot\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)\\cdot\\\|y\\\|\_\{\\infty\}\.\(58\)In particular, the YSNE truncation error decays geometrically inTlookT\_\{\\mathrm\{look\}\}with rateγρ\(L\)\\gamma\\rho\(L\)\.

The bound does*not*cover the substitution error between∇~full\\widetilde\{\\nabla\}\_\{\\mathrm\{full\}\}\(usingLL\) and∇\\nabla\(usingPπtgtP^\{\\pi^\{\\mathrm\{tgt\}\}\}, from \([56](https://arxiv.org/html/2605.21458#A4.E56)\)\): the YSNE takes the first\-order disease\-dynamics expansion as a proxy for the Bellman\-resolvent propagation\. Equality between the two holds only when perturbing prevalence has the same first\-order propagation to yield that it has toVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}, which the yield observableyywas constructed to satisfy at the first\-order level \(Section[D\.11](https://arxiv.org/html/2605.21458#A4.SS11), Disease\-dynamics value gradient paragraph\)\. We do not provide an analytic bound on the residual substitution error; empirically, on the HIV DGP the YSNE\-based PVV\-minimization ranking matches the ranking that would be obtained by Monte\-Carlo estimation of the full∇Vπtgt\\nabla V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(cf\. the paired\-test results of App\.[F](https://arxiv.org/html/2605.21458#A6)\)\.

###### Proof\.

The residual is the tail∑t=Tlook∞γt\[Lt\]s,s′πtgt\(a′∣s′\)y⊤\\sum\_\{t=T\_\{\\mathrm\{look\}\}\}^\{\\infty\}\\gamma^\{t\}\[L^\{t\}\]\_\{s,s^\{\\prime\}\}\\pi^\{\\mathrm\{tgt\}\}\(a^\{\\prime\}\\mid s^\{\\prime\}\)y^\{\\top\}\. Bounding\[Lt\]s,s′≤ρ\(L\)t\[L^\{t\}\]\_\{s,s^\{\\prime\}\}\\leq\\rho\(L\)^\{t\}\(sinceLLhas non\-negative entries bounded inℓ∞\\ell\_\{\\infty\}\-operator norm by its spectral radius under Gonsalves’ parameterization, andyyis non\-negative\), and summing the geometric tail starting att=Tlookt=T\_\{\\mathrm\{look\}\}gives \([58](https://arxiv.org/html/2605.21458#A4.E58)\)\. ∎

### D\.13A regime\-restricted result toward Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)

We first state the conjecture whose partial proof this subsection develops\.

###### Conjecture 1\(Navigation\-restricted Fisher\-SEP\-T\)\.

LetΠnav⊂Πadapt\\Pi\_\{\\mathrm\{nav\}\}\\subset\\Pi\_\{\\mathrm\{adapt\}\}be the deterministic\-navigation\-plus\-exploit class that ranks\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)by per\-pair PVV contribution and navigates to top\-ranked pairs along shortest admissible paths\. On an MDP with bottleneck setℬ⊂𝕊\\mathcal\{B\}\\subset\\mathbb\{S\}, letπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}be the unrestricted minimizer of the transition\-parameter PVV \([11](https://arxiv.org/html/2605.21458#S4.E11)\) \(Corollary[3](https://arxiv.org/html/2605.21458#Thmcorollary3)\),πnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}its restriction toΠnav\\Pi\_\{\\mathrm\{nav\}\}, andqnav:=ℙπnav⋆\[crossℬwithinTexplore\]q\_\{\\mathrm\{nav\}\}:=\\mathbb\{P\}\_\{\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\}\[\\text\{cross \}\\mathcal\{B\}\\text\{ within \}T\_\{\\mathrm\{explore\}\}\]its crossing probability\. We conjecture

PVVp\(πnav⋆\)−PVVp\(πPVV⋆\)≤\(1−qnav\)⋅Ctail\+\(κ−1−1\)⋅Coverlap,\\mathrm\{PVV\}\_\{p\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)\-\\mathrm\{PVV\}\_\{p\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)\\;\\leq\\;\(1\-q\_\{\\mathrm\{nav\}\}\)\\cdot C\_\{\\mathrm\{tail\}\}\+\(\\kappa^\{\-1\}\-1\)\\cdot C\_\{\\mathrm\{overlap\}\},whereCtail,CoverlapC\_\{\\mathrm\{tail\}\},C\_\{\\mathrm\{overlap\}\}are PVV\-weighted prior\-variance sums \(defined below\) andκ∈\(0,1\]\\kappa\\in\(0,1\]is the path\-overlap constant\. The bound vanishes as\(qnav,κ\)→\(1,1\)\(q\_\{\\mathrm\{nav\}\},\\kappa\)\\to\(1,1\)\.

The navigation restriction of Fisher\-SEP\-T trades off some expressivity for computability\. Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)stipulates the above bound\. We now prove it in the*strong\-bottleneck regime*— the regime in which the HIV case study operates — and show why the general bound \(without the regime restriction\) remains open\.

##### Definitions\.

Letℛ⊂𝕊×𝔸\\mathcal\{R\}\\subset\\mathbb\{S\}\\times\\mathbb\{A\}be the set of pairs reachable by the unrestricted PVV minimizerπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}withinTexploreT\_\{\\mathrm\{explore\}\}; letℛnav⊂𝕊×𝔸\\mathcal\{R\}\_\{\\mathrm\{nav\}\}\\subset\\mathbb\{S\}\\times\\mathbb\{A\}be the set of pairs reachable byπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}withinTexploreT\_\{\\mathrm\{explore\}\}\. The corridor cell isℬ⊂𝕊\\mathcal\{B\}\\subset\\mathbb\{S\}\(a single cell on the HIV grid\)\. We say the chain is in the*strong\-bottleneck regime*if the following two conditions hold:

1. \(i\)*Bottleneck structure\.*Every path from a Region\-A state to a Region\-B state inℛ\\mathcal\{R\}passes throughℬ\\mathcal\{B\}\. The bottleneck set isℬ=\{\(2,4\)\}\\mathcal\{B\}=\\\{\(2,4\)\\\}on the HIV grid\.
2. \(ii\)*Path overlap\.*For every\(s′,a′\)∈ℛ∩ℛnav\(s^\{\\prime\},a^\{\\prime\}\)\\in\\mathcal\{R\}\\cap\\mathcal\{R\}\_\{\\mathrm\{nav\}\}, the expected pilot observation countns′,a′\(πnav⋆\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)is within a constant factorκ∈\(0,1\]\\kappa\\in\(0,1\]ofns′,a′\(πPVV⋆\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\):ns′,a′\(πnav⋆\)≥κ⋅ns′,a′\(πPVV⋆\)n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)\\geq\\kappa\\cdot n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)\.

Condition \(i\) holds by construction of the HIV grid with the single corridor\. Condition \(ii\) holds withκ≈1\\kappa\\approx 1because both policies send their exploration teams through the corridor and then fan out into Region\-B along shortest admissible paths;κ\\kappais bounded away from zero by the constant number of admissible paths through Region\-B\.

###### Theorem 5\(Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)in the strong\-bottleneck regime\)\.

In the strong\-bottleneck regime with path\-overlap constantκ\\kappa,

PVVp\(πnav⋆\)−PVVp\(πPVV⋆\)≤\(1−qnav\)⋅Ctail\+\(κ−1−1\)⋅Coverlap,\\mathrm\{PVV\}\_\{p\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)\-\\mathrm\{PVV\}\_\{p\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)\\;\\leq\\;\(1\-q\_\{\\mathrm\{nav\}\}\)\\cdot C\_\{\\mathrm\{tail\}\}\+\\bigl\(\\kappa^\{\-1\}\-1\\bigr\)\\cdot C\_\{\\mathrm\{overlap\}\},\(59\)whereCtailC\_\{\\mathrm\{tail\}\}is the PVV\-weighted prior\-variance sum over beyond\-corridor pairs \([53](https://arxiv.org/html/2605.21458#A4.E53)\),CoverlapC\_\{\\mathrm\{overlap\}\}is the \(already pilot\-adjusted\) PVV contribution of the overlap region underπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\},

Coverlap:=∑\(s′,a′\)∈ℛshared∑sdπtgt\(s\)∇p\(⋅∣s′,a′\)Vπtgt\(s\)⊤\[Σp\(s′,a′\)−1\+ns′,a′\(πPVV⋆\)ℐpint\(s′,a′\)\]−1∇p\(⋅∣s′,a′\)Vπtgt\(s\),C\_\{\\mathrm\{overlap\}\}:=\\sum\_\{\(s^\{\\prime\},a^\{\\prime\}\)\\in\\mathcal\{R\}\_\{\\mathrm\{shared\}\}\}\\sum\_\{s\}d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\\,\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)^\{\\\!\\top\}\\\!\\bigl\[\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)^\{\-1\}\+n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)\\,\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)\\bigr\]^\{\\\!\-1\}\\\!\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\),andqnav=ℙπnav⋆\[crossℬwithinTexplore\]q\_\{\\mathrm\{nav\}\}=\\mathbb\{P\}\_\{\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\}\[\\text\{cross \}\\mathcal\{B\}\\text\{ within \}T\_\{\\mathrm\{explore\}\}\]\.

On the HIV grid withκ=1\\kappa=1\(exact path overlap\), the second term vanishes and Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)’s bound holds exactly\.

###### Proof\.

Decompose𝕊×𝔸\\mathbb\{S\}\\times\\mathbb\{A\}into three regions:

ℛshared:=ℛ∩ℛnav,ℛonly\-PVV:=ℛ∖ℛnav,ℛbeyond:=\(𝕊×𝔸\)∖ℛ\.\\mathcal\{R\}\_\{\\mathrm\{shared\}\}:=\\mathcal\{R\}\\cap\\mathcal\{R\}\_\{\\mathrm\{nav\}\},\\qquad\\mathcal\{R\}\_\{\\mathrm\{only\\text\{\-\}PVV\}\}:=\\mathcal\{R\}\\setminus\\mathcal\{R\}\_\{\\mathrm\{nav\}\},\\qquad\\mathcal\{R\}\_\{\\mathrm\{beyond\}\}:=\(\\mathbb\{S\}\\times\\mathbb\{A\}\)\\setminus\\mathcal\{R\}\.Each pair’s contribution toPVVp\\mathrm\{PVV\}\_\{p\}has the formg⊤\(A\+nB\)−1gg^\{\\top\}\(A\+nB\)^\{\-1\}gwithg=∇p\(⋅∣s′,a′\)Vπtgtg=\\nabla\_\{p\(\\cdot\\mid s^\{\\prime\},a^\{\\prime\}\)\}V^\{\\pi^\{\\mathrm\{tgt\}\}\},A=Σp\(s′,a′\)−1A=\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)^\{\-1\},B=ℐpint\(s′,a′\)B=\\mathcal\{I\}^\{\\mathrm\{int\}\}\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)\(both PSD\), andn=ns′,a′\(π\)n=n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi\); this is nonincreasing innnby operator\-monotonicity of the matrix inverse on the PSD cone\. Hence:

- •Onℛshared\\mathcal\{R\}\_\{\\mathrm\{shared\}\}: writingn⋆:=ns′,a′\(πPVV⋆\)n\_\{\\star\}:=n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)andnnav≥κn⋆n\_\{\\mathrm\{nav\}\}\\geq\\kappa n\_\{\\star\}, the identity\(A\+κn⋆B\)⪰κ\(A\+n⋆B\)\(A\+\\kappa n\_\{\\star\}B\)\\succeq\\kappa\(A\+n\_\{\\star\}B\)\(usingκ≤1\\kappa\\leq 1soA⪰κAA\\succeq\\kappa A\) implies\(A\+nnavB\)−1⪯\(A\+κn⋆B\)−1⪯κ−1\(A\+n⋆B\)−1\(A\+n\_\{\\mathrm\{nav\}\}B\)^\{\-1\}\\preceq\(A\+\\kappa n\_\{\\star\}B\)^\{\-1\}\\preceq\\kappa^\{\-1\}\(A\+n\_\{\\star\}B\)^\{\-1\}\. Therefore g⊤\(A\+nnavB\)−1g−g⊤\(A\+n⋆B\)−1g≤\(κ−1−1\)g⊤\(A\+n⋆B\)−1g\.g^\{\\top\}\(A\+n\_\{\\mathrm\{nav\}\}B\)^\{\-1\}g\-g^\{\\top\}\(A\+n\_\{\\star\}B\)^\{\-1\}g\\;\\leq\\;\(\\kappa^\{\-1\}\-1\)\\,g^\{\\top\}\(A\+n\_\{\\star\}B\)^\{\-1\}g\.Summing over shared pairs gives the\(κ−1−1\)Coverlap\(\\kappa^\{\-1\}\-1\)\\,C\_\{\\mathrm\{overlap\}\}term, withCoverlapC\_\{\\mathrm\{overlap\}\}the total shared\-region PVV contribution evaluated underπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\(the matrix\-denominator form displayed in the theorem statement\)\.
- •Onℛonly\-PVV\\mathcal\{R\}\_\{\\mathrm\{only\\text\{\-\}PVV\}\}:ns′,a′\(πnav⋆\)=0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}\)=0whilens′,a′\(πPVV⋆\)\>0n\_\{s^\{\\prime\},a^\{\\prime\}\}\(\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\)\>0\. By the bottleneck condition, every such pair lies beyond the corridor and requiresπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}to crossℬ\\mathcal\{B\}; with probability1−qnav1\-q\_\{\\mathrm\{nav\}\},πnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}fails to cross\. On the non\-crossing event the per\-pair excess equals the prior\-variance\-only contributiong⊤Σp\(s′,a′\)gg^\{\\top\}\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)g\(theA−1A^\{\-1\}limit whenn→0n\\to 0\); in expectation, summing overℛonly\-PVV\\mathcal\{R\}\_\{\\mathrm\{only\\text\{\-\}PVV\}\}yields\(1−qnav\)∑\(s′,a′\)∈ℛonly\-PVVg⊤Σp\(s′,a′\)g=:\(1−qnav\)Ctail\(1\-q\_\{\\mathrm\{nav\}\}\)\\sum\_\{\(s^\{\\prime\},a^\{\\prime\}\)\\in\\mathcal\{R\}\_\{\\mathrm\{only\\text\{\-\}PVV\}\}\}g^\{\\top\}\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)g=:\(1\-q\_\{\\mathrm\{nav\}\}\)C\_\{\\mathrm\{tail\}\}, matching the definition ofCtailC\_\{\\mathrm\{tail\}\}in \([53](https://arxiv.org/html/2605.21458#A4.E53)\)\.
- •Onℛbeyond\\mathcal\{R\}\_\{\\mathrm\{beyond\}\}: neither policy visits these pairs; their PVV contribution equalsg⊤Σp\(s′,a′\)gg^\{\\top\}\\Sigma\_\{p\}\(s^\{\\prime\},a^\{\\prime\}\)gand is identical acrossπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}andπPVV⋆\\pi^\{\\star\}\_\{\\mathrm\{PVV\}\}\. No excess\.

Summing the three regions gives \([59](https://arxiv.org/html/2605.21458#A4.E59)\)\. On the HIV grid withκ=1\\kappa=1, theCoverlapC\_\{\\mathrm\{overlap\}\}term drops out and the bound reduces to the statement of Conjecture[1](https://arxiv.org/html/2605.21458#Thmconjecture1)\. ∎

##### PVV\-excess decomposition and the role of the unrestricted minimizer\.

Theorem[5](https://arxiv.org/html/2605.21458#Thmtheorem5)’s proof decomposes the PVV excess into \(i\) a corridor\-crossing failure term\(1−qnav\)Ctail\(1\-q\_\{\\mathrm\{nav\}\}\)C\_\{\\mathrm\{tail\}\}and \(ii\) a path\-overlap penalty\(κ−1−1\)Coverlap\(\\kappa^\{\-1\}\-1\)C\_\{\\mathrm\{overlap\}\}\. The first term vanishes whenπnav⋆\\pi^\{\\star\}\_\{\\mathrm\{nav\}\}reliably crosses the corridor \(the HIV regime\); the second term vanishes when the navigation restriction does not force alternative paths throughℛshared\\mathcal\{R\}\_\{\\mathrm\{shared\}\}\(also the HIV regime\)\. An*unrestricted Fisher\-SEP\-T*run on HIV would therefore yield essentially the same PVV value as the navigation\-restricted variant withinTexplore=25T\_\{\\mathrm\{explore\}\}=25days — but only once a corridor\-traversing protocol is coded into the stochastic\-policy optimizer\. The navigation restriction is a computational shortcut for the strong\-bottleneck regime, not a theoretical compromise\.

### D\.14Unnormalized gap: is the HIV reachability gap exponential or linear?

Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)bounds the*worst\-case*reachability gap as exponential in the effective horizon on chain MDPs; we claimed in §[5\.2](https://arxiv.org/html/2605.21458#S5.SS2)that the HIV grid is a milder geometry in which the gap grows linearly inTTbecause finitely many explorer units suffice to cover Region B\. Figure[A4](https://arxiv.org/html/2605.21458#A4.F4)tests this claim directly by reporting*unnormalized*gaps rather than percentage of oracle\.

![Refer to caption](https://arxiv.org/html/2605.21458v1/x4.png)Figure A4:Unnormalized gap over time\.\(a\) Vending: cash gain of each policy over the SOP across the 30 trials,±2\\pm 2SE bands\. A\-SOP and KG\-SEP gains grow with horizon as the SOP degrades; Fisher\-SEP continues to gain atT=1600T=1600\. UCRL2 and UCBVI are plotted from their own 30\-trial cache\. \(b\) HIV: cumulative cases found above the A\-SOP,±2\\pm 2SE bands\. Linear fits fort≥50t\\geq 50give slopes3\.33\.3cases/day for SEP and4\.34\.3cases/day for Fisher\-SEP, withR2\>0\.999R^\{2\}\>0\.999on both\. The oracle slope is9\.09\.0cases/day, capping the achievable rate in this geometry\.*Conclusion:*within ourT≤400T\\leq 400sweep, the HIV reachability gap is linear inTT, not exponential\. The exponential bound of Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)is a worst\-case combinatorial prediction for chain MDPs; the HIV grid’s single\-corridor geometry is milder\.The linearR2\>0\.999R^\{2\}\>0\.999fit confirms the §[5\.2](https://arxiv.org/html/2605.21458#S5.SS2)interpretation: once the SEP’snexplore=3n\_\{\\mathrm\{explore\}\}=3teams cross the corridor \(which happens reliably within the first 25 days\), the per\-day yield gain over the A\-SOP is approximately constant, so the integrated gap scales asTT\. The Fisher\-SEP slope exceeds the SEP slope by∼1\\sim 1case/day, which explains the crossover atT≥300T\\geq 300in Table[A6](https://arxiv.org/html/2605.21458#A3.T6)\.

### D\.15Configuration

Table[A10](https://arxiv.org/html/2605.21458#A4.T10)provides the complete parameterization\.

Table A10:HIV mobile testing: complete parameter specification\. All values matchhiv\_testing\.pyandexp\_hiv\_testing\.py\.CategoryParameterValueGridRows×\\timesColumns5×85\\times 8Total zones40Wall column4Corridor cell\(2,4\)\(2,4\)PopulationRegion A population per zone \(NAN\_\{A\}\)500Region B population per zone \(NBN\_\{B\}\)300PrevalenceRegion A true prevalence0\.05±0\.0050\.05\\pm 0\.005Region B true prevalence \(non\-cluster\)0\.04±0\.0050\.04\\pm 0\.005Cluster epicenter prevalence0\.30Cluster neighbor prevalence0\.18Simulator Region B estimate0\.02±0\.0020\.02\\pm 0\.002Disease \(SIS\)Within\-zone transmissionβw\\beta\_\{w\}0\.002Between\-zone transmissionβb\\beta\_\{b\}0\.0005Prevalence clip range\[0\.001,0\.80\]\[0\.001,0\.80\]TreatmentTransmission reductionτ\\tau0\.90TestingTeams8Tests per team per dayμtests\\mu\_\{\\mathrm\{tests\}\}8 \(Poisson\)Discount factorγ\\gamma0\.95WarmupWarmup dayswmaxw\_\{\\max\}3Cold yield fraction0\.20BeliefPrior strengthκ\\kappa10Update ruleBeta–Binomial conjugateSEPExplore teamsnexploren\_\{\\mathrm\{explore\}\}3Exploration durationTexploreT\_\{\\mathrm\{explore\}\}25 daysExplore targetEpicenter\(4,7\)\(4,7\)Replan interval10 daysϵ\\epsilon\-greedyϵ\\epsilon0\.15Fisher\-SEPPilot durationTpilotT\_\{\\mathrm\{pilot\}\}3 daysExploration durationTexploreT\_\{\\mathrm\{explore\}\}25 daysFisher look\-ahead15 daysReplan interval5 daysExperimentHorizonsTT\{50,100,200,300,400\}\\\{50,100,200,300,400\\\}Trials30Cluster epicenter\(4,7\)\(4,7\)
The cluster is invisible to the simulator: the simulator estimatesp^sim≈0\.02\\hat\{p\}^\{\\mathrm\{sim\}\}\\approx 0\.02for all Region B zones, while the true epicenter prevalence is0\.300\.30\(a15×15\\timesunderestimate\)\. The confounding mechanism mirrors the vending\-machine experiment \(Appendix[C](https://arxiv.org/html/2605.21458#A3)\): the simulator was calibrated from clinic\-based data that systematically underrepresents the hard\-to\-reach population, analogous to the historical operator’s censored demand observations\.

### D\.16Boundingϵhist\\epsilon^\{\\mathrm\{hist\}\}under the Gonsalves SIS dynamics

Section[2](https://arxiv.org/html/2605.21458#S2)definesϵhist\\epsilon^\{\\mathrm\{hist\}\}as the distance between the exact history\-dependent POMDP marginal and the stationary observed\-state Markov projectionℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}, and declares this quantity out of scope\. In the HIV case study the SIS dynamics explicitly evolve the latent prevalence map over time, so the planner’s zone\-level projection is not literally stationary within an episode\. This subsection quantifies the approximation error\.

##### Setup\.

The true HIV POMDP has zone\-level latent stateHt∈ℝ\|𝕊\|H\_\{t\}\\in\\mathbb\{R\}^\{\|\\mathbb\{S\}\|\}\(the per\-zone prevalence vector\) evolving under the SIS discrete\-time dynamicsHt\+1=f\(Ht\)H\_\{t\+1\}=f\(H\_\{t\}\)withffthe Gonsalves update\[Gonsalveset al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib150)\]\. The simulator’s observed\-state projection uses the calibration\-time prevalence snapshotHcalibH\_\{\\mathrm\{calib\}\}as if it were stationary\.

###### Proposition 10\(Bound onϵhist\\epsilon^\{\\mathrm\{hist\}\}for the HIV DGP\)\.

LetLLbe the one\-step Jacobian of the SIS dynamics at the simulator’s nominal prevalence map, with spectral radiusρ\(L\)\\rho\(L\)\. LetH0H\_\{0\}be the deployment\-time prevalence vector, and letTTbe the planning horizon\. Then

ϵhist≤‖LTH0−H∞‖∞≤ρ\(L\)T‖H0−H∞‖∞,\\epsilon^\{\\mathrm\{hist\}\}\\;\\leq\\;\\\|L^\{T\}H\_\{0\}\-H\_\{\\infty\}\\\|\_\{\\infty\}\\;\\leq\\;\\rho\(L\)^\{T\}\\\|H\_\{0\}\-H\_\{\\infty\}\\\|\_\{\\infty\},\(60\)whereH∞H\_\{\\infty\}is the fixed point offf\(equivalent to the stationary Markov projection underℳobs⋆\\mathcal\{M\}^\{\\star\}\_\{\\mathrm\{obs\}\}\)\.

###### Proof\.

The exact POMDP marginal at timetthas zone\-level prevalenceLtH0L^\{t\}H\_\{0\}; the stationary projection uses the fixed pointH∞H\_\{\\infty\}\. Theℓ∞\\ell\_\{\\infty\}\-distance between the two evolutions is bounded by the Jacobian’s contraction rate, giving the geometric decay \([60](https://arxiv.org/html/2605.21458#A4.E60)\)\. ∎

### D\.17Region\-B underestimate magnitude sweep

The headline HIV DGP sets the simulator’s Region\-B prevalence estimate top^Bsim≈0\.02\\hat\{p\}^\{\\mathrm\{sim\}\}\_\{B\}\\approx 0\.02against a true non\-cluster Region\-B prevalence of0\.040\.04and a true cluster\-zone prevalence of0\.300\.30, a15×15\\timesunderestimate at the cluster\. To characterize how the Fisher\-SEP−\-A\-SOP gap scales with this magnitude, we sweep the underestimate factor over\{2×,5×,15×,30×\}\\\{2\\times,5\\times,15\\times,30\\times\\\}by scalingp^Bsim=0\.30/factor\\hat\{p\}^\{\\mathrm\{sim\}\}\_\{B\}=0\.30/\\mathrm\{factor\}; the true prevalence map is held fixed at cluster=0\.30=0\.30, cluster neighbors=0\.18=0\.18, non\-cluster Region\-B=0\.04=0\.04\. The Region\-A simulator estimate is unchanged at0\.050\.05, so factor=2=2givesp^Bsim=0\.15\\hat\{p\}^\{\\mathrm\{sim\}\}\_\{B\}=0\.15\(Region\-B looks*more*prevalent than Region\-A; A\-SOP should head there\) and factor=30=30givesp^Bsim=0\.01\\hat\{p\}^\{\\mathrm\{sim\}\}\_\{B\}=0\.01\(Region\-B looks strictly less prevalent; A\-SOP stays in Region\-A\)\. Thirty common\-seed trials per condition atT∈\{100,200,400\}T\\in\\\{100,200,400\\\}\.

Table A11:HIV magnitude sweep atT=400T=400: Fisher\-SEP\-T vs A\-SOP as a function of the Region\-B simulator\-underestimate factor\. Values in % of oracle cases found, 30 common\-seed trials\. Full table \(includingT∈\{100,200\}T\\in\\\{100,200\\\}\) is cached atcode/results/hiv\_magnitude\_sweep\.\{npz,csv\}\.##### Observations\.

The Fisher\-SEP−\-A\-SOP gap is stable at roughly\+27\+27to\+30\+30pp across all four magnitudes; the gap does*not*vanish at small underestimate factors within the sweep range\. At2×2\\timesthe simulator’s Region\-B estimate \(0\.150\.15\) exceeds its Region\-A estimate \(0\.050\.05\), so A\-SOP’s posterior\-mean policy actually heads to Region B; yet A\-SOP’s yield still trails Fisher\-SEP by\+29\.5\+29\.5pp atT=400T\{=\}400\. The mechanism is that A\-SOP’s Region\-B navigation targets the*uniform*Region\-B estimate, not the cluster: the cluster zones \(four cells in the bottom\-right2×22\\times 2block of Region B, true prevalence0\.180\.18–0\.300\.30\) are indistinguishable to the simulator from the rest of Region B at factor2×2\\times, while Fisher\-SEP’s A\-optimal PVV criterion and its 15\-day SIS\-propagated value gradient \(App\.[D\.11](https://arxiv.org/html/2605.21458#A4.SS11)\) both direct explorers to the cluster\-neighbor zones where the marginal value\-gradient is largest\. The underestimate\-factor sweep therefore isolates the*cluster\-localization*mechanism rather than the Region\-level prevalence ordering: the reachability gap documented in the main text persists whenever the simulator’s within\-Region\-B prevalence map is flat \(i\.e\., misses the cluster\), regardless of whether the Region\-B average looks higher or lower than Region\-A\. A\-SOP would only close the gap under a simulator that correctly localizes the cluster*within*Region B, which is exactly the information the Gonsalves calibration cannot supply because clinic\-based surveillance does not sample the cluster population\.

##### Where the gap would vanish\.

The sweep confirms that15×15\\timesis in the asymptotic regime of the gap as a function of the Region\-level underestimate factor: moving from15×15\\timesto30×30\\timeschanges the gap by<1<1pp\. A smaller underestimate \(2×2\\times\) slightly*increases*the gap because A\-SOP’s Region\-B targets are now high\-prevalence\-looking but cluster\-uninformed, so A\-SOP arrives at Region B but tests at the wrong zones within it\. We conjecture, but do not exhibit, that factor→1\\to 1\(no Region\-level bias, only cluster localization missing\) would reach a comparable gap; and that factor=1=1with*cluster\-localizing*priors \(e\.g\., a finer\-grained simulator that correctly identifies the bottom\-right2×22\\times 2block\) would reduce A\-SOP to within\-CI of Fisher\-SEP\. The second of these is no longer a same\-DGP comparison and is left to future work\.

## Appendix EProofs for Section 4 and Combination Lock MDP

This appendix collects the proofs of the results in Section[3](https://arxiv.org/html/2605.21458#S3)and then presents the combination\-lock MDP experiment, which both illustrates Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)and serves as the concrete witness for the strict hierarchy Theorem[7](https://arxiv.org/html/2605.21458#Thmtheorem7)\.

### E\.1Policy hierarchy details

The seven policy classes in the main\-text hierarchy are as follows\.Π0=\{πsim⋆\}\\Pi\_\{0\}=\\\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\\\}\(SOP only\);Π1\\Pi\_\{1\}adds undirected stochastic perturbations \(e\.g\.,ϵ\\epsilon\-greedy\);Π1′\\Pi\_\{1^\{\\prime\}\}adds passive Bayesian learning \(the A\-SOP lives here\);Π2\\Pi\_\{2\}adds targeted per\-state exploration using a design criterion \(e\.g\., knowledge\-gradient SEP\);Π3\\Pi\_\{3\}adds multi\-step trajectory planning \(e\.g\., EPI\-directed SEP\);Π3′\\Pi\_\{3^\{\\prime\}\}replaces the design criterion with the Fisher information \(Fisher\-SEP\); andΠ4=Πadapt\\Pi\_\{4\}=\\Pi\_\{\\mathrm\{adapt\}\}is the full class of adaptive policies, whose maximizer is the Bayes\-optimal BAMDP policy\. The seven levels correspond to the three uses of the simulator as follows: Levels 0 and 1 use the simulator only as a policy source \(use i\); Level1′1^\{\\prime\}adds the simulator as a Bayesian prior \(use ii\); Levels 2, 3, and3′3^\{\\prime\}add the simulator as a design tool \(use iii\); Level 4 is the Bayes\-optimal limit\. Table[A12](https://arxiv.org/html/2605.21458#A5.T12)summarizes the levels\.

Table A12:Policy hierarchy\. Simulator use: policy source \(P\), prior/hot start \(H\), design tool \(D\)\. The “Cost” column is per\-decision unless labeled “/replan”; for levels with randomized inner optimizers \(Level3′3^\{\\prime\}\), the stated bound is per coordinate\-ascent iteration and the full Fisher optimization runsO\(niter\)O\(n\_\{\\mathrm\{iter\}\}\)iterations\.
### E\.2Formal definitions for the gap decomposition

Theorem[1](https://arxiv.org/html/2605.21458#Thmproposition1)asserts𝒢=𝒢local\+𝒢reach\\mathcal\{G\}=\\mathcal\{G\}\_\{\\mathrm\{local\}\}\+\\mathcal\{G\}\_\{\\mathrm\{reach\}\}for the design gap𝒢=W\(πsime\)−W\(πsim⋆\)\\mathcal\{G\}=W\(\\pi^\{e\}\_\{\\mathrm\{sim\}\}\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\. The two components are defined as follows\.

###### Definition 9\(Support\-constrained SEPs\)\.

Letsupp\(dπ\):=\{s∈𝕊:dπ\(s\)\>0\}\\mathrm\{supp\}\(d\_\{\\pi\}\):=\\\{s\\in\\mathbb\{S\}:d\_\{\\pi\}\(s\)\>0\\\}denote the support of a policy’s discounted state\-visitation distribution\. Define the*support\-constrained SEP class*

ΠSEP,loc:=\{π∈ΠSEP:supp\(dπ\)⊆supp\(dπsim⋆\)\}\.\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}:=\\\{\\pi\\in\\Pi\_\{\\mathrm\{SEP\}\}:\\mathrm\{supp\}\(d\_\{\\pi\}\)\\subseteq\\mathrm\{supp\}\(d\_\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\)\\\}\.\(61\)A policy inΠSEP,loc\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}can choose actions differently from the SOP but can visit only states the SOP itself reaches with positive discounted probability\.

###### Definition 10\(Local and reachability components\)\.

𝒢local\\displaystyle\\mathcal\{G\}\_\{\\mathrm\{local\}\}:=maxπ∈ΠSEP,loc⁡W\(π\)−W\(πsim⋆\),\\displaystyle:=\\max\_\{\\pi\\in\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}\}W\(\\pi\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\),\(62\)𝒢reach\\displaystyle\\mathcal\{G\}\_\{\\mathrm\{reach\}\}:=𝒢−𝒢local=W\(πsime\)−maxπ∈ΠSEP,loc⁡W\(π\)\.\\displaystyle:=\\mathcal\{G\}\-\\mathcal\{G\}\_\{\\mathrm\{local\}\}=W\(\\pi^\{e\}\_\{\\mathrm\{sim\}\}\)\-\\max\_\{\\pi\\in\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}\}W\(\\pi\)\.\(63\)

With these definitions Theorem[1](https://arxiv.org/html/2605.21458#Thmproposition1)becomes a non\-tautological claim: the local component is the best the planner can achieve by acting differently inside the SOP’s footprint, and the reachability component is the additional value unlocked by leaving that footprint\.

###### Proof of Proposition[1](https://arxiv.org/html/2605.21458#Thmproposition1)\.

Additivity𝒢=𝒢local\+𝒢reach\\mathcal\{G\}=\\mathcal\{G\}\_\{\\mathrm\{local\}\}\+\\mathcal\{G\}\_\{\\mathrm\{reach\}\}is immediate from the definitions\. Non\-negativity of𝒢local\\mathcal\{G\}\_\{\\mathrm\{local\}\}follows fromπsim⋆∈ΠSEP,loc\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\\in\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}\(the SOP is a degenerate SEP whose support is its own support, obtained by setting the exploration budget to zero andπexploit=πsim⋆\\pi^\{\\mathrm\{exploit\}\}=\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\), somaxπ∈ΠSEP,loc⁡W\(π\)≥W\(πsim⋆\)\\max\_\{\\pi\\in\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}\}W\(\\pi\)\\geq W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\. Non\-negativity of𝒢reach\\mathcal\{G\}\_\{\\mathrm\{reach\}\}follows fromΠSEP,loc⊆ΠSEP\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}\\subseteq\\Pi\_\{\\mathrm\{SEP\}\}, somaxπ∈ΠSEP,loc⁡W\(π\)≤maxπ∈ΠSEP⁡W\(π\)=W\(πsime\)\\max\_\{\\pi\\in\\Pi\_\{\\mathrm\{SEP\},\\mathrm\{loc\}\}\}W\(\\pi\)\\leq\\max\_\{\\pi\\in\\Pi\_\{\\mathrm\{SEP\}\}\}W\(\\pi\)=W\(\\pi^\{e\}\_\{\\mathrm\{sim\}\}\)\. The containment chainsupΠnaW≤supΠpassiveW≤supΠadaptW\\sup\_\{\\Pi\_\{\\mathrm\{na\}\}\}W\\leq\\sup\_\{\\Pi\_\{\\mathrm\{passive\}\}\}W\\leq\\sup\_\{\\Pi\_\{\\mathrm\{adapt\}\}\}Wfollows from the set inclusionsΠna⊆Πpassive⊆Πadapt\\Pi\_\{\\mathrm\{na\}\}\\subseteq\\Pi\_\{\\mathrm\{passive\}\}\\subseteq\\Pi\_\{\\mathrm\{adapt\}\}established in Def\.[1](https://arxiv.org/html/2605.21458#Thmdefinition1)\(v3\)\. The statement is by substitution and class nesting; no further argument is needed\. ∎

### E\.3Dominance chain

###### Proof of Proposition[1](https://arxiv.org/html/2605.21458#Thmproposition1)\(dominance chain\)\.

W\(π\):=𝔼ℳ⋆∼𝒫\[𝔼π,ℳ⋆\[∑tγt∑iRi,t\]\]W\(\\pi\):=\\mathbb\{E\}\_\{\\mathcal\{M\}^\{\\star\}\\sim\\mathcal\{P\}\}\[\\mathbb\{E\}^\{\\pi,\\mathcal\{M\}^\{\\star\}\}\[\\sum\_\{t\}\\gamma^\{t\}\\sum\_\{i\}R\_\{i,t\}\]\]is the Bayes\-expected total discounted reward\. We want to establish the chain

W\(πsim⋆\)≤W\(πsima\)≤W\(πsime\)≤W\(πsimBAMDP\)\.W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\;\\leq\\;W\(\\pi^\{a\}\_\{\\mathrm\{sim\}\}\)\\;\\leq\\;W\(\\pi^\{e\}\_\{\\mathrm\{sim\}\}\)\\;\\leq\\;W\(\\pi^\{\\mathrm\{BAMDP\}\}\_\{\\mathrm\{sim\}\}\)\.The four policies are theWW\-maximizers over their respective classes, so it suffices to show the class chain

supΠnaW≤supΠpassiveW≤supΠSEPW≤supΠadaptW\.\\sup\_\{\\Pi\_\{\\mathrm\{na\}\}\}W\\;\\leq\\;\\sup\_\{\\Pi\_\{\\mathrm\{passive\}\}\}W\\;\\leq\\;\\sup\_\{\\Pi\_\{\\mathrm\{SEP\}\}\}W\\;\\leq\\;\\sup\_\{\\Pi\_\{\\mathrm\{adapt\}\}\}W\.
*supΠnaW≤supΠpassiveW\\sup\_\{\\Pi\_\{\\mathrm\{na\}\}\}W\\leq\\sup\_\{\\Pi\_\{\\mathrm\{passive\}\}\}W:*by Definition[1](https://arxiv.org/html/2605.21458#Thmdefinition1), non\-adaptive policies are the constant\-belief subclass of\(St,bt\)\(S\_\{t\},b\_\{t\}\)\-measurable Bayesian policies, soΠna⊆Πpassive\\Pi\_\{\\mathrm\{na\}\}\\subseteq\\Pi\_\{\\mathrm\{passive\}\}as sets and the supremum is weakly larger on the larger class\.

*supΠpassiveW≤supΠSEPW\\sup\_\{\\Pi\_\{\\mathrm\{passive\}\}\}W\\leq\\sup\_\{\\Pi\_\{\\mathrm\{SEP\}\}\}W:*every passive\-learning policyπ∈Πpassive\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}corresponds to aτ=0\\tau=0SEP withπtexploit=π\\pi^\{\\mathrm\{exploit\}\}\_\{t\}=\\pi\(zero exploration budget\), which lies inΠSEP\\Pi\_\{\\mathrm\{SEP\}\}by Definition[4](https://arxiv.org/html/2605.21458#Thmdefinition4)\. HenceΠpassive\\Pi\_\{\\mathrm\{passive\}\}embeds intoΠSEP\\Pi\_\{\\mathrm\{SEP\}\}via the degenerate\-exploration map, and the supremum is weakly larger on the target class\.

*supΠSEPW≤supΠadaptW\\sup\_\{\\Pi\_\{\\mathrm\{SEP\}\}\}W\\leq\\sup\_\{\\Pi\_\{\\mathrm\{adapt\}\}\}W:*every SEP is a history\-dependent policy with action distribution depending onℱt\\mathcal\{F\}\_\{t\}\(either asπexplore\\pi^\{\\mathrm\{explore\}\}while the exploration budget has not been consumed or asπtexploit\\pi^\{\\mathrm\{exploit\}\}\_\{t\}thereafter\), soΠSEP⊆Πadapt\\Pi\_\{\\mathrm\{SEP\}\}\\subseteq\\Pi\_\{\\mathrm\{adapt\}\}as sets\.

The three inequalities combine to give the stated chain\. Strict separations are discussed in Theorem[7](https://arxiv.org/html/2605.21458#Thmtheorem7)\. ∎

### E\.4Exponential reachability gap

We open with the headline statement referenced from the body, then prove a quantitative refinement\.

###### Theorem 6\(Deterministic reachability separation\)\.

There exists a deterministic MDP family — a combination\-lock chain of lengthkkwith two actions per state, in which the simulator gets every transition right except at the chain’s terminal state where it underestimates the reward by a multiplicative factorη<1\\eta<1— on which the reachability gap satisfies𝒢reach≥\(1−η\)Rmax\\mathcal\{G\}\_\{\\mathrm\{reach\}\}\\geq\(1\-\\eta\)R\_\{\\max\}for anyTeff≥kT\_\{\\mathrm\{eff\}\}\\geq k\. A directed explorer that allocatesΩ\(k\)\\Omega\(k\)exploratory steps to the chain’s terminal state attainsW\(πsim⋆\)\+\(1−η−o\(1\)\)RmaxW\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\+\(1\-\\eta\-o\(1\)\)R\_\{\\max\}\.

The bound\(1−η\)Rmax\(1\-\\eta\)R\_\{\\max\}is independent of horizon\. Under the simulator’s beliefs, the terminal action looks suboptimal, soπsim⋆\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}never visits it; under the true rewards, the terminal has the highest value of any state\. The combination\-lock is purpose\-built\. The same conclusion applies whenever a deployed policy avoids a region in which the simulator is miscalibrated\. Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)below makes the construction explicit and gives the quantitative dependence on chain length and per\-state simulator error rate; it implies Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)as a special case \(setϵp=1\\epsilon\_\{p\}=1at the terminal state andϵp=0\\epsilon\_\{p\}=0elsewhere\)\.

###### Proposition 11\(Exponential reachability gap\)\.

Consider the combination\-lock MDP withTeff\+1T\_\{\\mathrm\{eff\}\}\+1states, per\-state simulator error probabilityϵp=c/Teff\\epsilon\_\{p\}=c/T\_\{\\mathrm\{eff\}\}for a constantc\>0c\>0,n≥Teff\+1n\\geq T\_\{\\mathrm\{eff\}\}\+1units, and undiscounted finite\-horizon valueW\(π\):=𝔼ℳ⋆∼𝒫𝔼π,ℳ⋆\[∑t=0Teff∑iRi,t\]W\(\\pi\):=\\mathbb\{E\}\_\{\\mathcal\{M\}^\{\\star\}\\sim\\mathcal\{P\}\}\\,\\mathbb\{E\}^\{\\pi,\\mathcal\{M\}^\{\\star\}\}\\\!\\bigl\[\\sum\_\{t=0\}^\{T\_\{\\mathrm\{eff\}\}\}\\sum\_\{i\}R\_\{i,t\}\\bigr\]\(equivalently,γ=1\\gamma=1restricted to the chain’s length\)\. Then

1. \(i\)The simulator\-optimal policy satisfies W\(πsim⋆\)≤nRmax\(1−c/Teff\)Teff→Teff→∞nRmaxe−c\.W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\leq nR\_\{\\max\}\\left\(1\-c/T\_\{\\mathrm\{eff\}\}\\right\)^\{T\_\{\\mathrm\{eff\}\}\}\\;\\xrightarrow\{T\_\{\\mathrm\{eff\}\}\\to\\infty\}\\;nR\_\{\\max\}e^\{\-c\}\.\(64\)
2. \(ii\)A simulation\-aided experimental policy that allocatesnexplore=Teff\+1n\_\{\\mathrm\{explore\}\}=T\_\{\\mathrm\{eff\}\}\+1units to*sequential directed exploration*\(described in the proof below\) followed byn−nexploren\-n\_\{\\mathrm\{explore\}\}units to exploitation of the learned correct action satisfies W\(πsime\)≥\(n−Teff−1\)Rmax\.W\(\\pi^\{e\}\_\{\\mathrm\{sim\}\}\)\\geq\(n\-T\_\{\\mathrm\{eff\}\}\-1\)R\_\{\\max\}\.\(65\)

Hence𝒢=W\(πsime\)−W\(πsim⋆\)≥\(n−Teff−1\)Rmax−nRmax\(1−c/Teff\)Teff\\mathcal\{G\}=W\(\\pi^\{e\}\_\{\\mathrm\{sim\}\}\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\geq\(n\-T\_\{\\mathrm\{eff\}\}\-1\)R\_\{\\max\}\-nR\_\{\\max\}\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}, which approachesnRmax\(1−e−c\)nR\_\{\\max\}\(1\-e^\{\-c\}\)minus the exploration cost asTeff→∞T\_\{\\mathrm\{eff\}\}\\to\\infty\. Since the SOP is non\-adaptive and visits only states reachable via its own \(possibly wrong\) actions, this gap is a reachability gap:𝒢=𝒢reach\\mathcal\{G\}=\\mathcal\{G\}\_\{\\mathrm\{reach\}\}\.

###### Proof\.

\(i\) The simulator’s error at each chain state is an independent Bernoulli\(ϵp\)\(\\epsilon\_\{p\}\)event\. The SOP reaches the terminal statesTeffs\_\{T\_\{\\mathrm\{eff\}\}\}if and only if the simulator is correct at every one of theTeffT\_\{\\mathrm\{eff\}\}chain states, an event of probability\(1−ϵp\)Teff=\(1−c/Teff\)Teff\(1\-\\epsilon\_\{p\}\)^\{T\_\{\\mathrm\{eff\}\}\}=\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}\. If the SOP reaches the terminal, each of thennunits collects rewardRmaxR\_\{\\max\}; otherwise each unit collects zero\. Taking expectation over the simulator draw gives \([64](https://arxiv.org/html/2605.21458#A5.E64)\)\. The limit\(1−c/Teff\)Teff→e−c\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}\\to e^\{\-c\}is standard\.

\(ii\) We give an explicit*sequential directed*exploration schedule; uniform random exploration would require exponential samples \(see Remark[22](https://arxiv.org/html/2605.21458#Thmremark22)\)\. Assign one explorer per chain positionk=0,1,…,Teffk=0,1,\\ldots,T\_\{\\mathrm\{eff\}\}, processed in order\. Explorerkkstarts ats0s\_\{0\}, deterministically follows the correct actions already identified by explorers0,…,k−10,\\ldots,k\-1to arrive at statesks\_\{k\}, and tries one of\{a0,a1\}\\\{a\_\{0\},a\_\{1\}\\\}there\. Because transitions are deterministic and the two actions produce distinguishable outcomes \(advance tosk\+1s\_\{k\+1\}vs\. absorb tos⟂s\_\{\\perp\}\), a single trial atsks\_\{k\}reveals which action is correct\. After allTeff\+1T\_\{\\mathrm\{eff\}\}\+1explorers complete in sequence, the correct action is known at every chain state\. The remainingn−nexplore=n−Teff−1n\-n\_\{\\mathrm\{explore\}\}=n\-T\_\{\\mathrm\{eff\}\}\-1units each follow the learned policy and collect rewardRmaxR\_\{\\max\}, giving \([65](https://arxiv.org/html/2605.21458#A5.E65)\)\. The total action count over the explorers is∑k=0Teff\(k\+1\)=\(Teff\+1\)\(Teff\+2\)/2=O\(Teff2\)\\sum\_\{k=0\}^\{T\_\{\\mathrm\{eff\}\}\}\(k\+1\)=\(T\_\{\\mathrm\{eff\}\}\+1\)\(T\_\{\\mathrm\{eff\}\}\+2\)/2=O\(T\_\{\\mathrm\{eff\}\}^\{2\}\), polynomial inTeffT\_\{\\mathrm\{eff\}\}\. The identity𝒢=𝒢reach\\mathcal\{G\}=\\mathcal\{G\}\_\{\\mathrm\{reach\}\}follows because every chain state past the first error is unreachable by the SOP, so𝒢local=0\\mathcal\{G\}\_\{\\mathrm\{local\}\}=0\. ∎

### E\.5Strict hierarchy

We state the hierarchy in two forms: a uniform class\-level ordering that follows immediately from set inclusion, and a policy\-level asymptotic ordering that holds for specific policy representatives in the large\-horizon limit\. The two statements are distinct, and conflating them leads to empirical contradictions \(see Remark[23](https://arxiv.org/html/2605.21458#Thmremark23)below\)\.

###### Theorem 7\(Weak hierarchy with existential strictness, class\-level\)\.

LetWk:=supπ∈ΠkW\(π\)W\_\{k\}:=\\sup\_\{\\pi\\in\\Pi\_\{k\}\}W\(\\pi\)\. Then

W0≤W1≤W1′≤W2≤W3≤W4,W2≤W3′≤W4\.W\_\{0\}\\leq W\_\{1\}\\leq W\_\{1^\{\\prime\}\}\\leq W\_\{2\}\\leq W\_\{3\}\\leq W\_\{4\},\\qquad W\_\{2\}\\leq W\_\{3^\{\\prime\}\}\\leq W\_\{4\}\.The inequalities are weak in general\. At the levelsW0≤W1W\_\{0\}\\leq W\_\{1\}we have equalityW0=W1W\_\{0\}=W\_\{1\}under the Bayes\-expected\-value definition ofWW\(because the simulator\-optimal policy is prior\-optimal within any class of non\-learning policies\); strictness in this case is measured differently \(e\.g\., under the*true*MDP after data reveals the simulator’s error\), and we formalize this separately below\. At levelsW1′≤W2W\_\{1^\{\\prime\}\}\\leq W\_\{2\},W1′≤W3′W\_\{1^\{\\prime\}\}\\leq W\_\{3^\{\\prime\}\},W3≤W4W\_\{3\}\\leq W\_\{4\}, and the branchingW2≤W3′W\_\{2\}\\leq W\_\{3^\{\\prime\}\}, strictness is witnessed by specific \(MDP, prior\) instances\. At the levelW2≤W3W\_\{2\}\\leq W\_\{3\}, we conjecture strictness but our deterministic exhibit does not witness it; see discussion below\.

###### Proof of Theorem[7](https://arxiv.org/html/2605.21458#Thmtheorem7)\.

The chain of inequalities follows from nested set inclusion of the underlying policy classes: each levelΠk\\Pi\_\{k\}is a subset ofΠk\+1\\Pi\_\{k\+1\}by construction \(Definition[1](https://arxiv.org/html/2605.21458#Thmdefinition1)and Appendix[E\.1](https://arxiv.org/html/2605.21458#A5.SS1)\), so the supremaWk=supπ∈ΠkW\(π\)W\_\{k\}=\\sup\_\{\\pi\\in\\Pi\_\{k\}\}W\(\\pi\)are weakly ordered\.

We now establish the equalities and existential\-strictness statements level\-by\-level\.

- •W0=W1W\_\{0\}=W\_\{1\}\. Under the Bayes\-expected\-value definitionW\(π\)=𝔼ℳ⋆∼𝒫\[𝔼π,ℳ⋆\[∑γtR\]\]W\(\\pi\)=\\mathbb\{E\}\_\{\\mathcal\{M\}^\{\\star\}\\sim\\mathcal\{P\}\}\[\\mathbb\{E\}^\{\\pi,\\mathcal\{M\}^\{\\star\}\}\[\\sum\\gamma^\{t\}R\]\], the optimal non\-learning policy inΠ1\\Pi\_\{1\}is the one that acts optimally under the prior\-mean MDPℳ^sim\\hat\{\\mathcal\{M\}\}\_\{\\mathrm\{sim\}\}—which by definition is the SOP\. SosupΠ1W\(π\)=W\(SOP\)=W0\\sup\_\{\\Pi\_\{1\}\}W\(\\pi\)=W\(\\mathrm\{SOP\}\)=W\_\{0\}, forcingW0=W1W\_\{0\}=W\_\{1\}in general\.*Strictness holds in a different sense:*for a specific realizedℳ⋆\\mathcal\{M\}^\{\\star\}that is drawn from the prior \(a “true but unknown” MDP\), anϵ\\epsilon\-greedy policy can achieve strictly higher per\-unit reward than the SOP when the simulator is wrong at a high\-probability decision; our empirical SOP–ϵ\\epsilon\-greedy comparisons \(e\.g\., Table[A13](https://arxiv.org/html/2605.21458#A5.T13), Appendix[F\.1](https://arxiv.org/html/2605.21458#A6.SS1)\) reflect this*realized\-MDP*separation rather than the prior\-averageWW\. The class\-level hierarchy therefore hasW0=W1W\_\{0\}=W\_\{1\}; the*point\-evaluated*hierarchyVπ\(ℳ⋆\)≷VSOP\(ℳ⋆\)V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\)\\gtrless V^\{\\mathrm\{SOP\}\}\(\\mathcal\{M\}^\{\\star\}\)can go either way depending on the draw\.
- •W1≤W1′W\_\{1\}\\leq W\_\{1^\{\\prime\}\}, strict in general\. The passive\-learning classΠ1′\\Pi\_\{1^\{\\prime\}\}contains posterior\-mean\-optimal policies that update beliefs from online observations; these dominate the Bayes\-optimal non\-learning policy \(which is the SOP\) under the prior once the horizon permits any posterior concentration\. Theorem[1](https://arxiv.org/html/2605.21458#Thmtheorem1)of Appendix[A](https://arxiv.org/html/2605.21458#A1)gives an explicit closed\-form separation on a one\-state two\-action bandit with a Gaussian conjugate prior:W\(A\-SOP\)−W\(SOP\)=γnσ0ψ\(κ\)\>0W\(\\mathrm\{A\{\\text\{\-\}\}SOP\}\)\-W\(\\mathrm\{SOP\}\)=\\gamma n\\sigma\_\{0\}\\psi\(\\kappa\)\>0at anyκ∈\(0,κ⋆\(γ\)\)\\kappa\\in\(0,\\kappa^\{\\star\}\(\\gamma\)\)\.
- •W1′≤W2W\_\{1^\{\\prime\}\}\\leq W\_\{2\}, strict in general\. The HIV mobile\-testing DGP \(Appendix[D](https://arxiv.org/html/2605.21458#A4)\) witnesses strictness at every horizonT∈\{50,100,200,300,400\}T\\in\\\{50,100,200,300,400\\\}: A\-SOP plateaus at 43–56% of oracle while the KG\-SEP and its variants exceed 62% \(Table[A6](https://arxiv.org/html/2605.21458#A3.T6)\)\. The mechanism is that Bayesian posterior updating on the A\-SOP’s own trajectory never reaches Region B, whereas a KG\-driven SEP with corridor\-aware EPI does\.
- •W2≤W3W\_\{2\}\\leq W\_\{3\}, conjectured strict\. The combination\-lock MDP of Appendix[F\.1](https://arxiv.org/html/2605.21458#A6.SS1)exhibits this*conceptually*under stochastic transitions: targeted per\-state greedy exploration \(Level 2\) is insufficient to navigate long chains of error\-correcting actions, while multi\-step trajectory planning \(Level 3\) solves the chain\. Our deterministic chain\-lock experiment \(Table[A13](https://arxiv.org/html/2605.21458#A5.T13)\) does not witnessW2<W3W\_\{2\}<W\_\{3\}strictly because with deterministic transitions the A\-SOP already observes each state’s correct action from the first exploiting unit onward, so passive learning suffices once exploration reaches each state\. A stochastic\-fork variant \(probabilityη\\etaof deflection\) would separate the two classes empirically; we leave the full construction to future work and record this as a conjecture rather than a proven separation\.
- •W3≤W4W\_\{3\}\\leq W\_\{4\}, strict in general\. Any BAMDP with non\-vanishing posterior uncertainty admits a fully adaptive policy that mixes exploration with adaptive exploitation in ways no finite\-level hierarchy captures;Osbandet al\.\[[2013](https://arxiv.org/html/2605.21458#bib.bib91)\]exhibit such separations for posterior sampling\.
- •W2≤W3′W\_\{2\}\\leq W\_\{3^\{\\prime\}\}, directionally but not significantly witnessed\. On MDPs where the value function is sensitive to reward parameters at non\-trivially correlated state–action pairs, Fisher\-optimal stochastic policies strictly dominate per\-state greedy criteria\. The vending DGP atT=1600T\{=\}1600shows Fisher\-SEP at 75\.3±6\.5\(Table[1](https://arxiv.org/html/2605.21458#S5.T1)\) and KG\-SEP at 71\.8±4\.4\(Table[A5](https://arxiv.org/html/2605.21458#A3.T5)\); the CIs overlap at 30 trials, so this witness is directional rather than significant\. We conjecture replication with 100\+ trials would establish strict dominance; this does not undermine the class\-level statement, which depends only on set inclusion\.

Each strict\-level inclusion holds for the exhibited instance but not in every MDP: on a one\-state bandit with uniform prior,W0=W1=⋯=W4W\_\{0\}=W\_\{1\}=\\cdots=W\_\{4\}\. The theorem should be read as*existential strictness*at the class level, for the levels where it is claimed, and asW0=W1W\_\{0\}=W\_\{1\}universally\. ∎

### E\.6Experimental validation: the combination lock

We now present the combination\-lock experiment, which serves a dual role: it illustrates Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)numerically and it is the concrete construction referenced in the proof of Theorem[7](https://arxiv.org/html/2605.21458#Thmtheorem7)for theW2<W3W\_\{2\}<W\_\{3\}andW1′<W2W\_\{1^\{\\prime\}\}<W\_\{2\}witnesses on chain environments\.

##### Setup\.

The combination lock is a chain ofTeff\+1T\_\{\\mathrm\{eff\}\}\+1states plus an absorbing fail states⟂s\_\{\\perp\}, withK=2K\{=\}2actions\. In the true MDP, actiona0a\_\{0\}advances the chain at every state; actiona1a\_\{1\}sends the agent tos⟂s\_\{\\perp\}\. The terminal statesTeffs\_\{T\_\{\\mathrm\{eff\}\}\}yields rewardRmax=1R\_\{\\max\}\{=\}1; all others yield zero\. The simulator identifies the correct action at each state independently with probability1−ϵp1\-\\epsilon\_\{p\}, whereϵp=c/Teff\\epsilon\_\{p\}=c/T\_\{\\mathrm\{eff\}\}\. This parameterization ensures the total expected number of errors isccregardless ofTeffT\_\{\\mathrm\{eff\}\}, isolating the effect of horizon length from the total error budget\. We usen=50n\{=\}50units andc=1\.0c\{=\}1\.0, averaging over 300 simulator draws\.

##### What this tests\.

Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)predicts that the SOP’s value scales as\(1−c/Teff\)Teff\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}, converging toe−ce^\{\-c\}from below asTeff→∞T\_\{\\mathrm\{eff\}\}\\to\\infty: the SOP reaches the terminal only if the simulator is correct at*every*state\. The SEP can send exploratory units to learn the correct action, achieving polynomial sample complexity\.

##### Results\.

Table[A13](https://arxiv.org/html/2605.21458#A5.T13)shows the results \(% of oracle\) at five effective horizons\.

Table A13:Combination lock MDP \(% of oracle,c=1\.0c\{=\}1\.0,n=50n\{=\}50, 300 draws\)\. Best non\-oracle per column in bold\.
##### Analysis\.

Five patterns stand out in Table[A13](https://arxiv.org/html/2605.21458#A5.T13)\. The A\-SOP \(Level 1′\) reaches 98% of the oracle because deterministic transitions render every observation maximally informative: the first unit that executesa0a\_\{0\}at a given state reveals the correct action\. Theϵ\\epsilon\-greedy policy decays exponentially \(29%→\\to8%\) because random errors compound multiplicatively across the chain\. The KG\-SEP hits 100% because its directed exploration always selectsa0a\_\{0\}first, which happens to be the correct action at every state—an artifact of the environment’s symmetric design\. The SEP attains 90% at short horizons but decays to 45% atTeff=30T\_\{\\mathrm\{eff\}\}=30with its fixed exploration budget ofn/10n/10units: the probability that all states are covered by at least one explorer decreases with chain length, so longer chains see more exploration\-phase failures\. This finite\-sample decay is governed by Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)’s lower bound\. The Fisher\-SEP reaches 92–98% across horizons, substantially above the standard SEP, because the Fisher criterion directs exploration toward states whose reward parameters most affect the value function and uses the exploration budget more coverage\-efficiently than uniform random exploration\.

##### Takeaway\.

The combination lock isolates the compounding mechanism: transition errors at individual states multiply across the chain\. Passive learning is highly effective when transitions are deterministic, as each observation is maximally informative\. Undirected exploration is counterproductive because random errors compound multiplicatively\.

### E\.7Stochastic\-fork variant \(Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)\(b\)\)

The deterministic combination\-lock of the previous subsection witnesses Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6): the SOP’s value decays exponentially while a directed explorer closes the gap\. It does*not*witness Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2), because with deterministic transitions the A\-SOP already attains98%98\\%of oracle once any explorer visits each state \(Table[A13](https://arxiv.org/html/2605.21458#A5.T13)\)\. We now define the stochastic\-fork variant on which we conjecture passive learning cannot close the gap\.

##### Construction\.

Same chain statess0,…,sks\_\{0\},\\ldots,s\_\{k\}and absorbing fail states⟂s\_\{\\perp\}as the deterministic combination lock of Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)\. The two actions at each statesis\_\{i\}fori<k−1i<k\-1behave as before: actiona0a\_\{0\}advances deterministically tosi\+1s\_\{i\+1\}, actiona1a\_\{1\}sends tos⟂s\_\{\\perp\}\. The final transition from statesk−1s\_\{k\-1\}under the correct actiona0a\_\{0\}is stochastic: with probability12\\tfrac\{1\}\{2\}the agent advances to the terminal reward statesks\_\{k\}, and with probability12\\tfrac\{1\}\{2\}it returns to statesk−2s\_\{k\-2\}\. The simulator’s calibration is miscalibrated atsk−1s\_\{k\-1\}:℘^sim\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}assigns probability11tosk−2s\_\{k\-2\}under actiona0a\_\{0\}, so the SOP evaluates actiona0a\_\{0\}atsk−1s\_\{k\-1\}as a wasted step and stops atsk−2s\_\{k\-2\}\(or takes an arbitrary action whoseVπsim⋆V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}value is bounded away fromRmaxR\_\{\\max\}\)\. The terminal statesks\_\{k\}yields rewardRmaxR\_\{\\max\}; all other states yield zero\.

###### Conjecture 2\(Passive learning on the stochastic fork, high\-probability form\)\.

On the stochastic\-fork variant of the combination\-lock chain \(the construction above\), under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4), there existsp⋆<1p^\{\\star\}<1\(depending only on the prior hyperparameters\) such that for everyπ∈Πpassive\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}and every horizonTT,

ℙ\[πvisits\(sk,a⋆\)withinTsteps\]≤CT\(p⋆\)k−1,\\mathbb\{P\}\\bigl\[\\,\\pi\\text\{ visits \}\(s\_\{k\},a^\{\\star\}\)\\text\{ within \}T\\text\{ steps\}\\,\\bigr\]\\;\\leq\\;C\\,T\\,\(p^\{\\star\}\)^\{k\-1\},where the probability is taken over the posterior\-update randomness ofπ\\pi\(e\.g\., Thompson draws\) and the draws ofℳ∼𝒫\\mathcal\{M\}\\sim\\mathcal\{P\}from the prior, andCCis an absolute constant\. Consequently, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\deltaover the same randomness,

W\(π\)−W\(πsim⋆\)≤RmaxCTδ\(p⋆\)k−1→k→∞0\.W\(\\pi\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\;\\leq\\;R\_\{\\max\}\\,\\tfrac\{CT\}\{\\delta\}\\,\(p^\{\\star\}\)^\{k\-1\}\\;\\xrightarrow\{k\\to\\infty\}\\;0\.The remainder of this appendix proves this bound unconditionally for the regular subclassΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\(posterior\-mean\-optimal policies, bounded\-temperature softmax, polynomial\-shrinkage UCB\) \(Theorem[8](https://arxiv.org/html/2605.21458#Thmtheorem8)\) and establishes the geometric\(p⋆\)k−1\(p^\{\\star\}\)^\{k\-1\}rate for Thompson sampling \(Proposition[12](https://arxiv.org/html/2605.21458#Thmproposition12)\)\. The residual open item is thatp⋆p^\{\\star\}is bounded away from11uniformly over the prior family\.

*Why we conjecture it\.*Under the SOP, the agent’s trajectory never reachessk−1s\_\{k\-1\}\(because the SOP’s prior atsk−1s\_\{k\-1\}recommends actiona1a\_\{1\}or an alternative that also avoids the fork, evaluated as dominant under℘^sim\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\)\. Any policyπ∈Πpassive\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}selects actions as a measurable function of\(St,bt\)\(S\_\{t\},b\_\{t\}\), wherebtb\_\{t\}is the Bayesian posterior updated fromℱt\\mathcal\{F\}\_\{t\}\. Becausesk−1s\_\{k\-1\}is never visited along the SOP’s trajectory, its posteriorbt\(θsk−1,a0\)b\_\{t\}\(\\theta\_\{s\_\{k\-1\},a\_\{0\}\}\)stays at the prior mean inherited from℘^sim\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\. A posterior\-mean\-optimal policy therefore continues to avoidsk−1s\_\{k\-1\}, and the iteration closes: noΠpassive\\Pi\_\{\\mathrm\{passive\}\}policy ever visitssk−1s\_\{k\-1\}\. A proof requires showing that*every*Πpassive\\Pi\_\{\\mathrm\{passive\}\}policy \(including non\-posterior\-mean\-optimal members with non\-degenerate belief updates, e\.g\., stochastic posterior\-softmax\) also fails to visitsk−1s\_\{k\-1\}with probability at least1−o\(1\)1\-o\(1\)ask→∞k\\to\\infty, which needs a lower bound on the stationary visitation ofsk−1s\_\{k\-1\}across all belief\-conditional stochastic policies\. We do not close this step here\.

*Scope\.*A complete proof requires showing the posterior onθsk,a⋆\\theta\_\{s\_\{k\},a^\{\\star\}\}concentrates slowly enough under any posterior\-update rule inΠpassive\\Pi\_\{\\mathrm\{passive\}\}; we conjecture this and leave it to future work \(Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)\)\. The first clause of Theorem[6](https://arxiv.org/html/2605.21458#Thmtheorem6)is witnessed by the deterministic chain of the previous subsection; the conjecture is on the stochastic fork\.

##### Implementation note\.

The SEP in this experiment implements a*uniform\-random*explore\-then\-exploit protocol:nexplore=min⁡\(Teff\+1,n/10\)n\_\{\\mathrm\{explore\}\}=\\min\(T\_\{\\mathrm\{eff\}\}\+1,n/10\)units follow a uniform random policy and record the correct action at each state they visit, and the remainingn−nexploren\-n\_\{\\mathrm\{explore\}\}units exploit the learned policy\. This is not the sequential directed schedule of Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)\(ii\); as Remark[22](https://arxiv.org/html/2605.21458#Thmremark22)establishes, uniform\-random exploration fails exponentially inTeffT\_\{\\mathrm\{eff\}\}\. With probability1−\(1−2−k\)nexplore1\-\(1\-2^\{\-k\}\)^\{n\_\{\\mathrm\{explore\}\}\}at each statekk, at least one explorer observes the correct action there; failure at any state means the exploiters cannot reach the terminal from that state\. The KG\-SEP variant is analogous but always tries actiona0a\_\{0\}first \(directed rather than random\), which in this symmetric chain happens to match the correct action at every state\. We deliberately use uniform\-random exploration in the empirical comparison to show the class\-level separation \(SOP’s exponential decay vs\. SEP’s finite\-sample floor\) rather than to claim the sharpest polynomial bound; the sequential directed schedule would recover \([65](https://arxiv.org/html/2605.21458#A5.E65)\) exactly\.

### E\.8A restricted\-class result toward Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)

The informal argument above closes for posterior\-mean\-optimal policies\. It does*not*close for the full classΠpassive\\Pi\_\{\\mathrm\{passive\}\}, which by Definition[1](https://arxiv.org/html/2605.21458#Thmdefinition1)contains every\(St,bt\)\(S\_\{t\},b\_\{t\}\)\-measurable policy — including Thompson sampling \(TS\), UCB\-style optimism, and posterior\-softmax at any temperature\. This subsection establishes Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)rigorously for a restricted subclass that rules out off\-posterior\-mean action selection driven by prior tail mass, and separately delineates where Thompson sampling sits relative to the conjecture\.

##### The mechanism at the fork\.

The SOP’s exploit action atsk−1s\_\{k\-1\}isa1a\_\{1\}\(the “halt” action whose℘^sim\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\-evaluated return isVπsim⋆\(sk−1\)≥Vπsim⋆\(sk−2\)V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(s\_\{k\-1\}\)\\geq V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\}\(s\_\{k\-2\}\)under the miscalibrated prior, by construction of the fork\)\. CallΔsim\(s\):=Vπsim⋆,℘^sim\(s,a1\)−Vπsim⋆,℘^sim\(s,a0\)\\Delta\_\{\\mathrm\{sim\}\}\(s\):=V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\}\(s,a\_\{1\}\)\-V^\{\\pi^\{\\star\}\_\{\\mathrm\{sim\}\},\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}\}\(s,a\_\{0\}\)the simulator’s advantage gap atss\. On the stochastic fork,Δsim\(sk−1\)≥Rmax/2\\Delta\_\{\\mathrm\{sim\}\}\(s\_\{k\-1\}\)\\geq R\_\{\\max\}/2because the prior puts mass11on “a0a\_\{0\}returns tosk−2s\_\{k\-2\}\.” For anyπ∈Πpassive\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}to visitsk−1s\_\{k\-1\}it must first reachsk−1s\_\{k\-1\}\(which requires choosinga0a\_\{0\}at everysi,i<k−1s\_\{i\},i<k\-1\), and then atsk−1s\_\{k\-1\}selecta0a\_\{0\}overa1a\_\{1\}despite a posterior advantage fora1a\_\{1\}\.

##### The restricted classΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\.

We define a regularity condition on passive policies that rules out the pathological case where off\-posterior\-mean action selection is driven by unbounded prior tail mass at unvisited pairs\.

###### Definition 11\(Regular passive class\)\.

Let𝒱t\(π\)⊆𝕊\\mathcal\{V\}\_\{t\}\(\\pi\)\\subseteq\\mathbb\{S\}denote the set of states visited byπ\\piup to timett\. A policyπ∈Πpassive\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}is*regular*if there exist a constantβ≥0\\beta\\geq 0and a non\-decreasing functionc:\[0,1\]→\[0,1\]c:\[0,1\]\\to\[0,1\]withc\(0\)=0c\(0\)=0, independent ofkk, such that for every states∈𝕊s\\in\\mathbb\{S\}and every actiona∈𝔸a\\in\\mathbb\{A\},

πt\(a∣s,bt\)≤c\(ℙbt\[ais posterior\-mean\-optimal ats\]\)\+β⋅𝟏\[s∉𝒱t\(π\)\]⋅e−kα\\pi\_\{t\}\(a\\mid s,b\_\{t\}\)\\;\\leq\\;c\\bigl\(\\mathbb\{P\}\_\{b\_\{t\}\}\[\\text\{$a$ is posterior\-mean\-optimal at $s$\}\]\\bigr\)\+\\beta\\cdot\\mathbf\{1\}\[s\\notin\\mathcal\{V\}\_\{t\}\(\\pi\)\]\\cdot e^\{\-k\\alpha\}\(66\)for someα\>0\\alpha\>0, where the probability is over the posteriorbtb\_\{t\}\. Denote this subclass byΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\.

The condition is mild\. It requires that the rate at which a passive policy selects a given action at an unvisited state is controlled by two ingredients: \(i\) a functionccof the posterior probability thataais optimal there \(so policies that track the posterior are regular\), and \(ii\) an unvisited\-state prior\-tail termβe−kα\\beta e^\{\-k\\alpha\}that shrinks at least exponentially in chain length \(so policies that pay off unbounded prior\-tail mass at unvisited pairs are*not*regular\)\. Posterior\-mean\-optimal policies are regular withc\(0\)=0c\(0\)=0,β=0\\beta=0\. Posterior\-softmax with any bounded temperature is regular withc\(p\)=p1/τc\(p\)=p^\{1/\\tau\}\(up to a normalizer\) andβ=0\\beta=0\. UCB\-style upper\-confidence\-bound policies with confidence radii that shrink asO\(k−1/2\)O\(k^\{\-1/2\}\)at unvisited pairs are regular withβ\>0\\beta\>0andα=1/2\\alpha=1/2\. Thompson sampling in its raw form is*not*regular on this construction, because at unvisitedsk−1s\_\{k\-1\}it draws directly from the prior with nokk\-dependent shrinkage — we treat this as a separate open case below\.

###### Theorem 8\(Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)on the regular passive subclass\)\.

Under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)on the stochastic\-fork chain of lengthkk,

supπ∈ΠpassiveregW\(π\)−W\(πsim⋆\)=o\(1\)ask→∞\.\\sup\_\{\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\}W\(\\pi\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\;=\\;o\(1\)\\quad\\text\{as \}k\\to\\infty\.

###### Proof\.

Fixπ∈Πpassivereg\\pi\\in\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\. Letℰk:=\{ω:sk−1∈𝒱T\(π\)\}\\mathcal\{E\}\_\{k\}:=\\\{\\omega:s\_\{k\-1\}\\in\\mathcal\{V\}\_\{T\}\(\\pi\)\\\}be the event thatπ\\pivisitssk−1s\_\{k\-1\}within the planning horizonTT\. Onℰkc\\mathcal\{E\}\_\{k\}^\{c\}, the posteriorbt\(θsk−1,a0\)b\_\{t\}\(\\theta\_\{s\_\{k\-1\},a\_\{0\}\}\)stays at the prior inherited from℘^sim\\hat\{\\wp\}\_\{\\mathrm\{sim\}\}throughout the trajectory, soπ\\pi’s realized return is upper bounded byW\(πsim⋆\)W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)by construction of the fork \(the SOP acts identically on the SOP\-visited statess0,…,sk−2s\_\{0\},\\ldots,s\_\{k\-2\}under any policy that never observessk−1s\_\{k\-1\}’s true kernel\)\. Hence

W\(π\)−W\(πsim⋆\)≤Rmax⋅ℙ\(ℰk\)\.W\(\\pi\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\;\\leq\\;R\_\{\\max\}\\cdot\\mathbb\{P\}\(\\mathcal\{E\}\_\{k\}\)\.\(67\)It remains to boundℙ\(ℰk\)\\mathbb\{P\}\(\\mathcal\{E\}\_\{k\}\)\.

*Single\-attempt bound\.*Reachingsk−1s\_\{k\-1\}froms0s\_\{0\}requires the policy to picka0a\_\{0\}at everysis\_\{i\}fori=0,…,k−2i=0,\\ldots,k\-2in sequence within one trajectory\. Writeτ\\taufor the stopping time at which a given trajectory has first reachedsk−1s\_\{k\-1\}or returned tos⟂s\_\{\\perp\}\. Before the first visit tosis\_\{i\}, the posterior\-probability thata0a\_\{0\}is optimal atsis\_\{i\}is zero under the miscalibrated prior \(the fork’s construction\), so by the regularity condition \([66](https://arxiv.org/html/2605.21458#A5.E66)\),

πt\(a0∣si,bt\)≤c\(0\)\+βe−kα=βe−kα\\pi\_\{t\}\(a\_\{0\}\\mid s\_\{i\},b\_\{t\}\)\\;\\leq\\;c\(0\)\+\\beta e^\{\-k\\alpha\}\\;=\\;\\beta e^\{\-k\\alpha\}whenever the trajectory is atsis\_\{i\}withsis\_\{i\}unvisited\. Applying the tower property of conditional expectation along the trajectory,

ℙ\[single trajectory reachessk−1\]=𝔼\[∏i=0k−2πti\(a0∣si,bti\)\]≤\(βe−kα\)k−1=βk−1e−k\(k−1\)α,\\mathbb\{P\}\\bigl\[\\text\{single trajectory reaches \}s\_\{k\-1\}\\bigr\]\\;=\\;\\mathbb\{E\}\\\!\\left\[\\prod\_\{i=0\}^\{k\-2\}\\pi\_\{t\_\{i\}\}\(a\_\{0\}\\mid s\_\{i\},b\_\{t\_\{i\}\}\)\\right\]\\;\\leq\\;\\bigl\(\\beta e^\{\-k\\alpha\}\\bigr\)^\{k\-1\}\\;=\\;\\beta^\{k\-1\}e^\{\-k\(k\-1\)\\alpha\},where the inequality is pointwise on each step of the conditional product and the expectation preserves the pointwise bound\.

*Union bound over attempts\.*OverTTplanning steps, at mostTTdistinct trajectories can originate froms0s\_\{0\}\. Union\-bounding the single\-trajectory probability overTT,

ℙ\(ℰk\)≤T⋅βk−1e−k\(k−1\)α\.\\mathbb\{P\}\(\\mathcal\{E\}\_\{k\}\)\\;\\leq\\;T\\cdot\\beta^\{k\-1\}e^\{\-k\(k\-1\)\\alpha\}\.\(68\)Combined with \([67](https://arxiv.org/html/2605.21458#A5.E67)\), this yieldsW\(π\)−W\(πsim⋆\)≤RmaxTβk−1e−k\(k−1\)αW\(\\pi\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\leq R\_\{\\max\}T\\beta^\{k\-1\}e^\{\-k\(k\-1\)\\alpha\}\. Taking the supremum overΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}preserves this bound becauseβ\\betaandα\\alphaare class\-uniform constants\. For anyTTpolynomial inkk, the right\-hand side vanishes ask→∞k\\to\\infty, provingsupΠpassiveregW\(π\)−W\(πsim⋆\)=o\(1\)\\sup\_\{\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\}W\(\\pi\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)=o\(1\)\. ∎

##### Thompson sampling: the open case\.

TS is outsideΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}as defined\. Two empirical / theoretical observations narrow where TS sits\.

###### Proposition 12\(TS visitation rate on the stochastic fork\)\.

Letp⋆:=ℙprior\[θsk−1,a0makesa0optimal atsk−1\]p^\{\\star\}:=\\mathbb\{P\}\_\{\\mathrm\{prior\}\}\[\\theta\_\{s\_\{k\-1\},a\_\{0\}\}\\text\{ makes \}a\_\{0\}\\text\{ optimal at \}s\_\{k\-1\}\]under Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)with prior concentrationκ0\\kappa\_\{0\}\. For TS on the stochastic\-fork chain of lengthkk, the probability that TS reachessk−1s\_\{k\-1\}withinTTplanning steps satisfies

ℙTS\[ℰk\]≤T⋅\(p⋆\)k−1,\\mathbb\{P\}\_\{\\mathrm\{TS\}\}\[\\mathcal\{E\}\_\{k\}\]\\;\\leq\\;T\\cdot\(p^\{\\star\}\)^\{k\-1\},and the expected value gap satisfiesW\(TS\)−W\(πsim⋆\)≤Rmax⋅T⋅\(p⋆\)k−1W\(\\mathrm\{TS\}\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\leq R\_\{\\max\}\\cdot T\\cdot\(p^\{\\star\}\)^\{k\-1\}, which vanishes ask→∞k\\to\\inftyfor anyp⋆<1p^\{\\star\}<1\.

The bound decays geometrically inkkwith ratelog⁡\(1/p⋆\)\\log\(1/p^\{\\star\}\), and super\-polynomially wheneverTTis polynomial inkk\. For the Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)prior with a standard Dirichlet concentration ofκ0=1\\kappa\_\{0\}=1, a computation givesp⋆=1/2p^\{\\star\}=1/2, soℙTS\[ℰk\]≤T⋅2−\(k−1\)\\mathbb\{P\}\_\{\\mathrm\{TS\}\}\[\\mathcal\{E\}\_\{k\}\]\\leq T\\cdot 2^\{\-\(k\-1\)\}, which is super\-polynomially small inkkfor any polynomially\-largeTT\. The conjecture therefore holds for TS in thek→∞k\\to\\inftylimit wheneverp⋆<1p^\{\\star\}<1; TS does not fall intoΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}because its decay is not controlled by theβe−kα\\beta e^\{\-k\\alpha\}ansatz in \([66](https://arxiv.org/html/2605.21458#A5.E66)\) with uniform\(β,α\)\(\\beta,\\alpha\)across all prior choices\.

###### Proof\.

TS samplesθ~∼bt\\tilde\{\\theta\}\\sim b\_\{t\}at each decision, then acts greedily in the sampled MDP\. At everysis\_\{i\}the agent has not yet visited, the belief atsis\_\{i\}is the prior \(by construction of the stochastic fork, the SOP trajectory never reachessk−1s\_\{k\-1\}, and the chain of deterministic statess0,…,sk−2s\_\{0\},\\ldots,s\_\{k\-2\}is only traversed in the directiona0a\_\{0\}, so the belief atsk−1s\_\{k\-1\}remains at the prior until the first attempted visit\)\. For a single attempt to reachsk−1s\_\{k\-1\}, TS must sample aθ~\\tilde\{\\theta\}under whicha0a\_\{0\}is optimal at each ofs0,s1,…,sk−2s\_\{0\},s\_\{1\},\\ldots,s\_\{k\-2\}\. Since the prior factorizes across\(s,a\)\(s,a\)\(Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)\), thesek−1k\-1events are independent, and their joint probability is at most\(p⋆\)k−1\(p^\{\\star\}\)^\{k\-1\}\.

OverTTplanning steps, TS has at mostTTopportunities to initiate a traversal froms0s\_\{0\}\. Union\-bounding the single\-attempt probability overTTgivesℙTS\[ℰk\]≤T⋅\(p⋆\)k−1\\mathbb\{P\}\_\{\\mathrm\{TS\}\}\[\\mathcal\{E\}\_\{k\}\]\\leq T\\cdot\(p^\{\\star\}\)^\{k\-1\}\. For the value bound: onℰkc\\mathcal\{E\}\_\{k\}^\{c\}, TS’s realized return is bounded above byW\(πsim⋆\)W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)by the same argument as in the proof of Theorem[8](https://arxiv.org/html/2605.21458#Thmtheorem8)\(no information atsk−1s\_\{k\-1\}is obtained, so the posterior there remains the prior and the action there remains the SOP action\); onℰk\\mathcal\{E\}\_\{k\}its return is bounded above byRmaxR\_\{\\max\}\. Combining givesW\(TS\)−W\(πsim⋆\)≤Rmax⋅ℙTS\[ℰk\]≤Rmax⋅T⋅\(p⋆\)k−1W\(\\mathrm\{TS\}\)\-W\(\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\)\\leq R\_\{\\max\}\\cdot\\mathbb\{P\}\_\{\\mathrm\{TS\}\}\[\\mathcal\{E\}\_\{k\}\]\\leq R\_\{\\max\}\\cdot T\\cdot\(p^\{\\star\}\)^\{k\-1\}\. ∎

##### What is proved and what remains\.

Theorem[8](https://arxiv.org/html/2605.21458#Thmtheorem8)closes Conjecture[2](https://arxiv.org/html/2605.21458#Thmconjecture2)on the restricted subclassΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\(posterior\-mean, bounded\-temperature posterior\-softmax, and polynomial\-shrinkage UCB\)\. Proposition[12](https://arxiv.org/html/2605.21458#Thmproposition12)gives a geometric bound for Thompson sampling that implies the sameo\(1\)o\(1\)conclusion for every prior withp⋆<1p^\{\\star\}<1\. The residual open item is the uniform\-over\-priors statement for the raw TS case; it does not affect the reading of the conjecture, which holds on the implemented A\-SOP \(inΠpassivereg\\Pi\_\{\\mathrm\{passive\}\}^\{\\mathrm\{reg\}\}\) and on TS under generic priors\.

## Appendix FAdditional Experiments

This appendix presents two supplementary simulation studies that illustrate specific theoretical predictions from the main text: the exponential reachability gap \(combination lock\) and the corridor bottleneck \(hidden treasure\)\. The hidden treasure experiment is a simplified, static\-reward version of the HIV mobile testing experiment \(Section[5\.2](https://arxiv.org/html/2605.21458#S5.SS2)\): it shares the two\-region grid with a wall and corridor but omits disease dynamics, providing a controlled comparison that isolates the spatial reachability mechanism\. The stateless threshold experiment appears in Appendix[A](https://arxiv.org/html/2605.21458#A1)\. All code is available in the supplementary material\.

### F\.1Combination Lock MDP \(extended sweep\)

Appendix[E](https://arxiv.org/html/2605.21458#A5)introduces the combination\-lock MDP as the construction witnessing Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)\(exponential reachability gap\)\. This subsection presents the full experimental sweep over horizonTeffT\_\{\\mathrm\{eff\}\}and error budgetccwith confidence bands, which supplements the single\-column summary in Table[A13](https://arxiv.org/html/2605.21458#A5.T13)\.

##### Setup\.

The combination lock is a chain MDP withTeff\+1T\_\{\\mathrm\{eff\}\}\+1states in a chain \(s0,s1,…,sTeffs\_\{0\},s\_\{1\},\\ldots,s\_\{T\_\{\\mathrm\{eff\}\}\}\) plus an absorbing fail states⟂s\_\{\\perp\}, andK=2K=2actions\. In the true MDP, actiona0a\_\{0\}advances the chain at every state \(si→si\+1s\_\{i\}\\to s\_\{i\+1\}\) and actiona1a\_\{1\}sends the agent tos⟂s\_\{\\perp\}\. The terminal statesTeffs\_\{T\_\{\\mathrm\{eff\}\}\}yields rewardRmax=1R\_\{\\max\}=1; all other states yield zero\. The discount factor isγ=1−1/Teff\\gamma=1\-1/T\_\{\\mathrm\{eff\}\}, so the effective horizon matches the chain length\.

The simulator identifies the correct action at each state independently with probability1−ϵp1\-\\epsilon\_\{p\}, whereϵp=c/Teff\\epsilon\_\{p\}=c/T\_\{\\mathrm\{eff\}\}for a constantc\>0c\>0\. With probabilityϵp\\epsilon\_\{p\}, the simulator swaps the two actions at that state \(it believesa1a\_\{1\}advances anda0a\_\{0\}fails\)\. This parameterization ensures that the total expected number of errors isccregardless ofTeffT\_\{\\mathrm\{eff\}\}, isolating the effect of horizon length from the total error budget\.

##### What we test\.

Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)predicts that the SOP’s value scales as\(1−ϵp\)Teff=\(1−c/Teff\)Teff\(1\-\\epsilon\_\{p\}\)^\{T\_\{\\mathrm\{eff\}\}\}=\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}, converging toe−ce^\{\-c\}from below asTeff→∞T\_\{\\mathrm\{eff\}\}\\to\\infty: the SOP reaches the terminal state only if the simulator is correct at*every*chain state, and each state is an independent Bernoulli trial\. The SEP, by contrast, can send a small number of exploratory units to learn the correct action at each state, then deploy the learned policy for the remaining units, achieving polynomial \(rather than exponential\) sample complexity\.

##### Protocol\.

We sweepTeff∈\{5,8,10,12,15,20,25,30\}T\_\{\\mathrm\{eff\}\}\\in\\\{5,8,10,12,15,20,25,30\\\}andc∈\{0\.25,0\.5,1\.0,2\.0\}c\\in\\\{0\.25,0\.5,1\.0,2\.0\\\}\. For each\(Teff,c\)\(T\_\{\\mathrm\{eff\}\},c\)pair, we draw 300 independent simulator realizations \(each withϵp=c/Teff\\epsilon\_\{p\}=c/T\_\{\\mathrm\{eff\}\}per\-state error probability\) and evaluate five policies in the true MDP:

- •SOP\(πsim⋆\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\): Follow the simulator’s recommended policy\. The SOP reaches the terminal state if and only if the simulator is correct at allTeffT\_\{\\mathrm\{eff\}\}chain states\. Withn=50n=50units, the SOP value isn⋅Rmaxn\\cdot R\_\{\\max\}if it reaches the terminal, and0otherwise\.
- •ϵ\\epsilon\-greedy\(ϵ=0\.1\\epsilon=0\.1, non\-learning\): At each state, follow the simulator’s action with probability0\.90\.9and take a uniformly random action with probability0\.10\.1\. Each unit independently traverses the chain; a unit reaches the terminal only if it takes the correct action at every state\. Does not update beliefs from observations\.
- •Learningϵ\\epsilon\-greedy\(ϵ=0\.1\\epsilon=0\.1\): Same random exploration as theϵ\\epsilon\-greedy, but updates beliefs about the correct action at each state from observed outcomes\. The exploit action \(90% of the time\) follows the learned correct action when known, falling back to the simulator’s recommendation at unvisited states\.
- •KG\-SEP: Directed exploration at uncertain states\. At states where the correct action is unknown, the KG tries actiona0a\_\{0\}first \(directed, not random\)\. At states where the correct action has been learned, it exploits\.
- •SEP\(nexplore=5n\_\{\\mathrm\{explore\}\}=5\): Dedicate 5 units to exploration with a uniform policy at each state\. Since transitions are deterministic, observing both actions at each state suffices to learn the correct action\. The remainingn−nexplore=45n\-n\_\{\\mathrm\{explore\}\}=45units then follow the learned policy\.

All values are normalized by the oracle valueW⋆=n⋅Rmax=50W^\{\\star\}=n\\cdot R\_\{\\max\}=50\. We report means across the 300 simulator draws; confidence bands \(±2SE\\pm 2\\,\\mathrm\{SE\}\) are shown in the figure\.

##### Results\.

Figure[A5](https://arxiv.org/html/2605.21458#A6.F5)presents the results in three panels\.

Panel \(a\) compares the three policies atc=1\.0c=1\.0\. The SOP’s normalized value decays from∼33%\{\\sim\}33\\%atTeff=5T\_\{\\mathrm\{eff\}\}=5to∼0\.36\{\\sim\}0\.36atTeff=30T\_\{\\mathrm\{eff\}\}=30, closely tracking the theoretical curve\(1−c/Teff\)Teff→e−1≈0\.368\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}\\to e^\{\-1\}\\approx 0\.368\. The SEP maintains∼90%\\sim 90\\%of oracle across all horizons \(the10%10\\%loss comes from the 5 exploratory units that do not follow the optimal policy\)\. Theϵ\\epsilon\-greedy policy decays*faster*than the SOP: atTeff=30T\_\{\\mathrm\{eff\}\}=30, it achieves only∼5%\\sim 5\\%of oracle\.

Panel \(b\) shows the SOP’s decay across all fourccvalues\. The simulated curves \(markers\) closely match the theoretical predictions \(dashed lines\)\(1−c/Teff\)Teff\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}\. Forc=2\.0c=2\.0, the SOP’s value drops below15%15\\%of oracle byTeff=30T\_\{\\mathrm\{eff\}\}=30, approaching the asymptotee−2≈0\.135e^\{\-2\}\\approx 0\.135\. The confidence bands are narrow \(SE<0\.02<0\.02\) due to the 300 simulator draws\.

Panel \(c\) shows the SEP–SOP gap𝒢/W⋆\\mathcal\{G\}/W^\{\\star\}acrossccvalues\. The gap grows with bothTeffT\_\{\\mathrm\{eff\}\}andcc, confirming that the reachability component dominates for long horizons and large error budgets\.

##### Whyϵ\\epsilon\-greedy performs worse than the SOP\.

The poor performance of theϵ\\epsilon\-greedy policy merits explanation\. At each state, theϵ\\epsilon\-greedy agent takes a random action with probabilityϵ=0\.1\\epsilon=0\.1\. For a unit to reach the terminal, it must take the correct action at*every*state\. At states where the simulator is correct, theϵ\\epsilon\-greedy agent takes the wrong action with probabilityϵ/K=0\.05\\epsilon/K=0\.05\. At states where the simulator is wrong, it takes the correct action with probabilityϵ/K=0\.05\\epsilon/K=0\.05\. The probability of reaching the terminal is therefore

ℙ\(reach\)=∏s=0Teff−1ps,ps=\{1−ϵ/Kif simulator correct ats,ϵ/Kif simulator wrong ats\.\\mathbb\{P\}\(\\text\{reach\}\)=\\prod\_\{s=0\}^\{T\_\{\\mathrm\{eff\}\}\-1\}p\_\{s\},\\qquad p\_\{s\}=\\begin\{cases\}1\-\\epsilon/K&\\text\{if simulator correct at \}s,\\\\ \\epsilon/K&\\text\{if simulator wrong at \}s\.\\end\{cases\}\(69\)Even at states where the simulator is correct, the random exploration introduces a5%5\\%failure probability per state, compounding multiplicatively\. At states where the simulator is incorrect, the agent must select the correct action by chance \(5%5\\%probability\)\. The net effect is that theϵ\\epsilon\-greedy agent must simultaneously succeed at every incorrect state*and*avoid errors at every correct state—a doubly exponential penalty\. The SEP avoids this compounding by dedicating explorers to learn the correct action at each state independently; the remaining units then exploit the learned policy with certainty\.

![Refer to caption](https://arxiv.org/html/2605.21458v1/x5.png)Figure A5:Combination lock MDP\.\(a\) Normalized valueW\(π\)/W⋆W\(\\pi\)/W^\{\\star\}versus effective horizonTeffT\_\{\\mathrm\{eff\}\}atc=1\.0c=1\.0\. The SOP \(blue circles\) decays towarde−1e^\{\-1\}; the SEP \(vermillion squares\) maintains∼90%\{\\sim\}90\\%;ϵ\\epsilon\-greedy \(teal triangles\) decays faster than the SOP\. Dashed line: theoretical SOP curve\(1−c/Teff\)Teff\(1\-c/T\_\{\\mathrm\{eff\}\}\)^\{T\_\{\\mathrm\{eff\}\}\}\. Shaded bands:±2SE\\pm 2\\,\\mathrm\{SE\}over 300 simulator draws\. \(b\) SOP decay acrossc∈\{0\.25,0\.5,1\.0,2\.0\}c\\in\\\{0\.25,0\.5,1\.0,2\.0\\\}\. Markers: simulated; dashed: theoretical\. \(c\) SEP–SOP gap𝒢/W⋆\\mathcal\{G\}/W^\{\\star\}\. The gap grows with bothTeffT\_\{\\mathrm\{eff\}\}andcc, confirming the exponential reachability separation\.

### F\.2Hidden Treasure MDP \(Simplified Spatial Reachability\)

##### Setup\.

The Hidden Treasure environment is a simplified precursor to the HIV mobile testing experiment \(Section[5\.2](https://arxiv.org/html/2605.21458#S5.SS2)\)\. It uses a4×64\\times 6grid MDP with two regions separated by a wall, sharing the corridor bottleneck structure but replacing disease dynamics with static rewards\. This isolates the spatial reachability mechanism: the SOP never crosses the wall because the simulator undervalues Region B, while the SEP deliberately navigates through the corridor to discover the hidden high\-reward cluster\.

Region A \(columns0–22\) is well\-modeled by the simulator, with moderate rewardsrA=0\.3r\_\{A\}=0\.3at every state\. Region B \(columns33–55\) is poorly modeled: the simulator assignsrBsim=0\.1r\_\{B\}^\{\\mathrm\{sim\}\}=0\.1to all states, but the true MDP contains a high\-reward “treasure” cluster in the bottom\-right corner, with rewards up tormax=1\.0r\_\{\\max\}=1\.0at the treasure cell and0\.70\.7at its eight neighbors\. The wall between the two regions is impassable except at a single corridor cell at row22, column33\(the midpoint of the grid\)\. Transitions are stochastic: with probability0\.850\.85, the agent moves in the intended direction; with probability0\.150\.15, it moves uniformly at random among the four cardinal directions\. If a move would cross the wall \(except through the corridor\) or exit the grid, the agent stays in place\. The simulator’s transition model is a slightly noisy copy of the true transitions \(exponential noise added, then renormalized\), so transition errors are small; the dominant error is in the reward model for Region B\.

##### What we test\.

The key prediction is the*reachability gap*: the SOP, optimizing under the simulator’s reward model, never enters Region B \(sincerBsim=0\.1<rA=0\.3r\_\{B\}^\{\\mathrm\{sim\}\}=0\.1<r\_\{A\}=0\.3\) and therefore never discovers the treasure\. The SEP deliberately navigates through the corridor to explore Region B, discovers the high\-reward cluster, and redirects all units to exploit it after the exploration phase\. Undirected exploration \(ϵ\\epsilon\-greedy\) occasionally stumbles into Region B but cannot reliably navigate through the narrow corridor, especially with stochastic transitions\.

##### Protocol\.

We usen=100n=100units,T=90T=90time steps \(extended from 60 to allow sufficient time for exploration and exploitation\),γ=0\.95\\gamma=0\.95, and 50 independent trials\. All units start at cell\(0,0\)\(0,0\)\(top\-left corner of Region A\)\. Seven policies are compared:

- •Oracle: Follows the optimal policy computed from the true rewards and transitions\. Navigates directly to the treasure cluster and exploits it\.
- •SOP\(πsim⋆\\pi^\{\\star\}\_\{\\mathrm\{sim\}\}\): Follows the optimal policy computed from the simulator’s rewards and transitions\. Stays in Region A because the simulator undervalues Region B\.
- •ϵ\\epsilon\-greedy\(ϵ=0\.1\\epsilon=0\.1, non\-learning\): Follows the SOP with probability0\.90\.9; takes a uniformly random action with probability0\.10\.1\. Occasionally enters Region B through random walks but cannot reliably navigate the corridor\. Does not update beliefs\.
- •Learningϵ\\epsilon\-greedy\(ϵ=0\.1\\epsilon=0\.1\): Same random exploration as theϵ\\epsilon\-greedy, but learns the true reward at each visited state\-action pair\. The exploit action follows the posterior\-optimal policy \(recomputed every 10 steps from learned rewards\)\.
- •KG\-SEP: Directed exploration at uncertain states\. At each state, if any action has not been tried, the KG explores it with 30% probability; otherwise it exploits the learned policy\. Recomputes the policy every 10 steps\.
- •SEP\(nexplore=15n\_\{\\mathrm\{explore\}\}=15, exploration duration=15=15days\): Dedicates 15 units to directed exploration for the first 15 time steps\. Explorers follow a hand\-crafted navigation policy: move toward the corridor row, pass through the corridor into Region B, then explore randomly within Region B\. At each visited state, the explorer records the true reward\. After 15 steps, the learned rewards are used to recompute the optimal policy via value iteration, and all 100 units switch to the updated policy for the remaining 75 steps\. The other 85 units follow the SOP during the exploration phase\.
- •Fisher\-SEP\(nexplore=15n\_\{\\mathrm\{explore\}\}=15, exploration duration=15=15days\): Same three\-phase structure as the SEP, but the exploration phase uses the Fisher\-optimal stochastic policy \(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\) instead of the hand\-crafted navigation policy\. At each replan step, the Fisher\-SEP solves for the stochastic policyπ\\pithat maximizestr\(ℱ\(π\)\)\\mathrm\{tr\}\(\\mathcal\{F\}\(\\pi\)\)on the current posterior MDP, then samples actions from this policy during exploration\. The Fisher criterion naturally directs explorers toward the corridor and Region B because the value function is highly sensitive to the poorly estimated reward parameters there\.

We record: \(i\) per\-step reward at each time step \(averaged over 50 trials, with±2SE\\pm 2\\,\\mathrm\{SE\}confidence bands\); \(ii\) cumulative reward over the full horizon; and \(iii\) state visitation heatmaps for the SOP and SEP \(from a single representative trial\)\.

##### Results\.

Figure[A6](https://arxiv.org/html/2605.21458#A6.F6)presents the results in four panels\.

Panel \(a\) shows cumulative reward over time\. The Oracle accumulates reward fastest, reaching∼8,200\\sim 8\{,\}200byt=90t=90\. The SEP initially lags \(during the 15\-step exploration phase, the 15 explorers earn suboptimal rewards while navigating to Region B\), but after the exploration phase ends \(marked by the vertical dotted line\), the SEP’s slope increases sharply as all 100 units exploit the discovered treasure\. Byt=90t=90, the SEP achieves∼84%\\sim 84\\%of the Oracle’s cumulative reward\. The SOP accumulates reward at a steady but low rate \(all units earnrA=0\.3r\_\{A\}=0\.3\), reaching only∼33%\\sim 33\\%of Oracle\. Theϵ\\epsilon\-greedy policy performs comparably to the SOP \(∼33%\\sim 33\\%\), because its random exploration rarely navigates through the corridor\. The learningϵ\\epsilon\-greedy achieves∼46%\\sim 46\\%: it updates reward estimates at visited states, but its random exploration rarely reaches Region B, so the learned policy offers only marginal improvement over the SOP\. The KG\-SEP achieves∼51%\\sim 51\\%: its directed exploration at uncertain states is more effective than random perturbations, but it struggles with the corridor bottleneck because it explores locally \(one state at a time\) rather than planning a multi\-step trajectory through the corridor\. The SEP’s deliberate navigation through the corridor is the decisive advantage: it reaches Region B within the first 15 steps and discovers the treasure cluster, enabling all 100 units to exploit it for the remaining 75 steps\. The Fisher\-SEP achieves∼85%\{\\sim\}85\\%, matching the standard SEP \(∼86%\{\\sim\}86\\%\)\. The Fisher criterion’s advantage is that it optimizes over stochastic policies that naturally explore both regions: the Fisher\-optimal policy assigns positive probability to actions that move units toward the corridor, because the value function is highly sensitive to the reward parameters in Region B\. Unlike the EPI—which assigns zero weight to Region B states because the SOP never visits them—the Fisher criterion identifies these states through its optimization over the full stochastic policy space\.

Panel \(b\) shows per\-step reward \(smoothed with a 3\-step moving average\) with±2SE\\pm 2\\,\\mathrm\{SE\}confidence bands\. The SEP exhibits a characteristic “dip and surge” pattern: during exploration \(t<15t<15\), the per\-step reward dips below the SOP \(the explorers earn less thanrAr\_\{A\}while navigating\), then surges above the SOP after the policy update att=15t=15\. The confidence bands are tight for the SOP \(deterministic policy, low variance\) and wider for the SEP during the exploration phase \(stochastic navigation through the corridor\)\.

Panels \(c\) and \(d\) show state visitation heatmaps for the SOP and SEP, respectively, from a single representative trial\. The SOP’s visits are concentrated entirely in Region A \(columns0–22\), with the highest density near the starting cell\. The SOP never crosses the wall\. The SEP’s visits span both regions: Region A is visited during the exploitation phase \(and by the 85 non\-exploring units during exploration\), while Region B shows a clear trail through the corridor and into the treasure cluster\. The treasure cell is marked with a star \(★\\bigstar\); the corridor is marked with an arrow\.

##### The corridor bottleneck\.

The corridor constitutes the critical bottleneck in this environment\. To reach Region B, an agent must navigate to the corridor row \(row 2\) and then move right through the single passable cell\. With stochastic transitions \(15%15\\%noise\), an agent attempting to move right through the corridor has only an85%85\\%chance of succeeding on each attempt; with probability15%15\\%, it is deflected up, down, or left\. For theϵ\\epsilon\-greedy policy, reaching the corridor requires a sequence of random actions that happen to direct the agent toward row 2 and then rightward—an event whose probability decays geometrically in the distance from the starting cell to the corridor\. Even when anϵ\\epsilon\-greedy agent reaches Region B, it has no mechanism to communicate the discovery to other units or to update the shared policy\. The SEP overcomes both obstacles: the navigation policy deterministically guides explorers to the corridor, and the reward learning mechanism propagates the discovery to all units via the policy update att=15t=15\.

This experiment illustrates the reachability component of the SEP–SOP gap in a spatial setting: the treasure is reachable in the true MDP but not reachable*under the SOP*, and the corridor creates a bottleneck that undirected exploration cannot reliably penetrate\.

![Refer to caption](https://arxiv.org/html/2605.21458v1/x6.png)Figure A6:Hidden Treasure MDP\.\(a\) Cumulative reward overT=90T=90steps\. The SEP \(vermillion\) initially lags during exploration but surges ahead after the policy update att=15t=15\(dotted line\)\. The SOP \(blue\) stays in Region A;ϵ\\epsilon\-greedy \(teal\) barely improves over the SOP\. \(b\) Per\-step reward \(3\-step moving average\) with±2SE\\pm 2\\,\\mathrm\{SE\}confidence bands over 50 trials\. The SEP’s “dip and surge” pattern is clearly visible\. \(c\) SOP state visitation heatmap: visits concentrated in Region A\. \(d\) SEP state visitation heatmap: visits span both regions, with a clear trail through the corridor to the treasure cluster \(★\\bigstar\)\. The wall is shown as a dashed white line; region labels A and B are indicated above each heatmap\.

## Appendix GSEP Experiment Log

This appendix maps the SEP’s experimental actions to the formal notationℰ=\(πe,𝒩e,tstart,tend\)\\mathcal\{E\}=\(\\pi^\{e\},\\mathcal\{N\}^\{e\},t\_\{\\mathrm\{start\}\},t\_\{\\mathrm\{end\}\}\)from Section[2](https://arxiv.org/html/2605.21458#S2)and documents the belief convergence process\.

Figure[A7](https://arxiv.org/html/2605.21458#A7.F7)presents the SEP’s actions on a single representative trial spanning 90 days\. Panel \(a\) is a Gantt\-style timeline\. During the first 5 days, all machines undergo a balanced pilot phase: stocked at near\-capacity with randomized prices near the minimum segment WTP to collect unbiased demand observations\. On day 5, a hypothesis test identifies machines where the simulator’s predictions diverge significantly from the pilot observations\. During days 5–19, flagged machines \(typically University and New Neighborhood\) are over\-stocked at capacity with prices set to 90% of the minimum segment WTP, while unflagged machines follow the SOP\. After day 20, all machines switch to the learned exploitation policy: stocking at2×2\\timesthe learned demand rate and pricing at2\.52\.5–2\.8×2\.8\\timeswholesale\. Structural breaks—the festival at New Neighborhood on day 35 and the construction shock at Downtown on day 50—trigger automatic re\-exploration episodes \(marked by red triangles\), during which the affected machine is temporarily over\-stocked for one day to recalibrate the belief\.

Panel \(b\) shows the total demand estimate per machine converging from the simulator’s priors \(triangles\) toward the true rates \(dotted lines\)\. Convergence is rapid during the exploration phase \(shaded region\): within 10 days, the SEP’s estimates for University and New Neighborhood lie within 15% of the true rates\. During exploitation, beliefs stabilize, with perturbations at structural breaks followed by rapid re\-convergence\.

For comparison, the KG\-SEP \(Level 2\) does not employ a distinct exploration phase\. It continuously adjusts stocking based on the augmented indexλ^i,j\+γeff⋅KGi,j\\hat\{\\lambda\}\_\{i,j\}\+\\gamma\_\{\\mathrm\{eff\}\}\\cdot\\mathrm\{KG\}\_\{i,j\}, learning more slowly at University and New Neighborhood because it does not deliberately over\-stock these machines\. Over 400 days, the SEP achieves 67% of the oracle’s cash versus 73% for the KG\-SEP; over 1600 days the SEP leads at 74% versus 72%, confirming that the SEP’s front\-loaded exploration compounds into larger exploitation gains over longer horizons\. The Fisher\-SEP follows the same three\-phase protocol but uses the Fisher\-optimal stochastic policy during the exploration phase, achieving 66% atT=400T\{=\}400and 72% atT=1600T\{=\}1600—the best non\-oracle policy at long horizons\.

![Refer to caption](https://arxiv.org/html/2605.21458v1/x7.png)Figure A7:SEP experiment log \(single trial, 90 days\)\. \(a\) Timeline: orange bars indicate exploration \(over\-stocking at low prices\), green bars indicate exploitation \(learned policy\)\. Red triangles mark detected structural breaks\. \(b\) Per\-VM demand belief convergence: triangles show simulator priors, dotted lines show true rates\. Shaded region marks the exploration phase \(days 0–14\)\.Table[A14](https://arxiv.org/html/2605.21458#A7.T14)maps each experimental action to the formal notation\. In the vending machine setting, a “unit” is a product slot at a machine, and the experimental policyπe\\pi^\{e\}specifies the stocking level and price\.

The connection to the two\-phase pilot\-to\-policy protocol \(Section[4](https://arxiv.org/html/2605.21458#S4)\) is direct\. The 5\-day pilot phase \(days 0–4\) constitutes*Phase 0*: a balanced diagnostic that collects uncensored demand observations at all machines, answering the “if” question \(is the simulator wrong?\) and the “where” question \(at which machines?\)\. Experimentsℰ1\\mathcal\{E\}\_\{1\}andℰ2\\mathcal\{E\}\_\{2\}\(days 5–19\) constitute*Phase 1*\(targeted exploration\): they over\-stock the flagged machines to refine the demand estimates\. Experimentℰ4\\mathcal\{E\}\_\{4\}constitutes*Phase 2*\(exploitation\): it deploys the learned demand rates across all machines, answering the “how” question\. Experimentsℰ5\\mathcal\{E\}\_\{5\}andℰ6\\mathcal\{E\}\_\{6\}are adaptive extensions: they re\-run targeted exploration at specific machines when the structural break detector flags a regime change, answering the “when to re\-experiment” question\.

Table A14:SEP experiments mapped toℰ=\(πe,𝒩e,tstart,tend\)\\mathcal\{E\}=\(\\pi^\{e\},\\mathcal\{N\}^\{e\},t\_\{\\mathrm\{start\}\},t\_\{\\mathrm\{end\}\}\)\. Phase column indicates the correspondence to the two\-phase pilot\-to\-policy protocol\.The opportunity cost ofℰ0\\mathcal\{E\}\_\{0\}–ℰ2\\mathcal\{E\}\_\{2\}is the revenue lost from stocking at below\-market prices and diverting depot capacity to the experimental machines\. The information value is the unbiased demand observations that enableℰ4\\mathcal\{E\}\_\{4\}\. Experimentsℰ5\\mathcal\{E\}\_\{5\}andℰ6\\mathcal\{E\}\_\{6\}are triggered automatically by the structural break detector \(7\-day rolling mean deviates by\>80%\>80\\%\)\.ℰ0\\mathcal\{E\}\_\{0\}is the general pilot phase \(applied to all adaptive policies\);ℰ1\\mathcal\{E\}\_\{1\},ℰ2\\mathcal\{E\}\_\{2\},ℰ5\\mathcal\{E\}\_\{5\},ℰ6\\mathcal\{E\}\_\{6\}are Level 3 actions \(deliberate trajectory planning\);ℰ4\\mathcal\{E\}\_\{4\}is Level 2 \(myopic exploitation\)\. The KG\-SEP also usesℰ0\\mathcal\{E\}\_\{0\}\(the pilot\) but then proceeds directly toℰ4\\mathcal\{E\}\_\{4\}\-type actions, without the targeted exploration ofℰ1\\mathcal\{E\}\_\{1\}–ℰ2\\mathcal\{E\}\_\{2\}\.

## Appendix HRelated Work

This appendix surveys the literatures that intersect with our work\. For each theme we describe the core ideas, highlight the most relevant recent work, and explain how our framework relates to and differs from the existing literature\.

##### Reinforcement learning and Bayes\-adaptive MDPs\.

The tabular MDP framework\[Puterman,[1994](https://arxiv.org/html/2605.21458#bib.bib108), Sutton and Barto,[2018](https://arxiv.org/html/2605.21458#bib.bib90)\]underpins a substantial body of work on exploration\. Regret\-minimizing algorithms such as UCRL2\[Jakschet al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib93)\], minimax\-optimal methods\[Azaret al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib95)\], and optimistic Q\-learning\[Jinet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib94)\]achieve near\-optimal worst\-case guarantees, while PAC\-MDP algorithms\[Strehlet al\.,[2009](https://arxiv.org/html/2605.21458#bib.bib143), Kearns and Singh,[2002](https://arxiv.org/html/2605.21458#bib.bib114)\]guarantee near\-optimal behavior after polynomially many samples\. A common assumption throughout this literature is that the agent learns entirely from direct interaction, with no simulator or structured prior available\.

The closest methodological antecedent of our framework isKearns and Singh \[[2002](https://arxiv.org/html/2605.21458#bib.bib114)\], who prove the simulation lemma our Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)extends and propose the Explicit\-Explore\-or\-Exploit \(E3\) algorithm that partitions states into “known” \(visited enough times to have accurate estimates\) and “unknown” \(to be reached via explicit exploration trajectories\)\. E3’s known–unknown partition is a close structural cousin of our visited–unreachable partition: both identify states whose empirical estimates are trustworthy and both plan trajectories to expand the trustworthy region\. The two settings still diverge on three points\. Where E3’s errors are finite\-sample and shrink with additional interaction, ours split into a calibration–deployment regime\-shift component \(ϵh\\epsilon^\{h\}: unobserved confounding and drift, which a randomizedSS\-measurable pilot identifies and removes\) and a misspecification residual \(ϵm\\epsilon^\{m\}: parametric mis\-specification of the simulator’s training procedure, which no real\-world interaction reduces\)\. Our framework therefore distinguishes randomization \(required to addressϵh\\epsilon^\{h\}\) from passive interaction \(which addresses neitherϵh\\epsilon^\{h\}norϵm\\epsilon^\{m\}, and only sharpens the planner’s posterior under the simulator’s prior\)\. Where E3starts with no model of the environment, our planner holds a pre\-calibrated simulator and the question is whether real\-world interaction is worth its opportunity cost relative to the simulator’s recommendation; this opportunity cost—foregone reward under the deployed policy during the experimental phase—has no analog in PAC\-MDP analyses, which count samples but do not price them\. Where E3’s world is a fully observable MDP, ours is a POMDP whose observed\-state marginalization is the simulator’s target\. These distinctions explain why our policy hierarchy \(Table[A12](https://arxiv.org/html/2605.21458#A5.T12)\) contains strata that E3does not: the A\-SOP \(simulator\-as\-prior with no designed exploration\) has no counterpart in PAC\-MDP because the prior has no counterpart; the Fisher\-SEP augments the simulation lemma’s error\-bounding role with an A\-optimal design criterion over stochastic policies\.

While these frequentist approaches provide worst\-case guarantees, the Bayesian perspective offers a more natural framework for our setting\. The Bayes\-adaptive MDP \(BAMDP\) framework\[Duff,[2002](https://arxiv.org/html/2605.21458#bib.bib116), Guezet al\.,[2012](https://arxiv.org/html/2605.21458#bib.bib117)\]formulates the exploration–exploitation tradeoff as planning in an augmented belief\-state MDP: the state is the pair \(physical state, posterior over unknown parameters\), and the optimal policy in this augmented MDP automatically balances information gathering with reward maximization\. Posterior sampling for reinforcement learning \(PSRL\)\[Osbandet al\.,[2013](https://arxiv.org/html/2605.21458#bib.bib91), Russo,[2019](https://arxiv.org/html/2605.21458#bib.bib92)\]provides a computationally tractable approximation: at each episode, sample an MDP from the posterior and act optimally in it\. Our Level 4 policy—the Bayes\-optimal adaptive policy that jointly optimizes exploration and exploitation—corresponds exactly to the BAMDP solution, and our Thompson Sampling baseline implements PSRL\.

The critical departure from BAMDPs is not in the solution concept—our Level 4*is*the BAMDP solution—but in the question we ask\. BAMDPs ask “how should the agent explore?”; we ask “should the agent explore at all, given that a simulator already provides a reasonable policy?” Practical approximations that follow the posterior\-optimal policy without designed exploration correspond to our A\-SOP \(Level 1\)\. The gap between these levels—the value of designed exploration over passive learning—is the central object of our analysis\.

##### Bandits and adaptive experimentation\.

The multi\-armed bandit\[Thompson,[1933](https://arxiv.org/html/2605.21458#bib.bib49), Robbins,[1952](https://arxiv.org/html/2605.21458#bib.bib33)\]provides the canonical formulation of the exploration–exploitation tradeoff, with foundational solutions—the Gittins index\[Gittins,[1979](https://arxiv.org/html/2605.21458#bib.bib50), Gittinset al\.,[2011](https://arxiv.org/html/2605.21458#bib.bib51)\], UCB\[Aueret al\.,[2002](https://arxiv.org/html/2605.21458#bib.bib53), Lai and Robbins,[1985](https://arxiv.org/html/2605.21458#bib.bib52)\], and Thompson Sampling\[Russoet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib55), Agrawal and Goyal,[2012](https://arxiv.org/html/2605.21458#bib.bib54)\]—that achieve optimal or near\-optimal regret\. These solutions, however, model each arm as a stateless reward distribution, without transition dynamics, compounding errors over a planning horizon, or a simulator that provides prior knowledge\.

Our optimal allocation extends the knowledge gradient\[Frazieret al\.,[2008](https://arxiv.org/html/2605.21458#bib.bib48),[2009](https://arxiv.org/html/2605.21458#bib.bib115)\]from a single\-decision setting to a batch MDP setting where each experimental unit simultaneously earns immediate reward and generates information that propagates through the Bellman equations\. Best\-arm identification\[Audibert and Bubeck,[2010](https://arxiv.org/html/2605.21458#bib.bib56), Kaufmannet al\.,[2016](https://arxiv.org/html/2605.21458#bib.bib57), Russo,[2020](https://arxiv.org/html/2605.21458#bib.bib32)\]studies the pure\-exploration regime where the goal is to identify the best action with minimal samples; our EPI threshold can be viewed as a meta\-decision rule that determines whether the planner should enter a pure\-exploration phase at all, or whether the simulator’s recommendation is already sufficient\.

Recent work has extended the bandit framework in two directions relevant to ours\. Adaptive platform experiments\[Kasy and Sautmann,[2021](https://arxiv.org/html/2605.21458#bib.bib112), Simchi\-Levi and Wang,[2023](https://arxiv.org/html/2605.21458#bib.bib113), Katoet al\.,[2024](https://arxiv.org/html/2605.21458#bib.bib120)\]study the tension between learning and earning with batched observations and sequential treatment assignments, providing regret bounds for adaptive experimental design\. On a different track,Zhaoet al\.\[[2024](https://arxiv.org/html/2605.21458#bib.bib125)\]study “adaptive experimentation when you can’t experiment,” developing methods for learning from confounded observational data when randomization is infeasible\. In contrast to their setting, where experimentation is infeasible, our planner retains the option to experiment but must determine whether the cost is justified\. Our framework extends both directions by incorporating MDP structure—in which actions affect future states through transition dynamics—and the simulator as a structured prior that provides a warm start for both policy optimization and experimental design\.

##### Sim\-to\-real transfer\.

The sim\-to\-real literature studies*how*to deploy policies trained in simulation to the physical world\. The principal strategies—domain randomization\[Tobinet al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib13), Penget al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib109), Akkayaet al\.,[2019](https://arxiv.org/html/2605.21458#bib.bib14)\], active domain randomization\[Mehtaet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib15)\], and system identification\[Chebotaret al\.,[2019](https://arxiv.org/html/2605.21458#bib.bib16), Allevatoet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib17)\]—aim to close the gap between simulator and reality by improving the simulator’s fidelity or the policy’s robustness\[Muratoreet al\.,[2022](https://arxiv.org/html/2605.21458#bib.bib18), Salvatoet al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib19)\]\. Throughout this literature, real\-world interaction is*assumed*: the question is how to use it efficiently, not whether it is warranted\.

Wagenmaker and Jamieson \[[2024](https://arxiv.org/html/2605.21458#bib.bib118)\]refine this perspective by distinguishing*policy transfer*\(deploying a simulator\-trained policy directly\) from*exploration transfer*\(using the simulator to design an exploration strategy\), and show that the latter can yield exponentially better sample complexity\. Our SOP–SEP distinction parallels theirs: SOP corresponds to policy transfer, and SEP to exploration transfer\. Our Proposition[11](https://arxiv.org/html/2605.21458#Thmproposition11)is a Bayesian, population\-cost counterpart of their frequentist sample\-complexity separation: their bound is on the number of real\-world samples the transferred exploration policy needs; ours is on the reward a planner forgoes to the opportunity cost of experimentation on real units\. The two results are complementary: our contribution is the prior decision of whether the transfer is worth its opportunity cost, given the simulator’s bias structure and the planning horizon\.Fickinger \[[2025](https://arxiv.org/html/2605.21458#bib.bib129)\]provide the first provable guarantees for sim\-to\-real transfer via offline domain randomization calibrated from real\-world data\.

A closer algorithmic cousin isMemmelet al\.\[[2024](https://arxiv.org/html/2605.21458#bib.bib145)\], whose ASID system uses an initial \(possibly inaccurate\) simulator to design Fisher\-information\-maximizing exploration policies for sim\-to\-real*system identification*in robotic manipulation\. ASID and our Fisher\-SEP share the core principle that an imperfect simulator can be repurposed as a design tool for real\-world data collection via Fisher information\. Three structural differences distinguish the settings\. First, ASID identifies physical*parameters*\(mass, articulation, friction\) in a fully observed dynamical system; Fisher\-SEP targets the A\-optimal posterior variance of the*planner’s policy value*\(Definition[5](https://arxiv.org/html/2605.21458#Thmdefinition5)\) in a POMDP where the residual bias is structural confounding and drift, not parameter mis\-calibration\. Second, ASID’s state space is fully reachable \(the robot can in principle visit any configuration\); our setting has reachability bottlenecks that make the value of exploration depend on whether designed trajectories cross regions the simulator\-biased policy cannot\. Third, ASID is trained to identify parameters and then deploy a downstream controller; Fisher\-SEP interleaves design with the planner’s own value function, so the “design variable” and the “exploitation variable” share the same Bellman equations\.

However, even exploration transfer presupposes that real\-world interaction will occur\. Our framework introduces the decision\-theoretic layer of*whether*and*when*to experiment: the EPI provides a quantitative threshold below which the simulator’s recommendation should be deployed without modification\.

##### Model\-based and hybrid RL\.

Dyna\[Sutton,[1991](https://arxiv.org/html/2605.21458#bib.bib110)\]introduced the idea of interleaving model learning with planning\. Modern successors—MBPO\[Janneret al\.,[2019](https://arxiv.org/html/2605.21458#bib.bib111)\], DreamerV3\[Hafneret al\.,[2023](https://arxiv.org/html/2605.21458#bib.bib30)\], and TD\-MPC\[Hansenet al\.,[2022](https://arxiv.org/html/2605.21458#bib.bib119),[2024](https://arxiv.org/html/2605.21458#bib.bib37)\]—learn world models that generate synthetic experience for policy optimization, with DayDreamer\[Wuet al\.,[2023](https://arxiv.org/html/2605.21458#bib.bib36)\]demonstrating that world models can enable physical robot learning with minimal real\-world interaction\. The critical difference from our setting is that these methods*retrain*the model with incoming data, treating it as an evolving approximation\. Our simulator, by contrast, represents pre\-existing institutional knowledge that the planner cannot modify\. This distinction is less restrictive than it may appear: as Proposition[8](https://arxiv.org/html/2605.21458#Thmproposition8)establishes, Bayesian updating renders the fixed\-simulator assumption without loss of generality, since the posterior\-predictive model after observing data is equivalent to an adaptive simulator\.

A related but distinct line of work combines pre\-collected offline data with limited online interaction\. Hybrid offline\-and\-online RL\[Ballet al\.,[2023](https://arxiv.org/html/2605.21458#bib.bib35), Songet al\.,[2023](https://arxiv.org/html/2605.21458#bib.bib123), Niuet al\.,[2022](https://arxiv.org/html/2605.21458#bib.bib34),[2023](https://arxiv.org/html/2605.21458#bib.bib124)\]demonstrates that even modest amounts of online data can substantially improve offline RL performance\.Niuet al\.\[[2022](https://arxiv.org/html/2605.21458#bib.bib34)\]develop dynamics\-aware methods that selectively incorporate simulated data based on estimated model error\.Chenet al\.\[[2025](https://arxiv.org/html/2605.21458#bib.bib126)\]propose multi\-fidelity hybrid RL via information gain maximization, treating the simulator as a low\-fidelity source, andFuet al\.\[[2024](https://arxiv.org/html/2605.21458#bib.bib127)\]provide benchmarks for RL with biased offline data and imperfect simulators\. The key difference from all of these approaches is that our framework explicitly models the*cost*of real\-world experimentation: each real\-world sample carries an opportunity cost because the experimental unit could have been served by the simulator\-trained policy instead\. Where hybrid RL addresses how to combine heterogeneous data sources, our framework addresses the prior question of whether the expensive source is worth acquiring\.

##### Multi\-fidelity Bayesian optimization\.

Multi\-fidelity BO\[Poloczeket al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib11), Kandasamyet al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib9),[2019](https://arxiv.org/html/2605.21458#bib.bib10), Peherstorferet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib7)\]addresses a closely related problem: determining when a planner should query an expensive high\-fidelity oracle rather than rely on a cheap low\-fidelity surrogate\. Acquisition functions such as the multi\-fidelity knowledge gradient\[Poloczeket al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib11)\]and multi\-fidelity GP\-UCB\[Kandasamyet al\.,[2016](https://arxiv.org/html/2605.21458#bib.bib8)\]formalize this as a cost\-aware information\-value calculation, and cost\-aware BO\[Leeet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib12)\]explicitly accounts for query costs—ingredients shared with our EPI\.

The correspondence is limited, however, by three structural features of our MDP setting\. Multi\-fidelity BO optimizes a static function: querying one input does not change the function’s value at other inputs\. In our MDP, actions affect future states through transition dynamics, so the information value of an experiment depends on the entire state\-transition structure\. Each BO query is also an isolated evaluation with no population\-level consequence, whereas our experimental units bear the opportunity cost of not receiving the simulator’s recommendation\. And multi\-fidelity BO assumes the low\-fidelity function is a noisy version of the high\-fidelity function, whereas our simulator may be*systematically*biased due to confounding—a qualitatively different error structure that our extended simulation lemma \(Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)\) characterizes\. Our EPI generalizes the multi\-fidelity acquisition function to sequential decision problems with population\-level costs and structured model bias\.

##### Simulation lemma and model approximation\.

The simulation lemma\[Kearns and Singh,[2002](https://arxiv.org/html/2605.21458#bib.bib114)\]bounds the value difference between the true MDP and an approximate model as a function of reward and transition errors\.222In its standard form, for a discounted MDP with discountγ\\gamma:\|Vπ\(ℳ⋆\)−Vπ\(ℳ^\)\|≤ϵr1−γ\+γϵpVmax\(1−γ\)2\|V^\{\\pi\}\(\\mathcal\{M\}^\{\\star\}\)\-V^\{\\pi\}\(\\hat\{\\mathcal\{M\}\}\)\|\\leq\\frac\{\\epsilon\_\{r\}\}\{1\-\\gamma\}\+\\frac\{\\gamma\\epsilon\_\{p\}V\_\{\\max\}\}\{\(1\-\\gamma\)^\{2\}\}, whereϵr\\epsilon\_\{r\}andϵp\\epsilon\_\{p\}are the maximum reward and transition errors\.The central insight of this result—that transition errors compound quadratically with the effective horizon while reward errors compound only linearly—is foundational to our analysis\. This asymmetry implies that transition\-dominated regimes are inherently more difficult to address through experimentation, since even exact reward estimation cannot compensate for compounding transition errors\.

Kakade and Langford \[[2002](https://arxiv.org/html/2605.21458#bib.bib141)\]extend the simulation lemma to approximate policy iteration, showing that near\-optimal performance is achievable when approximation error is controlled—a result we build on in our analysis of the simulator as an approximate model\. Recent work byLobel and Parr \[[2024](https://arxiv.org/html/2605.21458#bib.bib144)\]shows that the classical bound is not tight, demonstrating that the standard simulation lemma overestimates how transition errors compound over time\. This suggests that our Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)bounds may also admit tightening in future work\.

Our extended simulation lemma \(Lemma[1](https://arxiv.org/html/2605.21458#Thmlemma1)\) specializes these results to the setting where the approximate model is a simulator with*structured*uncertainty: the reward errorϵr\\epsilon\_\{r\}and transition errorϵp\\epsilon\_\{p\}are functions of the hidden\-state distribution and the confounding biasβconf\\beta\_\{\\mathrm\{conf\}\}\. This provides a direct, interpretable connection between the simulator’s calibration quality and the value of experimentation—and, through the quadratic\-vs\-linear asymmetry, explains why the EPI is more sensitive to transition uncertainty than to reward uncertainty\.

##### Safe exploration and deployment\-efficient RL\.

Safe exploration\[Suiet al\.,[2015](https://arxiv.org/html/2605.21458#bib.bib97), Berkenkampet al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib98)\]constrains the learning agent to avoid catastrophic states, providing high\-probability guarantees that unsafe regions are never visited\. Deployment\-efficient RL\[Matsushimaet al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib121)\]addresses a complementary practical constraint: the number of distinct data\-collection policies must be small, reflecting the cost of deploying new policies in production systems\.

Our framework differs from both literatures in the nature of the constraint it imposes\. Safe exploration guards against physical catastrophe; deployment efficiency limits the number of policy switches\. Our constraint is*economic*: the cost of experimentation is the opportunity cost of not following the simulator\-trained policy for each experimental unit\. However, the deployment\-efficiency perspective is directly complementary: our SEP can be viewed as a single deployment of an experimental policy, and the EPI threshold determines whether that deployment is worthwhile\. The safe\-exploration perspective also applies when the simulator\-trained policy is known to be safe and any deviation carries risk; in such settings, the EPI threshold implicitly incorporates a safety premium\.

##### Adaptive clinical trials\.

The adaptive clinical trials literature confronts the same exploration–exploitation tradeoff that motivates our work, applied to a population of patients\. Response\-adaptive randomization\[Rosenberger and Lachin,[2012](https://arxiv.org/html/2605.21458#bib.bib63), Hu and Rosenberger,[2006](https://arxiv.org/html/2605.21458#bib.bib64), Berry,[2006](https://arxiv.org/html/2605.21458#bib.bib62),[2012](https://arxiv.org/html/2605.21458#bib.bib65)\]adjusts treatment allocation as evidence accumulates; platform trials\[Berryet al\.,[2015](https://arxiv.org/html/2605.21458#bib.bib69), Woodcock and LaVange,[2017](https://arxiv.org/html/2605.21458#bib.bib70), Adaptive Platform Trials Coalition,[2019](https://arxiv.org/html/2605.21458#bib.bib71)\]enable simultaneous evaluation of multiple treatments under a shared infrastructure; and group sequential methods\[Pocock,[1977](https://arxiv.org/html/2605.21458#bib.bib66), O’Brien and Fleming,[1979](https://arxiv.org/html/2605.21458#bib.bib67), Jennison and Turnbull,[1999](https://arxiv.org/html/2605.21458#bib.bib68)\]provide stopping rules that determine when enough evidence has accumulated to act\. The I\-SPY 2 trial\[Barkeret al\.,[2009](https://arxiv.org/html/2605.21458#bib.bib72)\]exemplifies how Bayesian adaptive design can simultaneously identify effective treatments and allocate patients to the most promising arms\.Villaret al\.\[[2015](https://arxiv.org/html/2605.21458#bib.bib122)\]survey the benefits and challenges of bandit models for clinical trial design\.

Our work departs from this literature in two respects that interact\. First, the simulator gives us structured prior knowledge about system dynamics—a richer starting point than the minimal priors typical of clinical trials—which underpins the three\-uses framework \(Section[4](https://arxiv.org/html/2605.21458#S4)\): the simulator can serve as a policy source, a Bayesian prior, or an experimental design tool\. Second, the MDP structure introduces transition dynamics where today’s action affects tomorrow’s state, creating the reachability gap that distinguishes designed exploration from passive learning\. Clinical trials typically model each patient as an independent draw, without the sequential state\-transition structure that gives our problem a different geometry from a bandit\.

##### Bayesian experimental design and information\-directed sampling\.

The value of information\[Howard,[1966](https://arxiv.org/html/2605.21458#bib.bib1), Raiffa and Schlaifer,[1961](https://arxiv.org/html/2605.21458#bib.bib2), DeGroot,[1970](https://arxiv.org/html/2605.21458#bib.bib38)\]provides the decision\-theoretic basis for our framework\. The expected value of sample information \(EVSI\)\[Adeset al\.,[2004](https://arxiv.org/html/2605.21458#bib.bib40), Heathet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib4)\]quantifies the benefit of collecting additional data before making a decision, and has been widely applied in health economics\[Claxton,[1999](https://arxiv.org/html/2605.21458#bib.bib42), Briggset al\.,[2006](https://arxiv.org/html/2605.21458#bib.bib44), Wilson,[2015](https://arxiv.org/html/2605.21458#bib.bib45)\]to determine whether clinical trials are worth conducting\. Our EPI is an EVSI adapted to the MDP setting: it measures the expected improvement in policy value from observing the outcome of a specific state\-action pair, with the key novelty that the information value propagates through the Bellman equations rather than affecting a one\-shot decision\.

Modern amortized BED\[Fosteret al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib26), Ivanovaet al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib27), Rainforthet al\.,[2024](https://arxiv.org/html/2605.21458#bib.bib28)\]provides computational tools for sequential experimental design, using deep networks to amortize the cost of computing optimal designs\.Blauet al\.\[[2022](https://arxiv.org/html/2605.21458#bib.bib6)\]connect BED to deep RL, casting sequential design as an MDP\. Classical optimal experimental design\[Pukelsheim,[2006](https://arxiv.org/html/2605.21458#bib.bib136), Chaloner and Verdinelli,[1995](https://arxiv.org/html/2605.21458#bib.bib3)\]provides the theoretical foundation for our Fisher\-SEP: the A\-optimality criteriontr\(ℱ\(π\)\)\\mathrm\{tr\}\(\\mathcal\{F\}\(\\pi\)\)that we maximize is the standard A\-optimal design criterion applied to the value function’s dependence on reward parameters, with the key novelty being that the “design variable” is a stochastic MDP policy rather than a regression design matrix\.

Information\-directed sampling \(IDS\)\[Russo and Van Roy,[2014](https://arxiv.org/html/2605.21458#bib.bib31), Russoet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib55)\]is the closest algorithmic ancestor of our Fisher\-SEP\. IDS selects actions that minimize the ratio of squared expected regret to mutual information gained about the optimal action, providing a principled way to balance exploration and exploitation\. More broadly, exploration criteria in the bandit and RL literatures divide into two families:*uncertainty\-based*criteria like UCB\[Aueret al\.,[2002](https://arxiv.org/html/2605.21458#bib.bib53)\], Thompson sampling\[Thompson,[1933](https://arxiv.org/html/2605.21458#bib.bib49), Russoet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib55)\], and the Bayesian knowledge gradient\[Frazieret al\.,[2008](https://arxiv.org/html/2605.21458#bib.bib48)\], which direct data collection toward regions of high posterior variance; and*information\-gain\-based*criteria like IDS and BED, which direct data collection toward observations that most reduce posterior entropy about the decision\-relevant quantity\. Our Fisher criterion belongs to the second family but makes a specific choice: the decision\-relevant quantity is the value functionVπV^\{\\pi\}, and the Fisher information\(∇θVπ\)⊤Dπ\(∇θVπ\)\(\\nabla\_\{\\theta\}V^\{\\pi\}\)^\{\\top\}D\_\{\\pi\}\(\\nabla\_\{\\theta\}V^\{\\pi\}\)is the \(local\) sensitivity of that value to perturbations in the reward parameters, weighted by the visitation distribution the policy induces\. This choice has three practical consequences\. The Fisher trace is closed\-form in tabular MDPs via the Bellman resolvent, whereas mutual information between observations and the optimal\-action posterior requires either variational approximation or MCMC\. It correctly ignores uncertainty that does not translate into decision uncertainty: a reward whose posterior variance is large but whose∂Vπ/∂θ\\partial V^\{\\pi\}/\\partial\\thetais small will be deprioritized relative to an uncertain reward whose perturbation flips the optimal action\. And it ignores visitation that does not translate into informational value: a state visited often but whose rewards are already confident contributes zero to the trace\. Our Fisher\-SEP also differs from IDS in three respects not related to the information measure\. It separates the exploration and exploitation phases, reflecting the practical reality that experiments are planned in advance and run as a batch rather than interleaved with exploitation at each time step; it accounts for the opportunity cost of experimentation through the EPI threshold, which determines whether to explore at all; and it operates in sequential MDP settings where IDS is typically applied to bandits\.

##### Novelty concentration: what Fisher\-SEP adds to the BED / IDS / KG / ASID family\.

The preceding paragraphs situate Fisher\-SEP inside an existing lineage of design\-theoretic exploration\. We close the related\-work discussion by stating what Fisher\-SEP contributes beyond that lineage, and what it does not\.

Fisher\-SEP does not introduce a new A\-optimal criterion in the sense of Chaloner\-Verdinelli\[Chaloner and Verdinelli,[1995](https://arxiv.org/html/2605.21458#bib.bib3)\]or Pukelsheim\[Pukelsheim,[2006](https://arxiv.org/html/2605.21458#bib.bib136)\]: the data\-rich limit reduces to classical A\-optimal design for reward parameters \(App\.[B](https://arxiv.org/html/2605.21458#A2), Cor\.[4](https://arxiv.org/html/2605.21458#Thmcorollary4)\)\. It does not propose a new information measure as an alternative to IDS; the Fisher trace is an established sensitivity statistic\. And it does not propose a new RL algorithm in the PSRL or UCRL2 lineage; the two\-phase explore\-then\-commit structure is deliberately non\-adaptive\.

The contribution is to use the target\-policy\-value\-weighted Bellman\-resolvent gradient as a design criterion over stochastic exploration policies, with the simulator supplying the resolvent\. Three specific contributions follow\.

*Target\-policy\-value weighting\.*Classical A\-optimal design and IDS both operate with information measures over unknown parameters or actions\. Fisher\-SEP weights the per\-\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)contribution bydπtgt\(s\)\(∂Vπtgt/∂θs′,a′\)2d\_\{\\pi^\{\\mathrm\{tgt\}\}\}\(s\)\(\\partial V^\{\\pi^\{\\mathrm\{tgt\}\}\}/\\partial\\theta\_\{s^\{\\prime\},a^\{\\prime\}\}\)^\{2\}, which isolates the uncertainty that affects the target policy’s value\. Parameter uncertainty that does not flow through the Bellman resolvent to the target’s visitation receives zero weight\. This is the mechanism behind Proposition[4](https://arxiv.org/html/2605.21458#Thmremark4): Fisher\-SEP prioritizes pairs that are target\-sensitive \(nonzero gradient\) but unvisited \(no pilot data\), which ordinary A\-optimal design does not distinguish\.

*Bellman\-resolvent gradient, not parameter\-identification gradient\.*ASID\[Memmelet al\.,[2024](https://arxiv.org/html/2605.21458#bib.bib145)\]uses the Fisher information of physical parametersθphys\\theta\_\{\\mathrm\{phys\}\}for sim\-to\-real in robotics, aiming to identifyθphys\\theta\_\{\\mathrm\{phys\}\}\. Fisher\-SEP uses the Fisher of the value functionalVπtgt\(θ\)V^\{\\pi^\{\\mathrm\{tgt\}\}\}\(\\theta\), which is the Bellman\-resolvent\-propagated version of the parameter Fisher\. The distinction matters whenever parameter uncertainty propagates non\-uniformly to value: uncertain parameters that leaveVπtgtV^\{\\pi^\{\\mathrm\{tgt\}\}\}nearly invariant are not prioritized\. ASID’s objective and Fisher\-SEP’s objective coincide only in the fully reachable regime with uniform value sensitivity; on a bottleneck geometry like HIV, they differ\.

*Simulator as connectivity prior, not as accurate forward model\.*In sim\-to\-real \(domain randomization, ASID\), the simulator is treated as an approximate forward model whose accuracy one tries to improve\. Fisher\-SEP uses the simulator as a connectivity prior: the Bellman\-resolvent\(I−γPπtgt\)−1\(I\-\\gamma P^\{\\pi^\{\\mathrm\{tgt\}\}\}\)^\{\-1\}that determines which\(s′,a′\)\(s^\{\\prime\},a^\{\\prime\}\)pairs’ uncertainty propagates to target\-value uncertainty\. The simulator need not be accurate \(the HIV simulator is15×15\\timesoff on Region B prevalence\); its connectivity pattern on the target’s visitation must be approximately correct for the resolvent to be informative\. This is the requirement behind our positive result on HIV\.

In summary, Fisher\-SEP applies A\-optimal Bayesian experimental design to a specific target functional, the planner’s policy value, via the simulator\-supplied Bellman resolvent\. It separates the property the simulator must capture accurately \(connectivity\) from the property it is typically assumed to capture accurately \(parameter values\)\.

##### Domain randomization\.

Domain randomization \(DR\)\[Tobinet al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib13), Penget al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib109), Akkayaet al\.,[2019](https://arxiv.org/html/2605.21458#bib.bib14), Mehtaet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib15)\]is frequently described as a sim\-to\-real technique for generating robust policies by training across a distribution of simulator parameters\. In the framework of this paper, DR corresponds to a*simulator\-time*version of experimental design: instead of randomizing in the real world under anSS\-measurable pilot policy \(our Lemma[2](https://arxiv.org/html/2605.21458#Thmlemma2)\), DR randomizes the simulator’s parameters at training time\. Active DR\[Mehtaet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib15)\]specifically chooses the parameter distribution to maximize Fisher information about policy robustness — structurally analogous to Fisher\-SEP, but applied at simulator\-training time rather than deployment time\.

The two approaches address different sources of error\. DR handles*parameter uncertainty under a fixed structural model*: it prepares a policy to work across a range of simulator parameter values\. Fisher\-SEP addresses*structural model uncertainty*: it uses real\-world experimentation to correct biases \(confounding, drift\) that domain randomization cannot address because the simulator’s own parameter distribution is contaminated by those biases\. In the HIV case, no amount of domain randomization over Region\-B prevalence values in the simulator fixes the simulator’s systematic under\-sampling of Region B — because the DR distribution is itself calibrated from clinic data that never reached Region B\. Fisher\-SEP’s real\-world pilot is the only intervention that can identifyℙt\(H∣s\)\\mathbb\{P\}\_\{t\}\(H\\mid s\)at Region\-B states\.

DR and Fisher\-SEP are complementary rather than competing\. A deployment framework can use DR for robustness to known parameter uncertainty and Fisher\-SEP for real\-world identification of structural confounding\. We do not run DR as a numerical baseline because the case studies are constructed around structural confounding \(the regime DR does not address\), not parameter sensitivity \(the regime DR targets\)\.Xu and Zeevi \[[2023](https://arxiv.org/html/2605.21458#bib.bib128)\]develop Bayesian design principles for frequentist sequential learning, bridging the two traditions—a connection our framework also exploits, since the EPI is a Bayesian quantity employed to make a frequentist\-style decision about whether to experiment\.

##### Reward\-free and task\-agnostic exploration\.

Reward\-free RL\[Jinet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib156), Kaufmannet al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib157)\]separates exploration \(phase 1, reward\-agnostic\) from exploitation \(phase 2, after a reward is specified\), proving that a single reward\-agnostic exploration phase can yield a policy near\-optimal for any reward function revealed later\. Maximum\-entropy exploration\[Hazanet al\.,[2019](https://arxiv.org/html/2605.21458#bib.bib158)\]is a closely related approach that maximizes the entropy of the induced state\-visitation distribution, providing broad coverage without a specific task objective\. Our Fisher\-SEP differs in its choice of objective—sensitivity to reward parameters rather than coverage or entropy—and in its phase structure \(pilot, explore, exploit\) being tied to an explicit decision rule \(the EPI threshold\)\. When the reward parameters are entirely unknown, the Fisher\-SEP’s objective tends toward uniform coverage and resembles reward\-free exploration; when the simulator supplies informative reward priors, the Fisher\-SEP directs exploration toward the state\-action pairs whose reward estimates most affect the optimal policy, a refinement that reward\-free exploration does not make\.

##### Optimism\-based exploration baselines\.

UCRL2\[Jakschet al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib93)\]and R\-max\[Brafman and Tennenholtz,[2002](https://arxiv.org/html/2605.21458#bib.bib154)\]are the canonical non\-Bayesian baselines for regret\-minimizing exploration, using optimism under uncertainty to implicitly prioritize under\-visited\(s,a\)\(s,a\)pairs\. BOSS\[Asmuthet al\.,[2009](https://arxiv.org/html/2605.21458#bib.bib155)\]combines Bayesian exploration with optimism in the face of model uncertainty\. These methods achieve something structurally similar to the reachability component of our gap decomposition—they navigate toward under\-visited states—but via frequentist worst\-case guarantees rather than Bayesian posterior decomposition, and without an opportunity\-cost accounting\. Our empirical comparisons \(Section[5](https://arxiv.org/html/2605.21458#S5)\) benchmark against these methods where applicable\.

##### Transfer RL with misspecified simulators\.

Modiet al\.\[[2020](https://arxiv.org/html/2605.21458#bib.bib159)\]study transfer RL with linear mixtures of model ensembles, providing sample\-complexity bounds when the true MDP lies in a span of hypothesized models\. Our confounded\-simulator setting can be viewed as a nonlinear version: the true observed\-state MDP lies in a space of marginalizations of the joint\(S,H\)\(S,H\)POMDP, and the simulator is the marginalization under the calibration\-time distributionℙt0\(H∣S,A\)\\mathbb\{P\}\_\{t\_\{0\}\}\(H\\mid S,A\)\.Luet al\.\[[2023](https://arxiv.org/html/2605.21458#bib.bib160)\]provide a recent information\-theoretic treatment of RL with informative priors that is directly complementary to our framework\.

##### Offline RL and confounded MDPs\.

Offline RL\[Levineet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib24), Langeet al\.,[2012](https://arxiv.org/html/2605.21458#bib.bib96), Fujimotoet al\.,[2019](https://arxiv.org/html/2605.21458#bib.bib23)\]learns policies from fixed datasets without additional interaction\. A central challenge is*distribution shift*: the learned policy may visit states unseen in the data, where value estimates are unreliable\. Model\-free approaches address this through conservatism—penalizing out\-of\-distribution actions\[Kumaret al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib20), Kostrikovet al\.,[2022](https://arxiv.org/html/2605.21458#bib.bib22)\]—while model\-based methods like MOPO\[Yuet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib21)\]and MOReL\[Kidambiet al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib25)\]learn a dynamics model and apply pessimistic penalties for states outside the data support\. Both families assume the offline data is*unconfounded*: the behavior policy’s action choices do not depend on unobserved variables that also affect outcomes\. When this assumption fails, a qualitatively different problem arises\.

Confounded MDPs\[Zhang and Bareinboim,[2016](https://arxiv.org/html/2605.21458#bib.bib133), Wanget al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib130)\]formalize settings where unobserved variables simultaneously affect actions and outcomes\.Zhang and Bareinboim \[[2016](https://arxiv.org/html/2605.21458#bib.bib133)\]introduce the causal approach to MDPs with unobserved confounders, showing that standard RL methods can fail when the behavior policy depends on hidden state;Wanget al\.\[[2021](https://arxiv.org/html/2605.21458#bib.bib130)\]provide the first provably efficient algorithms for this setting\. Off\-policy evaluation under confounding\[Namkoonget al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib131), Bennett and Kallus,[2024](https://arxiv.org/html/2605.21458#bib.bib132)\]develops methods to estimate counterfactual policy values when the behavior policy depends on unobserved variables\. Most relevant to our setting,Kalluset al\.\[[2018](https://arxiv.org/html/2605.21458#bib.bib79)\]demonstrate that even a small amount of experimental data can “ground” confounded observational estimates by correcting for hidden bias—a result that motivates our central question of*when*such experimental data is worth collecting\.Kallus and Zhou \[[2018](https://arxiv.org/html/2605.21458#bib.bib76)\]develop confounding\-robust policy improvement methods that provide worst\-case guarantees without experimental data, representing the alternative to our approach: accept the confounding and optimize conservatively, rather than experiment to resolve it\.

Our hidden\-state model is a confounded MDP: the simulator was calibrated from observational data where the behavior policy \(the knowledgeable manager’s stocking decisions\) depended on the hidden state \(true demand rates\), creating confounding biasβconf\\beta\_\{\\mathrm\{conf\}\}\. Rather than attempting to de\-bias the confounded model, we use it as a*prior*for experimental design\. Our framework asks: given that the simulator is confounded, is it worth collecting unconfounded experimental data, and if so, what experiments should be run? In this sense, our approach is complementary to the debiasing literature: rather than correcting the confounded model, we use it to determine*where*unconfounded data would be most valuable, and the EPI quantifies whether the expected gain justifies the cost\.

##### Generalizability and transportability\.

The generalizability and transportability literature in causal inference\[Stuartet al\.,[2011](https://arxiv.org/html/2605.21458#bib.bib107), Dahabrehet al\.,[2019](https://arxiv.org/html/2605.21458#bib.bib137), Bareinboim and Pearl,[2016](https://arxiv.org/html/2605.21458#bib.bib29), Degtiar and Rose,[2023](https://arxiv.org/html/2605.21458#bib.bib149)\]studies when causal effects estimated in one population can be applied to another\.Pearl and Bareinboim \[[2011](https://arxiv.org/html/2605.21458#bib.bib139)\]andBareinboim and Pearl \[[2016](https://arxiv.org/html/2605.21458#bib.bib29)\]formalize this as a causal inference problem: given a causal model and knowledge of which variables differ between source and target, can the target\-population effect be identified from source experiments and target observations?Degtiar and Rose \[[2023](https://arxiv.org/html/2605.21458#bib.bib149)\]synthesize the assumptions, methods, and tests for treatment effect heterogeneity that this enterprise requires, emphasizing that both internal and external validity are necessary for unbiased target\-population estimates\.Huang and Parikh \[[2024](https://arxiv.org/html/2605.21458#bib.bib146)\]survey persistent hurdles in extrapolating experimental findings across disciplines\.

A closely related line of work operationalizes transportability through*data fusion*—combining experimental and observational data to improve causal effect estimation\.Parikhet al\.\[[2025a](https://arxiv.org/html/2605.21458#bib.bib147)\]propose double machine learning estimators that combine experimental and observational studies, providing falsification tests for external validity; their no\-free\-lunch theorem—showing that one must correctly diagnose*which*assumption is violated—parallels our framework’s need to distinguish reward errors from transition errors\.Lannerset al\.\[[2025](https://arxiv.org/html/2605.21458#bib.bib148)\]extend data fusion to partial identification when both confounding and cross\-source exchangeability fail simultaneously, developing sensitivity analyses that quantify how severe assumption violations must be to overturn a given conclusion—a question structurally parallel to the logic of our EPI threshold\.Rosenmanet al\.\[[2023](https://arxiv.org/html/2605.21458#bib.bib77)\]develop shrinkage estimators for combining observational and experimental datasets, andYang and Ding \[[2020](https://arxiv.org/html/2605.21458#bib.bib78)\]study combining multiple observational sources\.

A recurring finding across this literature is that treatment effects vary substantially across populations:Vivalt \[[2020](https://arxiv.org/html/2605.21458#bib.bib105)\]documents this heterogeneity across impact evaluations,Meager \[[2019](https://arxiv.org/html/2605.21458#bib.bib106)\]uses Bayesian hierarchical models to synthesize evidence from seven microcredit experiments, andParikhet al\.\[[2024a](https://arxiv.org/html/2605.21458#bib.bib138)\]develop methods to characterize the subgroups for whom trial results may not generalize\.

Our setting maps naturally onto the transportability framework: the simulator is analogous to a trial conducted in a non\-representative population \(the calibration population\), and the real world is the target\. The confounding biasβconf\\beta\_\{\\mathrm\{conf\}\}measures the transportability gap arising from differences in the hidden\-state distribution\. The EPI threshold determines whether this gap is large enough to justify collecting new experimental data in the target population—a decision rule that the transportability literature identifies as important but does not provide\. The data fusion perspective ofParikhet al\.\[[2025a](https://arxiv.org/html/2605.21458#bib.bib147)\]andLannerset al\.\[[2025](https://arxiv.org/html/2605.21458#bib.bib148)\]is directly relevant: our experimental data \(unconfounded\) and simulator data \(potentially confounded\) constitute precisely the two sources that data fusion methods seek to combine, and our EPI provides the decision rule—absent from the existing literature—for whether the experimental source is worth acquiring\. This connection suggests a natural extension of our framework to settings where the “simulator” is a body of evidence from a different population, and the question is whether to conduct a new trial in the target\.

##### Value of information in health economics and operations research\.

Two research communities have independently developed frameworks for determining whether additional data collection is warranted\. In health economics, the expected value of perfect information \(EVPI\) provides an upper bound on the value of any experiment\[Claxton,[1999](https://arxiv.org/html/2605.21458#bib.bib42), Claxtonet al\.,[2002](https://arxiv.org/html/2605.21458#bib.bib43)\], while the EVSI quantifies the value of a specific experimental design\[Adeset al\.,[2004](https://arxiv.org/html/2605.21458#bib.bib40), Brennanet al\.,[2007](https://arxiv.org/html/2605.21458#bib.bib41), Stronget al\.,[2014](https://arxiv.org/html/2605.21458#bib.bib5)\]\. These tools determine whether clinical trials merit funding—a question structurally identical to our “whether to experiment” problem\.Wilson \[[2015](https://arxiv.org/html/2605.21458#bib.bib45)\]provides a practical guide, andJalal and Alarid\-Escudero \[[2018](https://arxiv.org/html/2605.21458#bib.bib39)\]develop Gaussian approximation methods for efficient computation\. In operations research, the ranking and selection literature\[Kim and Nelson,[2006](https://arxiv.org/html/2605.21458#bib.bib87), Chenet al\.,[2000](https://arxiv.org/html/2605.21458#bib.bib88), Honget al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib89)\]studies how to allocate simulation budget across competing alternatives, with sequential procedures that myopically maximize information value\[Chick and Inoue,[2001](https://arxiv.org/html/2605.21458#bib.bib46), Chicket al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib47)\]and the knowledge gradient\[Frazieret al\.,[2008](https://arxiv.org/html/2605.21458#bib.bib48),[2009](https://arxiv.org/html/2605.21458#bib.bib115)\]extending these ideas to correlated beliefs\.

Our EPI connects to both traditions: it is an EVSI for the MDP setting \(health economics perspective\) and a simulation budget allocation criterion \(operations research perspective\)\. The key novelty is that our “alternatives” are not independent arms but states in an MDP with transition dynamics, so the information value of observing one state\-action pair depends on the entire MDP structure through the Bellman equations\. Neither the health economics nor the OR literature models this dependence: EVSI treats the decision as a one\-shot choice among treatments, and ranking\-and\-selection treats alternatives as independent simulation configurations\.

##### Online experimentation in technology\.

In the technology sector, online controlled experimentation has become standard practice\. A/B testing\[Kohaviet al\.,[2009](https://arxiv.org/html/2605.21458#bib.bib80),[2020](https://arxiv.org/html/2605.21458#bib.bib81)\]is the dominant paradigm, with overlapping experiment infrastructure\[Tanget al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib82)\]supporting thousands of simultaneous tests\. Always\-valid inference\[Johariet al\.,[2022](https://arxiv.org/html/2605.21458#bib.bib73), Howardet al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib74), Ramdaset al\.,[2023](https://arxiv.org/html/2605.21458#bib.bib75)\]provides statistical methods for continuous monitoring without inflating Type I error, andBermanet al\.\[[2022](https://arxiv.org/html/2605.21458#bib.bib83)\]study false discovery at scale\. When treatments can be personalized based on user characteristics, A/B testing gives way to contextual bandits\[Liet al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib58), Agarwalet al\.,[2014](https://arxiv.org/html/2605.21458#bib.bib59), Fosteret al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib60), Foster and Rakhlin,[2020](https://arxiv.org/html/2605.21458#bib.bib61)\], with recent work addressing the statistical challenges of adaptively collected data\[Nieet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib84), Hadadet al\.,[2021](https://arxiv.org/html/2605.21458#bib.bib85), Zhanget al\.,[2020](https://arxiv.org/html/2605.21458#bib.bib86)\]\.

Our framework addresses a distinct question: rather than asking “which treatment is superior?” \(a comparison problem\), we ask “should an experiment be conducted at all?” \(a meta\-decision\)\. Always\-valid inference determines when to*stop*an experiment that is already running; our EPI determines whether to*initiate*one\. The EPI threshold provides a principled criterion: experiment only when the expected information value exceeds the opportunity cost of deviating from the simulator’s recommendation—a consideration of particular importance when experiments affect large populations and the opportunity cost is measured in revenue or user welfare\.

##### Real options and the economics of experimentation\.

The decision to experiment can be cast as an investment under uncertainty: the planner incurs an upfront cost \(the opportunity cost of deviating from the simulator\-trained policy\) in exchange for information that may improve future decisions\. The real options literature\[Dixit and Pindyck,[1994](https://arxiv.org/html/2605.21458#bib.bib99), McDonald and Siegel,[1986](https://arxiv.org/html/2605.21458#bib.bib100), Pindyck,[1993](https://arxiv.org/html/2605.21458#bib.bib101)\]provides the canonical framework for such decisions, establishing that the option to delay and gather information carries positive value, which raises the threshold for action above the naive net\-present\-value rule\. Our “whether to experiment” question admits a direct mapping: the planner may experiment now \(exercising the option\) or continue with the simulator\-trained policy \(preserving the option to experiment later, informed by passive learning\)\.

The EPI threshold serves as the exercise boundary: experiment when the expected information value exceeds the opportunity cost\. The principal difference is that our “investment” possesses MDP structure: the value of information depends on the state, the transition dynamics, and the planning horizon, yielding a state\-dependent exercise boundary rather than a scalar threshold\. Rational inattention\[Sims,[2003](https://arxiv.org/html/2605.21458#bib.bib102), Matějka and McKay,[2015](https://arxiv.org/html/2605.21458#bib.bib103)\]provides a complementary perspective: the planner rationally chooses how much experimental effort to devote to reducing uncertainty\. Our framework admits interpretation through this lens: the EPI governs the*intensity*of attention \(the fraction of units allocated to experimentation\), while the Fisher\-SEP governs its*direction*\(which state\-action pairs to observe\)\.Morris and Strack \[[2019](https://arxiv.org/html/2605.21458#bib.bib104)\]connect the Wald sequential testing problem to information acquisition costs, providing a unified framework for optimal stopping and information gathering that parallels the role of the EPI as a criterion for initiating or terminating experimentation\.

##### Summary of positioning\.

The literatures surveyed above converge on a common tension: knowledge from one source—a simulator, an observational dataset, a trial in a different population—is valuable but imperfect, and the question is what to do about the imperfection\. The offline RL and confounded MDP literatures address this by*debiasing*: correcting the imperfect source using causal structure or pessimistic bounds\. The transportability and data fusion literatures address it by*combining*: integrating the imperfect source with a second, complementary source under explicit assumptions about what differs between them\. The BAMDP and bandit literatures address it by*exploring*: treating the imperfection as uncertainty to be resolved through interaction\. The sim\-to\-real and model\-based RL literatures address it by*transferring*: adapting the imperfect source to the target domain through domain randomization or model refinement\.

None of these literatures addresses the*meta\-question*that logically precedes all four strategies: is the imperfection large enough to warrant action? Our framework supplies this missing decision layer\. The simulator serves three roles—policy source, Bayesian prior, and experimental design tool—that correspond to increasing levels of engagement with the real world\. The EPI threshold answers the “if” question by comparing the expected information value to the opportunity cost, drawing on the VoI and real options traditions\. The gap decomposition answers the “what kind” question: passive learning suffices when informative states are reachable; designed exploration is necessary when they are not—a distinction absent from the clinical trials and bandit literatures, which lack transition dynamics\. The Fisher\-SEP answers the “how” question by directing experimental effort toward the reward parameters that most affect policy value, extending the IDS and BED traditions to MDP policies\. By integrating the cost–benefit logic of Bayesian experimental design with the MDP structure of reinforcement learning and the source\-discrepancy reasoning of causal transportability, our framework provides a unified answer to the question in the paper’s title:*if*and*when*to experiment\.

##### Baselines and scope limitations\.

Two baseline classes beyond those reported in Section[5](https://arxiv.org/html/2605.21458#S5)would refine the empirical picture but are beyond the scope of this paper\.

*Information\-directed sampling\.*IDS\[Russo and Van Roy,[2014](https://arxiv.org/html/2605.21458#bib.bib31), Russoet al\.,[2018](https://arxiv.org/html/2605.21458#bib.bib55)\]minimizes the ratio of squared expected regret to mutual information about the optimal action\. The natural value\-IDS variant replaces the optimal\-action information term with mutual information about the target policy’s value, which matches the Fisher\-SEP objective in the Gaussian small\-noise limit\. A value\-IDS baseline on the vending and HIV case studies is a natural next step; we do not pursue it here because the Fisher\-SEP vs Thompson\-sampling comparison in Section[5](https://arxiv.org/html/2605.21458#S5)already isolates the “design criterion beats posterior sampling” direction at the paper’s claimed scale\.

*Representation\-matched UCRL2/UCBVI\.*UCRL2\[Jakschet al\.,[2010](https://arxiv.org/html/2605.21458#bib.bib93)\]and UCBVI\[Azaret al\.,[2017](https://arxiv.org/html/2605.21458#bib.bib95)\]in the Table[1](https://arxiv.org/html/2605.21458#S5.T1)columns run on a coarser2×22\{\\times\}2factored state discretization than the simulator\-informed policies\. A representation\-matched comparison \(UCRL2/UCBVI on the full observed\-state discretization, or Fisher\-SEP\-R on the2×22\{\\times\}2factored representation\) would tighten the “simulator\-free baseline” row to a method comparison rather than a representation\-and\-method comparison; the Table[1](https://arxiv.org/html/2605.21458#S5.T1)results as presented should be read as isolating the effect of the simulator\-as\-prior at the simulator\-informed representation, and as a lower bound on the simulator’s value\.

*Three additional scope limitations referenced from Section[6](https://arxiv.org/html/2605.21458#S6)\.*\(i\)*Rate vs\. PSRL/UCRL2\.*Theorem[4](https://arxiv.org/html/2605.21458#Thmtheorem4)gives aTpilot1/3T\_\{\\mathrm\{pilot\}\}^\{1/3\}Bayes\-regret bound for Fisher\-SEP\-R, slower than PSRL’s and UCRL2’sT\\sqrt\{T\}\(Remark[8](https://arxiv.org/html/2605.21458#Thmremark8)\); the rate is a direct consequence of the two\-phase explore\-then\-commit structure, and an adaptiveT\\sqrt\{T\}variant is left to future work\. \(ii\)*Independence\-prior sensitivity\.*Proposition[9](https://arxiv.org/html/2605.21458#Thmproposition9)\(App\.[C\.18](https://arxiv.org/html/2605.21458#A3.SS18)\) shows that a spatial GP prior with moderate lengthscale partially closes the reachability gap via passive learning; the “passive cannot close𝒢reach\\mathcal\{G\}\_\{\\mathrm\{reach\}\}” claim is therefore an Assumption[4](https://arxiv.org/html/2605.21458#Thmassumption4)statement, not a structural one\. \(iii\)*Contracting latent dynamics\.*The analysis treatsϵhist\\epsilon^\{\\mathrm\{hist\}\}as negligible under a contracting latent Jacobian \(App\.[D\.16](https://arxiv.org/html/2605.21458#A4.SS16)\); non\-contracting POMDPs giveϵhist=Θ\(1\)\\epsilon^\{\\mathrm\{hist\}\}=\\Theta\(1\)and the framework does not apply\. \(iv\)*Trial count\.*All results usen=30n\{=\}30common\-seed trials with paired Wilcoxon;n=100n\{=\}100replication with seed\-resampling would strengthen the vendingT=1600T\{=\}1600headline\.
Mind the Sim-to-Real Gap & Think Like a Scientist

Similar Articles

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics

Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

Submit Feedback

Similar Articles

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics
Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making
Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization