Off-Policy Evaluation with Strategic Agents via Local Disclosure
Summary
This paper studies off-policy evaluation (OPE) when decision subjects (agents) strategically modify their covariates in response to a policy. It proposes a method that uses local disclosure via post-hoc explanations to reveal agents' pre-strategic covariates and construct a doubly robust estimator for policy value.
View Cached Full Text
Cached at: 06/08/26, 09:15 AM
# Off-Policy Evaluation with Strategic Agents via Local Disclosure
Source: [https://arxiv.org/html/2606.07308](https://arxiv.org/html/2606.07308)
Kiet Q\. H\. Vo111Corresponding author:huynh\.vo@cispa\.deAbbavaram Gowtham ReddyRational Intelligence Lab, CISPA Helmholtz Center for Information Security, Saarbrücken, GermanyRelational Machine Learning Lab, CISPA Helmholtz Center for Information Security, Saarbrücken, GermanyJulian RodemannRational Intelligence Lab, CISPA Helmholtz Center for Information Security, Saarbrücken, GermanyDepartment of Statistics, LMU Munich, GermanySiu Lun ChauEpistemic Intelligence & Computation Lab, Nanyang Technological University, SingaporeKrikamol MuandetRational Intelligence Lab, CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
###### Abstract
We study off\-policy evaluation \(OPE\) under strategic behavior where decision subjects \(or agents\) respond to a decision maker’s policy by strategically modifying their covariates\. Such behavior induces apolicy\-dependentcovariate shift, breaking the standard assumption in existing methods that covariates are exogenous to the policy\. Related work addresses this challenge by imposing strong assumptions such as repeated interactions or full knowledge of agents’ response behavior, substantially limiting its applicability to OPE\. In contrast, we consider a one\-shot OPE setting where the decision maker has only partial knowledge of the agents’ response behavior\. Our key insight is that disclosing local information through post\-hoc explanations reveals agents’ pre\-strategic covariates prior to adaptation, mitigating the information loss induced by strategic behavior\. Leveraging this structure, we estimate a statistical model for the agents’ responses and construct a doubly robust estimator for policy value\. By assuming that the agents’ cost sensitivity follows a conditional log\-normal distribution, we establish consistency of the proposed estimator and validate our approach empirically\. More broadly, our results highlight how interaction design can mitigate information asymmetry by revealing otherwise hidden structure in agents’ strategic responses\.†
††footnotetext:Accepted at ICML 2026\.## Introduction
The abundance of individual\-level data has made it increasingly feasible for decision makers \(DMs\) to design and deploy personalized policies across a wide range of domains including healthcare\(Murphy,[2003](https://arxiv.org/html/2606.07308#bib.bib28); Hamburg and Collins,[2010](https://arxiv.org/html/2606.07308#bib.bib12)\), education\(Mandelet al\.,[2014](https://arxiv.org/html/2606.07308#bib.bib25)\), lending\(Kilbertuset al\.,[2020](https://arxiv.org/html/2606.07308#bib.bib20)\), and recommendation systems\(Joachimset al\.,[2021](https://arxiv.org/html/2606.07308#bib.bib17)\)\. In these applications, particularly in high\-stakes settings such as healthcare, deploying new policies directly is generally prohibitive, as poorly chosen decisions may lead to substantial harm, financial loss, or unfair outcomes\. As a result, rather than evaluating through experimentation, policy performance must be assessed using historically collected data generated under a different decision policy\. This problem is commonly referred to as anoff\-policy evaluation\(OPE\) problem\(Ueharaet al\.,[2022](https://arxiv.org/html/2606.07308#bib.bib41)\)\.
When decisions are personalized, agents, i\.e\., individuals subject to the decision rule, may strategically modify their observable covariates to receive more favorable decisions\. For instance, if college admissions policies place greater weight on standardized test scores \(e\.g\., GRE\), applicants may reallocate effort toward those tests\(Voet al\.,[2024](https://arxiv.org/html/2606.07308#bib.bib43)\); similarly, when a bank introduces a new lending policy, customers might alter their financial behavior to meet eligibility criteria\(Tsirtsis and Gomez Rodriguez,[2020](https://arxiv.org/html/2606.07308#bib.bib39)\)\. Such strategic responses induce apolicy\-dependentshift in the distribution of agents’ covariates: as the policy changes, so does the population it acts upon\. This phenomenon is widely studied in strategic classification\(Hardtet al\.,[2016](https://arxiv.org/html/2606.07308#bib.bib13)\)and performative prediction\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.07308#bib.bib31)\)\. Because policy performance is typically defined as an average over the induced population, a central insight from this literature is that evaluating a new policy requiresanticipating how the covariate distribution responds to it; failing to do so amounts to evaluating the policy under an incorrect induced population and thus estimating the wrong policy value\. Nonetheless, this challenge is largely overlooked in OPE literature\.
\(a\)Two agents modify to the same covariate\.
\(b\)Global information disclosure \(GID\) vs\. local information disclosure \(LID\)\.
Figure 1:The left figure*\(a\)*shows a situation where two agents—with two different pre\-strategic covariatesx∙b,x⋄b\{x\}^\{b\}\_\{\\bullet\},\{x\}^\{b\}\_\{\\diamond\}and cost functionsc∙,c⋄c\_\{\\bullet\},c\_\{\\diamond\}—adapt to the same covariate valuexs\{x\}^\{s\}\. If the pre\-strategic covariates are not observed, which happens in global information disclosure \(i\.e\., right figure*b*\), these two agents are indistinguishable\. This makes it harder for the DM to reason about the effect of another policyπ′\\pi^\{\\prime\}on the covariate shift\. The right figure*\(b\)*illustrates the interactions in GID \(where the DM makes their policy’s properties public\) and LID \(where the DM withholds disclosure until the agent gives information about themselves\)\.Prior work on strategic behavior in offline settings primarily originates from the strategic classification literature\(Hardtet al\.,[2016](https://arxiv.org/html/2606.07308#bib.bib13); Levanon and Rosenfeld,[2021](https://arxiv.org/html/2606.07308#bib.bib22); Rosenfeld and Rosenfeld,[2024](https://arxiv.org/html/2606.07308#bib.bib34)\)\. To enable tractable analysis and equilibrium characterization, these works typically assume that agents’ responses can be modeled precisely or that their behavior is homogeneous, commonly formalized through asingleand/orknowncost function governing covariate modification\. While analytically convenient, such assumptions rarely hold in practice\. Agents’ preferences and constraints are inherently heterogeneous, and are generally unobserved by the DM\. Hence, applying these approaches as\-is to OPE can substantially limit their practical usefulness, as OPE is most often employed to support real\-world decision making\.
Motivated by this, we study OPE under strategic behavior while relaxing the assumption of a single and known cost function\. Specifically, we allow agents to have heterogeneous cost functions, while assuming that the DM knows only their common components and not the agent\-specific costs\. This relaxation is possible when the DM employs local information disclosure \(LID\): disclosing partial information about their policy as a form of personalized feedback, for instance, through post\-hoc explanations\(Tsirtsis and Gomez Rodriguez,[2020](https://arxiv.org/html/2606.07308#bib.bib39); Xie and Zhang,[2024](https://arxiv.org/html/2606.07308#bib.bib45); Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\); see[Figure˜1\(b\)](https://arxiv.org/html/2606.07308#S1.F1.sf2)\.
Crucially, this interaction scheme allows the DM to observe agents’ original covariates \(orpre\-strategic covariates\) before their strategic adaptation\. Observing pre\-strategic covariates is essential because different agents—with different baseline characteristics and behavioral models—may strategically modify to the same final covariate value\. Without access to pre\-strategic information, such trajectories are observationally indistinguishable, making it impossible to separate strategic modification from genuine baseline characteristics\.[Figure˜1\(a\)](https://arxiv.org/html/2606.07308#S1.F1.sf1)illustrates this\. By contrast, under global information disclosure \(GID\), where the DM makes policy information public and observes only post\-adaptation covariates, they lose access to pre\-strategic information, e\.g\., as inShavitet al\.\([2020](https://arxiv.org/html/2606.07308#bib.bib36)\); Harriset al\.\([2022b](https://arxiv.org/html/2606.07308#bib.bib15)\); Munro\([2025](https://arxiv.org/html/2606.07308#bib.bib29)\); Cohenet al\.\([2024](https://arxiv.org/html/2606.07308#bib.bib8)\)\. While prior work has studied strategic behavior under LID\(Tsirtsis and Gomez Rodriguez,[2020](https://arxiv.org/html/2606.07308#bib.bib39); Xie and Zhang,[2024](https://arxiv.org/html/2606.07308#bib.bib45); Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\), they mainly focus on online learning and equilibrium analysis\. To our knowledge, no existing work leverages pre\-strategic information for OPE with strategic agents, nor for one\-shot learning with heterogeneous and unknown agents’ behavior\.
More broadly, this limitation of GID reflects a form ofinformation asymmetry: the DM observes agents only after strategic adaptation and therefore lacks access to key pre\-adaptation information\. Although LID is used as part of our problem setup, we emphasize that, practically, LID should be understood as a design choice in how the DM structures interactions with agents, rather than as a restrictive evaluation setup introduced solely to enable estimation\. In many systems, the DM has discretion over whether policy information is revealed globally or through personalized feedback, and this choice fundamentally shapes agents’ strategic responses and what can be inferred from data\.
Our work introduces this connection between interaction design and inferential structure as a novel perspective on OPE with heterogeneous and partially unknown agents’ behavior\. From this perspective, our main research question is therefore twofold: \(i\) under local information disclosure, how should the disclosure rule be designed for OPE under strategic behavior, and \(ii\) how can the corresponding policy value be estimated from historical data?
We summarize below our contributions:
- •As the first work to apply LID to OPE under strategic behavior, we extend the action recommendation\-based explanation \(ARex\)\(Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\)and adapt its explanation rule to handle covariate shift in OPE \([Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)\)\.
- •We show that ARexes, as an instantiation of LID, can be used to infer the behavioral model of agents\. Under some structural assumptions, we show that the estimator of unknown parameters is consistent \([Theorem˜2\.6](https://arxiv.org/html/2606.07308#S2.Thmtheorem6)\)\.
- •We propose a doubly robust estimator that can adjust for the strategic covariate shift, and prove the estimator’s consistency under standard conditions \([Theorem˜3\.2](https://arxiv.org/html/2606.07308#S3.Thmtheorem2)\)\.
All proofs are provided in the appendices\.
## Strategic OPE under Local Disclosure
### 2\.1Problem Formulation
We use the lending scenario\(Harriset al\.,[2022a](https://arxiv.org/html/2606.07308#bib.bib14)\)as our running example and consider the setting in which a DM \(e\.g\., a bank\) interacts with a population of agents \(e\.g\., their customers\)\. Following prior work\(Tsirtsis and Gomez Rodriguez,[2020](https://arxiv.org/html/2606.07308#bib.bib39); Harriset al\.,[2022b](https://arxiv.org/html/2606.07308#bib.bib15); Voet al\.,[2024](https://arxiv.org/html/2606.07308#bib.bib43),[2026](https://arxiv.org/html/2606.07308#bib.bib44)\), we assume agents interact with the DM independently of one another\. Therefore, for ease of exposition, we describe the setup for a single agent and simply view a population of heterogeneous agents as concrete realizations of the same model\. In particular, we adopt the strategic agent setting fromVoet al\.\([2026](https://arxiv.org/html/2606.07308#bib.bib44)\)and describe it below\.
LetXb∼PXb\{X\}^\{b\}\\sim P\_\{\{X\}^\{b\}\}be an agent’s covariate vector andxb∈𝒳\{x\}^\{b\}\\in\\mathcal\{X\}an independent realization representing the agent’s observable attributes, such as existing debt or bank account balance\. At this stage, as the agent has not modified their covariates, we also refer to the base covariatesxb\{x\}^\{b\}as their pre\-strategic covariates\. We assume the covariate space𝒳\\mathcal\{X\}is discrete\. This fits many real\-world scenarios where the DM dictates which agents’ information is collected and where continuous values are often discretized\. For example, a bank may only record coarse, pre\-specified attributes such as credit score buckets, income ranges, and debt\-to\-income ratio categories\.
In the beginning, the DM commits to a decision policyπ:𝒳→\[0,1\]\\pi:\\mathcal\{X\}\\to\[0,1\]that controls the probability of the agent getting a positive treatment \(e\.g\., a loan application getting approved\)\. LetTb∣xb∼Bernoulli\(π\(xb\)\)\{T\}^\{b\}\\mid\{x\}^\{b\}\\sim\\text\{Bernoulli\}\(\\pi\(\{x\}^\{b\}\)\)be a binary random variable that represents the treatment assigned to this agent, withtb∈𝒯=\{0,1\}\{t\}^\{b\}\\in\\mathcal\{T\}=\\\{0,1\\\}denotes the realization\. We note that our formulation of the \(potentially stochastic\) treatment policyπ\\piis consistent with prior work in OPE; see, e\.g\.,Ueharaet al\.\([2020](https://arxiv.org/html/2606.07308#bib.bib40)\); Kalluset al\.\([2022](https://arxiv.org/html/2606.07308#bib.bib18)\)\. Moreover, this stochasticity is realistic in many situations, such as when the DM wants randomization for learning\(Kilbertuset al\.,[2020](https://arxiv.org/html/2606.07308#bib.bib20); Munro,[2025](https://arxiv.org/html/2606.07308#bib.bib29); Voet al\.,[2024](https://arxiv.org/html/2606.07308#bib.bib43)\)or when it is a consequence of credit rationing\(Stiglitz and Weiss,[1981](https://arxiv.org/html/2606.07308#bib.bib38)\)\.
DM’s personalized feedback\.If the agent receives a negative treatment \(tb=0\{t\}^\{b\}=0\), they are provided with personalized feedback in the form of an action recommendation\-based explanation \(ARex\)\(Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\)and are allowed to modify their observable featuresxb\{x\}^\{b\}before reapplying\. We extend the ARex framework ofVoet al\.\([2026](https://arxiv.org/html/2606.07308#bib.bib44)\)to allow the agent to receive an explanationeethat containsk≥2k\\geq 2recommendations, i\.e\.,e=\{\(xjr\),π\(xjr\)\}j=1ke=\\\{\(\{x\}^\{r\}\_\{j\}\),\\pi\(\{x\}^\{r\}\_\{j\}\)\\\}\_\{j=1\}^\{k\}, rather than being restricted to only two recommendations\. In our lending example, these recommendations can inform the agent to pay off more debt or increase the amount in their savings account\. This is similar to the concept of algorithmic recourse\(Karimiet al\.,[2021](https://arxiv.org/html/2606.07308#bib.bib19); Harriset al\.,[2022a](https://arxiv.org/html/2606.07308#bib.bib14); Königet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib21)\)\. In addition, we refer to this set of recommended feature updates as𝒳r=\{xjr\}j=1k\{\\mathcal\{X\}\}^\{r\}=\\\{\{x\}^\{r\}\_\{j\}\\\}\_\{j=1\}^\{k\}and we useτ:\(xb,π\)↦e\\tau:\(\{x\}^\{b\},\\pi\)\\mapsto eto denote the explanation rule that generates ARexes\. For a given choice ofkk,ℰ=𝒳k×\[0,1\]k\\mathcal\{E\}=\\mathcal\{X\}^\{k\}\\times\[0,1\]^\{k\}denotes the space of ARexes andEEdenotes the \(random\) explanation\.
Agent’s adaptation\.FollowingTsirtsis and Gomez Rodriguez\([2020](https://arxiv.org/html/2606.07308#bib.bib39)\); Voet al\.\([2026](https://arxiv.org/html/2606.07308#bib.bib44)\), we model the agent’s strategic modification of their covariate vector as
xs∈argmaxx∈\{xb\}∪𝒳r\{π\(x\)−c\(x,xb\)\}⏟u\(π,c,x,xb\),\{x\}^\{s\}\\in\\operatorname\*\{arg\\,max\}\_\{x\\in\\\{\{x\}^\{b\}\\\}\\cup\{\\mathcal\{X\}\}^\{r\}\}\\;\\underbrace\{\\left\\\{\\pi\(x\)\-c\\big\(x,\{x\}^\{b\}\\big\)\\right\\\}\}\_\{u\(\\pi,c,x,\{x\}^\{b\}\)\},\(1\)wherec:𝒳×𝒳→ℝ≥0c:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\\geq 0\}denotes the cost function of this agent for modifying their base featuresxb\{x\}^\{b\}toxx\.
While this choice\-based behavioral assumption might appear strong, particularly when only a single recommendation is provided, our work mitigates this concern by allowingk≥2k\\geq 2recommendations\. Moreover, followingVoet al\.\([2026](https://arxiv.org/html/2606.07308#bib.bib44)\), we assume that agents do not explore actions outside the presented choices set\{xb\}∪𝒳r\\\{\{x\}^\{b\}\\\}\\cup\{\\mathcal\{X\}\}^\{r\}, so as to avoid the risk of inadvertently reducing their utility\.[Section˜B\.1](https://arxiv.org/html/2606.07308#A2.SS1)further discusses this behavioral assumption\. We denote𝒳f:=\{xb\}∪𝒳r\\mathcal\{X\}^\{f\}:=\\\{\{x\}^\{b\}\\\}\\cup\{\\mathcal\{X\}\}^\{r\}as the set of feasible actions\.
Agent’s cost model\.We assume that the agent’s cost function takes the form ofc\(x,x′\):=αd\(x,x′\)c\(x,x^\{\\prime\}\):=\\alpha d\(x,x^\{\\prime\}\), whered:𝒳×𝒳→ℝ≥0d:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\\geq 0\}is a deterministic function measuring the primitive cost of modifying covariatesxxtox′x^\{\\prime\}\. The functionddis shared across agents and known by the DM, while the scalarα∈\(0,∞\)\\alpha\\in\(0,\\infty\)captures agent\-specific cost sensitivity, unknown to the DM\. To model heterogeneity across agents, we assumeα\\alphafollows a log\-normal distribution:
lnα∣xb∼𝒩\(β⊤ϕ\(xb\)\+β0,σ2\),\\displaystyle\\ln\\alpha\\mid\{x\}^\{b\}\\sim\\mathcal\{N\}\(\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\},\\sigma^\{2\}\),\(2\)whereϕ:𝒳→ℝp\\phi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{p\}denotes some known feature transformation function of the covariate vectorxb\{x\}^\{b\}\. Furthermore, the conditional CDFF\(α∣xb;θ\)F\(\\alpha\\mid\{x\}^\{b\};\\theta\)is parameterized byθ=\(β,β0,σ\)∈Θ⊆ℝp×ℝ×ℝ\+\\theta=\(\\beta,\\beta\_\{0\},\\sigma\)\\in\\Theta\\subseteq\\mathbb\{R\}^\{p\}\\times\\mathbb\{R\}\\times\\mathbb\{R\}^\{\+\}\.
Intuitively,d\(x,x′\)d\(x,x^\{\\prime\}\)captures the underlying burden of modifying covariates \(e\.g\., time, effort, or monetary expenses\), whileα\\alphareflects how different agents internalize this burden\. The log\-normal model provides a tractable parameterization of heterogeneous cost sensitivities in the one\-shot setting\.222[SectionB\.2](https://arxiv.org/html/2606.07308#A2.SS2)provides further discussion of the shared primitive cost model and the log\-normal assumption on agents’ cost sensitivities\.We rewrite the agent’s strategic adaptation as follows:
xs\\displaystyle\{x\}^\{s\}∈argmaxx∈𝒳f\{π\(x\)−αd\(x,xb\)\}⏟u\(π,α,x,xb\)\\displaystyle\\in\\operatorname\*\{arg\\,max\}\_\{x\\in\\mathcal\{X\}^\{f\}\}\\;\\underbrace\{\\left\\\{\\pi\(x\)\-\\alpha d\\big\(x,\{x\}^\{b\}\\big\)\\right\\\}\}\_\{u\(\\pi,\\alpha,x,\{x\}^\{b\}\)\}
Agent’s outcome\.If the agent modifies their feature vector to a new valuexs≠xb\{x\}^\{s\}\\neq\{x\}^\{b\}and reportsxs\{x\}^\{s\}to the DM, they receives the final treatmentts∈𝒯=\{0,1\}\{t\}^\{s\}\\in\\mathcal\{T\}=\\\{0,1\\\}, modeled asTs∼Bernoulli\(π\(xs\)\)\{T\}^\{s\}\\sim\\text\{Bernoulli\}\(\\pi\(\{x\}^\{s\}\)\)\. Then, the agent’s outcome is realized asy:=h\(xs,ts,z\)y:=h\\big\(\{x\}^\{s\},\{t\}^\{s\},z\\big\)wherey∈𝒴⊆ℝy\\in\\mathcal\{Y\}\\subseteq\\mathbb\{R\},h:𝒳×𝒯×𝒵→𝒴h:\\mathcal\{X\}\\times\\mathcal\{T\}\\times\\mathcal\{Z\}\\to\\mathcal\{Y\}denotes the \(non\-random\) outcome function, andz∈𝒵z\\in\\mathcal\{Z\}the unobserved noise factor\. In our lending example, this outcome corresponds to the profit the bank makes from approving \(i\.e\.,ts=1\{t\}^\{s\}=1\) or rejecting \(i\.e\.,ts=0\{t\}^\{s\}=0\) this customer’s loan application \(i\.e\.,xs\{x\}^\{s\}\), as some customers might be able to pay back the loan while others do not\. In addition, we assume the random variableZZrepresents exogenous noise that only affects the outcomeYY\. This captures external events that occur after the decisionts\{t\}^\{s\}is made, such as unexpected income changes, unforeseen expenses, or temporary economic slowdown\.
On the other hand, the agent does not modify their base featuresxb\{x\}^\{b\}if either they receive a positive treatment at the first try, i\.e\.,tb=1\{t\}^\{b\}=1, or all recommended actionsxjr∈𝒳r\{x\}^\{r\}\_\{j\}\\in\{\\mathcal\{X\}\}^\{r\}yield lower utility than retainingxb\{x\}^\{b\}, e\.g\., due to high modification costs\(Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\)\. In these cases, we simply definexs:=xb\{x\}^\{s\}:=\{x\}^\{b\}andts:=tb\{t\}^\{s\}:=\{t\}^\{b\}\. Consequently, their outcome is realized asy:=h\(xs,ts,z\)=h\(xb,tb,z\)y:=h\\big\(\{x\}^\{s\},\{t\}^\{s\},z\\big\)=h\\big\(\{x\}^\{b\},\{t\}^\{b\},z\\big\)\.
DM’s objective\.We define the value of a policyπ\\piasV\(π\)=𝔼π\[Y\]V\(\\pi\)=\\mathbb\{E\}\_\{\\pi\}\[Y\], which, for instance, reflects the expected profit of the DM from a loan application\. Given a logging policyπ0\\pi\_\{0\}and the respective observational data333Note that, in contrast, a dataset in GID does not contain observations of the pairs\(xb,tb\)\(\{x\}^\{b\},\{t\}^\{b\}\)\.D=\{\(xib,tib,𝒳ir,xis,tis,yi\)\}i=1kD=\\\{\(\{x\}^\{b\}\_\{i\},\{t\}^\{b\}\_\{i\},\{\\mathcal\{X\}\}^\{r\}\_\{i\},\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{k\}, the DM’s goal is to estimateV\(π\)V\(\\pi\)of a target policyπ\\pivia an estimatorV^\(π,D\)\\hat\{V\}\(\\pi,D\)\.
### 2\.2Policy Value Decomposition
The policy valueV\(π\)V\(\\pi\)can be decomposed as follows:
V\(π\)=𝔼π\[Y\]\\displaystyle V\(\\pi\)=\\mathbb\{E\}\_\{\\pi\}\\left\[Y\\right\]=∑ts,xs𝔼π\[Y∣ts,xs\]pπ\(ts,xs\)\\displaystyle=\\sum\_\{\{t\}^\{s\},\{x\}^\{s\}\}\\mathbb\{E\}\_\{\\pi\}\\left\[Y\\mid\{t\}^\{s\},\{x\}^\{s\}\\right\]p\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\}\)=∑ts,xs𝔼π\[Y∣ts,xs\]\(∑tb,xbpπ\(ts∣xs,tb,xb\)pπ\(xs∣tb,xb\)pπ\(tb,xb\)\)\\displaystyle=\\sum\_\{\{t\}^\{s\},\{x\}^\{s\}\}\\mathbb\{E\}\_\{\\pi\}\\left\[Y\\mid\{t\}^\{s\},\{x\}^\{s\}\\right\]\\left\(\\sum\_\{\{t\}^\{b\},\{x\}^\{b\}\}p\_\{\\pi\}\(\{t\}^\{s\}\\mid\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)p\_\{\\pi\}\(\{x\}^\{s\}\\mid\{t\}^\{b\},\{x\}^\{b\}\)p\_\{\\pi\}\(\{t\}^\{b\},\{x\}^\{b\}\)\\right\)=∑ts,xs,tb,xb𝔼π\[Y∣ts,xs\]①pπ\(ts∣xs,tb,xb\)②pπ\(xs∣tb,xb\)③pπ\(tb,xb\)④\.\\displaystyle=\\sum\_\{\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\}\\underset\{\\text\{\\char 172\}\}\{\\mathbb\{E\}\_\{\\pi\}\\left\[Y\\mid\{t\}^\{s\},\{x\}^\{s\}\\right\]\}\\underset\{\\text\{\\char 173\}\}\{p\_\{\\pi\}\(\{t\}^\{s\}\\mid\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\}\\underset\{\\text\{\\char 174\}\}\{p\_\{\\pi\}\(\{x\}^\{s\}\\mid\{t\}^\{b\},\{x\}^\{b\}\)\}\\underset\{\\text\{\\char 175\}\}\{p\_\{\\pi\}\(\{t\}^\{b\},\{x\}^\{b\}\)\}\.In what follows, we explain each term in detail\.
①Conditional expected outcome𝔼π\[Y∣ts,xs\]\\mathbb\{E\}\_\{\\pi\}\\left\[Y\\mid\{t\}^\{s\},\{x\}^\{s\}\\right\]\.This term models the expected outcome conditioned on the strategically updated covariatexs\{x\}^\{s\}and treatmentts\{t\}^\{s\}\. To further simplify it, we make the following assumption\.
###### Assumption 2\.1\(Unconfoundedness\)\.
The noiseZZand the treatmentTs\{T\}^\{s\}are conditionally independent given the strategically updated covariateXs\{X\}^\{s\}\.
This is a standard assumption in OPE literature\(Ueharaet al\.,[2020](https://arxiv.org/html/2606.07308#bib.bib40); Kalluset al\.,[2022](https://arxiv.org/html/2606.07308#bib.bib18)\)and trivially holds in our setting, by definition ofZZ\. We note that in many real\-world settings where the noiseZZis correlated with agents’ featuresXs\{X\}^\{s\}, unconfoundedness does not hold trivially under strategic behavior\. This is because any intervention onTs\{T\}^\{s\}\(by the DM\) induces a corresponding intervention onXs\{X\}^\{s\}by the strategic agent, over which the DM has no control; see, e\.g\.,Munro\([2025](https://arxiv.org/html/2606.07308#bib.bib29)\); Voet al\.\([2024](https://arxiv.org/html/2606.07308#bib.bib43)\)\. Although strategic behavior introduces multiple challenges in OPE, our work focuses on the challenge of anticipating the covariate shift and assumes that the noiseZZis exogenous\.
Under[˜2\.1](https://arxiv.org/html/2606.07308#S2.Thmtheorem1), we can drop the subscriptπ\\pifrom the conditional expected outcome as𝔼\[Y∣ts,xs\]\\mathbb\{E\}\[Y\\mid\{t\}^\{s\},\{x\}^\{s\}\]becomes identifiable from observational data\.
②Strategic propensitypπ\(ts∣xs,tb,xb\)p\_\{\\pi\}\(\{t\}^\{s\}\\mid\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\.This term captures an agent’s chance to receive the treatmentts\{t\}^\{s\}, given a specific observation\{xs,tb,bb\}\\\{\{x\}^\{s\},\{t\}^\{b\},\{b\}^\{b\}\\\}\. It largely depends on the DM’s policyπ\\piand can be computed straightforwardly\.
③Strategic covariate shiftpπ\(xs∣tb,xb\)p\_\{\\pi\}\(\{x\}^\{s\}\\mid\{t\}^\{b\},\{x\}^\{b\}\)\.This term captures the covariate shift arising from strategic behavior\. We consider two cases oftb\{t\}^\{b\}\. Iftb=1\{t\}^\{b\}=1, then by definition, the agent does not update their features, thuspπ\(xs∣Tb=1,xb\)=𝟙\(xs=xb\)p\_\{\\pi\}\(\{x\}^\{s\}\\mid\{T\}^\{b\}=1,\{x\}^\{b\}\)=\\mathbbm\{1\}\(\{x\}^\{s\}=\{x\}^\{b\}\)\. In contrast, whentb=0\{t\}^\{b\}=0, we have
③=∑e∈ℰp\(xs∣Tb=0,xb,e\)pπ\(e∣xb\)\.\\displaystyle\\text\{\\char 174\}=\\sum\_\{e\\in\\mathcal\{E\}\}p\(\{x\}^\{s\}\\mid\{T\}^\{b\}=0,\{x\}^\{b\},e\)p\_\{\\pi\}\(e\\mid\{x\}^\{b\}\)\.\(3\)Given that the agent might have multiple utility maximizers, we present the following result to ensure that there is only one maximizer, almost surely\. This allows us to avoid making assumptions about how the agent might break ties\.
###### Lemma 2\.2\(Unique utility maximizer\)\.
For any agent with\(xb,tb=0\)\(\{x\}^\{b\},\{t\}^\{b\}=0\)that receives recommendation set𝒳r\{\\mathcal\{X\}\}^\{r\}\(coming from the ARexee\), if it holds thatd\(x∙r,xb\)≠d\(x⋄r,xb\)d\(\{x\}^\{r\}\_\{\\bullet\},\{x\}^\{b\}\)\\neq d\(\{x\}^\{r\}\_\{\\diamond\},\{x\}^\{b\}\)for any pairx∙r≠x⋄r\{x\}^\{r\}\_\{\\bullet\}\\neq\{x\}^\{r\}\_\{\\diamond\}in𝒳r\{\\mathcal\{X\}\}^\{r\}, then
pα\|Xb\(\|argmaxx∈𝒳fu\(π,α,x,xb\)\|=1∣xb\)=1\.\\displaystyle p\_\{\\alpha\|\{X\}^\{b\}\}\(\|\\arg\\max\_\{x\\in\\mathcal\{X\}^\{f\}\}u\(\\pi,\\alpha,x,\{x\}^\{b\}\)\|=1\\ \\mid\\ \{x\}^\{b\}\)=1\.\(4\)
[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)says that when the DM provides recommendations to the agent, as long as there are no two feature updates with the same distance toxb\{x\}^\{b\}, then there is only a unique utility maximizer for this agent, almost surely\. We use this result to further decomposep\(xs∣Tb=0,xb,e\)p\(\{x\}^\{s\}\\mid\{T\}^\{b\}=0,\{x\}^\{b\},e\)\.
Let𝒳−sf:=𝒳f∖\{xs\}\\mathcal\{X\}^\{f\}\_\{\-s\}:=\\mathcal\{X\}^\{f\}\\setminus\\\{\{x\}^\{s\}\\\}denote the set of feasible feature updates excluding the valuexs\{x\}^\{s\}\. Then, we have
p\(xs∣e,Tb=0,xb\)\\displaystyle p\(\{x\}^\{s\}\\mid e,\{T\}^\{b\}=0,\{x\}^\{b\}\)=p\(\{π\(xs\)−αd\(xs,xb\)\>π\(x\)−αd\(x,xb\)\}∀x∈𝒳−sf\|e,Tb=0,xb\)1\(xs∈𝒳f\)\\displaystyle=p\(\\\{\\pi\(\{x\}^\{s\}\)\-\\alpha d\(\{x\}^\{s\},\{x\}^\{b\}\)\>\\pi\(x\)\-\\alpha d\(x,\{x\}^\{b\}\)\\\}\\ \\forall x\\in\\mathcal\{X\}^\{f\}\_\{\-s\}\\ \\lvert\\ e,\{T\}^\{b\}=0,\{x\}^\{b\}\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\)=p\(\{π\(xs\)−π\(x\)\>α\(d\(xs,xb\)−d\(x,xb\)\)\}∀x∈𝒳−sf\|xb\)1\(xs∈𝒳f\),\\displaystyle=p\(\\\{\\pi\(\{x\}^\{s\}\)\-\\pi\(x\)\>\\alpha\(d\(\{x\}^\{s\},\{x\}^\{b\}\)\-d\(x,\{x\}^\{b\}\)\)\\\}\\ \\forall x\\in\\mathcal\{X\}^\{f\}\_\{\-s\}\\ \\lvert\\ \{x\}^\{b\}\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\),where the strict inequality follows becausexs\{x\}^\{s\}is the unique utility maximiser a\.s\. and we can drop the conditioning variables\{E,Tb\}\\\{E,\{T\}^\{b\}\\\}because the rewritten expression contains onlyα\\alphaas the source of randomness\. In the following, we writeΔπ\(x,x′\):=π\(x\)−π\(x′\)\\Delta\_\{\\pi\}\(x,x^\{\\prime\}\):=\\pi\(x\)\-\\pi\(x^\{\\prime\}\)andΔd\(x,x′,x′′\):=d\(x,x′′\)−d\(x′,x′′\)\\Delta\_\{d\}\(x,x^\{\\prime\},x^\{\\prime\\prime\}\):=d\(x,x^\{\\prime\\prime\}\)\-d\(x^\{\\prime\},x^\{\\prime\\prime\}\)to simplify the notation\.
Next, we define the three complementary sets for𝒳−sf\\mathcal\{X\}^\{f\}\_\{\-s\}\. Let𝒳−f:=\{x:x∈𝒳−sf&Δd\(xs,x,xb\)<0\}\\mathcal\{X\}^\{f\}\_\{\-\}:=\\big\\\{x:x\\in\\mathcal\{X\}^\{f\}\_\{\-s\}\\ \\&\\ \\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)<0\\big\\\}denote the subset of𝒳−sf\\mathcal\{X\}^\{f\}\_\{\-s\}where the distances between its members toxb\{x\}^\{b\}are larger than that ofxs\{x\}^\{s\}toxb\{x\}^\{b\}\. We define𝒳\+f\\mathcal\{X\}^\{f\}\_\{\+\}and𝒳0f\\mathcal\{X\}^\{f\}\_\{0\}analogously where the former corresponds to the caseΔd\(xs,x,xb\)\>0\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\>0and the latterΔd\(xs,x,xb\)=0\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)=0\. We then define these three variables:
δl\\displaystyle\\delta^\{l\}:=max\{0,maxx∈𝒳−f\{Δπ\(xs,x\)Δd\(xs,x,xb\)\}\},\\displaystyle:=\\max\\Big\\\{0,\\ \\max\_\{x\\in\\mathcal\{X\}^\{f\}\_\{\-\}\}\\Big\\\{\\frac\{\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\}\{\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\}\\Big\\\}\\Big\\\},δu\\displaystyle\\delta^\{u\}:=minx∈𝒳\+f\{Δπ\(xs,x\)Δd\(xs,x,xb\)\},δ0:=minx∈𝒳0fΔπ\(xs,x\)\.\\displaystyle:=\\min\_\{x\\in\\mathcal\{X\}^\{f\}\_\{\+\}\}\\Big\\\{\\frac\{\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\}\{\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\}\\Big\\\},\\;\\;\\delta^\{0\}:=\\min\_\{x\\in\\mathcal\{X\}^\{f\}\_\{0\}\}\\;\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\.If𝒳−f\\mathcal\{X\}^\{f\}\_\{\-\}and/or𝒳\+f\\mathcal\{X\}^\{f\}\_\{\+\}are empty, we can setδl:=0\\delta^\{l\}:=0andδu:=∞\\delta^\{u\}:=\\infty\. If𝒳0f\\mathcal\{X\}^\{f\}\_\{0\}is empty, we can equivalently setδ0\\delta^\{0\}to some arbitrarily positive value\. We can then rewritep\(xs∣e,Tb=0,xb\)p\(\{x\}^\{s\}\\mid e,\{T\}^\{b\}=0,\{x\}^\{b\}\)as
p\(xs∣e,Tb=0,xb\)\\displaystyle p\(\{x\}^\{s\}\\mid e,\{T\}^\{b\}=0,\{x\}^\{b\}\)=p\(\{Δπ\(xs,x\)\>αΔd\(xs,x,xb\)\}∀x∈𝒳−sf\|xb\)1\(xs∈𝒳f\)\\displaystyle=p\(\\\{\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\>\\alpha\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\\\}\\;\\forall x\\in\\mathcal\{X\}^\{f\}\_\{\-s\}\\ \\lvert\\ \{x\}^\{b\}\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\)=p\(Δπ\(xs,x\)Δd\(xs,x,xb\)\>α∀x∈𝒳\+f&Δπ\(xs,x\)Δd\(xs,x,xb\)<α∀x∈𝒳−f&Δπ\(xs,x\)\>0∀x∈𝒳0f\|xb\)\\displaystyle=p\\Big\(\\frac\{\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\}\{\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\}\>\\alpha\\ \\forall x\\in\\mathcal\{X\}^\{f\}\_\{\+\}\\ \\&\\ \\frac\{\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\}\{\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\}<\\alpha\\ \\forall x\\in\\mathcal\{X\}^\{f\}\_\{\-\}\\ \\&\\ \\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\>0\\ \\forall x\\in\\mathcal\{X\}^\{f\}\_\{0\}\\ \\big\\lvert\\ \{x\}^\{b\}\\Big\)𝟙\(xs∈𝒳f\)\\displaystyle\\hskip 17\.07164pt\\mathbbm\{1\}\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\)=p\(maxx∈𝒳−fΔπ\(xs,x\)Δd\(xs,x,xb\)<α<minx∈𝒳\+fΔπ\(xs,x\)Δd\(xs,x,xb\)&minx∈𝒳0fΔπ\(xs,x\)\>0\|xb\)1\(xs∈𝒳f\)\\displaystyle=p\\Big\(\\max\_\{x\\in\\mathcal\{X\}^\{f\}\_\{\-\}\}\\frac\{\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\}\{\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\}<\\alpha<\\min\_\{x\\in\\mathcal\{X\}^\{f\}\_\{\+\}\}\\frac\{\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\}\{\\Delta\_\{d\}\(\{x\}^\{s\},x,\{x\}^\{b\}\)\}\\ \\&\\ \\min\_\{x\\in\\mathcal\{X\}^\{f\}\_\{0\}\}\\Delta\_\{\\pi\}\(\{x\}^\{s\},x\)\>0\\ \\big\\lvert\\ \{x\}^\{b\}\\Big\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\)=p\(\{δl<α<δu\}&\{δ0\>0\}∣xb\)1\(xs∈𝒳f\)\\displaystyle=p\(\\\{\\delta^\{l\}<\\alpha<\\delta^\{u\}\\\}\\ \\&\\ \\\{\\delta^\{0\}\>0\\\}\\mid\{x\}^\{b\}\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\)=p\(\{δl<α<δu\}∣xb\)1\(δ0\>0\)1\(xs∈𝒳f\),\\displaystyle=p\(\\\{\\delta^\{l\}<\\alpha<\\delta^\{u\}\\\}\\mid\{x\}^\{b\}\)\\ \\mathbbm\{1\}\(\\delta^\{0\}\>0\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\),where the second to last equation follows from our definitions ofδl,δu,δ0\\delta^\{l\},\\delta^\{u\},\\delta^\{0\}and becauseα\\alphafollows a log\-normal distribution\. Therefore, being able to evaluate the CDF ofα\\alphaallows the DM to estimatepπ\(xs∣tb,xb\)p\_\{\\pi\}\(\{x\}^\{s\}\\mid\{t\}^\{b\},\{x\}^\{b\}\)and anticipate the strategic covariate shift\.
④Non\-strategic joint distributionpπ\(tb,xb\)p\_\{\\pi\}\(\{t\}^\{b\},\{x\}^\{b\}\)\.This term is not influenced by the agents’ strategic behavior and can be computed straightforwardly\. We discuss the estimation forp\(xb\)p\(\{x\}^\{b\}\)in the next section\.
The next subsection provides an estimation procedure for the unknown parametersθ=\(β,β0,σ\)\\theta=\(\\beta,\\beta\_\{0\},\\sigma\)of the cost model\.
### 2\.3Learning the Agents’ Cost Model
Given each observation of agents’ strategic behavior, i\.e\., a tuple\(xs,𝒳r,tb=0,xb\)\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{t\}^\{b\}=0,\{x\}^\{b\}\), we define the per\-observation likelihood contribution as
f\(δl,δu,xb;θ\):=pθ\(δ\(xs,e,xb\)l<α<δ\(xs,e,xb\)u\|xb\),\\displaystyle f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\):=p\_\{\\theta\}\\Big\(\\delta^\{l\}\_\{\(\{x\}^\{s\},e,\{x\}^\{b\}\)\}<\\alpha<\\delta^\{u\}\_\{\(\{x\}^\{s\},e,\{x\}^\{b\}\)\}\\ \\big\\lvert\\ \{x\}^\{b\}\\Big\),ifδ\(xs,e,xb\)l<δ\(xs,e,xb\)u\\delta^\{l\}\_\{\(\{x\}^\{s\},e,\{x\}^\{b\}\)\}<\\delta^\{u\}\_\{\(\{x\}^\{s\},e,\{x\}^\{b\}\)\}\. Whenδ\(xs,e,xb\)l=δ\(xs,e,xb\)u\\delta^\{l\}\_\{\(\{x\}^\{s\},e,\{x\}^\{b\}\)\}=\\delta^\{u\}\_\{\(\{x\}^\{s\},e,\{x\}^\{b\}\)\}, we setf\(δl,δu,xb;θ\):=cf\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\):=cfor some constantc∈\(0,1\)c\\in\(0,1\)\.
Givennnobservations\{\(xis,ei,tib=0,xib\)\}i=1n\\\{\(\{x\}^\{s\}\_\{i\},e\_\{i\},\{t\}^\{b\}\_\{i\}=0,\{x\}^\{b\}\_\{i\}\)\\\}\_\{i=1\}^\{n\}collected from agents who received negative base treatment, i\.e,tib=0\{t\}^\{b\}\_\{i\}=0, under the logging policyπ0\\pi\_\{0\}, we define the empirical log\-likelihood function as
𝒬n\(θ\)\\displaystyle\\mathcal\{Q\}\_\{n\}\(\\theta\)=1n∑i=1nlnf\(δil,δiu,xib;θ\),\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ln f\(\\delta^\{l\}\_\{i\},\\delta^\{u\}\_\{i\},\{x\}^\{b\}\_\{i\};\\theta\),and denoteθ^n:=argmaxθ∈Θ𝒬n\(θ\)\\hat\{\\theta\}\_\{n\}:=\\arg\\max\_\{\\theta\\in\\Theta\}\\mathcal\{Q\}\_\{n\}\(\\theta\)as the maximum likelihood estimator of the agents’ cost model\. We then present the conditions for this estimator to be consistent\.
###### Assumption 2\.3\(Compact parameter space\)\.
The parameter spaceΘ⊂ℝp×ℝ×ℝ\+\\Theta\\subset\\mathbb\{R\}^\{p\}\\times\\mathbb\{R\}\\times\\mathbb\{R\}^\{\+\}is compact\.
###### Assumption 2\.4\(Finite covariates\)\.
The space of covariate vectors𝒳⊂ℝd\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}is finite\.
###### Assumption 2\.5\(Weak positivity & full\-rank design matrix\)\.
There exists a subset𝒳⋄b⊆𝒳\{\\mathcal\{X\}\}^\{b\}\_\{\\diamond\}\\subseteq\\mathcal\{X\}of size at leastp\+1p\+1such that
- •for eachxb∈𝒳⋄b\{x\}^\{b\}\\in\{\\mathcal\{X\}\}^\{b\}\_\{\\diamond\}, there exist two positive valuesδu1≠δu2\\delta^\{u1\}\\neq\\delta^\{u2\}where the two observations\(0,δu1,xb\)\(0,\\delta^\{u1\},\{x\}^\{b\}\)and\(0,δu2,xb\)\(0,\\delta^\{u2\},\{x\}^\{b\}\)occur with positive probabilities, i\.e\., p\(δl=0,δu=δu1,Xb=xb\)\>0,p\(δl=0,δu=δu2,Xb=xb\)\>0;\\displaystyle\\begin\{aligned\} p\\big\(\\delta^\{l\}=0,\\delta^\{u\}=\\delta^\{u1\},\{X\}^\{b\}=\{x\}^\{b\}\\big\)&\>0,\\\\ p\\big\(\\delta^\{l\}=0,\\delta^\{u\}=\\delta^\{u2\},\{X\}^\{b\}=\{x\}^\{b\}\\big\)&\>0;\\end\{aligned\}
- •the augmented design matrixΦ~xb\\tilde\{\\Phi\}\_\{\{x\}^\{b\}\}has full column rank, where we define Φ~xb=\[1ϕ\(x1b\)⊤⋮⋮1ϕ\(xp\+1b\)⊤\]∀x1b,…,xp\+1b∈𝒳⋄b\.\\displaystyle\\tilde\{\\Phi\}\_\{\{x\}^\{b\}\}=\\begin\{bmatrix\}1&\\phi\(\{x\}^\{b\}\_\{1\}\)^\{\\top\}\\\\ \\vdots&\\vdots\\\\ 1&\\phi\(\{x\}^\{b\}\_\{p\+1\}\)^\{\\top\}\\end\{bmatrix\}\\qquad\\forall\{x\}^\{b\}\_\{1\},\\ldots,\{x\}^\{b\}\_\{p\+1\}\\in\{\\mathcal\{X\}\}^\{b\}\_\{\\diamond\}\.
The first part of[˜2\.5](https://arxiv.org/html/2606.07308#S2.Thmtheorem5)imposes a weak positivity condition, requiring sufficient variation in the observed strategic responses\. The second part is likewise mild, as it depends primarily on the marginalPXbP\_\{\{X\}^\{b\}\}and the mappingϕ\\phi\.
###### Theorem 2\.6\(Consistency ofθ^n\\hat\{\\theta\}\_\{n\}\)\.
Letθ∗=\(β∗,β0∗,σ∗\)\\theta^\{\*\}=\(\\beta^\{\*\},\\beta\_\{0\}^\{\*\},\\sigma^\{\*\}\)be the true parameters of agents’ cost model\. Under Assumptions[2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3)–[2\.5](https://arxiv.org/html/2606.07308#S2.Thmtheorem5)and that the DM recommends covariate updates with different distances toxb\{x\}^\{b\}\([Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)\),θ^n→𝑝θ∗\\hat\{\\theta\}\_\{n\}\\xrightarrow\{p\}\\theta^\{\*\}\.
Based on these results, standard off\-policy evaluation estimators can be derived\. In the following section, we present a doubly robust estimator\.
## Strategy\-Robust Doubly Robust Estimator
In this section, we introduce a doubly robust estimator for off\-policy evaluation that explicitly accounts for strategic covariate shift\. We refer to this estimator as thestrategy\-robust doubly robust \(SDR\)estimator:
V^SDR\(π\)=V^S\-IPS\-res\(π\)\+V^S\-DM\(π\),\\displaystyle\\hat\{V\}\_\{\\text\{SDR\}\}\(\\pi\)=\\hat\{V\}\_\{\\text\{S\-IPS\-res\}\}\(\\pi\)\+\\hat\{V\}\_\{\\text\{S\-DM\}\}\(\\pi\),whereV^S\-IPS\-res\\hat\{V\}\_\{\\text\{S\-IPS\-res\}\}refers to the \(strategy\-robust\) inverse propensity score \(IPS\)\-based estimate of the residual andV^S\-DM\\hat\{V\}\_\{\\text\{S\-DM\}\}the \(strategy\-robust\) direct method estimator of the policy value\. In particular, they correspond to
V^S\-IPS\-res\(π\)\\displaystyle\\hat\{V\}\_\{\\text\{S\-IPS\-res\}\}\(\\pi\)=1m∑i=1m\(yi−μ^\(tis,xis\)\)p^π\(tis,xis,tib∣xib\)p^π0\(tis,xis,tib∣xib\),\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\hat\{\\mu\}\(\{t\}^\{s\}\_\{i\},\{x\}^\{s\}\_\{i\}\)\\big\)\\frac\{\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\}\_\{i\},\{x\}^\{s\}\_\{i\},\{t\}^\{b\}\_\{i\}\\mid\{x\}^\{b\}\_\{i\}\)\}\{\\hat\{p\}\_\{\\pi\_\{0\}\}\(\{t\}^\{s\}\_\{i\},\{x\}^\{s\}\_\{i\},\{t\}^\{b\}\_\{i\}\\mid\{x\}^\{b\}\_\{i\}\)\},V^S\-DM\(π\)\\displaystyle\\hat\{V\}\_\{\\text\{S\-DM\}\}\(\\pi\)=∑𝒯×𝒳×𝒯×𝒳μ^\(ts,xs\)p^π\(ts,xs,tb,xb\),\\displaystyle=\\sum\_\{\\mathcal\{T\}\\times\\mathcal\{X\}\\times\\mathcal\{T\}\\times\\mathcal\{X\}\}\\hat\{\\mu\}\(\{t\}^\{s\},\{x\}^\{s\}\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\),whereμ^\(ts,xs\)\\hat\{\\mu\}\(\{t\}^\{s\},\{x\}^\{s\}\)denote the estimator of the conditional expected outcomeμ\(ts,xs\):=𝔼\[Y∣ts,xs\]\\mu\(\{t\}^\{s\},\{x\}^\{s\}\):=\\mathbb\{E\}\[Y\\mid\{t\}^\{s\},\{x\}^\{s\}\]\. The importance weights can also be defined as:
w\(ts,xs,tb,xb\)\\displaystyle w\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\):=pπ\(ts,xs,tb\|xb\)pπ0\(ts,xs,tb∣xb\),\\displaystyle:=\\frac\{p\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\}\|\{x\}^\{b\}\)\}\{p\_\{\\pi\_\{0\}\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\}\\mid\{x\}^\{b\}\)\},whose empirical versionw^\(ts,xs,tb,xb\)\\hat\{w\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)can be obtained by replacingpπp\_\{\\pi\}andpπ0p\_\{\\pi\_\{0\}\}with their empirical estimatesp^π\\hat\{p\}\_\{\\pi\}andp^π0\\hat\{p\}\_\{\\pi\_\{0\}\}, respectively\.
###### Assumption 3\.1\(Overlap\)\.
Given the logging policyπ0\\pi\_\{0\}and the evaluation policyπ\\pi, we assume thatpπ\(ts,xs,tb\|xb\)\>0p\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\}\|\{x\}^\{b\}\)\>0impliespπ0\(ts,xs,tb\|xb\)\>0p\_\{\\pi\_\{0\}\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\}\|\{x\}^\{b\}\)\>0, for all values\(ts,xs,tb,xb\)∈𝒯×𝒳×𝒯×𝒳\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\\in\\mathcal\{T\}\\times\\mathcal\{X\}\\times\\mathcal\{T\}\\times\\mathcal\{X\}\.
The densityp^\(xb\)\\hat\{p\}\(\{x\}^\{b\}\)can be estimated via empirical frequency, since the space𝒳\\mathcal\{X\}is finite\. Under the i\.i\.d sampling assumption, this estimator is consistent by the law of large numbers\. We next establish the double robustness property ofV^SDR\\hat\{V\}\_\{\\text\{SDR\}\}: the estimator is consistent if either the outcome modelμ^\(xs,ts\)\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)is consistent or[˜3\.1](https://arxiv.org/html/2606.07308#S3.Thmtheorem1)holds\.
###### Theorem 3\.2\(Consistency ofV^SDR\\hat\{V\}\_\{\\text\{SDR\}\}\)\.
Suppose that the estimatorp^π\(ts,xs,tb,xb\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)is consistent for anyπ\\pi, the nuisance components \(i\.e\.,w^\\hat\{w\},μ^\\hat\{\\mu\}\) are estimated independently, and the samples forV^S\-IPS\-res\\hat\{V\}\_\{\\text\{S\-IPS\-res\}\}are collected separately, thenV^SDR\\hat\{V\}\_\{\\text\{SDR\}\}is consistent if either the outcome modelμ^\(xs,ts\)\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)is consistent or[˜3\.1](https://arxiv.org/html/2606.07308#S3.Thmtheorem1)\(overlap\) holds\.
## Experiments
We conduct experiments on a synthetic dataset and a real\-world dataset to verify our theoretical results\. Firstly, we show that as the sample size increases, our estimates for \(i\) the parameters of the agents’ cost model and for \(ii\) the policy value converge in probability towards the ground\-truth, reflecting the consistency results in[Theorems˜2\.6](https://arxiv.org/html/2606.07308#S2.Thmtheorem6)and[3\.2](https://arxiv.org/html/2606.07308#S3.Thmtheorem2)\. Secondly, we show that a standard doubly robust approach, without correct adjustment for the strategic covariate shift, will produce incorrect policy value estimates\. This demonstrates the importance of correctly adjusting for the strategic covariate shift\. Thirdly, we investigate a scenario where there are some agents who behave differently from our assumed behavioral model in[Section˜2\.1](https://arxiv.org/html/2606.07308#S2.SS1)\. This is to understand how such deviation from our behavioral assumption can bias the policy value estimates\.
### 4\.1Synthetic Data
We generateN≥1000N\\geq 1000agents with 2\-dimensional observable feature vectorsxb∈𝒳=\{−10,…,10\}2⊂ℤ2\{x\}^\{b\}\\in\\mathcal\{X\}=\\\{\-10,\\ldots,10\\\}^\{2\}\\subset\\mathbb\{Z\}^\{2\}\. Each agent has a cost function of the formc\(x,xb\):=0\.05×α‖x−xb‖22c\(x,\{x\}^\{b\}\):=0\.05\\times\\alpha\\\|x\-\{x\}^\{b\}\\\|\_\{2\}^\{2\}wherelnα∼𝒩\(β⊤ϕ\(xb\)\+β0,σ2\)\\ln\\alpha\\sim\\mathcal\{N\}\(\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\},\\ \\sigma^\{2\}\)\. The true parameters areβ∗=\[1\.0,1\.2\]⊤\\beta^\{\*\}=\[1\.0,1\.2\]^\{\\top\},β0∗=0\.5\\beta\_\{0\}^\{\*\}=0\.5, andσ∗=1\.0\\sigma^\{\*\}=1\.0\. In addition, any strategic movement outside of the space𝒳\\mathcal\{X\}results in an infinite cost\. The outcome function ish\(xs,ts\):=\(\[5,5\]⊤xs\)ts\+5h\(\{x\}^\{s\},\{t\}^\{s\}\):=\(\[5,5\]^\{\\top\}\{x\}^\{s\}\)\{t\}^\{s\}\+5\. The DM aims to estimate the conditional expected outcome𝔼\[Y∣ts,xs\]\\mathbb\{E\}\[Y\\mid\{t\}^\{s\},\{x\}^\{s\}\]via a predictive modelμ^:𝒯×𝒳→𝒴\\hat\{\\mu\}:\\mathcal\{T\}\\times\\mathcal\{X\}\\to\\mathcal\{Y\}\.
For a logistic functiong\(a\):=1/\(1\+exp\(−a\)\)g\(a\):=1/\(1\+\\exp\(\-a\)\), we consider the two logging policiesπ0strict\\pi\_\{0\}^\{\\text\{strict\}\}andπ0lax\\pi\_\{0\}^\{\\text\{lax\}\}such thatπ0strict\(x\)=g\(\[4,4\]⊤x\)\\pi\_\{0\}^\{\\text\{strict\}\}\(x\)=g\(\[4,4\]^\{\\top\}x\)andπ0lax\(x\)=g\(\[1,1\]⊤x\)\\pi\_\{0\}^\{\\text\{lax\}\}\(x\)=g\(\[1,1\]^\{\\top\}x\)where theπ0strict\\pi\_\{0\}^\{\\text\{strict\}\}has stronger discriminative power \(i\.e\., closer to being a deterministic policy\)\. We use it to study the case where[˜3\.1](https://arxiv.org/html/2606.07308#S3.Thmtheorem1)\(overlap\) is violated\. These two logging policies will later induce two different logged datasetsDπ0strictD\_\{\\pi\_\{0\}^\{\\text\{strict\}\}\}andDπ0laxD\_\{\\pi\_\{0\}^\{\\text\{lax\}\}\}\. We use the former to evaluate our SDR estimator when there is limited overlap and the latter to evaluate whenμ^\\hat\{\\mu\}is misspecified\.
The DM uses a deterministic explanation policyτ\\tausuch that for any agent with base covariatesxb\{x\}^\{b\}, they receive𝒳r=\{xb\+\[0,1\]⊤,xb\+\[1,3\]⊤,xb\+\[1,4\]⊤\}\{\\mathcal\{X\}\}^\{r\}=\\\{\{x\}^\{b\}\+\[0,1\]^\{\\top\},\\ \{x\}^\{b\}\+\[1,3\]^\{\\top\},\\ \{x\}^\{b\}\+\[1,4\]^\{\\top\}\\\}\. The DM’s goal is to estimate the value of the new treatment policyπ\\piwhereπ\(x\)=g\(\[1,1\]⊤x−1\)\\pi\(x\)=g\(\[1,1\]^\{\\top\}x\-1\)\.
Baseline & evaluation\.We use our SDR estimator, with*mis\-specified*cost model’s parametersθ\\theta, as the baseline \(i\.e\., wrong adjustment for the covariate shift\)\. In particular, we set the parameters asβ=\[1\.5,0\.8\]⊤\\beta=\[1\.5,0\.8\]^\{\\top\},β0=0\.2\\beta\_\{0\}=0\.2, andσ=0\.7\\sigma=0\.7\. We repeat the experiment multiple times, while increasing the number of agentsNNfrom10001000to1100011000with a step size of500500\. For eachNN, we randomly generate a logged dataset of sizeNN, then estimate the parameters of agents’ cost model using the maximum likelihood approach and compute the SDR estimator\. We repeat3030times for each choice ofNNto collect3030noisy estimates\.
Figure 2:The plots show the consistency of the estimated parameters of the cost model \(on the synthetic dataset\)\. Each line shows the median of the noisy estimates\. Each shaded region captures the estimates between25%−75%25\\%\-75\\%quantiles\.\(a\)Despite limited overlap, our SDR estimator remains consistent, whereas the IPS\-based estimator fails\.
\(b\)Whenμ^\\hat\{\\mu\}is misspecified, our SDR estimator remains consistent, whereas the DM estimator fails\.
\(c\)The baseline SDR approach produces incorrect estimates when it assumes wrong agents’ strategic behavior\.
Figure 3:Illustrations of the differences between the estimates for policy value and the ground\-truth \(on the synthetic dataset\)\. Each line shows the median of the errors of the noisy estimates\. Each shaded region captures the errors within the25%−75%25\\%\-75\\%quantiles\.Result 1: All agents behave rationally\.[Figure˜2](https://arxiv.org/html/2606.07308#S4.F2)shows the convergence of the estimates for the log\-normal distribution ofα\\alpha\. The shrinkage of shaded regions implies the concentration of the noisy estimates around the ground\-truth values, as the dataset size increases\. This demonstrates the consistency of the estimated parameter vectorθ^\\hat\{\\theta\}\.[Figure˜3](https://arxiv.org/html/2606.07308#S4.F3)shows the convergence of our SDR estimator in several cases\. Similarly, the shrinkage of the shaded regions shows the concentration of the noisy estimates around the ground\-truth\. This demonstrates the consistency of our SDR estimator, while the baseline produces incorrect estimates due to wrong assumption about agents’ behavior\.


Figure 4:Similar to[Figure˜3](https://arxiv.org/html/2606.07308#S4.F3), these plots show the errors of the noisy policy\-value estimates, but when there are10001000agents who behave irrationally \(on the synthetic dataset\)\. This demonstrates that although such irrational behaviour biases our estimates, as we obtain more samples from rational agents, the bias tends to zero\.Result 2: Some agents behave irrationally\.We also test the robustness of our approach when agents behave differently from our assumed model in[Section˜2\.1](https://arxiv.org/html/2606.07308#S2.SS1)\. In particular, we fix the number of “irrational” agents to 1000 and observe how the bias introduced by such irrationality goes down as the dataset size increases\. For any agent who fails to modify their base covariate vectorxb\{x\}^\{b\}to any recommended valuexr\{x\}^\{r\}, this agent may choose an arbitrary valuexxdrawn uniformly from the covariate space𝒳\\mathcal\{X\}\.[Figure˜4](https://arxiv.org/html/2606.07308#S4.F4)shows the convergence of our SDR estimator when there are10001000irrational agents\. As the dataset size increases, our estimator converges as the number of samples from rational agents dominate\. Since, in our LID setting, the DM can distinguish between agents’ rational and irrational responses \(i\.e\., by checking ifxs∈𝒳f\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\), our MLE procedure \(for estimating the agents’ cost model\) simply discards these data points from irrational agents\. This irrational behaviour is correlated with the cost sensitivity of agents, hence creating selection bias when the data is discarded\.
### 4\.2German Credit Data
We preprocess the German credit dataset\(Hofmann,[1994](https://arxiv.org/html/2606.07308#bib.bib37)\)followingXie and Zhang\([2024](https://arxiv.org/html/2606.07308#bib.bib45)\)\. In particular, we exclude two sensitive attributes and retain 18 features, among which 8 are considered modifiable by strategic agents\. The original dataset contains10001000samples\. We use CTGAN\(Xuet al\.,[2019](https://arxiv.org/html/2606.07308#bib.bib46)\)to generate200000200000additional samples for our experiments\. We use covariates from the German Credit dataset and simulate agents’ strategic behavior and outcomes according to our structural model\.
The outcome model\.Although the German Credit dataset provides a binary credit\-risk label, our target outcome is the bank’s realized profit, which is naturally continuous\-valued\. Accordingly, we define the synthetic outcome function asY:=10\(\(1→\)⊤Xs\)T\+5Y:=10\(\(\\vec\{1\}\)^\{\\top\}\{X\}^\{s\}\)T\+5, where the first term represents treatment\-dependent loan profit and the constant term represents a processing fee\. We evaluate the SDR estimator under both correctly specified and misspecified outcome modelμ^\\hat\{\\mu\}\. In particular, a correctly specified model has the multiplicative formμ^\(ts,xs\)=ζ1⊤xs\+ζ2⊤ts\+ζ3\(\(1→\)⊤xs\)t\+ζ0\\hat\{\\mu\}\(\{t\}^\{s\},\{x\}^\{s\}\)=\\zeta\_\{1\}^\{\\top\}\{x\}^\{s\}\+\\zeta\_\{2\}^\{\\top\}\{t\}^\{s\}\+\\zeta\_\{3\}\(\(\\vec\{1\}\)^\{\\top\}\{x\}^\{s\}\)t\+\\zeta\_\{0\}, while the misspecified model omits the interaction term\.
DM\-agents interactions\.To set up the logging policyπ0\\pi\_\{0\}and evaluation policyπ\\pi, we first fit a logistic regression model on the original German Credit binary labels and use the fitted model as a base scoring function\. We then scale its coefficients to obtain policies with different levels of selectiveness\. Suppose that the base policy isπbase;η\(x\)=1/\(1\+exp\(−\(η0\+η1⊤x\)\)\\pi\_\{\\text\{base\};\\eta\}\(x\)=1/\(1\+\\exp\(\-\(\\eta\_\{0\}\+\\eta\_\{1\}^\{\\top\}x\)\)whereη=\(η0,η1\)\\eta=\(\\eta\_\{0\},\\eta\_\{1\}\)\. Then, the logging and evaluation policies are respectivelyπ0:=πbase;0\.1η\\pi\_\{0\}:=\\pi\_\{\\text\{base\};0\.1\\eta\}andπ:=πbase;2η\\pi:=\\pi\_\{\\text\{base\};2\\eta\}\.
To operationalize ARexes, we let the DM generate all possible candidates for feature updatesxr\{x\}^\{r\}whose Hamming distance \(to the base covariate vector\) is equal to one, i\.e\.,dHamming\(xr,xb\)=1d\_\{\\text\{Hamming\}\}\(\{x\}^\{r\},\{x\}^\{b\}\)=1\. Then, for each agent, the DM gives top 3 recommendationsxr\{x\}^\{r\}with the highest weighted scoresπ\(xr\)/d\(xr,xb\)\\pi\(\{x\}^\{r\}\)/d\(\{x\}^\{r\},\{x\}^\{b\}\)\. Such recommendations aim to balance the benefit and the \(base\) effort necessary for an agent to adapt\. For example, a bank might recommend a change in a customer’s profile that benefits the customer without incurring high cost for them\.
The agents’ behavioral model\.Similar toVoet al\.\([2026](https://arxiv.org/html/2606.07308#bib.bib44)\), we set up an agent’s cost function as
c\(x,x′\)=αd\(x,x′\)=α0\.001∑i∈ℐ\|xi−xib\|\|xiU−xiL\|,\\displaystyle c\(x,x^\{\\prime\}\)=\\alpha d\(x,x^\{\\prime\}\)=\\alpha 0\.001\\sum\_\{i\\in\\mathcal\{I\}\}\\frac\{\|x\_\{i\}\-\{x\}^\{b\}\_\{i\}\|\}\{\|x\_\{i\}^\{U\}\-x\_\{i\}^\{L\}\|\},whereℐ\\mathcal\{I\}is the set of indices of the 8 modifiable features and\[xiU,xiL\]\[x\_\{i\}^\{U\},x\_\{i\}^\{L\}\]denotes the valid range of a featurexix\_\{i\}\. Any change in non\-modifiable features incurs infinite cost\.
Recall thatα∼𝒩\(β⊤ϕ\(xb\)\+β0,σ2\)\\alpha\\sim\\mathcal\{N\}\(\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\},\\ \\sigma^\{2\}\)\. We apply PCA on the semi\-synthetic dataset containing the agents’ base covariate vectorsxb\{x\}^\{b\}, extract the top 4 principle components, and use them to construct a PCA transformationϕ:𝒳→ℝ4\\phi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{4\}\. We set the true parameters toβ∗=\[0\.25,0\.25,0\.25,0\.25\]⊤\\beta^\{\*\}=\[0\.25,0\.25,0\.25,0\.25\]^\{\\top\},β0∗=0\.5\\beta\_\{0\}^\{\*\}=0\.5, andσ∗=1\.0\\sigma^\{\*\}=1\.0\.
Estimation & evaluation\.To account for continuous features, we use Monte Carlo approximation to compute the SDR \(policy\-value\) estimate\. In particular, we do not fit a model to predictp\(xb\)p\(\{x\}^\{b\}\), but compute onlyp^\(ts,xs,tb\|xb\)\\hat\{p\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\}\|\{x\}^\{b\}\)and use it to weight the data points when computing the SDR estimate\. We use our SDR estimator, with*mis\-specified*cost model’s parametersθ\\theta, as the baseline\. In particular, we set the cost model’s parameters asβ=\[0\.225,0\.225,0\.225,0\.225\]⊤\\beta=\[0\.225,0\.225,0\.225,0\.225\]^\{\\top\},β0=0\.2\\beta\_\{0\}=0\.2, andσ=0\.7\\sigma=0\.7\. We repeat the experiment multiple times, while increasing the number of agentsNNfrom95009500to190000190000with a step size of95009500\. Similar to the synthetic data case, for eachNN, we repeat the experiment3030times to obtain3030noisy estimates\.
\(a\)The errors of our estimatesθ^\\hat\{\\theta\}decay as the sample size increases\.
\(b\)Baseline SDR gives incorrect estimates under the wrong behavioral assumption\.
\(c\)Our SDR estimator is consistent whenμ^\\hat\{\\mu\}is misspecified, unlike the DM estimator\.
Figure 5:These plots illustrate the convergence of our estimators on the German credit dataset\. Each line shows the median of the errors of the noisy estimates\. Each shaded region captures the errors within the25%−75%25\\%\-75\\%quantiles\.[Figure˜5](https://arxiv.org/html/2606.07308#S4.F5)shows the convergence of our estimateθ^\\hat\{\\theta\}of the log\-normal distribution’s parameters and the convergence of our SDR estimator\. The shrinkage of shaded regions implies the concentration of the noisy estimates around the ground\-truth values, as the dataset size increases\.
## Related Work
\(Off\-\)policy evaluation and policy learning\.Existing work on OPE under covariate shift focuses onexogenousshifts, where changes in the covariate distribution do not arise in response to the policy being evaluated\. For example,Ueharaet al\.\([2020](https://arxiv.org/html/2606.07308#bib.bib40)\); Guoet al\.\([2024](https://arxiv.org/html/2606.07308#bib.bib11)\)study settings in which the test distribution of covariates is assumed to be known, whileKalluset al\.\([2022](https://arxiv.org/html/2606.07308#bib.bib18)\)assumes the test distribution of covariates falls within a pre\-specified set\. These assumptions allow the evaluation problem to condition on a fixed or externally specified target distribution\. In contrast, when covariate shifts are policy\-dependent, the test distribution varies with the policy being evaluated and is therefore neither known nor constrained to a policy\-invariant set\. As a result, existing approaches, which rely on exogenous specification of the target distribution, are not suitable in this setting\. Another line of work studies optimizing decision policies and typically deals with strategic behavior through repeated interactions\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.07308#bib.bib31); Munro,[2025](https://arxiv.org/html/2606.07308#bib.bib29); Chenet al\.,[2024](https://arxiv.org/html/2606.07308#bib.bib6); Perdomo,[2025](https://arxiv.org/html/2606.07308#bib.bib32)\)\. While leveraging repeated online interactions can help adapt policies or infer agents’ responses over time, this is often infeasible in high\-stakes settings where experiments are costly or ethically constrained\. This makes them unsuitable for OPE\.
Strategic machine learning\.A prominent line of work that examines strategic behavior comes from strategic classification literature and its variants\. While many of them rely on repeated online interactions\(Shavitet al\.,[2020](https://arxiv.org/html/2606.07308#bib.bib36); Harriset al\.,[2022b](https://arxiv.org/html/2606.07308#bib.bib15); Horowitz and Rosenfeld,[2023](https://arxiv.org/html/2606.07308#bib.bib16); Voet al\.,[2024](https://arxiv.org/html/2606.07308#bib.bib43); Xie and Zhang,[2024](https://arxiv.org/html/2606.07308#bib.bib45)\), several considers offline settings\(Hardtet al\.,[2016](https://arxiv.org/html/2606.07308#bib.bib13); Levanon and Rosenfeld,[2021](https://arxiv.org/html/2606.07308#bib.bib22); Rosenfeld and Rosenfeld,[2024](https://arxiv.org/html/2606.07308#bib.bib34)\)\. However, these works in offline settings typically assume that agents’ responses to policy changes can be modeled precisely, commonly formalized through asingleandknowncost function\. This strong assumption limits the applicability of their approaches to OPE\. Closest to our work in the spirit of relaxing such strong assumption is the work ofRosenfeld and Rosenfeld\([2024](https://arxiv.org/html/2606.07308#bib.bib34)\), which considers a one\-shot setting without knowing the exact agents’ cost function\. However, there are two main distinctions\. Their approach relies on an externally specified set of feasible cost functions, which cannot capture the heterogeneity of agents’ behavior\. Secondly, their focus is on optimizing the classifier for the worst\-case scenario while our focus is on estimating the performance of a policy\.
Contract design\.The information loss induced by global disclosure is closely related to classical problems of information asymmetry studied in economics, such as adverse selection in contract design\(Rothschild and Stiglitz,[1976](https://arxiv.org/html/2606.07308#bib.bib35); Bolton and Dewatripont,[2004](https://arxiv.org/html/2606.07308#bib.bib5)\)\. While our setting differs substantially from standard economic models in both objectives and setup, a common insight is that the principal’s choice of how interactions are structured plays a central role in mitigating information asymmetry\. In particular, screening models\(Baron and Myerson,[1982](https://arxiv.org/html/2606.07308#bib.bib2)\)emphasize that a principal can deliberately design an interaction scheme so that agents’ responses reveal information that would otherwise remain unobserved\. Our perspective draws on this high\-level idea: local information disclosure represents a design choice that structures interactions so that agents’ pre\-strategic covariates are observed to the DM, thereby preserving information that is lost under global disclosure\.
Recourse and explanation design\.Our use of the action recommendation\-based explanation \(ARex\) framework\(Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\)as a local disclosure mechanism is related to the algorithmic recourse and counterfactual explanations literature, which studies how decision makers can provide individuals with actionable feedback to achieve a desired decision outcome \(see Section[2](https://arxiv.org/html/2606.07308#S2)andKarimiet al\.\([2021](https://arxiv.org/html/2606.07308#bib.bib19)\); Wachteret al\.\([2018](https://arxiv.org/html/2606.07308#bib.bib47)\)\)\. Several works model how releasing such feedback can and induce strategic responses, creating feedback loops between the decision rule and the observed covariates\(Tsirtsis and Gomez Rodriguez,[2020](https://arxiv.org/html/2606.07308#bib.bib39)\)and potentially rendering explanations*performative*\(Königet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib21)\)\. Recent work also stresses that recourse should be*set\-valued*: offering multiple counterfactuals can better heterogeneous user preferences\(Mothilalet al\.,[2020](https://arxiv.org/html/2606.07308#bib.bib27)\)\.
## Conclusion and Discussion
When agents behave strategically, it gives rise to a policy\-dependent covariate shift, affecting the existing OPE approaches that rely on a fixed or externally specified covariate distribution\. While agents’ responses can be anticipated precisely with full knowledge of their behavioral model, such an assumption rarely holds in practice\. Our work is not only among the first to examine the problem of OPE under strategic behavior, but also proposes an approach that does not assume full knowledge of the agents’ behavioral model, unlike much of related work in strategic machine learning\.
In summary, we extend the ARex framework\(Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\)specifically to the task of OPE under strategic behavior\. We then propose an estimation procedure to learn the parameters of agents’ cost model\. Finally, we construct a strategy\-robust doubly robust estimator and prove its consistency\. Beyond these technical contributions, our work draws attention to the issue of information asymmetry when GID is assumed in the presence of strategic behaviour\. When agents modify their features strategically in response to the DM’s policy, two key challenges emerge\.
The first challenge is thepolicy\-dependentcovariate shift, which the DM must account for to accurately estimate policy performance\. In many applications, LID is a design choice that can be leveraged to elicit additional information from strategic agents\. We demonstrate this using ARexes\. More broadly, this suggests a new approach for mitigating information asymmetry in learning problems, enabling the DM to relax strong assumptions about agent behavior\.
The second challenge is the breakdown of unconfoundedness, a standard assumption in OPE\. While treatment assignmentTTmay be unconfounded with the outcomeYYgiven covariatesXXin non\-strategic settings, this assumption can fail under strategic behavior \(see[Section˜2\.2](https://arxiv.org/html/2606.07308#S2.SS2)\)\. Consequently,𝔼\[Y∣t,x\]\\mathbb\{E\}\[Y\\mid t,x\]may no longer be identifiable\. Although not the main focus, our framework can be extended to infer latent variables governing strategic adaptationsXs\{X\}^\{s\}, analogous to the approach in[Section˜2\.3](https://arxiv.org/html/2606.07308#S2.SS3)\. Conditioning on such latent information may help block confounding paths betweenXs\{X\}^\{s\}andYY\. Crucially, this becomes possible through the use of LID to reveal additional information about agents\.
## Acknowledgments
We sincerely thank the members of the Rational Intelligence \(RI\) Lab, including Anurag Singh, Joseph Sheils, and Monseej Purkayastha for their insightful discussions, constructive feedback, and invaluable contributions to this work\. We also thank the anonymous reviewers for their valuable feedback to improve our work\. Kiet Q\. H\. Vo is a doctoral candidate at Saarland University\.
## References
- D\. P\. Baron and R\. B\. Myerson \(1982\)Regulating a monopolist with unknown costs\.Econometrica50\(4\),pp\. 911–930\.Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p3.1)\.
- P\. Bolton and M\. Dewatripont \(2004\)Contract theory\.MIT press\.Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p3.1)\.
- Q\. Chen, Y\. Chen, and B\. Li \(2024\)Practical performative policy learning with strategic agents\.arXiv preprint arXiv:2412\.01344\.Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p1.1)\.
- L\. Cohen, S\. Sharifi\-Malvajerdi, K\. Stang, A\. Vakilian, and J\. Ziani \(2024\)Bayesian strategic classification\.Advances in Neural Information Processing Systems37,pp\. 111649–111678\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p5.1)\.
- R\. Ebrahimi, K\. Vaccaro, and P\. Naghizadeh \(2025\)The double\-edged sword of behavioral responses in strategic classification: theory and user studies\.InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency,pp\. 868–886\.Cited by:[§B\.1](https://arxiv.org/html/2606.07308#A2.SS1.p2.1)\.
- Y\. Guo, H\. Liu, Y\. Yue, and A\. Liu \(2024\)Distributionally robust policy evaluation under general covariate shift in contextual bandits\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=R7PReNELww)Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p1.1)\.
- M\. A\. Hamburg and F\. S\. Collins \(2010\)The path to personalized medicine\.New England Journal of Medicine363\(4\),pp\. 301–304\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p1.1)\.
- M\. Hardt, N\. Megiddo, C\. Papadimitriou, and M\. Wootters \(2016\)Strategic classification\.InProceedings of the 2016 ACM conference on innovations in theoretical computer science,pp\. 111–122\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p2.1),[§1](https://arxiv.org/html/2606.07308#S1.p3.1),[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- K\. Harris, V\. Chen, J\. Kim, A\. Talwalkar, H\. Heidari, and S\. Z\. Wu \(2022a\)Bayesian persuasion for algorithmic recourse\.Advances in Neural Information Processing Systems35,pp\. 11131–11144\.Cited by:[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p4.10)\.
- K\. Harris, D\. D\. T\. Ngo, L\. Stapleton, H\. Heidari, and S\. Wu \(2022b\)Strategic instrumental variable regression: recovering causal relationships from strategic responses\.InInternational Conference on Machine Learning,pp\. 8502–8522\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p5.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p1.1),[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- H\. Hofmann \(1994\)Statlog \(German Credit Data\)\.Note:UCI Machine Learning RepositoryDOI: https://doi\.org/10\.24432/C5NC77Cited by:[§4\.2](https://arxiv.org/html/2606.07308#S4.SS2.p1.2)\.
- G\. Horowitz and N\. Rosenfeld \(2023\)Causal strategic classification: a tale of two shifts\.InInternational Conference on Machine Learning,pp\. 13233–13253\.Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- T\. Joachims, B\. London, Y\. Su, A\. Swaminathan, and L\. Wang \(2021\)Recommendations as treatments\.AI Magazine42\(3\),pp\. 19–30\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p1.1)\.
- N\. Kallus, X\. Mao, K\. Wang, and Z\. Zhou \(2022\)Doubly robust distributionally robust off\-policy evaluation and learning\.InInternational Conference on Machine Learning,pp\. 10598–10632\.Cited by:[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p3.4),[§2\.2](https://arxiv.org/html/2606.07308#S2.SS2.p3.6),[§5](https://arxiv.org/html/2606.07308#S5.p1.1)\.
- A\. Karimi, B\. Schölkopf, and I\. Valera \(2021\)Algorithmic recourse: from counterfactual explanations to interventions\.InProceedings of the 2021 ACM conference on fairness, accountability, and transparency,pp\. 353–362\.Cited by:[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p4.10),[§5](https://arxiv.org/html/2606.07308#S5.p4.1)\.
- N\. Kilbertus, M\. G\. Rodriguez, B\. Schölkopf, K\. Muandet, and I\. Valera \(2020\)Fair decisions despite imperfect predictions\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 277–287\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p3.4)\.
- G\. König, H\. Fokkema, T\. Freiesleben, C\. Mendler\-Dünner, and U\. Luxburg \(2026\)Performative validity of recourse explanations\.Advances in Neural Information Processing Systems38,pp\. 139334–139370\.Cited by:[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p4.10),[§5](https://arxiv.org/html/2606.07308#S5.p4.1)\.
- S\. Levanon and N\. Rosenfeld \(2021\)Strategic classification made practical\.InInternational Conference on Machine Learning,pp\. 6243–6253\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p3.1),[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- T\. Mandel, Y\. Liu, S\. Levine, E\. Brunskill, and Z\. Popovic \(2014\)Offline policy evaluation across representations with applications to educational games\.\.InAAMAS,Vol\.1077\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p1.1)\.
- R\. K\. Mothilal, A\. Sharma, and C\. Tan \(2020\)Explaining machine learning classifiers through diverse counterfactual explanations\.InProceedings of the 2020 conference on fairness, accountability, and transparency,pp\. 607–617\.Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p4.1)\.
- E\. Munro \(2025\)Treatment allocation with strategic agents\.Management Science71\(1\),pp\. 123–145\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p5.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p3.4),[§2\.2](https://arxiv.org/html/2606.07308#S2.SS2.p3.6),[§5](https://arxiv.org/html/2606.07308#S5.p1.1)\.
- S\. A\. Murphy \(2003\)Optimal dynamic treatment regimes\.Journal of the Royal Statistical Society Series B: Statistical Methodology65\(2\),pp\. 331–355\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p1.1)\.
- W\. K\. Newey and D\. McFadden \(1994\)Large sample estimation and hypothesis testing\.Handbook of econometrics4,pp\. 2111–2245\.Cited by:[§D\.1](https://arxiv.org/html/2606.07308#A4.SS1.12.p1.1),[§D\.1](https://arxiv.org/html/2606.07308#A4.SS1.4.p1.1),[§D\.1](https://arxiv.org/html/2606.07308#A4.SS1.8.p1.1)\.
- J\. C\. Perdomo \(2025\)Revisiting the predictability of performative, social events\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=Q4yzASDktN)Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p1.1)\.
- J\. Perdomo, T\. Zrnic, C\. Mendler\-Dünner, and M\. Hardt \(2020\)Performative prediction\.InInternational Conference on Machine Learning,pp\. 7599–7609\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p2.1),[§5](https://arxiv.org/html/2606.07308#S5.p1.1)\.
- E\. Rosenfeld and N\. Rosenfeld \(2024\)One\-shot strategic classification under unknown costs\.InProceedings of the 41st International Conference on Machine Learning,pp\. 42719–42741\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p3.1),[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- M\. Rothschild and J\. Stiglitz \(1976\)Equilibrium in competitive insurance markets: an essay on the economics of imperfect information\.The Quarterly Journal of Economics90\(4\),pp\. 629–649\.External Links:[Document](https://dx.doi.org/10.2307/1885326)Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p3.1)\.
- Y\. Shavit, B\. Edelman, and B\. Axelrod \(2020\)Causal strategic linear regression\.InInternational Conference on Machine Learning,pp\. 8676–8686\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p5.1),[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- J\. E\. Stiglitz and A\. Weiss \(1981\)Credit rationing in markets with imperfect information\.The American economic review71\(3\),pp\. 393–410\.Cited by:[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p3.4)\.
- S\. Tsirtsis and M\. Gomez Rodriguez \(2020\)Decisions, counterfactual explanations and strategic behavior\.Advances in Neural Information Processing Systems33,pp\. 16749–16760\.Cited by:[§B\.1](https://arxiv.org/html/2606.07308#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.07308#S1.p2.1),[§1](https://arxiv.org/html/2606.07308#S1.p4.1),[§1](https://arxiv.org/html/2606.07308#S1.p5.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p5.4),[§5](https://arxiv.org/html/2606.07308#S5.p4.1)\.
- M\. Uehara, M\. Kato, and S\. Yasui \(2020\)Off\-policy evaluation and learning for external validity under a covariate shift\.Advances in Neural Information Processing Systems33,pp\. 49–61\.Cited by:[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p3.4),[§2\.2](https://arxiv.org/html/2606.07308#S2.SS2.p3.6),[§5](https://arxiv.org/html/2606.07308#S5.p1.1)\.
- M\. Uehara, C\. Shi, and N\. Kallus \(2022\)A review of off\-policy evaluation in reinforcement learning\.arXiv preprint arXiv:2212\.06355\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p1.1)\.
- A\. W\. Van der Vaart \(2000\)Asymptotic statistics\.Vol\.3,Cambridge university press\.Cited by:[§D\.2](https://arxiv.org/html/2606.07308#A4.SS2.p2.4),[§F\.1](https://arxiv.org/html/2606.07308#A6.SS1.p2.4),[§F\.2](https://arxiv.org/html/2606.07308#A6.SS2.p4.5)\.
- K\. Q\. Vo, M\. Aadil, S\. L\. Chau, and K\. Muandet \(2024\)Causal strategic learning with competitive selection\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 15411–15419\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p3.4),[§2\.2](https://arxiv.org/html/2606.07308#S2.SS2.p3.6),[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- K\. Q\. Vo, S\. L\. Chau, M\. Kato, Y\. Wang, and K\. Muandet \(2026\)Explanation design in strategic learning: sufficient explanations that induce non\-harmful responses\.InThe 29th International Conference on Artificial Intelligence and Statistics,Cited by:[§B\.1](https://arxiv.org/html/2606.07308#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2606.07308#A2.SS1.p2.1),[1st item](https://arxiv.org/html/2606.07308#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.07308#S1.p4.1),[§1](https://arxiv.org/html/2606.07308#S1.p5.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p10.7),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p4.10),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p5.4),[§2\.1](https://arxiv.org/html/2606.07308#S2.SS1.p6.3),[§4\.2](https://arxiv.org/html/2606.07308#S4.SS2.p5.4),[§5](https://arxiv.org/html/2606.07308#S5.p4.1),[§6](https://arxiv.org/html/2606.07308#S6.p2.1)\.
- S\. Wachter, B\. Mittelstadt, and C\. Russell \(2018\)Counterfactual explanations without opening the black box: automated decisions and the GDPR\.Harvard Journal of Law & Technology31\(2\),pp\. 841–887\.Cited by:[§5](https://arxiv.org/html/2606.07308#S5.p4.1)\.
- T\. Xie and X\. Zhang \(2024\)Non\-linear welfare\-aware strategic learning\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Vol\.7,pp\. 1660–1671\.Cited by:[§1](https://arxiv.org/html/2606.07308#S1.p4.1),[§1](https://arxiv.org/html/2606.07308#S1.p5.1),[§4\.2](https://arxiv.org/html/2606.07308#S4.SS2.p1.2),[§5](https://arxiv.org/html/2606.07308#S5.p2.1)\.
- L\. Xu, M\. Skoularidou, A\. Cuesta\-Infante, and K\. Veeramachaneni \(2019\)Modeling tabular data using conditional gan\.Advances in neural information processing systems32\.Cited by:[§4\.2](https://arxiv.org/html/2606.07308#S4.SS2.p1.2)\.
## Appendix ASummary of Notation
Table[1](https://arxiv.org/html/2606.07308#A1.T1)presents a summary of the notation we used in this paper\.
Table 1:Summary of the notationNotationMeaningXbX^\{b\}A Random variable denoting an agent’s base/original/pre\-strategic covariatesxbx^\{b\}A value taken by the random variableXbX^\{b\}𝒳\\mathcal\{X\}Space of covariatesπ:𝒳→\[0,1\]\\pi:\\mathcal\{X\}\\to\[0,1\]A DM’s decision policyTbT^\{b\}A random variable denoting the treatment assigned to an agent with base covariatesXbX^\{b\}tbt^\{b\}A value taken by the random variableTbT^\{b\}𝒯=\{0,1\}\\mathcal\{T\}=\\\{0,1\\\}Binary treatment space𝒳r=\{xjr\}j=1k\\mathcal\{X\}^\{r\}=\\\{x^\{r\}\_\{j\}\\\}\_\{j=1\}^\{k\}The set ofkkrecommendationse=\{\(xjr\),π\(xjr\)\}j=1ke=\\\{\(x\_\{j\}^\{r\}\),\\pi\(x\_\{j\}^\{r\}\)\\\}\_\{j=1\}^\{k\}Explanation containing recommendations and policy outcomes𝒳f≔\{xb\}∪𝒳r\\mathcal\{X\}^\{f\}\\coloneq\\\{x\_\{b\}\\\}\\cup\\mathcal\{X\}^\{r\}A set of feasible actions containing base covariates and recommendationsc:𝒳×𝒳→ℝ≥0c:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\\geq 0\}Cost function denoting the cost involved in modifying covariatesd:𝒳×𝒳→ℝ≥0d:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\\geq 0\}A \(primitive cost\) function known to DM and common to all agentsα∈\(0,∞\)\\alpha\\in\(0,\\infty\)Agent’s specific cost sensitivityZ,YZ,YRandom variables denoting the exogenous noise and the agent’s outcomez,yz,yValues taken by the random variablesZ,YZ,Y𝒵,𝒴\\mathcal\{Z\},\\mathcal\{Y\}Spaces of values forz,yz,y
## Appendix BAdditional Discussion
### B\.1On the Agents’ Behavioral Model
The agents’ behavioral model in[Equation˜1](https://arxiv.org/html/2606.07308#S2.E1)of[Section˜2\.1](https://arxiv.org/html/2606.07308#S2.SS1)follows prior work on strategic learning under local information disclosure\(Tsirtsis and Gomez Rodriguez,[2020](https://arxiv.org/html/2606.07308#bib.bib39); Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\)\. In particular, the motivation comes from the ARex framework\(Voet al\.,[2026](https://arxiv.org/html/2606.07308#bib.bib44)\), where the authors argue that, in many real\-world settings, the DM can collect additional information \(e\.g\., through surveys\) to generate recommendations better aligned with agents’ preferences, thereby reducing the likelihood that agents deviate from the recommended actions\. Our work further extends the ARex framework to allow an unrestricted number of recommendations, which can further mitigate the chance that agents deviate from the presented recommendation set\.
Recent work in strategic classification has questioned whether agents necessarily behave according to idealized best\-response models\. However, many such works operate under substantially different explanation settings\. For example, inEbrahimiet al\.\([2025](https://arxiv.org/html/2606.07308#bib.bib9)\), agents are provided feature importance weights and must infer for themselves how feature modifications translate into utility gains\. As discussed byVoet al\.\([2026](https://arxiv.org/html/2606.07308#bib.bib44)\), such uncertainty, which is common in many explanation paradigms, can give rise to misinterpretation\. In contrast, ARexes directly present actionable recommendations together with their associated policy outcomes, thereby substantially reducing ambiguity about the utility consequences of recommended actions\.
Nevertheless, as with most strategic learning frameworks, our approach still relies on a stylized behavioral model of agents\. Although ARexes can be made more practical through additional preference elicitation and our extension mitigates the likelihood that agents deviate from the recommendation set, such behavior cannot be ruled out entirely\. Extending OPE under strategic behavior to accommodate richer or boundedly rational response models remains an important direction for future work\. In[Section˜4\.1](https://arxiv.org/html/2606.07308#S4.SS1), we additionally provide a simple experiment to study how deviations from the assumed behavioral model may affect estimation performance\.
### B\.2On the Agents’ Cost Model
#### The cost structure\.
The cost model in[Section˜2\.1](https://arxiv.org/html/2606.07308#S2.SS1)separates the modification costc\(x,x′\)c\(x,x^\{\\prime\}\)into two components: a shared primitive costd\(x,x′\)d\(x,x^\{\\prime\}\)and an agent\-specific sensitivity parameterα\\alpha\. Intuitively,d\(x,x′\)d\(x,x^\{\\prime\}\)captures the burden imposed by society \(e\.g\., time, effort, and monetary expenses\) required to modify one’s covariates\. In contrast,α\\alphareflects how a specific agent internalizes this burden in their utility, translating the primitive costd\(x,x′\)d\(x,x^\{\\prime\}\)into a personal costc\(x,x′\)c\(x,x^\{\\prime\}\)that is weighed against the chance of receiving a positive treatment\.
The current problem formulation can in principle be generalized to accommodate richer forms of heterogeneous cost functions\. For example, instead of a scalar sensitivity parameterα\\alpha, one could consider feature\-specific sensitivities that allow different agents to find some feature modifications easier or harder than others\. This could correspond to replacing the scalar scaling factor with a quadratic form such as\(x−x′\)⊤Λ\(x−x′\)\(x\-x^\{\\prime\}\)^\{\\top\}\\Lambda\(x\-x^\{\\prime\}\), where the matrixΛ\\Lambdacaptures how different feature modifications are weighted\. However, such extensions substantially complicate the analysis of the induced strategic covariate shiftp\(xs\|tb,xb\)p\(\{x\}^\{s\}\|\{t\}^\{b\},\{x\}^\{b\}\), since one would need to derive the distribution of scalar costs\(x−x′\)⊤Λ\(x−x′\)\(x\-x^\{\\prime\}\)^\{\\top\}\\Lambda\(x\-x^\{\\prime\}\)from distributions over random vector\- or matrix\-valued latent variablesΛ\\Lambda\. Developing estimation procedures for such settings remains a promising direction for future work\.
#### Modeling the cost sensitivity\.
The modeling choice for the cost sensitivityα\\alphareflects that agents with different characteristics, captured byxb\{x\}^\{b\}, may differ in their willingness to bear effort or monetary expenses\. The transformationϕ\(xb\)\\phi\(\{x\}^\{b\}\)allows the conditional mean oflnα\\ln\\alphato depend on covariates while accommodating mixtures of categorical and numerical variables\. In contrast, the variance parameterσ2\\sigma^\{2\}captures residual heterogeneity arising from latent factors outsidexb\{x\}^\{b\}, such as temporary personal constraints or unobserved motivational differences\. Using a shared variance parameter therefore reflects the assumption that, while different groups may differ in their average attitudes toward effort, the dispersion around those averages is broadly comparable across groups\.
The log\-normal model has a natural motivation based on the central limit theorem \(CLT\):α\>0\\alpha\>0is a positive scale parameter that describes an agent’s overall cost sensitivity, which can be interpreted as the aggregate effect of many latent factors \(e\.g\., liquidity, time constraints, opportunity cost, motivation, risk tolerance, etc\.\)\. If these factors combine approximately multiplicatively on the original scale, thenln\(α\)∣xb\\ln\(\\alpha\)\\mid\{x\}^\{b\}becomes approximately additive\. If the latent contributions additionally have finite variance, standard central\-limit\-type arguments motivate approximatingln\(α\)∣xb\\ln\(\\alpha\)\\mid\{x\}^\{b\}by a normal distribution\.
We model the conditional mean oflnα\\ln\\alphaas a linear function of transformed covariates for interpretability and statistical tractability\. Since the DM observes only one interaction per agent, the available data are substantially more limited than in repeated\-interaction settings, motivating the use of a parsimonious parametric model while still allowing nonlinear relationships through the transformationϕ\(xb\)\\phi\(\{x\}^\{b\}\)\.
## Appendix CUniqueness of the Agent’s Utility Maximizer
Here, we prove[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)under a parametric assumption onα\\alphaand a design condition for𝒳r\{\\mathcal\{X\}\}^\{r\}\.
Because we assume the cost function to have the formC\(x,x′\):=αd\(x,x′\)C\(x,x^\{\\prime\}\):=\\alpha d\(x,x^\{\\prime\}\)whereα∈ℝ\+\\alpha\\in\\mathbb\{R\}^\{\+\}, when the DM designs𝒳r\{\\mathcal\{X\}\}^\{r\}\(for each agent\) such that all recommended actionsxjr∈𝒳r\{x\}^\{r\}\_\{j\}\\in\{\\mathcal\{X\}\}^\{r\}have different primitive costsd\(xjr,xb\)d\(\{x\}^\{r\}\_\{j\},\{x\}^\{b\}\), there is only at most one value forα\\alphathat could result in multiple utility maximizers\.
To see this, we first suppose that there are two utility maximizersx∙r,x⋄r∈𝒳r\{x\}^\{r\}\_\{\\bullet\},\{x\}^\{r\}\_\{\\diamond\}\\in\{\\mathcal\{X\}\}^\{r\}for an agentiisuch thatd\(x∙r,xb\)≠d\(x⋄r,xb\)d\(\{x\}^\{r\}\_\{\\bullet\},\{x\}^\{b\}\)\\neq d\(\{x\}^\{r\}\_\{\\diamond\},\{x\}^\{b\}\)\. Note that this is possible because we assume the DM knowsdd\. Then, for this agent, the following must hold:
u\(π,αi,x∙r,xb\)=u\(π,αi,x⋄r,xb\)\\displaystyle u\(\\pi,\\alpha\_\{i\},\{x\}^\{r\}\_\{\\bullet\},\{x\}^\{b\}\)=u\(\\pi,\\alpha\_\{i\},\{x\}^\{r\}\_\{\\diamond\},\{x\}^\{b\}\)\(5\)⇒\\displaystyle\\Rightarrow\\π\(x∙r\)−αid\(x∙r,xb\)=π\(x⋄r\)−αid\(x⋄r,xb\)\\displaystyle\\pi\(\{x\}^\{r\}\_\{\\bullet\}\)\-\\alpha\_\{i\}d\(\{x\}^\{r\}\_\{\\bullet\},\{x\}^\{b\}\)=\\pi\(\{x\}^\{r\}\_\{\\diamond\}\)\-\\alpha\_\{i\}d\(\{x\}^\{r\}\_\{\\diamond\},\{x\}^\{b\}\)\(6\)⇒\\displaystyle\\Rightarrow\\π\(x∙r\)−π\(x⋄r\)=αi\(d\(x∙r,xb\)−d\(x⋄r,xb\)⏟≠0\),\\displaystyle\\pi\(\{x\}^\{r\}\_\{\\bullet\}\)\-\\pi\(\{x\}^\{r\}\_\{\\diamond\}\)=\\alpha\_\{i\}\\big\(\\underbrace\{d\(\{x\}^\{r\}\_\{\\bullet\},\{x\}^\{b\}\)\-d\(\{x\}^\{r\}\_\{\\diamond\},\{x\}^\{b\}\)\}\_\{\\neq 0\}\\big\),\(7\)which only holds for at most one value ofαi\\alpha\_\{i\}\.
Given\(π,α,𝒳f,xb\)\(\\pi,\\alpha,\\mathcal\{X\}^\{f\},\{x\}^\{b\}\), the event\{\|argmaxx∈𝒳fu\(π,α,x,xb\)\|≥2\}\\\{\|\\arg\\max\_\{x\\in\\mathcal\{X\}^\{f\}\}u\(\\pi,\\alpha,x,\{x\}^\{b\}\)\|\\geq 2\\\}is equivalent to a set of finiteα\\alphavaluesA=\{α′:\|argmaxx∈𝒳fu\(π,α′,x,xb\)\|≥2\}A=\\\{\\alpha^\{\\prime\}:\|\\arg\\max\_\{x\\in\\mathcal\{X\}^\{f\}\}u\(\\pi,\\alpha^\{\\prime\},x,\{x\}^\{b\}\)\|\\geq 2\\\}\.
When we assumelnα∣xb∼𝒩\(⋅,⋅\)\\ln\\alpha\\mid\{x\}^\{b\}\\sim\\mathcal\{N\}\(\\cdot,\\cdot\), we havep\(α∈A\|xb\)=0p\(\\alpha\\in A\|\{x\}^\{b\}\)=0for any setAAcontaining finite values ofα\\alpha\. Therefore,[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)holds as
pα\|Xb\(\|argmaxx∈𝒳fu\(π,α,x,xb\)\|≥2∣xb\)=0\.\\displaystyle p\_\{\\alpha\|\{X\}^\{b\}\}\\Big\(\|\\arg\\max\_\{x\\in\\mathcal\{X\}^\{f\}\}u\(\\pi,\\alpha,x,\{x\}^\{b\}\)\|\\geq 2\\mid\{x\}^\{b\}\\Big\)=0\.\(8\)Note thatggand𝒳f\\mathcal\{X\}^\{f\}are arguments to evaluate above quantity\. This concludes the proof\.
### C\.1Probability of the Agent’s Response
Here, we show how to arrive at strict inequalities of utility comparison, by using the result from[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)\.
LetM𝒳M\_\{\\mathcal\{X\}\}denote the set of utility maximizers for an agent, we have the following, from previous result:
p\(\|M𝒳\|=1∣𝒳r,Tb=0,xb\)=1&p\(\|M𝒳\|≥2∣𝒳r,Tb=0,xb\)=0\.\\displaystyle p\\big\(\|M\_\{\\mathcal\{X\}\}\|=1\\mid\{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)=1\\qquad\\&\\qquad p\\big\(\|M\_\{\\mathcal\{X\}\}\|\\geq 2\\mid\{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)=0\.\(9\)
Note that in our setup,α⟂⟂\{𝒳r,Tb\}∣Xb\\alpha\\perp\\\!\\\!\\\!\\perp\\\{\{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}\\\}\\mid\{X\}^\{b\}\. LetW:=𝟙\|M𝒳\|=1W:=\\mathbbm\{1\}\_\{\|M\_\{\\mathcal\{X\}\}\|=1\}be a binary random variable denoting if the set of maximizers has the size of11or not\. We have
p\(W=1∣𝒳r,Tb=0,xb\)=1&p\(W=0∣𝒳r,Tb=0,xb\)=0\.\\displaystyle p\(W=1\\mid\{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)=1\\qquad\\&\\qquad p\(W=0\\mid\{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)=0\.\(10\)
Then,
p\(xs\|𝒳r,Tb=0,xb\)\\displaystyle p\(\{x\}^\{s\}\|\{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\)\(11\)=\\displaystyle=\\∑w∈\{0,1\}p\(\{agent picksxs\}&W=w\|𝒳r,Tb=0,xb\)1\(xs∈𝒳f\)\\displaystyle\\sum\_\{w\\in\\\{0,1\\\}\}p\\big\(\\\{\\text\{agent picks \}\{x\}^\{s\}\\\}\\ \\&\\ W=w\\ \\big\\lvert\\ \{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)\\ \\mathbbm\{1\}\\big\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\\big\)\(12\)=\\displaystyle=\\∑w∈\{0,1\}p\(\{agent picksxs\}\|W=w,…\)p\(W=w∣𝒳r,Tb=0,xb\)⏟=0ifw=01\(xs∈𝒳f\)\\displaystyle\\sum\_\{w\\in\\\{0,1\\\}\}p\\big\(\\\{\\text\{agent picks \}\{x\}^\{s\}\\\}\\ \\big\\lvert\\ W=w,\\ldots\\big\)\\ \\underbrace\{p\\big\(W=w\\mid\{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)\}\_\{=0\\text\{ if \}w=0\}\\ \\mathbbm\{1\}\\big\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\\big\)\(13\)=\\displaystyle=\\p\(\{agent picksxs\}&W=1\|𝒳r,Tb=0,xb\)1\(xs∈𝒳f\)\\displaystyle p\\big\(\\\{\\text\{agent picks \}\{x\}^\{s\}\\\}\\ \\&\\ W=1\\ \\big\\lvert\\ \{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)\\ \\mathbbm\{1\}\\big\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\\big\)\(14\)=\\displaystyle=\\p\(\{agent picksxs\}&\|M𝒳\|=1\|𝒳r,Tb=0,xb\)1\(xs∈𝒳f\)\\displaystyle p\\big\(\\\{\\text\{agent picks \}\{x\}^\{s\}\\\}\\ \\&\\ \|M\_\{\\mathcal\{X\}\}\|=1\\ \\big\\lvert\\ \{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)\\ \\mathbbm\{1\}\\big\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\\big\)\(15\)=\\displaystyle=\\p\(\{u\(π,C,xs,xb\)\>u\(g,C,x,xb\)\}∀x∈𝒳f∖\{xs\}\|𝒳r,Tb=0,xb\)1\(xs∈𝒳f\)\\displaystyle p\\big\(\\big\\\{u\(\\pi,C,\{x\}^\{s\},\{x\}^\{b\}\)\>u\(g,C,x,\{x\}^\{b\}\)\\big\\\}\\ \\forall x\\in\\mathcal\{X\}^\{f\}\\setminus\\\{\{x\}^\{s\}\\\}\\ \\big\\lvert\\ \{\\mathcal\{X\}\}^\{r\},\{T\}^\{b\}=0,\{x\}^\{b\}\\big\)\\ \\mathbbm\{1\}\\big\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\\big\)\(16\)=\\displaystyle=\\p\(\{u\(π,C,xs,xb\)\>u\(g,C,x,xb\)\}∀x∈𝒳f∖\{xs\}\|xb\)1\(xs∈𝒳f\)\.\\displaystyle p\\big\(\\big\\\{u\(\\pi,C,\{x\}^\{s\},\{x\}^\{b\}\)\>u\(g,C,x,\{x\}^\{b\}\)\\big\\\}\\ \\forall x\\in\\mathcal\{X\}^\{f\}\\setminus\\\{\{x\}^\{s\}\\\}\\ \\big\\lvert\\ \{x\}^\{b\}\\big\)\\ \\mathbbm\{1\}\\big\(\{x\}^\{s\}\\in\\mathcal\{X\}^\{f\}\\big\)\.\(17\)
## Appendix DEstimation of Agents’ Strategic Responses
Here, we prove the consistency of the maximum likelihood estimatorθ^n\\hat\{\\theta\}\_\{n\}\. Recall that
lnα∣xb∼𝒩\(β⊤ϕ\(xb\)\+β0,σ2\),\\displaystyle\\ln\\alpha\\mid\{x\}^\{b\}\\sim\\mathcal\{N\}\(\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\},\\sigma^\{2\}\),\(18\)whereϕ:𝒳→ℝp\\phi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{p\}denotes some known transformation function of the covarite vectorxb\{x\}^\{b\}\. Furthermore, the conditional CDF,F\(α\|xb;θ\)F\(\\alpha\|\{x\}^\{b\};\\theta\), is parameterized byθ=\(β,β0,σ\)∈Θ⊆ℝp×ℝ×ℝ\+\\theta=\(\\beta,\\beta\_\{0\},\\sigma\)\\in\\Theta\\subseteq\\mathbb\{R\}^\{p\}\\times\\mathbb\{R\}\\times\\mathbb\{R\}^\{\+\}\.
Given each observation of agents’ strategic behavior, i\.e\., a tuple\(xs,𝒳r,tb=0,xb\)\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{t\}^\{b\}=0,\{x\}^\{b\}\), we define the per\-observation likelihood contribution as
f\(δl,δu,xb;θ\)=\{pθ\(δ\(xs,𝒳r,xb\)l<α<δ\(xs,𝒳r,xb\)u\|xb\)if0≤δ\(xs,𝒳r,xb\)l<δ\(xs,𝒳r,xb\)u,fconstotherwise,\\displaystyle f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)=\\left\\\{\\begin\{aligned\} &p\_\{\\theta\}\\Big\(\\delta^\{l\}\_\{\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\}<\\alpha<\\delta^\{u\}\_\{\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\}\\ \\big\\lvert\\ \{x\}^\{b\}\\Big\)\\qquad\\text\{if\}\\ 0\\leq\\delta^\{l\}\_\{\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\}<\\delta^\{u\}\_\{\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\},\\\\ &f\_\{\\text\{const\}\}\\qquad\\text\{otherwise\},\\end\{aligned\}\\right\.\(19\)where any choice for the constantfconst∈\(0,1\)f\_\{\\text\{const\}\}\\in\(0,1\)works and each pair of\(δl,δu\)\(\\delta^\{l\},\\delta^\{u\}\)is the transformation of an observation\(xs,𝒳r,tb=0,xb\)\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{t\}^\{b\}=0,\{x\}^\{b\}\), as defined in the main paper\. Note that0≤δl≤δu0\\leq\\delta^\{l\}\\leq\\delta^\{u\}by construction, soffoutputsfconstf\_\{\\text\{const\}\}whenδl=δu\\delta^\{l\}=\\delta^\{u\}\.
Note that in our log\-normal model, in the special case0=δl<δu0=\\delta^\{l\}<\\delta^\{u\}, we have
f\(0,δu,xb;θ\)\\displaystyle f\(0,\\delta^\{u\},\{x\}^\{b\};\\theta\)=pθ\(0<α<δu\|xb\)=pθ\(α<δu\|xb\)\\displaystyle=p\_\{\\theta\}\(0<\\alpha<\\delta^\{u\}\\ \\big\\lvert\\ \{x\}^\{b\}\)=p\_\{\\theta\}\(\\alpha<\\delta^\{u\}\\ \\big\\lvert\\ \{x\}^\{b\}\)\(20\)=Fα\|Xb\(δu\|xb;θ\)\.\\displaystyle=F\_\{\\alpha\|\{X\}^\{b\}\}\(\\delta^\{u\}\|\{x\}^\{b\};\\theta\)\.\(21\)
Givennnobservations\{\(xis,𝒳ir,tib=0,xib\)\}i=1n\\\{\(\{x\}^\{s\}\_\{i\},\{\\mathcal\{X\}\}^\{r\}\_\{i\},\{t\}^\{b\}\_\{i\}=0,\{x\}^\{b\}\_\{i\}\)\\\}\_\{i=1\}^\{n\}collected from agents who received negative base treatment, i\.e,tib=0\{t\}^\{b\}\_\{i\}=0, we define the empirical log\-likelihood and its population version as follows:
𝒬n\(θ\)\\displaystyle\\mathcal\{Q\}\_\{n\}\(\\theta\)=1n∑i=1nlnf\(δil,δiu,xib;θ\),\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ln f\(\\delta^\{l\}\_\{i\},\\delta^\{u\}\_\{i\},\{x\}^\{b\}\_\{i\};\\theta\),\(22\)𝒬\(θ\)\\displaystyle\\mathcal\{Q\}\(\\theta\)=𝔼Pδl,δu,Xb\|Tb\[lnf\(δl,δu,Xb;θ\)\|Tb=0\],\\displaystyle=\\mathbb\{E\}\_\{P\_\{\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\}\|\{T\}^\{b\}\}\}\\left\[\\ln f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\)\\ \\Big\\lvert\\ \{T\}^\{b\}=0\\right\],\(23\)where the distributionPδl,δu,Xb\|TbP\_\{\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\}\|\{T\}^\{b\}\}is induced byPXs,𝒳r,Xb\|TbP\_\{\{X\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{X\}^\{b\}\|\{T\}^\{b\}\}\. Furthermore, the density of an observation can be expressed as
p\(xs,𝒳r,xb∣Tb=0\)\\displaystyle p\(\{x\}^\{s\},\\mathcal\{X\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0\)=p\(xs∣𝒳r,xb,Tb=0\)p\(𝒳r,xb∣Tb=0\)\\displaystyle=p\(\{x\}^\{s\}\\mid\\mathcal\{X\}^\{r\},\{x\}^\{b\},\{T\}^\{b\}=0\)\\ p\(\\mathcal\{X\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0\)\(24\)=p\(δl<α<δu∣xb;θ0\)1\(xs∈𝒳r∪\{xb\}\)p\(𝒳r,xb∣Tb=0\),\\displaystyle=p\(\\delta^\{l\}<\\alpha<\\delta^\{u\}\\mid\{x\}^\{b\}\\ ;\\ \\theta\_\{0\}\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\{\\mathcal\{X\}\}^\{r\}\\cup\\\{\{x\}^\{b\}\\\}\)\\ p\(\\mathcal\{X\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0\),\(25\)where the last line follows from what we derive in the main paper, and of course, from[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)\. Similarly, assuming the log\-normal model forα\\alphaand the uniqueness of utility maximiser \(almost surely\), the same decomposition holds for any arbitrary parameter valueθ\\theta, i\.e\.
p\(xs,𝒳r,xb∣Tb=0;θ\)\\displaystyle p\(\{x\}^\{s\},\\mathcal\{X\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0\\ ;\\ \\theta\)=p\(xs∣𝒳r,xb,Tb=0;θ\)p\(𝒳r,xb∣Tb=0\)\\displaystyle=p\(\{x\}^\{s\}\\mid\\mathcal\{X\}^\{r\},\{x\}^\{b\},\{T\}^\{b\}=0\\ ;\\ \\theta\)\\ p\(\\mathcal\{X\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0\)\(26\)=p\(δl<α<δu∣xb;θ\)1\(xs∈𝒳r∪\{xb\}\)p\(𝒳r,xb∣Tb=0\)\.\\displaystyle=p\(\\delta^\{l\}<\\alpha<\\delta^\{u\}\\mid\{x\}^\{b\}\\ ;\\ \\theta\)\\ \\mathbbm\{1\}\(\{x\}^\{s\}\\in\{\\mathcal\{X\}\}^\{r\}\\cup\\\{\{x\}^\{b\}\\\}\)\\ p\(\\mathcal\{X\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0\)\.\(27\)We will use this decomposition in the proof for[Lemma˜D\.5](https://arxiv.org/html/2606.07308#A4.Thmtheorem5)\(unique likelihood maximiser\) later\.
Letθ^n:=argmaxθ∈Θ𝒬n\(θ\)\\hat\{\\theta\}\_\{n\}:=\\arg\\max\_\{\\theta\\in\\Theta\}\\mathcal\{Q\}\_\{n\}\(\\theta\)andθ0\\theta\_\{0\}is the true parameter value inF\(α\|xb;θ\)F\(\\alpha\|\{x\}^\{b\};\\theta\), our goal is to prove thatθ^n→𝑝θ0\\hat\{\\theta\}\_\{n\}\\xrightarrow\{p\}\\theta\_\{0\}\.
We introduce the following lemma that will help proving subsequent results\.
###### Lemma D\.1\(Continuity\)\.
For each tuple\(δl,δu,xb\)∈ℝd\+2\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\)\\in\\mathbb\{R\}^\{d\+2\}, the corresponding functionsf\(δl,δu,xb;θ\)f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)andlnf\(δl,δu,xb;θ\)\\ln f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)are continuous w\.r\.t\.θ∈ℝp×ℝ×ℝ\+\\theta\\in\\mathbb\{R\}^\{p\}\\times\\mathbb\{R\}\\times\\mathbb\{R\}^\{\+\}\.
###### Proof\.
We expand the functionfffor the case of0<δl<δu0<\\delta^\{l\}<\\delta^\{u\}:
pθ\(δ\(xs,𝒳r,xb\)l<α<δ\(xs,𝒳r,xb\)u\|xb\)\\displaystyle p\_\{\\theta\}\\Big\(\\delta^\{l\}\_\{\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\}<\\alpha<\\delta^\{u\}\_\{\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\}\\ \\big\\lvert\\ \{x\}^\{b\}\\Big\)\(28\)=Fα\|Xb\(δu\|xb;θ\)−Fα\|Xb\(δl\|xb;θ\)\\displaystyle=F\_\{\\alpha\|\{X\}^\{b\}\}\(\\delta^\{u\}\|\{x\}^\{b\};\\theta\)\-F\_\{\\alpha\|\{X\}^\{b\}\}\(\\delta^\{l\}\|\{x\}^\{b\};\\theta\)\(29\)=12\[erf\(lnδu−β⊤ϕ\(xb\)−β0σ2\)−erf\(lnδl−β⊤ϕ\(xb\)−β0σ2\)\],\\displaystyle=\\frac\{1\}\{2\}\\left\[\\operatorname\{erf\}\\left\(\\frac\{\\ln\\delta^\{u\}\-\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\}\{\\sigma\\sqrt\{2\}\}\\right\)\-\\operatorname\{erf\}\\left\(\\frac\{\\ln\\delta^\{l\}\-\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\}\{\\sigma\\sqrt\{2\}\}\\right\)\\right\],\(30\)whereerf\\operatorname\{erf\}denotes the error function\.
Because each member function insideffis continuous w\.r\.t\.θ\\theta, such as inversion \(1/σ1/\\sigma\), linear transformation \(β⊤ϕ\(xb\)\+β0\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\}\), anderf\(⋅\)\\operatorname\{erf\}\(\\cdot\), their composition is also continuous onΘ\\Theta\. Similar argument applies for the case of0=δl<δu0=\\delta^\{l\}<\\delta^\{u\}\.
For the case of collapsed intervals,ffoutputs a constant so it is continuous\. Thusffis continuous onΘ\\Theta\.
Similarly, asln\\lnis continuous on the domainℝ\+\\mathbb\{R\}^\{\+\},ln∘f\\ln\\circ fis continuous onΘ\\Theta\. This concludes the proof\. ∎
### D\.1Consistency ofθ^n\\hat\{\\theta\}\_\{n\}
###### Lemma D\.2\(Dominance\)\.
Under[˜2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3),[˜2\.4](https://arxiv.org/html/2606.07308#S2.Thmtheorem4), and when[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)holds, there exists an upper boundη∈ℝ\\eta\\in\\mathbb\{R\}such that
\|lnf\(δl,δu,xb,θ\)\|<η∀θ∈Θ,∀\(δl,δu,xb\)∈𝒞g,σ,d,𝒳,\\displaystyle\|\\ln f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\},\\theta\)\|<\\eta\\qquad\\forall\\theta\\in\\Theta,\\ \\forall\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\)\\in\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\},\(31\)where𝒞g,σ,d,𝒳\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\}denotes the set of all possible values of\(δl,δu,xb\)\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\)induced by\{g,σ,d,𝒳\}\\\{g,\\sigma,d,\\mathcal\{X\}\\\}\.
###### Proof\.
From[Lemma˜D\.1](https://arxiv.org/html/2606.07308#A4.Thmtheorem1), for each configuration\(δl,δu,xb\)\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\), the functionf\(δl,δu,xb;θ\)f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)is continuous w\.r\.t\.θ\\theta\. In addition, because the spaceΘ\\Thetais compact,f\(δl,δu,xb;θ\)f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)is bounded onΘ\\Theta, by the extreme value theorem\.
Because the space𝒳\\mathcal\{X\}is finite,𝒞g,σ,d,𝒳\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\}is also finite, for a given instantiation of\{g,σ,d\}\\\{g,\\sigma,d\\\}\. Consequently, there are finitely many bounds forf\(δl,δu,xb;θ\)f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)for all\(δl,δu,xb\)∈𝒞g,σ,d,𝒳\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\)\\in\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\}andθ∈Θ\\theta\\in\\Theta\. Therefore, there exists a constantη∈ℝ\\eta\\in\\mathbb\{R\}such that
\|lnf\(δl,δu,xb,θ\)\|<η∀θ∈Θ,∀\(δl,δu,xb\)∈𝒞g,σ,d,𝒳\.\\displaystyle\|\\ln f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\},\\theta\)\|<\\eta\\qquad\\forall\\theta\\in\\Theta,\\ \\forall\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\)\\in\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\}\.\(32\)This concludes the proof\. ∎
###### Lemma D\.3\(Uniform convergence\)\.
Under[˜2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3)and[˜2\.4](https://arxiv.org/html/2606.07308#S2.Thmtheorem4), we have
supθ∈Θ\|𝒬n\(θ\)−𝒬\(θ\)\|→𝑝0\.\\displaystyle\\sup\_\{\\theta\\in\\Theta\}\|\\mathcal\{Q\}\_\{n\}\(\\theta\)\-\\mathcal\{Q\}\(\\theta\)\|\\xrightarrow\{p\}0\.\(33\)
###### Proof\.
From[˜2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3)\(compactness\),[Lemma˜D\.1](https://arxiv.org/html/2606.07308#A4.Thmtheorem1)\(continuity\), and[Lemma˜D\.2](https://arxiv.org/html/2606.07308#A4.Thmtheorem2)\(dominance\), we directly get the desired result, by using Lemma 2\.4 ofNewey and McFadden\([1994](https://arxiv.org/html/2606.07308#bib.bib30)\)\. ∎
###### Lemma D\.4\(Identifiability\)\.
Under[˜2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3),[˜2\.4](https://arxiv.org/html/2606.07308#S2.Thmtheorem4),[˜2\.5](https://arxiv.org/html/2606.07308#S2.Thmtheorem5), and when[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)holds, we have
\{f\(δl,δu,xb;θ\)=f\(δl,δu,xb;θ0\)∀\(δl,δu,xb\)∈𝒞g,σ,d,𝒳\}⇒θ=θ0\.\\displaystyle\\big\\\{f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)=f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\_\{0\}\)\\qquad\\forall\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\)\\in\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\}\\big\\\}\\Rightarrow\\theta=\\theta\_\{0\}\.\(34\)
###### Proof\.
Suppose that there exists a parameter valueθ∙\\theta\_\{\\bullet\}such thatf\(δl,δu,xb;θ∙\)=f\(δl,δu,xb;θ0\)f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\_\{\\bullet\}\)=f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\_\{0\}\)for all\(δl,δu,xb\)∈𝒞g,σ,d,𝒳\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\}\)\\in\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\}\. From[˜2\.5](https://arxiv.org/html/2606.07308#S2.Thmtheorem5)\(weak positivity & and full\-rank\), there exist two observations\(0,δu1,xb\)\(0,\\delta^\{u1\},\{x\}^\{b\}\)and\(0,δu2,xb\)\(0,\\delta^\{u2\},\{x\}^\{b\}\), for somexb∈𝒳\{x\}^\{b\}\\in\\mathcal\{X\}, such that for allδu∈\{δu1,δu2\}\\delta^\{u\}\\in\\\{\\delta^\{u1\},\\delta^\{u2\}\\\},
f\(0,δu,xb;θ∙\)\\displaystyle f\(0,\\delta^\{u\},\{x\}^\{b\};\\theta\_\{\\bullet\}\)=f\(0,δu,xb;θ0\)\\displaystyle=f\(0,\\delta^\{u\},\{x\}^\{b\};\\theta\_\{0\}\)\(35\)⇒erf\(lnδu−β\(θ∙\)⊤ϕ\(xb\)−β0\(θ∙\)σ\(θ∙\)2\)\\displaystyle\\Rightarrow\\operatorname\{erf\}\\left\(\\frac\{\\ln\\delta^\{u\}\-\\beta\(\\theta\_\{\\bullet\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\(\\theta\_\{\\bullet\}\)\}\{\\sigma\(\\theta\_\{\\bullet\}\)\\sqrt\{2\}\}\\right\)=erf\(lnδu−β\(θ0\)⊤ϕ\(xb\)−β0\(θ0\)σ\(θ0\)2\),\\displaystyle=\\operatorname\{erf\}\\left\(\\frac\{\\ln\\delta^\{u\}\-\\beta\(\\theta\_\{0\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\(\\theta\_\{0\}\)\}\{\\sigma\(\\theta\_\{0\}\)\\sqrt\{2\}\}\\right\),\(36\)where we useβ\(θ\),β0\(θ\),σ\(θ\)\\beta\(\\theta\),\\beta\_\{0\}\(\\theta\),\\sigma\(\\theta\)to denote the respective elements in a parameter vectorθ\\theta\. Because the error function is strictly increasing on the domainℝ\\mathbb\{R\}, hence invertible, we get the following for allδu∈\{δu1,δu2\}\\delta^\{u\}\\in\\\{\\delta^\{u1\},\\delta^\{u2\}\\\}:
lnδu−β\(θ∙\)⊤ϕ\(xb\)−β0\(θ∙\)σ\(θ∙\)=lnδu−β\(θ0\)⊤ϕ\(xb\)−β0\(θ0\)σ\(θ0\)\\displaystyle\\frac\{\\ln\\delta^\{u\}\-\\beta\(\\theta\_\{\\bullet\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\(\\theta\_\{\\bullet\}\)\}\{\\sigma\(\\theta\_\{\\bullet\}\)\}=\\frac\{\\ln\\delta^\{u\}\-\\beta\(\\theta\_\{0\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\(\\theta\_\{0\}\)\}\{\\sigma\(\\theta\_\{0\}\)\}\(37\)⇒\\displaystyle\\Rightarrowlnδu\(1σ\(θ∙\)−1σ\(θ0\)\)\+−β\(θ∙\)⊤ϕ\(xb\)−β0\(θ∙\)σ\(θ∙\)−−β\(θ0\)⊤ϕ\(xb\)−β0\(θ0\)σ\(θ0\)=0\.\\displaystyle\\ \\ln\\delta^\{u\}\\left\(\\frac\{1\}\{\\sigma\(\\theta\_\{\\bullet\}\)\}\-\\frac\{1\}\{\\sigma\(\\theta\_\{0\}\)\}\\right\)\+\\frac\{\-\\beta\(\\theta\_\{\\bullet\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\(\\theta\_\{\\bullet\}\)\}\{\\sigma\(\\theta\_\{\\bullet\}\)\}\-\\frac\{\-\\beta\(\\theta\_\{0\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\beta\_\{0\}\(\\theta\_\{0\}\)\}\{\\sigma\(\\theta\_\{0\}\)\}=0\.\(38\)
Because the above holds for two different values ofδu∈ℝ\+\\delta^\{u\}\\in\\mathbb\{R\}^\{\+\}, it must hold thatσ\(θ∙\)=σ\(θ0\)\\sigma\(\\theta\_\{\\bullet\}\)=\\sigma\(\\theta\_\{0\}\), then we obtain
β\(θ∙\)⊤ϕ\(xb\)\+β0\(θ∙\)=β\(θ0\)⊤ϕ\(xb\)\+β0\(θ0\)\\displaystyle\\beta\(\\theta\_\{\\bullet\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\}\(\\theta\_\{\\bullet\}\)=\\beta\(\\theta\_\{0\}\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\}\(\\theta\_\{0\}\)\(39\)⇒\\displaystyle\\Rightarrow\(β\(θ∙\)−β\(θ0\)\)⊤ϕ\(xb\)\+\(β0\(θ∙\)−β0\(θ0\)\)=0\.\\displaystyle\\ \\big\(\\beta\(\\theta\_\{\\bullet\}\)\-\\beta\(\\theta\_\{0\}\)\\big\)^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\big\(\\beta\_\{0\}\(\\theta\_\{\\bullet\}\)\-\\beta\_\{0\}\(\\theta\_\{0\}\)\\big\)=0\.\(40\)
When this scenario holds forp\+1p\+1distinct values ofxb\{x\}^\{b\}such that the augmented design matrixΦ~xb\\tilde\{\\Phi\}\_\{\{x\}^\{b\}\}has full column rank \([˜2\.5](https://arxiv.org/html/2606.07308#S2.Thmtheorem5)\), by the closed form solution for ordinary least squares, we getβ\(θ∙\)−β\(θ0\)=0\\beta\(\\theta\_\{\\bullet\}\)\-\\beta\(\\theta\_\{0\}\)=0andβ0\(θ∙\)−β0\(θ0\)=0\\beta\_\{0\}\(\\theta\_\{\\bullet\}\)\-\\beta\_\{0\}\(\\theta\_\{0\}\)=0\. Thusθ∙=θ0\\theta\_\{\\bullet\}=\\theta\_\{0\}and this concludes the proof\. ∎
###### Lemma D\.5\(Unique likelihood maximizer\)\.
Under[˜2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3),[˜2\.4](https://arxiv.org/html/2606.07308#S2.Thmtheorem4),[˜2\.5](https://arxiv.org/html/2606.07308#S2.Thmtheorem5), and additionally when[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)holds, the true parameterθ0\\theta\_\{0\}is the unique global maximizer of𝒬\(θ\)\\mathcal\{Q\}\(\\theta\)\.
###### Proof\.
We use the idea in Lemma 2\.2 ofNewey and McFadden\([1994](https://arxiv.org/html/2606.07308#bib.bib30)\)where the strict version of Jensen’s inequality is employed for a non\-constant random variable\. For anyθ≠θ0\\theta\\neq\\theta\_\{0\},
𝒬\(θ0\)−𝒬\(θ\)\\displaystyle\\mathcal\{Q\}\(\\theta\_\{0\}\)\-\\mathcal\{Q\}\(\\theta\)=𝔼Pδl,δu,Xb\|Tb\[lnf\(δl,δu,Xb;θ0\)f\(δl,δu,Xb;θ\)\|Tb=0\]\\displaystyle=\\mathbb\{E\}\_\{P\_\{\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\}\|\{T\}^\{b\}\}\}\\left\[\\ln\\frac\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\_\{0\}\)\}\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\)\}\\ \\Big\\lvert\\ \{T\}^\{b\}=0\\right\]\(41\)=𝔼Pδl,δu,Xb\|Tb\[−lnf\(δl,δu,Xb;θ\)f\(δl,δu,Xb;θ0\)\|Tb=0\]\\displaystyle=\\mathbb\{E\}\_\{P\_\{\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\}\|\{T\}^\{b\}\}\}\\left\[\-\\ln\\frac\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\)\}\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\_\{0\}\)\}\\ \\Big\\lvert\\ \{T\}^\{b\}=0\\right\]\(42\)\>−ln𝔼Pδl,δu,Xb\|Tb\[f\(δl,δu,Xb;θ\)f\(δl,δu,Xb;θ0\)\|Tb=0\],\\displaystyle\>\-\\ln\\mathbb\{E\}\_\{P\_\{\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\}\|\{T\}^\{b\}\}\}\\left\[\\frac\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\)\}\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\_\{0\}\)\}\\ \\Big\\lvert\\ \{T\}^\{b\}=0\\right\],\(43\)wheref\(δl,δu,Xb;θ\)/f\(δl,δu,Xb;θ0\)f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\)/f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\_\{0\}\)is a non\-constant random variable with support in𝒞g,σ,d,𝒳\\mathcal\{C\}\_\{g,\\sigma,d,\\mathcal\{X\}\}thanks to[Lemma˜D\.4](https://arxiv.org/html/2606.07308#A4.Thmtheorem4)\(identifiability\) and[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)\(unique utility maximiser\), which results in non\-zero probabilities for non\-degenerate intervals\.
We further rewrite the inequality below, where variablesδl,δu\\delta^\{l\},\\delta^\{u\}are just some transformations ofxs,𝒳r,xb\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}which in turn are transformations ofα,𝒳r,xb\\alpha,\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\},
𝒬\(θ0\)−𝒬\(θ\)\\displaystyle\\mathcal\{Q\}\(\\theta\_\{0\}\)\-\\mathcal\{Q\}\(\\theta\)\>−ln𝔼Pδl,δu,Xb\|Tb\[f\(δl,δu,Xb;θ\)f\(δl,δu,Xb;θ0\)\|Tb=0\]\\displaystyle\>\-\\ln\\mathbb\{E\}\_\{P\_\{\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\}\|\{T\}^\{b\}\}\}\\left\[\\frac\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\)\}\{f\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\_\{0\}\)\}\\ \\Big\\lvert\\ \{T\}^\{b\}=0\\right\]\(44\)=−ln𝔼PXs,𝒳r,Xb\|Tb\[f\(δl,δu,Xb;θ\)f\(δl,δu,Xb;θ0\)\|Tb=0\]\\displaystyle=\-\\ln\\mathbb\{E\}\_\{P\_\{\{X\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{X\}^\{b\}\|\{T\}^\{b\}\}\}\\left\[\\frac\{f\\big\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\\big\)\}\{f\\big\(\\delta^\{l\},\\delta^\{u\},\{X\}^\{b\};\\theta\_\{0\}\\big\)\}\\ \\Big\\lvert\\ \{T\}^\{b\}=0\\right\]\(45\)=−ln∫𝒳×𝒫\(𝒳\)×𝒳f\(δl,δu,xb;θ\)f\(δl,δu,xb;θ0\)p\(xs,𝒳r,xb∣Tb=0;θ0\)d\(xs,𝒳r,xb\)\\displaystyle=\-\\ln\\int\_\{\\mathcal\{X\}\\times\\mathcal\{P\}\(\\mathcal\{X\}\)\\times\\mathcal\{X\}\}\\frac\{f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\)\}\{f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\_\{0\}\)\}\\ p\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0;\\theta\_\{0\}\)\\ d\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\(46\)=−ln∫𝒳×𝒫\(𝒳\)×𝒳p\(δl<α<δu∣xb;θ\)p\(δl<α<δu∣xb;θ0\)p\(xs,𝒳r,xb∣Tb=0;θ0\)d\(xs,𝒳r,xb\)\\displaystyle=\-\\ln\\int\_\{\\mathcal\{X\}\\times\\mathcal\{P\}\(\\mathcal\{X\}\)\\times\\mathcal\{X\}\}\\frac\{p\(\\delta^\{l\}<\\alpha<\\delta^\{u\}\\mid\{x\}^\{b\};\\theta\)\}\{p\(\\delta^\{l\}<\\alpha<\\delta^\{u\}\\mid\{x\}^\{b\};\\theta\_\{0\}\)\}\\ p\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0;\\theta\_\{0\}\)\\ d\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\(47\)=−ln∫𝒳×𝒫\(𝒳\)×𝒳p\(xs,𝒳r,xb∣Tb=0;θ\)d\(xs,𝒳r,xb\)\\displaystyle=\-\\ln\\int\_\{\\mathcal\{X\}\\times\\mathcal\{P\}\(\\mathcal\{X\}\)\\times\\mathcal\{X\}\}p\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0;\\theta\)\\ d\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\)\(48\)=−ln1\\displaystyle=\-\\ln 1\(49\)=0\.\\displaystyle=0\.\(50\)
Note that we could cancel out the termf\(δl,δu,xb;θ0\)f\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\_\{0\}\)by using our earlier decomposition forp\(xs,𝒳r,xb∣Tb=0;θ0\)p\(\{x\}^\{s\},\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\}\\mid\{T\}^\{b\}=0;\\theta\_\{0\}\)at the beginning of this section\. Furthermore, we do not have to worry about the case wheref\(δl,δu,xb;θ0\)=fconstf\(\\delta^\{l\},\\delta^\{u\},\{x\}^\{b\};\\theta\_\{0\}\)=f\_\{\\text\{const\}\}because the probability of observing a degenerate interval is zero, thanks to[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)\.
Thus, we have shown thatθ0\\theta\_\{0\}is the unique global maximiser for𝒬\\mathcal\{Q\}\. This concludes the proof\. ∎
We now show the main theorem on the consistency ofθ^n\\hat\{\\theta\}\_\{n\}\.
###### Theorem D\.6\(Consistency ofθ^n\\hat\{\\theta\}\_\{n\}\)\.
Under[˜2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3),[˜2\.4](https://arxiv.org/html/2606.07308#S2.Thmtheorem4),[˜2\.5](https://arxiv.org/html/2606.07308#S2.Thmtheorem5), and when[Lemma˜2\.2](https://arxiv.org/html/2606.07308#S2.Thmtheorem2)holds, we haveθ^n→𝑝θ0\\hat\{\\theta\}\_\{n\}\\xrightarrow\{p\}\\theta\_\{0\}\.
###### Proof\.
From[˜2\.3](https://arxiv.org/html/2606.07308#S2.Thmtheorem3)\(compactness\),[Lemma˜D\.1](https://arxiv.org/html/2606.07308#A4.Thmtheorem1)\(continuity\),[Lemma˜D\.3](https://arxiv.org/html/2606.07308#A4.Thmtheorem3)\(uniform convergence\), and[Lemma˜D\.5](https://arxiv.org/html/2606.07308#A4.Thmtheorem5)\(unique likelihood maximizer\), we directly get the desired result, by using Theorem 2\.1 ofNewey and McFadden\([1994](https://arxiv.org/html/2606.07308#bib.bib30)\)\. ∎
### D\.2Consistency ofF\(α\|xb;θ^n\)F\(\\alpha\|\{x\}^\{b\};\\hat\{\\theta\}\_\{n\}\)
Because we assume the log\-normal model wherelnα∣xb∼𝒩\(β⊤ϕ\(xb\)\+β0,σ2\)\\ln\\alpha\\mid\{x\}^\{b\}\\sim\\mathcal\{N\}\(\\beta^\{\\top\}\\phi\(\{x\}^\{b\}\)\+\\beta\_\{0\},\\ \\sigma^\{2\}\), we get
F\(α\|xb;θ^n\)=12\[1\+erf\(lnα−β^⊤ϕ\(xb\)−β0^σ^2\)\],\\displaystyle F\(\\alpha\|\{x\}^\{b\};\\hat\{\\theta\}\_\{n\}\)=\\frac\{1\}\{2\}\\left\[1\+\\operatorname\{erf\}\\left\(\\frac\{\\ln\\alpha\-\\hat\{\\beta\}^\{\\top\}\\phi\(\{x\}^\{b\}\)\-\\hat\{\\beta\_\{0\}\}\}\{\\hat\{\\sigma\}\\sqrt\{2\}\}\\right\)\\right\],\(51\)whereθ^=\(β^,β^0,σ^\)\\hat\{\\theta\}=\(\\hat\{\\beta\},\\hat\{\\beta\}\_\{0\},\\hat\{\\sigma\}\)\.
From the proof of[Lemma˜D\.1](https://arxiv.org/html/2606.07308#A4.Thmtheorem1), the mappingF\(α\|xb;θ\)F\(\\alpha\|\{x\}^\{b\};\\theta\)\(for any given pair of values\{α,xb\}\\\{\\alpha,\{x\}^\{b\}\\\}\) is continuous onΘ\\Thetaand withp\(θ0∈Θ\)=1p\(\\theta\_\{0\}\\in\\Theta\)=1, then by the continuous mapping theorem\(Van der Vaart,[2000](https://arxiv.org/html/2606.07308#bib.bib42)\)we get
\(θ^n→𝑝θ0\)⇒\(F\(α\|xb;θ^n\)→𝑝F\(α\|xb;θ0\)\)∀\(α,xb\)∈ℝ\+×𝒳\.\\displaystyle\\big\(\\hat\{\\theta\}\_\{n\}\\xrightarrow\{p\}\\theta\_\{0\}\\big\)\\Rightarrow\\big\(F\(\\alpha\|\{x\}^\{b\};\\hat\{\\theta\}\_\{n\}\)\\xrightarrow\{p\}F\(\\alpha\|\{x\}^\{b\};\\theta\_\{0\}\)\\big\)\\qquad\\forall\(\\alpha,\{x\}^\{b\}\)\\in\\mathbb\{R\}^\{\+\}\\times\\mathcal\{X\}\.\(52\)
Similarly,p\(xs∣𝒳r,xb,Tb=0;θ\)p\(\{x\}^\{s\}\\mid\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\},\{T\}^\{b\}=0;\\theta\)is continuous onΘ\\Thetabecause the other multiplicative terms are constant w\.r\.t\.θ\\theta\. We then obtain consistency ofp\(xs∣𝒳r,xb,Tb=0;θ^n\)p\(\{x\}^\{s\}\\mid\{\\mathcal\{X\}\}^\{r\},\{x\}^\{b\},\{T\}^\{b\}=0;\\hat\{\\theta\}\_\{n\}\)\.
## Appendix EStrategy\-Robust Doubly Robust Estimator
We useπ\\pito denote the evaluation policy,π0\\pi\_\{0\}the logging policy, andμ^\(xs,ts\)\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)the regression\-based estimator for the conditional expected outcome𝔼\[Y\|xs,ts\]\\mathbb\{E\}\[Y\|\{x\}^\{s\},\{t\}^\{s\}\]\.
V^SDR\(π\)\\displaystyle\\hat\{V\}\_\{\\text\{SDR\}\}\(\\pi\)=V^S\-IPS\-res\(π\)\+V^S\-DM\(π\)\\displaystyle=\\hat\{V\}\_\{\\text\{S\-IPS\-res\}\}\(\\pi\)\+\\hat\{V\}\_\{\\text\{S\-DM\}\}\(\\pi\)\(53\)=1m∑i=1m\(yi−μ^\(xis,tis\)\)pπ\(tis\|xis,tib,xib\)pπ0\(tis\|xis,tib,xib\)p^π\(xis\|tib,xib\)p^π0\(xis\|tib,xib\)pπ\(tib\|xib\)pπ0\(tib\|xib\)\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\frac\{p\_\{\\pi\}\(\{t\}^\{s\}\_\{i\}\|\{x\}^\{s\}\_\{i\},\{t\}^\{b\}\_\{i\},\{x\}^\{b\}\_\{i\}\)\}\{p\_\{\\pi\_\{0\}\}\(\{t\}^\{s\}\_\{i\}\|\{x\}^\{s\}\_\{i\},\{t\}^\{b\}\_\{i\},\{x\}^\{b\}\_\{i\}\)\}\\frac\{\\hat\{p\}\_\{\\pi\}\(\{x\}^\{s\}\_\{i\}\|\{t\}^\{b\}\_\{i\},\{x\}^\{b\}\_\{i\}\)\}\{\\hat\{p\}\_\{\\pi\_\{0\}\}\(\{x\}^\{s\}\_\{i\}\|\{t\}^\{b\}\_\{i\},\{x\}^\{b\}\_\{i\}\)\}\\frac\{p\_\{\\pi\}\(\{t\}^\{b\}\_\{i\}\|\{x\}^\{b\}\_\{i\}\)\}\{p\_\{\\pi\_\{0\}\}\(\{t\}^\{b\}\_\{i\}\|\{x\}^\{b\}\_\{i\}\)\}\(54\)\+∑𝒳2×𝒯2μ^\(xs,ts\)pπ\(ts\|xs,tb,xb\)p^π\(xs\|tb,xb\)pπ\(tb\|xb\)p^\(xb\)\.\\displaystyle\\hskip 28\.45274pt\+\\sum\_\{\\mathcal\{X\}^\{2\}\\times\\mathcal\{T\}^\{2\}\}\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)p\_\{\\pi\}\(\{t\}^\{s\}\|\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\\hat\{p\}\_\{\\pi\}\(\{x\}^\{s\}\|\{t\}^\{b\},\{x\}^\{b\}\)p\_\{\\pi\}\(\{t\}^\{b\}\|\{x\}^\{b\}\)\\hat\{p\}\(\{x\}^\{b\}\)\.\(55\)
Note that we assume the above important weights are well\-defined on the observed samples, which is mild because both the samples and the denominator terms are obtained under the same logging policyπ0\\pi\_\{0\}\.
For ease of presentation, we define the following terms:
w\(ts,xs,tb,xb\)\\displaystyle w\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\):=pπ\(ts\|xs,tb,xb\)pπ0\(ts\|xs,tb,xb\)pπ\(xs\|tb,xb\)pπ0\(xs\|tb,xb\)pπ\(tb\|xb\)pπ0\(tb\|xb\),\\displaystyle:=\\frac\{p\_\{\\pi\}\(\{t\}^\{s\}\|\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\}\{p\_\{\\pi\_\{0\}\}\(\{t\}^\{s\}\|\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\}\\frac\{\{p\}\_\{\\pi\}\(\{x\}^\{s\}\|\{t\}^\{b\},\{x\}^\{b\}\)\}\{\{p\}\_\{\\pi\_\{0\}\}\(\{x\}^\{s\}\|\{t\}^\{b\},\{x\}^\{b\}\)\}\\frac\{p\_\{\\pi\}\(\{t\}^\{b\}\|\{x\}^\{b\}\)\}\{p\_\{\\pi\_\{0\}\}\(\{t\}^\{b\}\|\{x\}^\{b\}\)\},\(56\)w^\(ts,xs,tb,xb\)\\displaystyle\\hat\{w\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\):=pπ\(ts\|xs,tb,xb\)pπ0\(ts\|xs,tb,xb\)p^π\(xs\|tb,xb\)p^π0\(xs\|tb,xb\)pπ\(tb\|xb\)pπ0\(tb\|xb\)\.\\displaystyle:=\\frac\{p\_\{\\pi\}\(\{t\}^\{s\}\|\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\}\{p\_\{\\pi\_\{0\}\}\(\{t\}^\{s\}\|\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\}\\frac\{\\hat\{p\}\_\{\\pi\}\(\{x\}^\{s\}\|\{t\}^\{b\},\{x\}^\{b\}\)\}\{\\hat\{p\}\_\{\\pi\_\{0\}\}\(\{x\}^\{s\}\|\{t\}^\{b\},\{x\}^\{b\}\)\}\\frac\{p\_\{\\pi\}\(\{t\}^\{b\}\|\{x\}^\{b\}\)\}\{p\_\{\\pi\_\{0\}\}\(\{t\}^\{b\}\|\{x\}^\{b\}\)\}\.\(57\)We then usewiw\_\{i\}andw^i\\hat\{w\}\_\{i\}to refer tow\(tis,xis,tib,xib\)w\(\{t\}^\{s\}\_\{i\},\{x\}^\{s\}\_\{i\},\{t\}^\{b\}\_\{i\},\{x\}^\{b\}\_\{i\}\)andw^\(tis,xis,tib,xib\)\\hat\{w\}\(\{t\}^\{s\}\_\{i\},\{x\}^\{s\}\_\{i\},\{t\}^\{b\}\_\{i\},\{x\}^\{b\}\_\{i\}\), respectively\.
In addition, we define the estimated densities:
p^π\(ts,xs,tb,xb\)\\displaystyle\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\):=pπ\(ts\|xs,tb,xb\)p^π\(xs\|tb,xb\)pπ\(tb\|xb\)p^\(xb\),\\displaystyle:=p\_\{\\pi\}\(\{t\}^\{s\}\|\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\\ \\hat\{p\}\_\{\\pi\}\(\{x\}^\{s\}\|\{t\}^\{b\},\{x\}^\{b\}\)\\ p\_\{\\pi\}\(\{t\}^\{b\}\|\{x\}^\{b\}\)\\ \\hat\{p\}\(\{x\}^\{b\}\),\(58\)p^π\(ts,xs\)\\displaystyle\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\}\):=∑𝒳×𝒯p^π\(ts,xs,tb,xb\),\\displaystyle:=\\sum\_\{\\mathcal\{X\}\\times\\mathcal\{T\}\}\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\),\(59\)and letP^Ts,Xs;π\\hat\{P\}\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}denote the distribution that corresponds to the densityp^π\(ts,xs\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\}\)\.
We rewrite the SDR estimator as
V^SDR\(π\)\\displaystyle\\hat\{V\}\_\{\\text\{SDR\}\}\(\\pi\)=V^S\-IPS\-res\(π\)\+V^S\-DM\(π\)\\displaystyle=\\hat\{V\}\_\{\\text\{S\-IPS\-res\}\}\(\\pi\)\+\\hat\{V\}\_\{\\text\{S\-DM\}\}\(\\pi\)\(60\)=1m∑i=1m\(yi−μ^\(xis,tis\)\)w^i\+∑𝒳2×𝒯2μ^\(xs,ts\)p^π\(ts,xs,tb,xb\)\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\hat\{w\}\_\{i\}\+\\sum\_\{\\mathcal\{X\}^\{2\}\\times\\mathcal\{T\}^\{2\}\}\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)\(61\)=1m∑i=1m\(yi−μ^\(xis,tis\)\)w^i\+∑𝒳×𝒯μ^\(xs,ts\)p^π\(ts,xs\)\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\hat\{w\}\_\{i\}\+\\sum\_\{\\mathcal\{X\}\\times\\mathcal\{T\}\}\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\}\)\(62\)=1m∑i=1m\(yi−μ^\(xis,tis\)\)w^i\+𝔼P^Ts,Xs;π\[μ^\(Xs,Ts\)\]\.\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\hat\{w\}\_\{i\}\+\\mathbb\{E\}\_\{\\hat\{P\}\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\\left\[\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\\right\]\.\(63\)
## Appendix FConsistency ofV^SDR\\hat\{V\}\_\{\\text\{SDR\}\}
We show the double robustness property forV^SDR\\hat\{V\}\_\{\\text\{SDR\}\}\. Whenp^π\(ts,xs,tb,xb\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)is consistent \(for anyπ\\pi, including the caseπ=π0\\pi=\\pi\_\{0\}\), the nuisance components \(i\.e\.,w^\\hat\{w\},μ^\\hat\{\\mu\}\) are estimated independently, and the samples forV^S\-IPS\-res\\hat\{V\}\_\{\\text\{S\-IPS\-res\}\}are collected separately,V^SDR\\hat\{V\}\_\{\\text\{SDR\}\}is consistent if eitherμ^\(xs,ts\)\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)is consistent or[˜3\.1](https://arxiv.org/html/2606.07308#S3.Thmtheorem1)\(overlap\) holds \(i\.e\.,w\(⋅\)w\(\\cdot\)is well defined\)\.
### F\.1Whenμ^\(xs,ts\)\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)Is Consistent
We inject the termμ\(xis,tis\)\\mu\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)intoV^SDR\(π\)\\hat\{V\}\_\{\\text\{SDR\}\}\(\\pi\)as follows:
V^SDR\(π\)\\displaystyle\\hat\{V\}\_\{\\text\{SDR\}\}\(\\pi\)=1m∑i=1m\(yi−μ^\(xis,tis\)\)w^i\+𝔼P^Ts,Xs;π\[μ^\(Xs,Ts\)\]\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\hat\{w\}\_\{i\}\+\\mathbb\{E\}\_\{\\hat\{P\}\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\\left\[\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\\right\]\(64\)=1m∑i=1m\(yi−μ\(xis,tis\)\)w^i⏟A\+1m∑i=1m\(μ\(xis,tis\)−μ^\(xis,tis\)\)w^i⏟B\+𝔼P^Ts,Xs;π\[μ^\(Xs,Ts\)\]⏟C\.\\displaystyle=\\underbrace\{\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\mu\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\hat\{w\}\_\{i\}\}\_\{A\}\+\\underbrace\{\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(\\mu\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\-\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\hat\{w\}\_\{i\}\}\_\{B\}\+\\underbrace\{\\mathbb\{E\}\_\{\\hat\{P\}\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\\left\[\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\\right\]\}\_\{C\}\.\(65\)
Whenμ^\(xs,ts\)\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)andp^π\(ts,xs,tb,xb\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)are consistent estimators, then by the continuous mapping theorem\(Van der Vaart,[2000](https://arxiv.org/html/2606.07308#bib.bib42)\),C→𝑝𝔼PTs,Xs;π\[μ\(Xs,Ts\)\]C\\xrightarrow\{p\}\\mathbb\{E\}\_\{P\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\[\\mu\(\{X\}^\{s\},\{T\}^\{s\}\)\], which meansC→𝑝V\(π\)C\\xrightarrow\{p\}V\(\\pi\)\.
Whenw^\\hat\{w\}is a fixed function \(which can be achieved by estimated from a separate data set\), we can use the law of large numbers to show thatA→𝑝0A\\xrightarrow\{p\}0\. In particular, we have
𝔼g0\[\(Y−μ\(Xs,Ts\)\)w^\(Ts,Xs,Tb,Xb\)\]\\displaystyle\\mathbb\{E\}\_\{g\_\{0\}\}\\left\[\\big\(Y\-\\mu\(\{X\}^\{s\},\{T\}^\{s\}\)\\big\)\\hat\{w\}\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)\\right\]\(66\)=\\displaystyle=\\𝔼g0\[𝔼\[\(Y−μ\(Xs,Ts\)\)w^\(Ts,Xs,Tb,Xb\)\|Ts,Xs,Tb,Xb\]\]\\displaystyle\\mathbb\{E\}\_\{g\_\{0\}\}\\left\[\\mathbb\{E\}\\left\[\\big\(Y\-\\mu\(\{X\}^\{s\},\{T\}^\{s\}\)\\big\)\\hat\{w\}\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)\\ \\big\\lvert\\ \{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\\right\]\\right\]\(67\)=\\displaystyle=\\𝔼g0\[𝔼\[Y−μ\(Xs,Ts\)\|Ts,Xs,Tb,Xb\]⏟=0from the definition ofμw^\(Ts,Xs,Tb,Xb\)\]\\displaystyle\\mathbb\{E\}\_\{g\_\{0\}\}\\Big\[\\underbrace\{\\mathbb\{E\}\\big\[Y\-\\mu\(\{X\}^\{s\},\{T\}^\{s\}\)\\ \\big\\lvert\\ \{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\\big\]\}\_\{=0\\text\{ from the definition of \}\\mu\}\\hat\{w\}\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)\\Big\]\(68\)=\\displaystyle=\\0,\\displaystyle 0,\(69\)wherew^\(Ts,Xs,Tb,Xb\)\\hat\{w\}\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)can be moved outside of the conditional expectation term becausew^\(Ts,Xs,Tb,Xb\)\\hat\{w\}\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)is a constant when conditioned on\{Ts,Xs,Tb,Xb\}\\\{\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\\\}, which comes from the fact thatw^\\hat\{w\}is a non\-random function\.
As the above expectation exists, we can use the law of large numbers and show thatA→𝑝𝔼π0\[\(Y−μ\(Xs,Ts\)\)w^\(Ts,Xs,Tb,Xb\)\]A\\xrightarrow\{p\}\\mathbb\{E\}\_\{\\pi\_\{0\}\}\\left\[\\big\(Y\-\\mu\(\{X\}^\{s\},\{T\}^\{s\}\)\\big\)\\hat\{w\}\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)\\right\], which givesA→𝑝0A\\xrightarrow\{p\}0\.
Similarly, in the B term, whenμ^\(xs,ts\)\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)is a consistent estimator, it meansμ^\(xs,ts\)−μ\(xs,ts\)→𝑝0\\hat\{\\mu\}\(\{x\}^\{s\},\{t\}^\{s\}\)\-\\mu\(\{x\}^\{s\},\{t\}^\{s\}\)\\xrightarrow\{p\}0\. Consequently, this gives usB→𝑝0B\\xrightarrow\{p\}0as long asw^\(⋅\)\\hat\{w\}\(\\cdot\)is bounded\. This boundedness behaviour ofw^\\hat\{w\}can be obtained whenw^\\hat\{w\}andμ^\\hat\{\\mu\}are estimated independently from separate datasets and the space𝒳×𝒯\\mathcal\{X\}\\times\\mathcal\{T\}is finite\.
Together, we haveV^SDR→𝑝V\(π\)\\hat\{V\}\_\{\\text\{SDR\}\}\\xrightarrow\{p\}V\(\\pi\)\.
### F\.2When Overlap Holds
We inject the termswiw\_\{i\}intoV^SDR\\hat\{V\}\_\{\\text\{SDR\}\}as follows:
V^SDR\(g\)\\displaystyle\\hat\{V\}\_\{\\text\{SDR\}\}\(g\)=1m∑i=1m\(yi−μ^\(xis,tis\)\)w^i\+𝔼P^Ts,Xs;π\[μ^\(Xs,Ts\)\]\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\big\(y\_\{i\}\-\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\)\\hat\{w\}\_\{i\}\+\\mathbb\{E\}\_\{\\hat\{P\}\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\\left\[\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\\right\]\(70\)=1m∑i=1myiwi\+1m∑i=1myi\(w^i−wi\)\+1m∑i=1mμ^\(xis,tis\)\(wi−w^i\)\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}y\_\{i\}w\_\{i\}\+\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}y\_\{i\}\\big\(\\hat\{w\}\_\{i\}\-w\_\{i\}\\big\)\+\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\(w\_\{i\}\-\\hat\{w\}\_\{i\}\\big\)\(71\)−1m∑i=1mμ^\(xis,tis\)wi\+𝔼P^Ts,Xs;π\[μ^\(Xs,Ts\)\]\.\\displaystyle\\hskip 153\.64487pt\-\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)w\_\{i\}\+\\mathbb\{E\}\_\{\\hat\{P\}\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\\left\[\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\\right\]\.\(72\)
We further denote the following for ease of presentation:
A\\displaystyle A:=1m∑i=1myiwi,\\displaystyle:=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}y\_\{i\}w\_\{i\},\(73\)B\\displaystyle B:=1m∑i=1myi\(w^i−wi\),\\displaystyle:=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}y\_\{i\}\\big\(\\hat\{w\}\_\{i\}\-w\_\{i\}\\big\),\(74\)C\\displaystyle C:=1m∑i=1mμ^\(xis,tis\)\(wi−w^i\),\\displaystyle:=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)\\big\(w\_\{i\}\-\\hat\{w\}\_\{i\}\\big\),\(75\)D\\displaystyle D:=−1m∑i=1mμ^\(xis,tis\)wi\+𝔼P^Ts,Xs;π\[μ^\(Xs,Ts\)\],\\displaystyle:=\-\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\hat\{\\mu\}\(\{x\}^\{s\}\_\{i\},\{t\}^\{s\}\_\{i\}\)w\_\{i\}\+\\mathbb\{E\}\_\{\\hat\{P\}\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\\left\[\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\\right\],\(76\)whereV^SDR=A\+B\+C\+D\\hat\{V\}\_\{\\text\{SDR\}\}=A\+B\+C\+Dand we will show thatA→𝑝V\(π\)A\\xrightarrow\{p\}V\(\\pi\)while the remaining terms converge to0\.
Using the law of large numbers, we haveA→𝑝𝔼π0\[Yw\(Ts,Xs,Tb,Xb\)\]A\\xrightarrow\{p\}\\mathbb\{E\}\_\{\\pi\_\{0\}\}\[Yw\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)\], where
𝔼π0\[Yw\(Ts,Xs,Tb,Xb\)\]=𝔼π\[Y\]=V\(π\)\\mathbb\{E\}\_\{\\pi\_\{0\}\}\[Yw\(\{T\}^\{s\},\{X\}^\{s\},\{T\}^\{b\},\{X\}^\{b\}\)\]=\\mathbb\{E\}\_\{\\pi\}\[Y\]=V\(\\pi\)if the overlap assumption holds\.
Whenw^i\\hat\{w\}\_\{i\}is a consistent estimator ofwiw\_\{i\}, by using the continuous mapping theorem\(Van der Vaart,[2000](https://arxiv.org/html/2606.07308#bib.bib42)\), we getB→𝑝0B\\xrightarrow\{p\}0andC→𝑝0C\\xrightarrow\{p\}0, as long as the mappingμ^\\hat\{\\mu\}is fixed \(e\.g\., by estimating from a separate dataset\)\.
When the mappingμ^\\hat\{\\mu\}is fixed, by the law of large numbers, the first term inDDconverges to𝔼PTs,Xs;π\[−μ^\(Xs,Ts\)\]\\mathbb\{E\}\_\{P\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\[\-\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\]\. Whenp^π\(ts,xs,tb,xb\)\\hat\{p\}\_\{\\pi\}\(\{t\}^\{s\},\{x\}^\{s\},\{t\}^\{b\},\{x\}^\{b\}\)is consistent, the second term inDDconverges to𝔼PTs,Xs;π\[μ^\(Xs,Ts\)\]\\mathbb\{E\}\_\{P\_\{\{T\}^\{s\},\{X\}^\{s\};\\pi\}\}\[\\hat\{\\mu\}\(\{X\}^\{s\},\{T\}^\{s\}\)\]\. Thus, togetherD→𝑝0D\\xrightarrow\{p\}0\.
Putting everything together,V^SDR\(π\)→𝑝V\(π\)\\hat\{V\}\_\{\\text\{SDR\}\}\(\\pi\)\\xrightarrow\{p\}V\(\\pi\)\.Similar Articles
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
This paper introduces EDGE-OPD, a modification of on-policy self-distillation for LLMs that uses guided rollouts and evidence masks to internalize privileged context without degrading general capabilities, showing success in rare-token identity settings.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.
@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…
This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.
Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
Proposes on-policy critique distillation (Opcd) using weak models as critics to provide revision directions for strong models, improving reasoning and alignment without requiring weak models to solve tasks.
Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents
Proposes Adwm, an autoregressive diffusion world model for off-policy evaluation of LLM agents, enabling reliable value estimates from pre-collected trajectories without online interaction.