Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

arXiv cs.LG Papers

Summary

This paper identifies the problem of missing observations in inverse reinforcement learning (IRL) that can make expert actions appear suboptimal, and develops a practical algorithm to quantify the minimal perturbations needed for expert actions to appear optimal, validated on synthetic tasks, cancer treatment simulation, and ICU data.

arXiv:2605.12831v1 Announce Type: new Abstract: Inverse reinforcement learning (IRL), which infers reward functions from demonstrations, is a valuable tool for modeling and understanding decision-making behavior. Many variants of IRL have been developed to capture complexities of human decision-making, such as subjective beliefs, imperfect planning, and dynamic goals. However, an often-overlooked issue in real-world behavioral datasets is that the recorded data may be missing observations that were available to the original decision-maker. In use-inspired settings such as healthcare, this can make expert actions appear suboptimal, even when they were near-optimal given the information available at the time. As a result, the rewards learned by standard IRL may be misleading. In this paper, we identify the minimal perturbations to the recorded observations needed for the expert's actions to appear optimal. We develop a practical algorithm for this problem and demonstrate its utility for quantifying the possible extent of missing observations in behavioral datasets through extensive experiments on synthetic navigation tasks, a cancer treatment simulator, and ICU treatment data.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:19 AM

# Quantifying Potential Observation Missingness in Inverse Reinforcement Learning
Source: [https://arxiv.org/html/2605.12831](https://arxiv.org/html/2605.12831)
Leo Benac School of Engineering and Applied Sciences Harvard University lbenac@g\.harvard\.edu &Abhishek Sharma School of Engineering and Applied Sciences Harvard University abhisheksharma@g\.harvard\.eduAlihan Hüyük School of Engineering and Applied Sciences Harvard University ahuyuk@fas\.harvard\.edu&Finale Doshi\-Velez School of Engineering and Applied Sciences Harvard University finale@seas\.harvard\.edu

###### Abstract

Inverse reinforcement learning \(IRL\), which infers reward functions from demonstrations, is a valuable tool for modeling and understanding decision\-making behavior\. Many variants of IRL have been developed to capture complexities of human decision\-making, such as subjective beliefs, imperfect planning, and dynamic goals\. However, an often\-overlooked issue in real\-world behavioral datasets is that the recorded data may be missing observations that were available to the original decision\-maker\. In use\-inspired settings such as healthcare, this can make expert actions appear suboptimal, even when they were near\-optimal given the information available at the time\. As a result, the rewards learned by standard IRL may be misleading\. In this paper, we identify the minimal perturbations to the recorded observations needed for the expert’s actions to appear optimal\. We develop a practical algorithm for this problem and demonstrate its utility for quantifying the possible extent of missing observations in behavioral datasets through extensive experiments on synthetic navigation tasks, a cancer treatment simulator, and ICU treatment data\.

## 1Introduction

A common use of inverse reinforcement learning \(IRL\) in healthcare is to analyze retrospective treatment data: given trajectories of patient vital signs \(states\) and clinician treatments \(actions\), can we infer the implicit objectives guiding those decisions? More generally, IRL aims to determine what reward function a decision\-making agent might be optimizing given observations of their past actions\(Nget al\.,[2000](https://arxiv.org/html/2605.12831#bib.bib1)\)\. Although IRL is typically performed as an intermediary step inimitation learning\(mimicking the decision\-making policy of some demonstrator\)\(e\.g\. Brownet al\.,[2020](https://arxiv.org/html/2605.12831#bib.bib16); Ruanet al\.,[2023](https://arxiv.org/html/2605.12831#bib.bib28)\)orapprenticeship learning\(matching the demonstrator’s performance in terms of some notion of ground\-truth rewards\)\(e\.g\. Abbeel and Ng,[2004](https://arxiv.org/html/2605.12831#bib.bib17)\), it has also been an invaluable tool for modeling and understanding human decision\-making behavior—providing a means to efficiently infer the likely goals of expert decision\-makers and concisely describe them as reward functions\.

In line with thisdescriptive purpose, the literature on IRL has focused on accounting for various complexities of human decision\-making that are not typically encountered in reinforcement learning—creating reward\-based models of experts when they might have subjective beliefs about their environment that are different from the objective truth\(Reddyet al\.,[2018](https://arxiv.org/html/2605.12831#bib.bib18); Hüyüket al\.,[2021](https://arxiv.org/html/2605.12831#bib.bib19)\), when their ability to plan future actions is not perfect\(Jarrettet al\.,[2021](https://arxiv.org/html/2605.12831#bib.bib20); Poianiet al\.,[2024](https://arxiv.org/html/2605.12831#bib.bib21)\), or when their goals might be shifting over time\(Likmetaet al\.,[2021](https://arxiv.org/html/2605.12831#bib.bib43); Hüyüket al\.,[2022](https://arxiv.org/html/2605.12831#bib.bib39)\)\. Additionally, recent descriptive models have leveraged this assumption of expert near\-optimality to directly infer the environment’s unknown transition dynamics rather than a reward function\(Benacet al\.,[2024](https://arxiv.org/html/2605.12831#bib.bib2)\)\.

However, one complication of analyzing observed behavior that is often overlooked in IRL is the possibility of missing observations—that is some of the information observed by the original decision\-maker might not have been recorded for us to observe\.111This is not to be confused with partial observations, where both the original decision\-maker and us as researchers do not observe parts of the environment state\.

![Refer to caption](https://arxiv.org/html/2605.12831v1/x1.png)Figure 1:Suppose states consist of shapes and colors, and actions are determined solely by colors\. The behavior in two episodes begins to differ after different color changes att=4t=4\. If the recorded data omits colors, conventional IRL cannot accurately predict actions from that time onward\. Our approach provides an alternative perspective: some unobserved change att=4t=4is needed for the actions to be perfectly predictable\.Missing observations can have significant implications, especially from a modeling perspective\. For instance, consider a healthcare scenario where either Treatment A or Treatment B is most beneficial for a patient depending on some biomarker, say whether their blood pressure is low or high \(suppose it is low 25% of the time\)\. Accordingly, doctors always assign Treatment A to patients with low blood pressure and Treatment B to patients with high blood pressure\. Now suppose blood pressure is not recorded in a dataset of treatment decisions, despite being available to doctors at decision time\. If we use conventional IRL with this dataset, we might conclude that Treatment B has higher reward but that 25% of the doctors act sub\-optimally\. However, an equally plausible explanation given our data is that the doctors act near\-optimally, and that the factor making Treatment A preferable is simply missing from the recorded observations\. We might prefer one explanation over the other depending on our prior belief about the situation: the former if we trust the data collection process, or the latter if we trust the doctors’ expertise, which is our assumption in this work\. This type of retrospective analysis is representative of existing healthcare uses of IRL, including work on sepsis management, ICU hypotension management, and ICU ventilation and sedation decisions\(Leeet al\.,[2019](https://arxiv.org/html/2605.12831#bib.bib11); Srinivasan and Doshi\-Velez,[2020](https://arxiv.org/html/2605.12831#bib.bib13); Yuet al\.,[2019](https://arxiv.org/html/2605.12831#bib.bib12)\)\. We provide a toy example in Figure[1](https://arxiv.org/html/2605.12831#S1.F1)to convey the intuition of our setup\.

Motivated by this gap in decision modeling with IRL, we ask the following question:What is the smallest perturbation that needs to be made to observations so that the corresponding actions appear to be optimal?Of course, making this question meaningful requires a careful notion of what counts as a*small*perturbation, which is one of the key technical issues we address\. While answering this question does not allow us to recover the missing information itself, it can help us gauge how much information may be missing—more precisely, the minimal extent of missingness required if the decision\-makers are assumed to be experts acting optimally\. Determining whether this potential missingness is substantial can inform downstream choices about how the dataset should be used\. For instance, if we intend to perform policy evaluation or regular reinforcement learning using a behavioral dataset, we might prefer algorithms designed to mitigate missing observations\(e\.g\. Kallus and Zhou,[2020](https://arxiv.org/html/2605.12831#bib.bib23); Wanget al\.,[2021](https://arxiv.org/html/2605.12831#bib.bib24)\)\.

Contributions\.Our contributions are three\-fold\.Conceptually,we introduce the idea of quantifying missingness in terms of minimal perturbations needed to align the actions of an expert with those of an optimal policy\. We formulate this as a new optimization problem in Section[3](https://arxiv.org/html/2605.12831#S3)\.Technically,we develop an algorithm that can solve this optimization problem effectively \(Section[4](https://arxiv.org/html/2605.12831#S4)\)\.Empirically,we demonstrate that we can recover perturbations with magnitude proportional to missingness, in a way that is robust to missingness in unimportant features, and that the learned perturbations capture meaningful structure in behavior by separating trajectories according to the underlying hidden decision context\.

## 2Related Work

Missingness in Imitation Learning\.As we briefly mentioned in the introduction, IRL is often performed as an intermediary step in imitation learning, which aims to copy the decision\-making policy of a demonstrator\. This is achieved by first inferring the demonstrator’s reward function via IRL, and then optimizing that reward function via regular reinforcement learning to obtain their policy\. Besides this strategy, other methods have also been developed for imitation learning that avoid inferring reward functions all together\(e\.g\. Ho and Ermon,[2016](https://arxiv.org/html/2605.12831#bib.bib25)\)\.

For reward\-free imitation,Zhanget al\.\([2020](https://arxiv.org/html/2605.12831#bib.bib26)\); Kumoret al\.\([2021](https://arxiv.org/html/2605.12831#bib.bib27)\)study missing observations from a causal perspective\. They characterize when the demonstrator’s policy can still be imitated despite missing variables in the underlying causal structure, i\.e\. despite hidden confounding\. More recently,Ruanet al\.\([2023](https://arxiv.org/html/2605.12831#bib.bib28)\)extended this line of work to reward\-based imitation\. Our goal is complementary\. Rather than asking when imitation or reward recovery remains possible under hidden confounding, we ask how much unobserved information would be needed for the observed behavior in a fixed dataset to be explained as optimal under a reward model\. This provides a quantitative measure of potential missingness, and can localize where missing context matters, without requiring a causal graph or assumptions on the underlying causal mechanisms\.

Modeling Variations in Behavior\.As we discussed in the introduction, when observations are missing, behavior might seem like it is variable across different episodes, whereas actually, these variations can be explained perfectly through the variation of those missing observations\. In our earlier healthcare example, the data we recorded made it seem like different doctors assign different treatments, however, these variations were simply due to differences in patient blood pressures\. In that sense, our method can be viewed as a way of explaining variations that occur specifically across episodes, and as such, it is related to other methods that aim to explain variations in behavior using IRL\.

Multi\-modal IRL\(Wanget al\.,[2017](https://arxiv.org/html/2605.12831#bib.bib29); Hsiaoet al\.,[2019](https://arxiv.org/html/2605.12831#bib.bib30); Myerset al\.,[2022](https://arxiv.org/html/2605.12831#bib.bib31); Qiaoet al\.,[2024](https://arxiv.org/html/2605.12831#bib.bib32)\)and multi\-task IRL\(Babeset al\.,[2011](https://arxiv.org/html/2605.12831#bib.bib33); Choi and Kim,[2012](https://arxiv.org/html/2605.12831#bib.bib34); Ramponiet al\.,[2020](https://arxiv.org/html/2605.12831#bib.bib35); Huanget al\.,[2021](https://arxiv.org/html/2605.12831#bib.bib36)\)also explain variation in observed behavior, typically by inferringKKlatent modes, decision\-makers, or reward functions\. Our setting is related but asks a different question: rather than introducing multiple experts or rewards, we ask how much heterogeneity can be explained by missing observations while retaining a shared reward function\. Thus, these methods vary the number of latent modes, whereas we quantify the minimal unobserved context needed to make the observed actions appear optimal\.

Finally, econometrics and statistics have long studied sensitivity to missing variables through omitted\-variable\-bias analyses for regression and treatment\-effect estimation\. Prior work uses selection on observables to reason about selection on unobservables\(Altonjiet al\.,[2005](https://arxiv.org/html/2605.12831#bib.bib10)\), coefficient stability andR2R^\{2\}bounds to assess unobserved confounding\(Oster,[2019](https://arxiv.org/html/2605.12831#bib.bib9)\), and robustness values or partial\-R2R^\{2\}summaries computable from standard regression outputs\(Cinelli and Hazlett,[2020](https://arxiv.org/html/2605.12831#bib.bib8)\)\. Our work is in the same spirit, but for sequential decision\-making: instead of asking whether a scalar coefficient is robust to an omitted variable, we ask how much unobserved trajectory\-level information is needed for observed decisions to be consistent with optimal behavior under a reward model\.

## 3Problem Formulation

Setting\.We consider a setting where agents act throughNNepisodes, indexed byn∈\[N\]n\\in\[N\], forTTtime steps each, indexed byt∈\[T\]t\\in\[T\]\. At each time step, they encounter a statesn​t∈𝒮s\_\{nt\}\\in\\mathcal\{S\}and take an actionan​t∈𝒜a\_\{nt\}\\in\\mathcal\{A\}based on that state according to some behavior policy:an​t∼πbehavior​\(sn​t\)a\_\{nt\}\\sim\\pi\_\{\\text\{behavior\}\}\(s\_\{nt\}\), assumed to be near\-optimal\. Now, suppose each state consists of two parts:sn​t=\(xn​t,x~n​t\)∈𝒮=𝒳×𝒳~s\_\{nt\}=\(x\_\{nt\},\\allowbreak\\tilde\{x\}\_\{nt\}\)\\in\\mathcal\{S\}=\\mathcal\{X\}\\times\\tilde\{\\mathcal\{X\}\}\. While the agents determine their actions based on both parts,sn​ts\_\{nt\}, suppose only the first part,xn​tx\_\{nt\}, is recorded in a dataset while the latter part,x~n​t\\tilde\{x\}\_\{nt\}, goes unrecorded, resulting in a behavioral dataset𝑫=\{xn​t,an​t\}∈\(𝒳×𝒜\)N×T\\bm\{D\}=\\\{x\_\{nt\},\\allowbreak a\_\{nt\}\\\}\\in\(\\mathcal\{X\}\\times\\mathcal\{A\}\)^\{N\\times T\}with missing observations\{x~n​t\}\\\{\\tilde\{x\}\_\{nt\}\\\}\.

Objective\.Our objective is to obtain a reward\-based description of the behavior policyπbehavior\\pi\_\{\\text\{behavior\}\}while accounting for the missing observations\{x~n​t\}\\\{\\tilde\{x\}\_\{nt\}\\\}\. In general, we cannot recover these missing observations, so we donotaim for imputation; nor do we aim to replicate the behavior policy, so we donotaim for imitation\. Instead, we quantify how large the missing information might be under the prior belief thatπbehavior\\pi\_\{\\text\{behavior\}\}is near\-optimal—for example, because it reflects decisions by domain experts such as clinicians\. This provides a way to assess the dataset𝑫\\bm\{D\}and inform downstream use: large potential missingness may suggest improving data collection or treating downstream analyses with caution, while small potential missingness supports greater confidence in tasks such as off\-policy evaluation or offline reinforcement learning\.

Conventional IRL\.As a contrast, consider applying conventional IRL to the recorded observations alone, ignoring the missing observations\{x~n​t\}\\\{\\tilde\{x\}\_\{nt\}\\\}\. Given a reward functionrθ​\(x,a\)r\_\{\\theta\}\(x,a\)with corresponding optimal policyπrθ∗\\pi^\{\*\}\_\{r\_\{\\theta\}\}, IRL seeks a reward under which the recorded actions appear as optimal:

minimizeθ​∑n,t‖an​t−πrθ∗​\(xn​t\)‖\.\\displaystyle\\textstyle\\textit\{minimize\}\_\{\\theta\}~\\sum\_\{n,t\}\\\|a\_\{nt\}\-\\pi\_\{r\_\{\\theta\}\}^\{\*\}\(x\_\{nt\}\)\\\|\.\(1\)where∥⋅∥\\\|\\cdot\\\|is a distance measure between the actions of the experts and the inferred policy’s decisions\. For instance, in the case of maximum entropy IRL\(Ziebartet al\.,[2008](https://arxiv.org/html/2605.12831#bib.bib44)\), this would be the negative log\-likelihood of actions under the inferred policy:‖an​t−πrθ∗​\(xn​t\)‖≐−log⁡πrθ∗​\(xn​t\)​\[an​t\]\\\|a\_\{nt\}\-\\pi\_\{r\_\{\\theta\}\}^\{\*\}\(x\_\{nt\}\)\\\|\\doteq\-\\log\\pi\_\{r\_\{\\theta\}\}^\{\*\}\(x\_\{nt\}\)\[a\_\{nt\}\]\.

It is important to note that a good match between actions and the inferred policy may not always be possible\. In particular, if the same recorded observationxn​t=xn′​t′x\_\{nt\}=x\_\{n^\{\\prime\}t^\{\\prime\}\}has led to different actionsan​t≠an′​t′a\_\{nt\}\\neq a\_\{n^\{\\prime\}t^\{\\prime\}\}, then any policy of the formπrθ∗​\(x\)\\pi^\{\*\}\_\{r\_\{\\theta\}\}\(x\)must assign the same decision to both inputs and therefore cannot match both actions simultaneously\. This is exactly the type of inconsistency that can arise with missing observations: the difference in actions may be due to differences in the unrecorded part of the state,x~n​t≠x~n′​t′\\tilde\{x\}\_\{nt\}\\neq\\tilde\{x\}\_\{n^\{\\prime\}t^\{\\prime\}\}, even though the recorded part is the same\.

Modeling Missingness\.Recall that our goal is to reason about the information content of the missing observations\{x~n​t\}\\\{\\tilde\{x\}\_\{nt\}\\\}, rather than to reconstruct each missing value itself\. To avoid explicitly modeling the dynamics of the missing observations, we introduce for each trajectoryn∈\[N\]n\\in\[N\]a static latent factorzn∈𝒵z\_\{n\}\\in\\mathcal\{Z\}that stands in for all missing information relevant to that trajectory\. Here,znz\_\{n\}may in general be a vector, and we do not assume that its dimension or structure is known a priori\. It is*static*only in the sense that it does not vary across time within a trajectory, hence it carries no time index\. From this point on, we model rewards and policies as functions of the recorded observations together with this latent factor:rθ​\(zn,xn​t,an​t\)r\_\{\\theta\}\(z\_\{n\},x\_\{nt\},a\_\{nt\}\)andan​t∼π​\(zn,xn​t\)a\_\{nt\}\\sim\\pi\(z\_\{n\},x\_\{nt\}\)\.

This reformulation does not reduce expressiveness as long asznz\_\{n\}can encode information equivalent to the full missing trajectory\(x~n​1,x~n​2,…,x~n​T\)\(\\tilde\{x\}\_\{n1\},\\tilde\{x\}\_\{n2\},\\ldots,\\tilde\{x\}\_\{nT\}\)\. We emphasize thatznz\_\{n\}is introduced as a compact representation of potentially missing context, not as an assumption that the true missing information is itself low\-dimensional or static\.

Then our objective can be stated as finding the smallest possible\{zn\}\\\{z\_\{n\}\\\}that allows a desired level of optimality, measured as the distance between recorded actions and an optimal policy:

minimize\{zn\},θ\\displaystyle\\textit\{minimize\}\_\{\\\{z\_\{n\}\\\},\\theta\}~∑n\\displaystyle\\textstyle\\sum\_\{n\}‖zn‖\\displaystyle\\\|z\_\{n\}\\\|\(2\)s\.t\.∑n,t\\displaystyle\\textstyle\\sum\_\{n,t\}‖an​t−πrθ∗​\(zn,xn​t\)‖≤ζ\.\\displaystyle\\\|a\_\{nt\}\-\\pi^\{\*\}\_\{r\_\{\\theta\}\}\(z\_\{n\},x\_\{nt\}\)\\\|\\leq\\zeta\.\(3\)
Hereζ\\zetacontrols the required action\-matching level\. Whenzn=0z\_\{n\}=0for allnn, the formulation reduces to conventional IRL and can only achieve thresholds above

ζ0=minθ​∑n,t‖an​t−πrθ∗​\(xn​t\)‖\.\\zeta\_\{0\}=\\min\_\{\\theta\}\\sum\_\{n,t\}\\\|a\_\{nt\}\-\\pi^\{\*\}\_\{r\_\{\\theta\}\}\(x\_\{nt\}\)\\\|\.Asζ\\zetadecreases belowζ0\\zeta\_\{0\}, nonzeroznz\_\{n\}must supply additional information\. At the other extreme,ζ=0\\zeta=0can be achieved by encoding the full action sequence inznz\_\{n\}, but our objective asks how much smaller a perturbation suffices\.

The optimization in Equations \([2](https://arxiv.org/html/2605.12831#S3.E2)\)–\([3](https://arxiv.org/html/2605.12831#S3.E3)\) is still abstract at this stage\. To make the size ofznz\_\{n\}meaningful, we must make concrete modeling choices about both the space𝒵\\mathcal\{Z\}and the wayznz\_\{n\}enters the reward function\. We now turn to these choices\.

Magnitude of Latent Factors\.A key issue is that, unless we design the space ofznz\_\{n\}and the class of reward functionsrθ​\(zn,xn​t,an​t\)r\_\{\\theta\}\(z\_\{n\},x\_\{nt\},a\_\{nt\}\)carefully, the magnitude of‖zn‖\\\|z\_\{n\}\\\|may not be meaningful\. As a simple example, suppose𝒵=ℝ\\mathcal\{Z\}=\\mathbb\{R\},𝒳=ℝ\\mathcal\{X\}=\\mathbb\{R\},𝒜=\{0,1\}\\mathcal\{A\}=\\\{0,1\\\}, and the reward functions are linear:

rθ​\(zn,xn​t,an​t\)=ϕan​t​zn\+ψan​t​xn​t,r\_\{\\theta\}\(z\_\{n\},x\_\{nt\},a\_\{nt\}\)=\\phi\_\{a\_\{nt\}\}z\_\{n\}\+\\psi\_\{a\_\{nt\}\}x\_\{nt\},with parametersθ=\{ϕ0,ϕ1,ψ0,ψ1\}\\theta=\\\{\\phi\_\{0\},\\phi\_\{1\},\\psi\_\{0\},\\psi\_\{1\}\\\}\. Then we can make\{zn\}\\\{z\_\{n\}\\\}arbitrarily small by making\{ϕ0,ϕ1\}\\\{\\phi\_\{0\},\\phi\_\{1\}\\\}sufficiently large, without changing the reward function or the optimal policy\. In such a parameterization, the size ofznz\_\{n\}reflects an arbitrary scaling choice rather than the amount of missing information\.

𝒵\\mathcal\{Z\}Representation\.To make the magnitude ofznz\_\{n\}meaningful, we restrict how missing information enters the reward\. In this work, we consider continuous state spaces and parameterize episode\-specific perturbations using a finite set ofKKkernels placed in the observed state space\. Specifically, we introduce kernel centersμ1,…,μK∈𝒳\\mu\_\{1\},\\ldots,\\mu\_\{K\}\\in\\mathcal\{X\}and a bandwidth parameterσ\>0\\sigma\>0, both shared across episodes\. Each episoden∈\[N\]n\\in\[N\]then has its own action\-specific kernel coefficientszn∈𝒵=ℝK×\|𝒜\|\.z\_\{n\}\\in\\mathcal\{Z\}=\\mathbb\{R\}^\{K\\times\|\\mathcal\{A\}\|\}\.We define the reward function as:

rθ​\(zn,xn​t,an​t\)=r¯θ​\(xn​t,an​t\)\+∑k=1Kκσ​\(xn​t,μk\)​\(zn\)k,an​t,\\displaystyle r\_\{\\theta\}\(z\_\{n\},x\_\{nt\},a\_\{nt\}\)=\\bar\{r\}\_\{\\theta\}\(x\_\{nt\},a\_\{nt\}\)\+\\sum\_\{k=1\}^\{K\}\\kappa\_\{\\sigma\}\(x\_\{nt\},\\mu\_\{k\}\)\\,\(z\_\{n\}\)\_\{k,a\_\{nt\}\},\(4\)wherer¯θ\\bar\{r\}\_\{\\theta\}is a shared base reward function \(represented as a neural network with parameterθ\\theta\) andκσ​\(x,μk\)\\kappa\_\{\\sigma\}\(x,\\mu\_\{k\}\)measures the proximity ofxxto thekk\-th kernel center\. In our implementation, we use Gaussian radial basis function kernels:κσ​\(x,μk\)=exp⁡\(−‖x−μk‖222​σ2\)\.\\kappa\_\{\\sigma\}\(x,\\mu\_\{k\}\)=\\exp\\\!\\left\(\-\\frac\{\\\|x\-\\mu\_\{k\}\\\|\_\{2\}^\{2\}\}\{2\\sigma^\{2\}\}\\right\)\.

Size ofznz\_\{n\}\.Under this parameterization,znz\_\{n\}affects the reward only*linearly*: each coefficient\(zn\)k,a\(z\_\{n\}\)\_\{k,a\}directly scales one localized perturbation pattern for actionaa\. Thus, the size ofznz\_\{n\}is directly tied to the size of the reward perturbation it induces\. We measure trajectory\-level missingness by theℓ1\\ell\_\{1\}norm‖zn‖1\\\|z\_\{n\}\\\|\_\{1\}, and use‖zn‖=‖zn‖1\\\|z\_\{n\}\\\|=\\\|z\_\{n\}\\\|\_\{1\}in Equation \([2](https://arxiv.org/html/2605.12831#S3.E2)\)\. Smaller‖zn‖1\\\|z\_\{n\}\\\|\_\{1\}corresponds to weaker episode\-specific deviations from the shared reward function\.

Importantly,znz\_\{n\}is penalized once per trajectory, not once per time step\. Thus, a single latent component can influence multiple decisions within the same trajectory without being counted multiple times, matching our interpretation ofznz\_\{n\}as trajectory\-level missing context\. More generally, this kernel parameterization avoids assigning an independent perturbation to every possible continuous observation; instead, it represents missing information through a small number of spatially localized perturbation patterns, smoothly interpolated across the observed state space\.

## 4Minimum\-Perturbation IRL

We now describe the optimization used to learn the base\-IRL reward and episode\-specific perturbations\. Our method builds on the maximum causal entropy IRL framework\(Ziebartet al\.,[2010](https://arxiv.org/html/2605.12831#bib.bib15)\), in which the expert is modeled as following a soft\-optimal policy induced by a softQQ\-function:

π​\(a\|x\)=exp⁡\(Q​\(x,a\)\)∑a′∈𝒜exp⁡\(Q​\(x,a′\)\)\.\\pi\(a\|x\)=\\frac\{\\exp\(Q\(x,a\)\)\}\{\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}\\exp\(Q\(x,a^\{\\prime\}\)\)\}\.\(5\)The corresponding soft Bellman equations are:

Q​\(x,a\)=𝔼x′∼T\(⋅\|x,a\)​\[r​\(x,a\)\+γ​V​\(x′\)\],V​\(x\)=log​∑a∈𝒜exp⁡\(Q​\(x,a\)\)\.\\displaystyle Q\(x,a\)=\\mathbb\{E\}\_\{x^\{\\prime\}\\sim T\(\\cdot\|x,a\)\}\\\!\\left\[r\(x,a\)\+\\gamma V\(x^\{\\prime\}\)\\right\],\\qquad V\(x\)=\\log\\sum\_\{a\\in\\mathcal\{A\}\}\\exp\(Q\(x,a\)\)\.\(6\)In our offline setting, we do not have access to the transition modelTT\. We therefore enforce these equations only approximately on transitions in the dataset using a sample\-based Bellman target and a penalized objective\. This forces us to regularize the Q\-function more heavily\.

Minimum\-Perturbation IRL \(MP\-IRL\)\.We optimize the model in two phases\. Phase 1 performs standard IRL to extract the best*base*\-reward explanation based solely on what the recorded data tells us\. Phase 2 then freezes this base reward model and infers the episode\-specific perturbationsznz\_\{n\}needed to recover the remaining behavior\. This structure is central to our approach: it ensures that the latent variables only explain what the observed data cannot, allowing us to accurately quantify the perturbations corresponding to the missing observations\.

We write

πϕ​\(a∣x,z,μ\)∝exp⁡\(Qϕ​\(x,z,μ,a\)τ\)\\pi\_\{\\phi\}\(a\\mid x,z,\\mu\)\\propto\\exp\\\!\\left\(\\frac\{Q\_\{\\phi\}\(x,z,\\mu,a\)\}\{\\tau\}\\right\)for the soft policy induced by the criticQϕQ\_\{\\phi\}, whereτ\>0\\tau\>0is a temperature parameter andμ=\{μk\}k=1K\\mu=\\\{\\mu\_\{k\}\\\}\_\{k=1\}^\{K\}are the shared kernel centers\. We useτ=0\.3\\tau=0\.3throughout\. Smaller values ofτ\\taumake the policy closer to deterministic, but this does not necessarily by itself improve action matching or likelihood; it only sharpens the action probabilities implied by the critic\. Sinceτ\\tauchanges the scale of the perturbations needed to explain the same actions, we keep it fixed whenever comparing the learned size ofzzacross experiments\. We also define the soft Bellman target:

yϕ,θ\(xn​t,zn,μ,an​t,xn,t\+1\)=rθ\(zn,xn​t,an​t;μ\)\+γlog∑a′∈𝒜exp\(Qϕ¯\(zn,xn,t\+1,a;′μ\)\)\\displaystyle y\_\{\\phi,\\theta\}\(x\_\{nt\},z\_\{n\},\\mu,a\_\{nt\},x\_\{n,t\+1\}\)=r\_\{\\theta\}\(z\_\{n\},x\_\{nt\},a\_\{nt\};\\mu\)\+\\gamma\\log\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}\\exp\\bigl\(Q\_\{\\bar\{\\phi\}\}\(z\_\{n\},x\_\{n,t\+1\},a\{\{\}^\{\\prime\}\};\\mu\)\\bigr\)\(7\)whereQϕ¯Q\_\{\\bar\{\\phi\}\}denotes the target Q\-network for stability and with the understanding that terminal next states have zero continuation value\. Throughout,ℓSL1\\ell\_\{\\mathrm\{SL1\}\}denotes the smooth\-ℓ1\\ell\_\{1\}\(Huber\) loss\.

Decision regions and kernel initialization\.Although we do not give a formal definition, it is useful to think in terms of*decision regions*: parts of the observed state space where nearly identical recorded observations can lead to different expert actions because relevant context is missing\. These are precisely the regions where we expect reward perturbations to be needed\. We normalize all features before computing kernel distances so that variables with larger numerical scale do not dominate the RBF kernels\. If prior knowledge about decision regions is available, the centers can be initialized there directly\. Otherwise, we initialize them withKK\-means on the recorded observations so that they start on the support of the data\. We deliberately chooseKKsomewhat large; if some centers are not useful, the sparsity penalty in phase 2 drives their corresponding coefficients inznz\_\{n\}close to zero\.

Phase 1 \(base IRL\)\.We initializezn←𝟎z\_\{n\}\\leftarrow\\mathbf\{0\}for all episodes\. Since the perturbation then vanishes, the reward reduces to the shared base rewardr¯θ\\bar\{r\}\_\{\\theta\}\. We solve

minϕ,θℒbase​\(ϕ,θ\)\\displaystyle\\min\_\{\\phi,\\theta\}\\quad\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(\\phi,\\theta\)=−∑n=1N∑t=1Tnlog⁡πϕ​\(an​t∣xn​t,𝟎;μ\)\\displaystyle=\-\\sum\_\{n=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{n\}\}\\log\\pi\_\{\\phi\}\(a\_\{nt\}\\mid x\_\{nt\},\\mathbf\{0\};\\mu\)\+λ​∑n=1N∑t=1TnℓSL1​\(Qϕ​\(𝟎,xn​t,an​t;μ\),yϕ,θ​\(xn​t,𝟎,μ,an​t,xn,t\+1\)\)\.\\displaystyle\\quad\+\\lambda\\sum\_\{n=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{n\}\}\\ell\_\{\\mathrm\{SL1\}\}\\\!\\left\(Q\_\{\\phi\}\(\\mathbf\{0\},x\_\{nt\},a\_\{nt\};\\mu\),y\_\{\\phi,\\theta\}\(x\_\{nt\},\\mathbf\{0\},\\mu,a\_\{nt\},x\_\{n,t\+1\}\)\\right\)\.\(8\)Let\(ϕ\(1\),θ\(1\)\)\(\\phi^\{\(1\)\},\\theta^\{\(1\)\}\)denote the solution\.

Phase 2 \(MP\-IRL\)\.We freeze the reward parameters atθ\(1\)\\theta^\{\(1\)\}, initialize the critic from phase 1 withϕ\(1\)\\phi^\{\(1\)\}, and optimize the critic together with the episode\-specific kernel coefficients𝒛=\{zn\}n=1N\\bm\{z\}=\\\{z\_\{n\}\\\}\_\{n=1\}^\{N\}and the kernel centersμ=\{μk\}k=1K\\mu=\\\{\\mu\_\{k\}\\\}\_\{k=1\}^\{K\}\. Thus, phase 2 solves

minϕ,𝒛;μℒMP​\(ϕ,𝒛;μ\)\\displaystyle\\min\_\{\\phi,\\bm\{z\};\\mu\}\\quad\\mathcal\{L\}\_\{\\mathrm\{MP\}\}\(\\phi,\\bm\{z\};\\mu\)=−∑n=1N∑t=1Tnlog⁡πϕ​\(an​t∣xn​t,zn;μ\)\\displaystyle=\-\\sum\_\{n=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{n\}\}\\log\\pi\_\{\\phi\}\(a\_\{nt\}\\mid x\_\{nt\},z\_\{n\};\\mu\)\+λ​∑n=1N∑t=1TnℓSL1​\(Qϕ​\(zn,xn​t,an​t;μ\),yϕ,θ\(1\)​\(xn​t,zn,μ,an​t,xn,t\+1\)\)\\displaystyle\\quad\+\\lambda\\sum\_\{n=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{n\}\}\\ell\_\{\\mathrm\{SL1\}\}\\\!\\left\(Q\_\{\\phi\}\(z\_\{n\},x\_\{nt\},a\_\{nt\};\\mu\),y\_\{\\phi,\\theta^\{\(1\)\}\}\(x\_\{nt\},z\_\{n\},\\mu,a\_\{nt\},x\_\{n,t\+1\}\)\\right\)\+α​∑n=1N\(∑k=1K‖zn,k,:‖2\+‖zn‖1\)\\displaystyle\\quad\+\\alpha\\sum\_\{n=1\}^\{N\}\\left\(\\sum\_\{k=1\}^\{K\}\\\|z\_\{n,k,:\}\\\|\_\{2\}\+\\\|z\_\{n\}\\\|\_\{1\}\\right\)\+β​∑n=1N∑t=1Tn∑a′∈𝒜ℓSL1​\(Qϕ​\(𝟎,xn​t,a′;μ\),Qϕ\(1\)​\(𝟎,xn​t,a′;μ\)\)\.\\displaystyle\\quad\+\\beta\\sum\_\{n=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{n\}\}\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}\\ell\_\{\\mathrm\{SL1\}\}\\\!\\left\(Q\_\{\\phi\}\(\\mathbf\{0\},x\_\{nt\},a^\{\\prime\};\\mu\),Q\_\{\\phi^\{\(1\)\}\}\(\\mathbf\{0\},x\_\{nt\},a^\{\\prime\};\\mu\)\\right\)\.\(9\)
The first term in both phases encourages action matching, while the second enforces soft Bellman consistency so that the learned policy remains tied to reward optimality rather than pure behavior cloning\. In Phase 2, the sparsity andℓ1\\ell\_\{1\}penalties implement the minimum\-perturbation objective by encouraging localized, low\-magnitude reward changes throughznz\_\{n\}\. The final term anchors the phase\-2 critic to the phase\-1 critic whenz=0z=0, so improvements over Phase 1 must be explained by nonzero perturbations rather than by changing the zero\-perturbation policy\. Forznz\_\{n\}to be meaningful as a missingness measure, the Bellman and anchor errors should remain small; otherwise the critic could absorb behavior\-cloning\-like changes\. Algorithm[1](https://arxiv.org/html/2605.12831#alg1)in Appendix[A](https://arxiv.org/html/2605.12831#A1)gives the full training procedure\.

## 5Experimental Setup

Environments and data\.We evaluate our method on five domains: three synthetic continuous navigation tasks, a cancer simulator based onYauney and Shah \([2018](https://arxiv.org/html/2605.12831#bib.bib14)\), and an ICU hypotension\-management task using MIMIC\-IV\(Johnsonet al\.,[2020](https://arxiv.org/html/2605.12831#bib.bib3)\)\. The three navigation tasks share the same continuous 2D geometry but differ in how hidden information affects branching: a*single\-decision*task, a*two\-decision dependent*task, and a*two\-decision independent*task\. For the synthetic domains and the cancer simulator, we generate500500expert trajectories per environment\. More details are deferred to Appendices[B](https://arxiv.org/html/2605.12831#A2),[C](https://arxiv.org/html/2605.12831#A3), and[D](https://arxiv.org/html/2605.12831#A4)\.

Evaluation metrics\.We quantify potential missingness by the averagesizeofzz:1N​∑n=1N‖zn‖1,\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\\|z\_\{n\}\\\|\_\{1\},where‖zn‖1\\\|z\_\{n\}\\\|\_\{1\}is the elementwiseℓ1\\ell\_\{1\}norm of the trajectory\-level perturbation\. Sinceznz\_\{n\}enters the reward linearly, this quantity measures the average magnitude of the reward perturbation needed beyond the base\-IRL reward\. We also report action\-matchingaccuracyand the negative log\-likelihood \(NLL\) of expert actions under the learned policy\. For navigation tasks, decision regions are known by construction, so we additionally report accuracy inside and outside these regions; for other domains, we report overall accuracy and NLL\.

Continuous navigation tasks\.The 3 navigation tasks are controlled settings where missing episode\-level information affects behavior only in specific*decision regions*\. In all tasks, the learner observes only position\(x,y\)\(x,y\), while the expert also observes hidden context\. In the single\-decision task, a binary variablez∈\{0,1\}z\\in\\\{0,1\\\}determines whether the agent should go left or right after reaching a central region, so similar observed positions can require different actions\. The two\-decision tasks add a second branching point\. In the dependent version, one hidden variable controls both decisions, yielding two trajectory families: left\-left and right\-right\. In the independent version, two hidden variables\(z\(1\),z\(2\)\)\(z^\{\(1\)\},z^\{\(2\)\}\)control the decisions separately, yielding four families: left\-left, left\-right, right\-left, and right\-right\. Thus, the independent task contains more missing information despite the same two decision regions\. Example of expert demonstrations are shown in Figure[2](https://arxiv.org/html/2605.12831#S5.F2)for all 3 tasks\.

Cancer simulator\.The cancer simulator is a low\-grade glioma treatment\-planning task withT=30T=30monthly decisions and a binary chemotherapy action \(a=1a=1for treatment,a=0a=0otherwise\)\. The full state contains drug concentration, proliferative tissue, quiescent tissue, damaged quiescent tissue, and time step\. Expert trajectories, shown in Figure[10](https://arxiv.org/html/2605.12831#A3.F10)of Appendix[C](https://arxiv.org/html/2605.12831#A3), mainly reflect one patient type up to transition noise; time step already explains much of the treatment pattern\. To study missingness, we evaluate three masks of the same expert dataset\. From most to least missing information, the learner observes: \(i\) time step only; \(ii\) time step, quiescent tissue, and damaged quiescent tissue; and \(iii\) all features except quiescent tissue, the weakest predictor\.

ICU hypotension\-management task\.We evaluate on a real\-life hypotension\-management task \(MIMIC\-IV\), where each trajectory is one ICU stay with hourly measurements and clinician treatments\. States contain 14 clinical variables plus time, and actions are no treatment, vasopressor, IV fluid bolus, or both; preprocessing details are in Appendix[D](https://arxiv.org/html/2605.12831#A4)\. We use three masks: time only, time plus low\-predictive features, and all recorded features\. The low\-predictive mask tests whetherzzsize remains stable when added variables contain little decision\-relevant information\.

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/mains_trajs.png)

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/Gridworlds_PCA.png)

Figure 2:Continuous navigation tasks\.Top: demonstrations; shaded boxes mark decision regions\. Bottom: PCA projections of learned perturbationsznz\_\{n\}, which separate trajectories by hidden context\.
## 6Results

Continuous navigation tasks\.The continuous navigation experiments test two questions: whether MP\-IRL localizes where missing information is needed, and whether the learned perturbation size reflects the amount of missing information\. Table[4](https://arxiv.org/html/2605.12831#A5.T4)shows that conventional IRL already matches the expert almost perfectly outside decision regions, with non\-decision accuracy near100%100\\%in all three tasks\. Its errors are concentrated inside decision regions, where similar observed position require different actions\. In the single\-decision task, conventional IRL achieves only67\.9%67\.9\\%decision\-region accuracy and NLL183\.39183\.39, while MP\-IRL increases accuracy to99\.6%99\.6\\%and reduces NLL to31\.8931\.89\.

Figure[3](https://arxiv.org/html/2605.12831#S6.F3)illustrates the mechanism in the single\-decision task\. The base reward explains behavior away from the center but cannot resolve the decision region, where hidden context determines whether the expert goes left or right\. The learned perturbations add reward to the context\-appropriate action near the decision region and penalize competing actions, soznz\_\{n\}locally modifies the reward in the direction needed to explain the expert action\. This visualization uses one kernel initialized in the decision region; other experiments useKK\-means initialization\.

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/Grid_1_R.png)Figure 3:Learned reward structure for the single\-decision navigation task\. The top row shows the base\-IRL rewardr¯θ\\bar\{r\}\_\{\\theta\}\. The lower rows show the average learned perturbationzzfor right\- and left\-context trajectories across right, left, and up actions; red indicates positive perturbations and blue negative perturbations\. MP\-IRL learns localized perturbations near the decision region that explain the expert behavior\.The two\-decision tasks results show the same pattern \(Table[4](https://arxiv.org/html/2605.12831#A5.T4)\)\. Conventional IRL performs well outside decision regions but poorly inside them, with decision\-region accuracies of59\.6%59\.6\\%and59\.4%59\.4\\%for the dependent and independent tasks\. MP\-IRL raises these to98\.4%98\.4\\%and97\.1%97\.1\\%, respectively, while substantially reducing NLL\. The learned perturbations activate near decision regions and favor the expert actions; unused kernels receive near\-zero coefficients due to the sparsity penalty\.

The perturbation sizes also match the hidden structure\. The single\-decision and dependent two\-decision tasks have similarzzsizes,1\.21\.2and1\.11\.1, because both require one hidden bit\. The independent task requires two hidden decisions, yields four trajectory families, and learns a larger perturbation size,2\.72\.7\. Figures[7](https://arxiv.org/html/2605.12831#A2.F7)and[9](https://arxiv.org/html/2605.12831#A2.F9)show that perturbations concentrate near decision regions and favor the actions required by each hidden context\. The PCA projections in Figure[2](https://arxiv.org/html/2605.12831#S5.F2)further show thatznz\_\{n\}separates into two groups for the single\-decision and dependent tasks, and four groups for the independent task, matching the true hidden contexts\.

Cancer simulator\.For the cancer simulator, we create missingness by removing features observable to the IRL agent\. We evaluate three observation masks, ordered from least to most missing information: all features except quiescent tissue, damaged quiescent tissue plus quiescent tissue plus time, and time only\. Example of demonstrations are shown in Figure[10](https://arxiv.org/html/2605.12831#A3.F10), where we notice time alone is already a strong feature for explaining most of the expert’s actions\. The more features are hidden, the more conventional IRL degrades: accuracy decreases from93\.73%93\.73\\%to91\.72%91\.72\\%and then87\.87%87\.87\\%, while NLL increases from1291\.391291\.39to1687\.031687\.03and then1988\.291988\.29\(see Table[4](https://arxiv.org/html/2605.12831#A5.T4)\)\. MP\-IRL recovers near\-perfect accuracy in all settings, but the required perturbation size increases with missingness, from0\.540\.54to0\.910\.91and then1\.651\.65, consistent with interpreting‖z‖1\\\|z\\\|\_\{1\}as a measure of potential missingness\.

We useK=10K=10kernels initialized withKK\-means in all cancer experiments\. Figure[4](https://arxiv.org/html/2605.12831#S6.F4)shows the averaged accuracy over time \(top row\) and the learned kernel centers projected onto the time feature, along with the average perturbation magnitude‖z⋅,k,:‖1\\\|z\_\{\\cdot,k,:\}\\\|\_\{1\}for each kernel \(bottom row\); gray horizontal bars indicate the kernel bandwidth\. MP\-IRL assigns larger perturbations to time points where conventional IRL has low accuracy\. For example, in the time\-only setting, conventional IRL fails around time steps33,99, and2222, and MP\-IRL places high\-magnitude kernels near those regions\. In the all\-but\-quiescent setting, the remaining ambiguity is mostly concentrated near time step33, where the learned kernels also concentrate\. Thus, MP\-IRL not only improves action matching, but also localizes where recorded observations are insufficient to explain expert treatment decisions, thereby learning meaningful minimal perturbations\.

Table 1:Compact main results\.Accuracy is decision\-region accuracy for navigation and overall accuracy otherwise\. Full results, including non\-decision\-region accuracy, are in Appendix[E](https://arxiv.org/html/2605.12831#A5), Table[4](https://arxiv.org/html/2605.12831#A5.T4)\.![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/cancer_main.png)Figure 4:Cancer simulator results\.Top: Accuracy over time\. Bottom: Learned kernel centers, with marker proportional to average perturbation magnitude‖z⋅,k,:‖1\\\|z\_\{\\cdot,k,:\}\\\|\_\{1\}; gray bars show bandwidth\.ICU hypotension\-management task\.We run three experiments on the MIMIC\-IV hypotension task using different observation masks: time step only, time step plus low\-predictive features, and all recorded features\. Because this is a real clinical dataset, even the full recorded state may omit information available to clinicians at decision time, such as bedside assessments or patient\-specific context\. We useK=30K=30kernels in all settings\.

Table[4](https://arxiv.org/html/2605.12831#A5.T4)shows that the time\-only and time\-plus\-low\-predictive\-feature settings have similar accuracy, NLL, and learnedzzsize\. This shows that MP\-IRL is robust to unimportant features: the perturbation size reflects missing decision\-relevant information, not simply the number of observed features\. When all recorded features are included, conventional IRL achieves higher accuracy and lower NLL, and MP\-IRL requires a smaller perturbation, again supporting the interpretation ofzzsize as a measure of potential missingness\.

Across all three masks, MP\-IRL substantially improves over conventional IRL, although action\-matching accuracy remains around90%90\\%\. This is expected in a noisy and heterogeneous clinical setting where decisions may depend on information absent from MIMIC\-IV\. Still, this accuracy is higher than prior model\-based RL results on the same task, which reported action\-matching accuracies around5050–60%60\\%for the strongest methods\(Benacet al\.,[2024](https://arxiv.org/html/2605.12831#bib.bib2)\)\. Overall, MP\-IRL remains useful in this real\-world setting: it improves action matching, is robust to low\-predictive features, and learns smaller perturbations when more decision\-relevant information is observed\.

## 7Conclusion

We introduced MP\-IRL, a method for quantifying potential observation missingness in inverse reinforcement learning\. The method asks how much trajectory\-level information must be added to the recorded observations for expert actions to appear optimal under a reward model\. Across continuous navigation tasks, MP\-IRL localizes perturbations to decision regions and learns perturbation sizes that match the true amount of hidden context\. In the cancer simulator and MIMIC\-IV ICU hypotension task, the learned perturbation size decreases as more decision\-relevant information is observed and remains stable when low\-predictive features are added\. These results support the use\-inspired motivation of our work: in healthcare datasets, apparent expert suboptimality may reflect missing clinical context rather than poor decisions\.

Limitations\.A limitation of MP\-IRL is that the kernel\-based perturbation representation may become difficult to learn effectively in high\-dimensional state spaces, where placing and optimizing localized kernels requires substantially more data and careful regularization\. In such settings, future work may need more scalable perturbation parameterizations while preserving the interpretability of the learned missingness measure\.

## References

- Apprenticeship learning via inverse reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p1.1)\.
- J\. G\. Altonji, T\. E\. Elder, and C\. R\. Taber \(2005\)Selection on observed and unobserved variables: assessing the effectiveness of catholic schools\.Journal of political economy113\(1\),pp\. 151–184\.Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p5.2)\.
- M\. Babes, V\. Marivate, K\. Subramanian, and M\. L\. Littman \(2011\)Apprenticeship learning about multiple intentions\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- L\. Benac, A\. Sharma, S\. Parbhoo, and F\. Doshi\-Velez \(2024\)Inverse transition learning: learning dynamics from demonstrations\.arXiv preprint arXiv:2411\.05174\.Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p2.1),[§6](https://arxiv.org/html/2605.12831#S6.p9.3)\.
- D\. Brown, R\. Coleman, R\. Srinivasan, and S\. Niekum \(2020\)Safe imitation learning via fast Bayesian reward inference from preferences\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p1.1)\.
- J\. Choi and K\. Kim \(2012\)Nonparametric Bayesian inverse reinforcement learning for multiple reward functions\.InConference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- C\. Cinelli and C\. Hazlett \(2020\)Making sense of sensitivity: extending omitted variable bias\.Journal of the Royal Statistical Society Series B: Statistical Methodology82\(1\),pp\. 39–67\.Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p5.2)\.
- J\. Ho and S\. Ermon \(2016\)Generative adversarial imitation learning\.Conference on Neural Information Processing Systems\.Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p1.1)\.
- F\. Hsiao, J\. Kuo, and M\. Sun \(2019\)Learning a multi\-modal policy via imitating demonstrations with mixed behaviors\.InConference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- Z\. Huang, J\. Wu, and C\. Lv \(2021\)Driving behavior modeling using naturalistic human driving data with inverse reinforcement learning\.IEEE Transactions on Intelligent Transportation Systems23\(8\),pp\. 10239–10251\.Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- A\. Hüyük, D\. Jarrett, C\. Tekin, and M\. van der Schaar \(2021\)Explaining by imitating: understanding decisions by interpretable policy learning\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p2.1)\.
- A\. Hüyük, D\. Jarrett, and M\. van der Schaar \(2022\)Inverse contextual bandits: learning how behavior evolves over time\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p2.1)\.
- D\. Jarrett, A\. Hüyük, and M\. Van Der Schaar \(2021\)Inverse decision modeling: learning interpretable representations of behavior\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p2.1)\.
- A\. Johnson, L\. Bulgarelli, T\. Pollard, S\. Horng, L\. A\. Celi, and R\. Mark \(2020\)Mimic\-iv\.PhysioNet\. Available online at: https://physionet\. org/content/mimiciv/1\.0/\(accessed August 23, 2021\)\.Cited by:[§5](https://arxiv.org/html/2605.12831#S5.p1.1)\.
- N\. Kallus and A\. Zhou \(2020\)Confounding\-robust policy evaluation in infinite\-horizon reinforcement learning\.InConference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p5.1)\.
- D\. Kumor, J\. Zhang, and E\. Bareinboim \(2021\)Sequential causal imitation learning with unobserved confounders\.InConference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p2.1)\.
- D\. Lee, S\. Srinivasan, and F\. Doshi\-Velez \(2019\)Truly batch apprenticeship learning with deep successor features\.arXiv preprint arXiv:1903\.10077\.Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p4.1)\.
- A\. Likmeta, A\. M\. Metelli, G\. Ramponi, A\. Tirinzoni, M\. Giuliani, and M\. Restelli \(2021\)Dealing with multiple experts and non\-stationarity in inverse reinforcement learning: an application to real\-life problems\.Machine Learning110,pp\. 2541–2576\.Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p2.1)\.
- V\. Myers, E\. Biyik, N\. Anari, and D\. Sadigh \(2022\)Learning multimodal rewards from rankings\.InConference on Robot Learning,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- A\. Y\. Ng, S\. Russell,et al\.\(2000\)Algorithms for inverse reinforcement learning\.\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p1.1)\.
- E\. Oster \(2019\)Unobservable selection and coefficient stability: theory and evidence\.Journal of Business & Economic Statistics37\(2\),pp\. 187–204\.Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p5.2)\.
- R\. Poiani, G\. Curti, A\. M\. Metelli, and M\. Restelli \(2024\)Inverse reinforcement learning with sub\-optimal experts\.arXiv preprint arXiv:2401\.03857\.Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p2.1)\.
- G\. Qiao, G\. Liu, P\. Poupart, and Z\. Xu \(2024\)Multi\-modal inverse constrained reinforcement learning from a mixture of demonstrations\.InConference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- G\. Ramponi, A\. Likmeta, A\. M\. Metelli, A\. Tirinzoni, and M\. Restelli \(2020\)Truly batch model\-free inverse reinforcement learning about multiple intentions\.InInternational Conference on Artificial Intelligence and Statistics,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- S\. Reddy, A\. Dragan, and S\. Levine \(2018\)Where do you think you’re going?: inferring beliefs about dynamics from behavior\.InConference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p2.1)\.
- K\. Ruan, J\. Zhang, X\. Di, and E\. Bareinboim \(2023\)Causal imitation learning via inverse reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p1.1),[§2](https://arxiv.org/html/2605.12831#S2.p2.1)\.
- S\. Srinivasan and F\. Doshi\-Velez \(2020\)Interpretable batch irl to extract clinician goals in icu hypotension management\.AMIA Summits on Translational Science Proceedings2020,pp\. 636\.Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p4.1)\.
- L\. Wang, Z\. Yang, and Z\. Wang \(2021\)Provably efficient causal reinforcement learning with confounded observational data\.InConference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p5.1)\.
- Z\. Wang, J\. S\. Merel, S\. E\. Reed, N\. de Freitas, G\. Wayne, and N\. Heess \(2017\)Robust imitation of diverse behaviors\.InConference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p4.1)\.
- G\. Yauney and P\. Shah \(2018\)Reinforcement learning with action\-derived rewards for chemotherapy and clinical trial dosing regimen selection\.InMachine Learning for Healthcare Conference,pp\. 161–226\.Cited by:[§5](https://arxiv.org/html/2605.12831#S5.p1.1)\.
- C\. Yu, J\. Liu, and H\. Zhao \(2019\)Inverse reinforcement learning for intelligent mechanical ventilation and sedative dosing in intensive care units\.BMC medical informatics and decision making19\(Suppl 2\),pp\. 57\.Cited by:[§1](https://arxiv.org/html/2605.12831#S1.p4.1)\.
- J\. Zhang, D\. Kumor, and E\. Bareinboim \(2020\)Causal imitation learning with unobserved confounders\.InConference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.12831#S2.p2.1)\.
- B\. D\. Ziebart, J\. A\. Bagnell, and A\. K\. Dey \(2010\)Modeling interaction via the principle of maximum causal entropy\.Cited by:[§4](https://arxiv.org/html/2605.12831#S4.p1.1)\.
- B\. D\. Ziebart, A\. L\. Maas, J\. A\. Bagnell, A\. K\. Dey,et al\.\(2008\)Maximum entropy inverse reinforcement learning\.\.InAAAI,Cited by:[§3](https://arxiv.org/html/2605.12831#S3.p3.5)\.

## Appendix ATraining Procedure

### A\.1Algorithm

Minimum\-Perturbation IRL algorithm\.Algorithm[1](https://arxiv.org/html/2605.12831#alg1)summarizes our training procedure\. The method has a simple structure: it first runs conventional IRL on the recorded observations alone, and then asks how much additional episode\-specific perturbation is needed to explain the remaining mismatch with expert behavior\.

InPhase 1 \(Base IRL\), we set all perturbations to zero and optimize the shared reward model and critic by minimizing Equation \([8](https://arxiv.org/html/2605.12831#S4.E8)\)\. Sincezn=0z\_\{n\}=0for all episodes, the reward reduces to the shared componentr¯θ\\bar\{r\}\_\{\\theta\}\. The resulting parameters\(ϕ\(1\),θ\(1\)\)\(\\phi^\{\(1\)\},\\theta^\{\(1\)\}\)therefore capture the part of the behavior that can already be explained from the recorded observations alone\.

InPhase 2 \(Minimum\-Perturbation IRL\), we freeze the shared reward atθ\(1\)\\theta^\{\(1\)\}and optimize the critic together with the episode\-specific coefficients\{zn\}n=1N\\\{z\_\{n\}\\\}\_\{n=1\}^\{N\}and, when desired, the kernel centersμ\\mu\. This is the key step of the method\. Because the shared reward is fixed, the model can no longer improve action matching by changing the common reward structure across all trajectories\. Instead, any additional improvement must come from nonzero episode\-specific perturbations\. In this sense, Phase 2 directly operationalizes the question posed in Section[3](https://arxiv.org/html/2605.12831#S3): what is the smallest perturbation needed for the observed actions to appear optimal?

The minimum\-perturbation aspect is enforced by the penalty term in Equation \([9](https://arxiv.org/html/2605.12831#S4.E9)\), which encourages only a small number of kernels to become active and keeps their coefficients small\. As discussed above, this is meaningful becauseznz\_\{n\}enters the reward only as a linear perturbation, and the penalty is applied once per trajectory rather than once per time step\. The critic anchoring term is also important: it keeps the phase\-2 critic close to the phase\-1 critic whenz=0z=0, preventing the critic from explaining the demonstrations better on its own without using the perturbation variables\. As a result, the gains achieved in Phase 2 are attributable to the learned perturbations rather than to additional flexibility in the critic\.

Finally, the kernel centers determine*where*in the observed state space perturbations can be expressed\. If prior knowledge suggests where decision regions are likely to occur, the centers can be initialized there\. Otherwise, we initialize them withKK\-means on the recorded observations so that they begin on the support of the data\. We intentionally chooseKKsomewhat large; if some centers are not useful, the sparsity penalty drives their corresponding coefficients inz\.,k,\.z\_\{\.,k,\.\}close to zero\.

Algorithm 1Minimum\-Perturbation IRL \(MP\-IRL\)1:Offline demonstrations

𝔻=\{\(xn​t,an​t,xn,t\+1\)\}n=1,t=1N,Tn\\mathbb\{D\}=\\\{\(x\_\{nt\},a\_\{nt\},x\_\{n,t\+1\}\)\\\}\_\{n=1,t=1\}^\{N,T\_\{n\}\}, kernel bandwidth

σ\\sigma, number of kernels

KK, tradeoff weights

λ,α,β\\lambda,\\alpha,\\beta, learning rates

ηθ,ηϕ,ηz,ημ\\eta\_\{\\theta\},\\eta\_\{\\phi\},\\eta\_\{z\},\\eta\_\{\\mu\}
2:Initial kernel centers

μ=\{μk\}k=1K\\mu=\\\{\\mu\_\{k\}\\\}\_\{k=1\}^\{K\}⊳\\trianglerighte\.g\. from prior knowledge orKK\-means on recorded observationsreturnBase reward parametersθ\(1\)\\theta^\{\(1\)\}, final critic parametersϕ\\phi, kernel centersμ\\mu, episode\-specific coefficients\{zn\}n=1N\\\{z\_\{n\}\\\}\_\{n=1\}^\{N\}

3:Initialize base reward network

r¯θ\\bar\{r\}\_\{\\theta\}, critic

QϕQ\_\{\\phi\}, and target critic

Qϕ¯Q\_\{\\bar\{\\phi\}\}
4:Initialize

zn←0∈ℝK×\|𝒜\|z\_\{n\}\\leftarrow 0\\in\\mathbb\{R\}^\{K\\times\|\\mathcal\{A\}\|\}for all

n∈\[N\]n\\in\[N\]
5:

6:Phase 1: Base IRL

7:fornumber of phase\-1 iterationsdo

8:Sample a mini\-batch of trajectories from

𝔻\\mathbb\{D\}
9:Compute

ℒbase​\(ϕ,θ\)\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(\\phi,\\theta\)using Equation \([8](https://arxiv.org/html/2605.12831#S4.E8)\)

10:Update

θ←θ−ηθ​∇θℒbase\\theta\\leftarrow\\theta\-\\eta\_\{\\theta\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{base\}\}
11:Update

ϕ←ϕ−ηϕ​∇ϕℒbase\\phi\\leftarrow\\phi\-\\eta\_\{\\phi\}\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{\\mathrm\{base\}\}
12:Update target critic parameters

ϕ¯\\bar\{\\phi\}
13:Store

\(ϕ\(1\),θ\(1\)\)←\(ϕ,θ\)\(\\phi^\{\(1\)\},\\theta^\{\(1\)\}\)\\leftarrow\(\\phi,\\theta\)
14:

15:Phase 2: Minimum\-Perturbation IRL

16:Freeze reward parameters at

θ\(1\)\\theta^\{\(1\)\}and initialize

ϕ←ϕ\(1\)\\phi\\leftarrow\\phi^\{\(1\)\}
17:fornumber of phase\-2 iterationsdo

18:Sample a mini\-batch of trajectories from

𝔻\\mathbb\{D\}
19:Compute

ℒMP​\(ϕ,𝒛,μ\)\\mathcal\{L\}\_\{\\mathrm\{MP\}\}\(\\phi,\\bm\{z\},\\mu\)using Equation \([9](https://arxiv.org/html/2605.12831#S4.E9)\)

20:Update

ϕ←ϕ−ηϕ​∇ϕℒMP\\phi\\leftarrow\\phi\-\\eta\_\{\\phi\}\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{\\mathrm\{MP\}\}
21:Update

zn←zn−ηz​∇znℒMPz\_\{n\}\\leftarrow z\_\{n\}\-\\eta\_\{z\}\\nabla\_\{z\_\{n\}\}\\mathcal\{L\}\_\{\\mathrm\{MP\}\}for trajectories

nnin the mini\-batch

22:ifkernel centers are learnablethen

23:Update

μ←μ−ημ​∇μℒMP\\mu\\leftarrow\\mu\-\\eta\_\{\\mu\}\\nabla\_\{\\mu\}\\mathcal\{L\}\_\{\\mathrm\{MP\}\}
24:Update target critic parameters

ϕ¯\\bar\{\\phi\}
25:return

θ\(1\),ϕ,μ,\{zn\}n=1N\\theta^\{\(1\)\},\\phi,\\mu,\\\{z\_\{n\}\\\}\_\{n=1\}^\{N\}

## Appendix BContinuous Navigation Tasks

We use three synthetic continuous navigation tasks\. In all three tasks, the recorded observation is the two\-dimensional position

xt=\(ptx,pty\)∈\[0,1\]2,x\_\{t\}=\(p\_\{t\}^\{x\},p\_\{t\}^\{y\}\)\\in\[0,1\]^\{2\},and the action space is

𝒜=\{Right,Left,Up,Down\}\.\\mathcal\{A\}=\\\{\\textsc\{Right\},\\textsc\{Left\},\\textsc\{Up\},\\textsc\{Down\}\\\}\.Each episode starts from a position sampled uniformly from the bottom\-center rectangle

\[0\.47,0\.53\]×\[0\.00,0\.05\]\.\[0\.47,0\.53\]\\times\[0\.00,0\.05\]\.We use the term*decision region*for a subset of the observed state space where trajectories that look similar from the recorded observation split because the optimal action depends on an unobserved episode\-level context\.

### B\.1Single\-Decision Task

The single\-decision task contains one binary latent variable

sampled once at the beginning of the episode and fixed for the whole trajectory\. The agent moves in continuous space according to

pt\+1=clip​\(pt\+0\.1​d​\(at\)\+εt,\[0,1\]2\),p\_\{t\+1\}=\\mathrm\{clip\}\\\!\\left\(p\_\{t\}\+0\.1\\,d\(a\_\{t\}\)\+\\varepsilon\_\{t\},\\ \[0,1\]^\{2\}\\right\),whered​\(at\)d\(a\_\{t\}\)is the unit displacement associated with actionata\_\{t\},clip​\(⋅,\[0,1\]2\)\\mathrm\{clip\}\(\\cdot,\[0,1\]^\{2\}\)clips each coordinate to\[0,1\]\[0,1\], and the transition noise is

εt∼𝒰​\(\[−0\.02,0\.02\]2\)\.\\varepsilon\_\{t\}\\sim\\mathcal\{U\}\(\[\-0\.02,0\.02\]^\{2\}\)\.The maximum episode length is3030steps\.

Each step incurs a reward of−0\.1\-0\.1\. Let

C=\[0\.4,0\.6\]×\[0\.4,0\.6\]C=\[0\.4,0\.6\]\\times\[0\.4,0\.6\]denote the center square\. The first time the agent entersCC, it receives an intermediate reward of\+20\+20\. After this intermediate reward has been collected, the episode terminates with reward\+20\+20when the agent reaches the left boundaryptx≤0\.02p\_\{t\}^\{x\}\\leq 0\.02ifz=0z=0, or the right boundaryptx≥0\.98p\_\{t\}^\{x\}\\geq 0\.98ifz=1z=1\.

The unique decision region is the center squareCC\. Expert trajectories first move upward towardCC\. Once insideCC, the hidden context determines whether the agent should move left or right\. Therefore, two trajectories can reach nearly the same observed position in the center of the domain but require different optimal actions because they correspond to different values ofzz\.

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/toy_grid_1.png)Figure 5:Expert demonstrations for the single\-decision task\. The center square is the only decision region: trajectories first move upward into this region and then branch left or right depending on the hidden episode\-level context\.
### B\.2Two\-Decision Task \(Independent\)

The independent two\-decision task extends the previous environment by introducing two separate branching points\. The hidden context is now

z=\(z\(1\),z\(2\)\)∈\{0,1\}2,z=\(z^\{\(1\)\},z^\{\(2\)\}\)\\in\\\{0,1\\\}^\{2\},wherez\(1\)z^\{\(1\)\}controls the first decision andz\(2\)z^\{\(2\)\}controls the second one\. These two latent variables are sampled independently at the start of the episode and remain fixed throughout the trajectory\.

The dynamics are

pt\+1=clip​\(pt\+0\.1​d​\(at\)\+εt,\[0,1\]2\),p\_\{t\+1\}=\\mathrm\{clip\}\\\!\\left\(p\_\{t\}\+0\.1\\,d\(a\_\{t\}\)\+\\varepsilon\_\{t\},\\ \[0,1\]^\{2\}\\right\),with smaller transition noise

εt∼𝒰​\(\[−0\.01,0\.01\]2\),\\varepsilon\_\{t\}\\sim\\mathcal\{U\}\(\[\-0\.01,0\.01\]^\{2\}\),and the maximum episode length is5050steps\. Each step incurs reward−1\-1\.

As before, the first intermediate reward is obtained when the agent first enters the center square

C=\[0\.4,0\.6\]×\[0\.4,0\.6\],C=\[0\.4,0\.6\]\\times\[0\.4,0\.6\],which yields reward\+20\+20\. The first left\-right decision happens in this region\. From there, the agent moves toward one of two side portals:

BL=\[0\.0,0\.2\]×\[0\.4,0\.6\],BR=\[0\.8,1\.0\]×\[0\.4,0\.6\]\.B\_\{L\}=\[0\.0,0\.2\]\\times\[0\.4,0\.6\],\\qquad B\_\{R\}=\[0\.8,1\.0\]\\times\[0\.4,0\.6\]\.If the agent entersBLB\_\{L\}andz\(1\)=0z^\{\(1\)\}=0, or entersBRB\_\{R\}andz\(1\)=1z^\{\(1\)\}=1, it receives another reward of\+20\+20\. Upon reaching either side portal, the agent is immediately teleported to the top\-center landing region

T=\[0\.45,0\.55\]×\[0\.90,0\.98\]\.T=\[0\.45,0\.55\]\\times\[0\.90,0\.98\]\.
The second decision is made after teleportation\. From the landing regionTT, the episode terminates with reward\+100\+100at the left boundary ifz\(2\)=0z^\{\(2\)\}=0, and at the right boundary ifz\(2\)=1z^\{\(2\)\}=1\.

This task therefore has two decision regions: the center squareCCand the teleport landing regionTT\. The key property of this environment is that the two decisions are*independent*\. The first latent variable determines whether the agent takes the left or right side portal, while the second latent variable independently determines whether the final destination is the left or right boundary\. As a result, all four trajectory types can occur:

\(left,left\),\(left,right\),\(right,left\),\(right,right\)\.\(\\text\{left\},\\text\{left\}\),\\quad\(\\text\{left\},\\text\{right\}\),\\quad\(\\text\{right\},\\text\{left\}\),\\quad\(\\text\{right\},\\text\{right\}\)\.
![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/Grid_2_indep.png)Figure 6:Expert demonstrations for the independent two\-decision task\. The first decision region is the center square, and the second one is the top\-center landing region reached after teleportation\. Because the two hidden decisions are independent, all four trajectory families can appear\.#### B\.2\.1Learned Rewards

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/GRID_2_IND_R.png)Figure 7:Learned reward structure for the two\-decision independent navigation task\. The top row shows the base\-IRL rewardr¯θ\\bar\{r\}\_\{\\theta\}\. The lower rows show the average learned perturbationzzfor left\-left, left\-right, right\-left, and right\-right trajectories across the right, left, and up actions; red indicates positive perturbations and blue negative perturbations\. MP\-IRL learns distinct perturbation patterns for the four hidden decision contexts\.

### B\.3Two\-Decision Task \(Dependent\)

The dependent two\-decision task uses the same geometry, rewards, teleportation mechanism, and transition model as the independent two\-decision task\. In particular, the agent again moves according to

pt\+1=clip​\(pt\+0\.1​d​\(at\)\+εt,\[0,1\]2\),εt∼𝒰​\(\[−0\.01,0\.01\]2\),p\_\{t\+1\}=\\mathrm\{clip\}\\\!\\left\(p\_\{t\}\+0\.1\\,d\(a\_\{t\}\)\+\\varepsilon\_\{t\},\\ \[0,1\]^\{2\}\\right\),\\qquad\\varepsilon\_\{t\}\\sim\\mathcal\{U\}\(\[\-0\.01,0\.01\]^\{2\}\),with horizon5050, step reward−1\-1, center reward\+20\+20, side\-portal reward\+20\+20, and terminal reward\+100\+100\.

The important difference is that the whole trajectory is now governed by a*single*binary latent variable

sampled once per episode\. The first decision region is still the center squareC=\[0\.4,0\.6\]×\[0\.4,0\.6\]C=\[0\.4,0\.6\]\\times\[0\.4,0\.6\], and the second decision region is still the teleport landing regionT=\[0\.45,0\.55\]×\[0\.90,0\.98\]T=\[0\.45,0\.55\]\\times\[0\.90,0\.98\]\. However, the two decisions are no longer independent\. Instead, they are tied together by the same hidden context\. Ifz=0z=0, the agent should take the left side portal and then finish on the left boundary\. Ifz=1z=1, it should take the right side portal and then finish on the right boundary\.

Thus, this environment again contains two decision regions, but only two trajectory families are possible:

\(left,left\)and\(right,right\)\.\(\\text\{left\},\\text\{left\}\)\\qquad\\text\{and\}\\qquad\(\\text\{right\},\\text\{right\}\)\.This makes the dependent task more structured than the independent one: the first decision already predicts the second\.

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/grid_2_dep.png)Figure 8:Expert demonstrations for the dependent two\-decision task\. The two decision regions are the same as in the independent task, but both choices are controlled by the same hidden context, so only the left\-left and right\-right trajectory families occur\.#### B\.3\.1Learnt Rewards

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/GRID_2_DEP_R.png)Figure 9:Learned reward structure for the two\-decision dependent navigation task\. The top row shows the base\-IRL rewardr¯θ\\bar\{r\}\_\{\\theta\}\. The lower rows show the average learned perturbationzzfor left\-left and right\-right trajectories across the right, left, and up actions; red indicates positive perturbations and blue negative perturbations\. MP\-IRL learns localized perturbations near the two decision regions, with both decisions controlled by the same hidden context\.

## Appendix CCancer Simulator

We also evaluate our method on a cancer treatment simulator based on a low\-grade glioma growth model\. The environment is a finite\-horizon Markov decision process in which each time step corresponds to one month, and the action specifies whether chemotherapy is administered during that month\.

The latent physiological state consists of four continuous variables:

st=\(Ct,Pt,Qt,Qtp\),s\_\{t\}=\(C\_\{t\},P\_\{t\},Q\_\{t\},Q^\{p\}\_\{t\}\),whereCtC\_\{t\}is the drug concentration,PtP\_\{t\}is the proliferative tissue diameter,QtQ\_\{t\}is the quiescent tissue diameter, andQtpQ^\{p\}\_\{t\}is the damaged quiescent tissue diameter\. In some experiments, the simulator also appends the current monthttto the state, yielding a 5\-dimensional state\. The action space is binary,

at∈\{0,1\},a\_\{t\}\\in\\\{0,1\\\},whereat=1a\_\{t\}=1denotes treatment andat=0a\_\{t\}=0denotes no treatment\. Episodes last for at most3030months\.

Let

Pt⋆=Pt\+Qt\+QtpP\_\{t\}^\{\\star\}=P\_\{t\}\+Q\_\{t\}\+Q^\{p\}\_\{t\}denote the total tumor diameter\. The simulator evolves according to a discrete\-time update of the underlying tumor\-growth model\. If treatment is given, the drug concentration is first increased by one unit\. The state is then updated as

Ct\+1\\displaystyle C\_\{t\+1\}=Ct\+at−KD​E​Ct,\\displaystyle=C\_\{t\}\+a\_\{t\}\-K\_\{DE\}C\_\{t\},\(10\)Pt\+1\\displaystyle P\_\{t\+1\}=Pt\+λP​Pt​\(1−Pt⋆K\)\+KQp​P​Qtp−KP​Q​Pt−γ​KD​E​Ct\+1​Pt,\\displaystyle=P\_\{t\}\+\\lambda\_\{P\}P\_\{t\}\\left\(1\-\\frac\{P\_\{t\}^\{\\star\}\}\{K\}\\right\)\+K\_\{Q\_\{p\}P\}Q^\{p\}\_\{t\}\-K\_\{PQ\}P\_\{t\}\-\\gamma K\_\{DE\}C\_\{t\+1\}P\_\{t\},\(11\)Qt\+1\\displaystyle Q\_\{t\+1\}=Qt\+KP​Q​Pt−γ​KD​E​Ct\+1​Qt,\\displaystyle=Q\_\{t\}\+K\_\{PQ\}P\_\{t\}\-\\gamma K\_\{DE\}C\_\{t\+1\}Q\_\{t\},\(12\)Qt\+1p\\displaystyle Q^\{p\}\_\{t\+1\}=Qtp\+γ​KD​E​Ct\+1​Qt−KQp​P​Qtp−δQp​Qtp\.\\displaystyle=Q^\{p\}\_\{t\}\+\\gamma K\_\{DE\}C\_\{t\+1\}Q\_\{t\}\-K\_\{Q\_\{p\}P\}Q^\{p\}\_\{t\}\-\\delta\_\{Q\_\{p\}\}Q^\{p\}\_\{t\}\.\(13\)HereKD​EK\_\{DE\}controls drug decay,λP\\lambda\_\{P\}is the proliferative growth rate,KP​QK\_\{PQ\}is the transition rate from proliferative to quiescent tissue,KQp​PK\_\{Q\_\{p\}P\}is the rate at which damaged quiescent tissue returns to the proliferative compartment,γ\\gammacontrols treatment efficacy,δQp\\delta\_\{Q\_\{p\}\}is the elimination rate of damaged quiescent tissue, andKKis the carrying\-capacity parameter\. Optionally, multiplicative Gaussian transition noise can be added to the four physiological state variables after the deterministic update\.

The reward encourages tumor shrinkage while discouraging excessive treatment\. At each month, the reward is

rt=\(Pt⋆−Pt\+1⋆\)−η​Ct\+1,r\_\{t\}=\\left\(P\_\{t\}^\{\\star\}\-P\_\{t\+1\}^\{\\star\}\\right\)\-\\eta C\_\{t\+1\},whereη\\etais a dose\-penalty coefficient\. Thus, the agent is rewarded for reducing total tumor burden and penalized for large treatment intensity\. At the final month, the simulator adds an additional terminal term proportional to the reduction from the initial tumor size:

rT←rT\+β​\(P0⋆−PT⋆\),r\_\{T\}\\leftarrow r\_\{T\}\+\\beta\\left\(P\_\{0\}^\{\\star\}\-P\_\{T\}^\{\\star\}\\right\),whereβ\\betais a fixed terminal\-reward weight\.

Unlike the continuous navigation tasks, the cancer simulator does not contain a localized spatial branching point\. A treatment decision is made at every nonterminal step, so every valid state can be viewed as lying in the decision region\. In experiments with missing observations, we simulate partial observability by masking selected components of the state before they are given to the learning algorithm, while the simulator itself continues to evolve according to the full latent state\.

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/cancer_data.png)Figure 10:Expert demonstrations in the cancer simulator\. Each point is a monthly state from an expert trajectory, colored by the empirical probability of treatment at that time step\. The strong time\-dependent pattern shows that treatment timing explains much of the behavior, while the physiological variables provide additional context that we selectively mask to create controlled missingness\.
## Appendix DReal\-life ICU Environment

Following the experiments within a synthetic environments, we now transition to the evaluation of our methodology in a real\-world scenario\. To this end, we selected the Medical Information Mart for Intensive Care IV \(MIMIC\-IV\) dataset as our experimental field\. This dataset offers a rich, diverse, and challenging setting for testing our method, especially given its potential to contribute to advancements in healthcare analytics and patient care strategies\.

### D\.1About MIMIC\-IV Dataset

The MIMIC\-IV dataset, developed by the MIT Lab for Computational Physiology and publicly available, aggregates a vast range of anonymized health data from critical care units at Beth Israel Deaconess Medical Center in Boston\. Covering over a decade’s worth of patient admissions, it provides detailed records on demographics, vital signs, lab tests, medications, and more, establishing itself as a critical resource for healthcare model development\. Its comprehensive scope spans all patient care aspects, enabling the creation of holistic models for predicting diverse patient outcomes\. The dataset’s richness lies in its variety, covering over 40,000 patients of different ages, ethnicities, and conditions, and its granularity, offering high\-resolution data points and time\-stamped records, which are essential for developing precise, dynamic healthcare models\. Moreover, MIMIC\-IV’s public accessibility fosters a global research community’s collaboration, enhancing healthcare analytics advancements\.

Utilizing the MIMIC\-IV dataset, we showcase out the learning applicability of our method in real\-world healthcare, to get valuable insights from the data in such a complicated environemnt\.

### D\.2Data Preprocessing for Hypotension Analysis

In our investigation into hypotension within ICU settings, we tailored our preprocessing steps to exclusively include patients affected by this condition\. Our methodology commenced with the application of specific filters on the MIMIC\-IV dataset to accurately identify the patient cohort of interest\. These filters were designed to capture adults aged 18 to 80 years, who had ICU stays of a minimum duration of 24 hours, and exhibited Mean Arterial Pressure \(MAP\) readings of 65mmHg or below, indicative of acute hypotension\.

The analytical framework of this study is structured around a state space defined by a specific set of 15 clinical variables\. These variables are categorized into five functional subgroups to facilitate multidimensional analysis:

Table 2:Categorization of Clinical State Space VariablesAdditionally, the variabletime\_stepis utilized to account for the temporal dimension of the state space\.

The action space encompasses two primary treatment modalities:intravenous \(IV\) fluid bolus therapyandvasopressor therapy\. This precise filtering approach yielded a dataset comprising 1,684 distinct ICU admissions, from which we derived approximately 100,000 tuples\(state, action, next​\_​state\)∈𝒟\(\\text\{state, action, next\}\\\_\\text\{state\}\)\\in\\mathcal\{D\}\. This dataset serves as the foundation to evaluate our method and the baselines\.

![Refer to caption](https://arxiv.org/html/2605.12831v1/figures/mimic_trajs.png)Figure 11:ICU hypotension treatment trajectories from MIMIC\-IV\. Each trajectory corresponds to one patient stay, with hourly clinical measurements and clinician\-prescribed actions\. The plots show the temporal evolution of the recorded clinical variables together with the four possible treatment actions: no treatment, vasopressor, IV fluid bolus, or both\.
### D\.3Action Space Definition

The action space in our model encapsulates the range of possible treatments administered to patients suffering from hypotension\. It consists of four discrete actions, each representing a specific treatment strategy\. The actions are enumerated as follows:

Table 3:Definition of Actions in the Treatment Strategy SpaceEach action is designed to reflect the clinical decisions made in the intensive care unit for managing patients’ blood pressure levels\. Action 0 \(no treatment\) represents a conservative approach, where no immediate intervention is applied\. Action 1 \(vasopressor therapy\) and Action 2 \(IV fluid bolus\) correspond to the administration of specific treatments aimed at increasing blood pressure\.

## Appendix EFull results

Table 4:For navigation, accuracy is reported inside and outside decision regions; for cancer and ICU, the first accuracy column is overall accuracy\. Higher accuracy and lower NLL/zzsize are better\.
## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The claims are stated in the abstract and introduction and are supported by the experimental results in Section[5](https://arxiv.org/html/2605.12831#S5)and Section 6\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Limitations are discussed at the end of Section 7\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification: The paper does not present formal theoretical results or proofs\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: We provide algorithmic and experimental details in the main paper and appendix
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[No\]
24. Justification: Code is not released with this submission\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Yes, see Appendix
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[N/A\]
34. Justification: Our method is more descriptive nature rather than quantitative so we provide results with corresponding visualization on 1 set of demionstrations per experiment
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[N/A\]
39. Justification: Only needed my local computer to run experiments
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: conform with code of ethics
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: Yes we talk about how that relate to real clinician settings
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification:
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: we credited the dataset used
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.12831v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[N/A\]
64. Justification: N/A
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: N/A
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification:N/A
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: N/A
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

arXiv cs.LG

This paper introduces Interactive Inverse Reinforcement Learning (IIRL), a framework where a learner actively interacts with an expert to infer reward functions, formulated as a stochastic bi-level optimization problem. The authors propose the BISIRL algorithm, providing convergence guarantees and experimental validation for this interactive learning paradigm.

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

arXiv cs.LG

This paper studies adversarial action masking in self-play reinforcement learning, where an attacker selectively removes legal actions from a victim's action set. The attack is shown to be significantly more damaging than random masking or perturbation baselines across multiple environments and algorithms, and victims do not recover under extended training.