Imperfect World Models are Exploitable

arXiv cs.AI Papers

Summary

This paper formalizes model exploitation in reinforcement learning, proving it is unavoidable in large policy sets, and establishes a theoretical bridge between reward hacking and model exploitation.

arXiv:2605.15960v1 Announce Type: new Abstract: We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:34 AM

# Imperfect World Models are Exploitable
Source: [https://arxiv.org/html/2605.15960](https://arxiv.org/html/2605.15960)
Logan Mondal Bhamidipaty1Esmeralda S\. Whitammer1David Abel1 Mykel J\. Kochenderfer2Subramanian Ramamoorthy1 1University of Edinburgh2Stanford University

###### Abstract

We propose a novel definition of model exploitation in reinforcement learning\. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment’s true transition model implies the reverse\. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation\. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case\. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation\. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided\. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models\.

## 1Introduction

Efficient real\-world planning typically requires approximating complex dynamics\(Simon,[1955](https://arxiv.org/html/2605.15960#bib.bib46); Javed and Sutton,[2024](https://arxiv.org/html/2605.15960#bib.bib55)\)\. In the best case, well\-chosen approximations unravel otherwise intractable problems, as with admissible heuristics in search\(Hartet al\.,[1968](https://arxiv.org/html/2605.15960#bib.bib45)\), model reduction in dynamical systems\(Antoulas,[2005](https://arxiv.org/html/2605.15960#bib.bib44)\), and latent\-space world models in sequential decision\-making\(Hafneret al\.,[2019](https://arxiv.org/html/2605.15960#bib.bib40)\)\. In the worst case, however, ill\-chosen approximations can compromise safety, as with mimicry in evolutionary biology\(Bates,[1862](https://arxiv.org/html/2605.15960#bib.bib63)\), arbitrage in financial markets\(Ross,[1976](https://arxiv.org/html/2605.15960#bib.bib42)\), and speedrunning in video games\(Scully\-Blaker,[2014](https://arxiv.org/html/2605.15960#bib.bib43)\)\. In reinforcement learning \(RL\), this liability arises not only exogenously from adversaries but also endogenously from the optimizer itself: an agent trained to maximize total expected reward under an imperfect model111We use “model” to mean the transition model of a Markov decision process \([Section2](https://arxiv.org/html/2605.15960#S2)\) or an agent’s approximation to it, sometimes called a world model\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2605.15960#bib.bib41)\)\. We do not use it to mean the policy \(as a generative model of actions\) as is sometimes meant in other RL literature\.may discover behavior that performs well in simulation but poorly in reality\(Jakobiet al\.,[1995](https://arxiv.org/html/2605.15960#bib.bib18); Tobinet al\.,[2017](https://arxiv.org/html/2605.15960#bib.bib39)\)\. This failure mode is calledmodel exploitation\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2605.15960#bib.bib41); Janneret al\.,[2019](https://arxiv.org/html/2605.15960#bib.bib37)\), and a natural question is whether it can be avoided\.This question pertains to notable results in RL theory including the simulation lemma\(Kearns and Singh,[2002](https://arxiv.org/html/2605.15960#bib.bib60)\)and the value equivalence principle\(Grimmet al\.,[2020](https://arxiv.org/html/2605.15960#bib.bib61)\)\. Both implicitly use predictive performance to measure the quality of a model’s approximation\. The simulation lemma does so cardinally by bounding the error from optimizing under an imperfect model, while the value equivalence principle does so locally by characterizing when two models induce identical Bellman updates\. Modern world models also emphasize predictive performance, typically learning latent\-space models of transitions\(Hafneret al\.,[2019](https://arxiv.org/html/2605.15960#bib.bib40); Balestriero and LeCun,[2025](https://arxiv.org/html/2605.15960#bib.bib11)\)\.We argue that while this emphasis on performance is necessary for efficient planning, it is not sufficient for safe planning\. Recent work on reward specification\(Hadfield\-Menellet al\.,[2017](https://arxiv.org/html/2605.15960#bib.bib12); Abelet al\.,[2021](https://arxiv.org/html/2605.15960#bib.bib54)\)views reward functions as observations of the designer’s intent to communicate a goal, which may be an imperfect reflection of a true objective\. Empirical transition models are similarly imperfect\. In this setting, the natural notion of safety is ordinal rather than cardinal \(does the model preserve which policies are better than which?\) and global rather than local \(does it do so across the entire policy set, not just on the chosen classes?\)\. Concretely, a proxy model is*safe*in our sense if it does not invert the policy ordering induced by the true model\. That is, whenever reality prefers a policyπ\\pioverπ′\\pi^\{\\prime\}, a safe approximation should not preferπ′\\pi^\{\\prime\}overπ\\pi\. When such an inversion exists, we say the pair of transition models is*exploitable*\([Definition˜1](https://arxiv.org/html/2605.15960#Thmdefinition1)\)\.Our formalization of model exploitation is analogous to the notion of reward hacking inSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)\. Unfortunately, their characterization of when hacking exists and is avoidable does not transfer to exploitation \([Section˜3\.2](https://arxiv.org/html/2605.15960#S3.SS2)\)\. We overcome this by introducing a more general theory, from which four main contributions emerge\.

1. 1\.We propose a new definition of model exploitation \([Definition˜1](https://arxiv.org/html/2605.15960#Thmdefinition1)\) that, to the best of our knowledge, is the first ordinal notion of safety for world models\.
2. 2\.We develop a unified theory of reward hacking and model exploitation that characterizes exploitation on common policy sets \([Corollaries˜1](https://arxiv.org/html/2605.15960#Thmcorollary1)and[2](https://arxiv.org/html/2605.15960#Thmcorollary2)\), yields[Theorem 1](https://arxiv.org/html/2605.15960#Thmskalsetheorem1)ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)as a special case \([Corollary˜3](https://arxiv.org/html/2605.15960#Thmcorollary3)\), and provides a shared geometric intuition for both phenomena \([Lemmas˜1](https://arxiv.org/html/2605.15960#Thmlemma1)and[2](https://arxiv.org/html/2605.15960#Thmlemma2)\)\.
3. 3\.We give a construction reducing any instance of model exploitation to one of reward hacking \([Proposition˜2](https://arxiv.org/html/2605.15960#Thmproposition2)\), but show that a converse construction does not exist \([Counterexample˜1](https://arxiv.org/html/2605.15960#Thmcounterexample1)\)\. We further prove that the conditions that guarantee unhackability are insufficient to guarantee unexploitability \(LABEL:ce:finite\-exploitation\)\.
4. 4\.We introduce a relaxed notion of model exploitation,ε\\varepsilon\-exploitability \([Definition˜3](https://arxiv.org/html/2605.15960#Thmdefinition3)\), and use the tight form of the simulation lemma\(Kearns and Singh,[2002](https://arxiv.org/html/2605.15960#bib.bib60)\)recently proved byLobel and Parr \([2024](https://arxiv.org/html/2605.15960#bib.bib57)\)to derive a safe horizon for discounted Markov decision processes within whichε\\varepsilon\-unexploitability is guaranteed \([Theorem˜2](https://arxiv.org/html/2605.15960#Thmtheorem2)\)\.

## 2Preliminaries

We first recap the preliminaries for reinforcement learning\(Suttonet al\.,[1998](https://arxiv.org/html/2605.15960#bib.bib58)\)and introduce our definition of model exploitation\.

### 2\.1Reinforcement learning

We consider aMarkov decision process \(MDP\)ℳ=\(𝒮,𝒜,𝒯,d0,ℛ,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},d\_\{0\},\\mathcal\{R\},\\gamma\)where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,𝒯:𝒮×𝒜→Δ​\(𝒮\)\\mathcal\{T\}\\colon\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Delta\(\\mathcal\{S\}\)is the transition model,d0∈Δ​\(𝒮\)d\_\{0\}\\in\\Delta\(\\mathcal\{S\}\)is the initial state distribution,ℛ:𝒮×𝒜→ℝ\\mathcal\{R\}\\colon\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}is the reward function, andγ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor\. As inSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\), we assume finite𝒮\\mathcal\{S\}with all states reachable and finite𝒜\\mathcal\{A\}with\|𝒜\|\>1\|\\mathcal\{A\}\|\>1\.Astationary policyπ:𝒮→Δ​\(𝒜\)\\pi\\colon\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\)maps each state to a distribution over actions\. Anon\-stationary policyπ=\(π0,π1,π2,…\)\\pi=\(\\pi\_\{0\},\\pi\_\{1\},\\pi\_\{2\},\\dots\)is a sequence of mappingsπt:𝒮→Δ​\(𝒜\)\\pi\_\{t\}\\colon\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\), allowing the action distribution to depend on the time step\. Rolling out a policy produces a trajectoryτ=\(s0,a0,r0,…\)\\tau=\(s\_\{0\},a\_\{0\},r\_\{0\},\\dots\)whose discounted return isG​\(τ\)=∑t=0∞γt​rtG\(\\tau\)=\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\_\{t\}\. Thevalueof a policy is its expected returnJ​\(π\)=𝔼τ∼π​\[G​\(τ\)\]J\(\\pi\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\}\[G\(\\tau\)\], or, equivalently,J​\(π\)=⟨ℛ,ℱ𝒯π⟩J\(\\pi\)=\\langle\\mathcal\{R\},\\mathcal\{F\}^\{\\pi\}\_\{\\mathcal\{T\}\}\\ranglewhereℱ𝒯π​\(s,a\)=𝔼τ∼π​\[∑t=0∞γt​𝟙​\(st=s,at=a\)\]\\mathcal\{F\}^\{\\pi\}\_\{\\mathcal\{T\}\}\(s,a\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\mathbbm\{1\}\(s\_\{t\}=s,a\_\{t\}=a\)\]are thediscounted visit countsofπ\\piunder𝒯\\mathcal\{T\}\. We make the standard assumption in classical RL and optimal control\(Suttonet al\.,[1998](https://arxiv.org/html/2605.15960#bib.bib58); Bertsekas,[2012](https://arxiv.org/html/2605.15960#bib.bib64)\)that designers use value to compare policies and prefer high\-value policies over low\-value ones\.Anenvironmentis an MDP without a reward function\(𝒮,𝒜,𝒯,d0,\_,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},d\_\{0\},\\\_,\\gamma\),222This is sometimes called a controlled Markov process\.and ataskis an MDP without a transition model\(𝒮,𝒜,\_,d0,ℛ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\\_,d\_\{0\},\\mathcal\{R\},\\gamma\)\. When considering multiple reward functions in a fixed environment, we writeJℛ​\(π\)=⟨ℛ,ℱπ⟩J\_\{\\mathcal\{R\}\}\(\\pi\)=\\langle\\mathcal\{R\},\\mathcal\{F\}^\{\\pi\}\\rangle\(dropping the subscript onℱπ\\mathcal\{F\}^\{\\pi\}when𝒯\\mathcal\{T\}is fixed\)\. When considering multiple transition models in a fixed task, we writeJ𝒯​\(π\)=⟨ℛ,ℱ𝒯π⟩J\_\{\\mathcal\{T\}\}\(\\pi\)=\\langle\\mathcal\{R\},\\mathcal\{F\}^\{\\pi\}\_\{\\mathcal\{T\}\}\\rangle\. We useJiJ\_\{i\}as shorthand forJℛiJ\_\{\\mathcal\{R\}\_\{i\}\}orJ𝒯iJ\_\{\\mathcal\{T\}\_\{i\}\}depending on whether the index ranges over reward functions or transition models\.Since𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}are finite, the space of stationary policies can be represented as a product of\|𝒮\|\|\\mathcal\{S\}\|probability simplicesΔ​\(𝒜\)\|𝒮\|\\Delta\(\\mathcal\{A\}\)^\{\|\\mathcal\{S\}\|\}\. We writeΠ\+\\Pi^\{\+\}for the interior of this space, consisting of all policies withπ​\(a∣s\)\>0\\pi\(a\\mid s\)\>0for everys∈𝒮s\\in\\mathcal\{S\}anda∈𝒜a\\in\\mathcal\{A\}\. Unless otherwise specified, openness is taken with respect toΠ\+\\Pi^\{\+\}\. We provide a reference list of notation and terminology in[Appendix˜A](https://arxiv.org/html/2605.15960#A1)\.

### 2\.2Defining model exploitation and reward hacking

![Refer to caption](https://arxiv.org/html/2605.15960v1/x1.png)
Figure 1:Taxonomy of transition model relationships in a 3\-state MDP\([Appendix˜D](https://arxiv.org/html/2605.15960#A4)\)\.Policies are parameterized byπθ​\(a0∣s\)=θ\\pi\_\{\\theta\}\(a\_\{0\}\\mid s\)=\\thetafor alls∈𝒮s\\in\\mathcal\{S\}\. Each panel has the same𝒯1\\mathcal\{T\}\_\{1\}but a unique𝒯2\\mathcal\{T\}\_\{2\}\.\(a\)𝒯2\\mathcal\{T\}\_\{2\}is trivial, soJ2J\_\{2\}is constant\.\(b\)𝒯1\\mathcal\{T\}\_\{1\},𝒯2\\mathcal\{T\}\_\{2\}are equivalent \(in policy ordering\): both value functions increaseinθ\\theta, preserving the policy ordering\.\(c, d\)𝒯1\\mathcal\{T\}\_\{1\},𝒯2\\mathcal\{T\}\_\{2\}are exploitable: dotted lines show theexploiting pair\(π,π′\)\(\\pi,\\pi^\{\\prime\}\)from[Definition˜1](https://arxiv.org/html/2605.15960#Thmdefinition1), withJ1​\(π′\)\>J1​\(π\)J\_\{1\}\(\\pi^\{\\prime\}\)\>J\_\{1\}\(\\pi\)butJ2​\(π\)\>J2​\(π′\)J\_\{2\}\(\\pi\)\>J\_\{2\}\(\\pi^\{\\prime\}\)\.We define model exploitation as a binary relation between two transition models\.

###### Definition 1\(Model Exploitation\)\.

Transition functions𝒯\\mathcal\{T\}and𝒯′\\mathcal\{T\}^\{\\prime\}areexploitablerelative to a policy setΠ\\Piand a task\(𝒮,𝒜,\_,d0,ℛ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\\_,d\_\{0\},\\mathcal\{R\},\\gamma\)if there existπ,π′∈Π\\pi,\\pi^\{\\prime\}\\in\\Pisuch that

J𝒯​\(π\)\>J𝒯​\(π′\)andJ𝒯′​\(π′\)\>J𝒯′​\(π\),J\_\{\\mathcal\{T\}\}\(\\pi\)\>J\_\{\\mathcal\{T\}\}\(\\pi^\{\\prime\}\)\\quad\\text\{and\}\\quad J\_\{\\mathcal\{T\}^\{\\prime\}\}\(\\pi^\{\\prime\}\)\>J\_\{\\mathcal\{T\}^\{\\prime\}\}\(\\pi\),otherwise they areunexploitable\.

Intuitively, model exploitation says that optimizing under one transition model may look like a mistake under the other\. This definition directly analogizes the notion of reward hacking inSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)with transition models in place of reward functions\. We restate their definition below\.

###### Definition 2\(Reward Hacking,Skalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)\)\.

Reward functionsℛ\\mathcal\{R\}andℛ′\\mathcal\{R\}^\{\\prime\}arehackablerelative to a policy setΠ\\Piand an environment\(𝒮,𝒜,𝒯,d0,\_,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},d\_\{0\},\\\_,\\gamma\)if there existπ,π′∈Π\\pi,\\pi^\{\\prime\}\\in\\Pisuch that

Jℛ​\(π\)\>Jℛ​\(π′\)andJℛ′​\(π′\)\>Jℛ′​\(π\),J\_\{\\mathcal\{R\}\}\(\\pi\)\>J\_\{\\mathcal\{R\}\}\(\\pi^\{\\prime\}\)\\quad\\text\{and\}\\quad J\_\{\\mathcal\{R\}^\{\\prime\}\}\(\\pi^\{\\prime\}\)\>J\_\{\\mathcal\{R\}^\{\\prime\}\}\(\\pi\),otherwise they areunhackable\.

Both definitions are instances of a common primitive, which we callvalue inversion\. We say two value functionsJ1J\_\{1\}andJ2J\_\{2\}admit avalue inversiononΠ\\Piif there existπ,π′∈Π\\pi,\\pi^\{\\prime\}\\in\\Pisuch thatJ1​\(π\)\>J1​\(π′\)J\_\{1\}\(\\pi\)\>J\_\{1\}\(\\pi^\{\\prime\}\)andJ2​\(π′\)\>J2​\(π\)J\_\{2\}\(\\pi^\{\\prime\}\)\>J\_\{2\}\(\\pi\)\. Thus, model exploitation is a value inversion betweenJ𝒯J\_\{\\mathcal\{T\}\}andJ𝒯′J\_\{\\mathcal\{T\}^\{\\prime\}\}, and reward hacking is a value inversion betweenJℛJ\_\{\\mathcal\{R\}\}andJℛ′J\_\{\\mathcal\{R\}^\{\\prime\}\}\.We further sayJ1J\_\{1\}andJ2J\_\{2\}areequivalentonΠ\\Piif they induce the same ordering onΠ\\Piand that a value functionJJistrivialonΠ\\PiifJ​\(π\)J\(\\pi\)is constant onΠ\\Pi\([Figure˜1](https://arxiv.org/html/2605.15960#S2.F1)\)\. Informally, equivalence means that two value functions agree on every pairwise comparison between policies, while triviality means there are no meaningful comparisons to make\. We also use trivial and equivalent to describe transition models and reward functions as shorthand for the corresponding properties of the value functions they induce\. We note that triviality precludes value inversion and that value inversion is symmetric, irreflexive, and not necessarily transitive\.

## 3Results

To understand when model exploitation is avoidable, we ask a single question with increasing precision:on which policy sets can we find non\-trivial, non\-equivalent, unexploitable transition model pairs?We begin with the largest possible policy set and progressively tighten the scope\.

### 3\.1Non\-stationary policies

Consider first the set of all non\-stationary policiesΠNS\{\\Pi\_\{\\text\{NS\}\}\}\. If exploitation can be avoided here, it can be avoided on any policy set, since all policy sets are subsets ofΠNS\{\\Pi\_\{\\text\{NS\}\}\}\. Unfortunately, this is false: we find that no interesting unexploitability exists\. main\-pratenddefaultcategory\.tex

###### Proposition 1\(No unexploitability onΠNS\{\\Pi\_\{\\text\{NS\}\}\}\)\.

On the set of all non\-stationary policies, every non\-trivial, non\-equivalent pair of transition models is exploitable\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.texWe defer all formal proofs to[Appendix˜B](https://arxiv.org/html/2605.15960#A2)\. The proof uses a similar construction to the one inSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\), relying on the richness ofΠNS\{\\Pi\_\{\\text\{NS\}\}\}to find an exploiting policy pair, but says nothing about the set of stationary policies, where the question is considerably harder\.

### 3\.2Distinguishing model exploitation from reward hacking

Having seen that interesting unexploitability is impossible in the most general policy set, a natural question is whether it can be avoided in smaller policy sets, such as deterministic or optimal policies\. From[Theorem 1](https://arxiv.org/html/2605.15960#Thmskalsetheorem1)ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\), we know that reward hacking is inevitable on any policy set that contains an open subset \(under mild regularity conditions\)\. Given that our proof of[Proposition˜1](https://arxiv.org/html/2605.15960#Thmproposition1)follows a similar construction to the one inSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\), it is reasonable to expect that we might derive analogous results using the same or comparable techniques\. Unfortunately, this is not true, as we argue in this section\.

#### 3\.2\.1Distinguishing transition models and reward functions

Transition models and reward functions have significantly different geometries that make it hard to reason about them in a unified manner\. Two of these differences are important for our results\.First, the policy value functionJ​\(π\)=⟨ℛ,ℱ𝒯π⟩J\(\\pi\)=\\langle\\mathcal\{R\},\\mathcal\{F\}\_\{\\mathcal\{T\}\}^\{\\pi\}\\rangleis linear in the reward function but nonlinear in the transition model’s probability mass function\. This means that the linear\-algebraic and real\-analytic machinery thatSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)develop for their characterization of reward hacking generally does not transfer to the model exploitation setting, so analogous results \(if they exist\) must rely on different methods\.Second, the only constraint imposed on the reward function is typically boundedness: rewards may not go to infinity\. Transition models, on the other hand, are probability mass functions, which are nonnegative and sum to 1 at each state\. This means reward functions live in the full Euclidean spaceℝ\|𝒮\|​\|𝒜\|\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\|\\mathcal\{A\}\|\}, which is closed under negation, scaling, and addition, but transition models live on a compact product of simplicesΔ​\(𝒮\)\|𝒮\|​\|𝒜\|\\Delta\(\\mathcal\{S\}\)^\{\|\\mathcal\{S\}\|\|\\mathcal\{A\}\|\}, which does not have vector space structure \(at least in the probability domain\)\.The primary consequence of these differences is that neither the techniques nor the results from reward hacking cleanly apply to model exploitation\. In fact, as we show next, results that are true for hacking are sometimes false for exploitation\.

#### 3\.2\.2Finite policy sets

Unlike reward hacking, model exploitation persists even in finite policy sets\. Formally,[Theorem 2](https://arxiv.org/html/2605.15960#Thmskalsetheorem2)ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)shows that, together with some mild regularity conditions, a non\-trivial, non\-equivalent unhackable reward pair always exists on any finite policy set\.Making the corresponding existence guarantee for unexploitable transition model pairs requires ruling out two degeneracies that have no analogue for reward hacking\. Whenγ=0\\gamma=0, the valueJ​\(π\)=∑sd0​\(s\)​∑aπ​\(a\|s\)​ℛ​\(s,a\)J\(\\pi\)=\\sum\_\{s\}d\_\{0\}\(s\)\\sum\_\{a\}\\pi\(a\|s\)\\,\\mathcal\{R\}\(s,a\)is independent of𝒯\\mathcal\{T\}, so every transition model pair is equivalent\. The same obstruction applies when bothℛ\\mathcal\{R\}and every policy inΠ\\Piare stateless \(i\.e\.,ℛ​\(s,a\)=ℛ​\(a\)\\mathcal\{R\}\(s,a\)=\\mathcal\{R\}\(a\)andπ​\(a\|s\)=π​\(a\)\\pi\(a\|s\)=\\pi\(a\)for alls∈𝒮s\\in\\mathcal\{S\}\), since thenJ​\(π\)=11−γ​∑aℛ​\(a\)​π​\(a\)J\(\\pi\)=\\frac\{1\}\{1\-\\gamma\}\\sum\_\{a\}\\mathcal\{R\}\(a\)\\,\\pi\(a\)\. In both cases, no transition model pairs are non\-equivalent\.In examining whether the remaining regularity conditions ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)suffice once these degeneracies are excluded, we noticed that[Theorem 2](https://arxiv.org/html/2605.15960#Thmskalsetheorem2)requires an additional hypothesis that is not present in the original statement but implicitly used in the proof\. \(We view this as a minor oversight that does not affect the broader conclusions of their work, since the condition is degenerate and easily excluded\.\)\{\\@@bibref\{Authors Phrase1YearPhrase2\}\{skalse2022defining\}\{\\@@citephrase\{\(\}\}\{\\@@citephrase\{\)\}\}\}\[Non\-collinearity is necessary for[Theorem 2](https://arxiv.org/html/2605.15960#Thmskalsetheorem2)of restatable:\] propositionprAtEndRestateiiiOn any policy set with collinear visit counts \(i\.e\., all visit countsℱπ\\mathcal\{F\}^\{\\pi\}forπ∈Π\\pi\\in\\Pilie on a single line inℝ\|𝒮\|​\|𝒜\|\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\|\\mathcal\{A\}\|\}\), every non\-trivial, non\-equivalent pair of reward functions is hackable\. main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.texWe provide a corrected statement in[Appendix˜C](https://arxiv.org/html/2605.15960#A3)with the additional hypothesis that visit counts are not collinear\. Unfortunately, even assuming all of the regularity conditions above together with those of[Theorem 2](https://arxiv.org/html/2605.15960#Thmskalsetheorem2), this is still insufficient to guarantee the existence of a non\-trivial, non\-equivalent unexploitable transition model pair on finite policy sets \(seeLABEL:ce:finite\-exploitationin[Appendix˜B](https://arxiv.org/html/2605.15960#A2)\)\. We suspect that the finite\-policy\-set case admits no clean existence guarantee for unexploitability, because the nonlinear dependence ofJJon𝒯\\mathcal\{T\}makes the geometry of the induced value orderings fundamentally richer than in the reward hacking setting\. main\-pratenddefaultcategory\.tex

###### Question 1\.

Are there necessary and sufficient conditions on a task\(𝒮,𝒜,\_,d0,ℛ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\\_,d\_\{0\},\\mathcal\{R\},\\gamma\)and finite policy set that guarantee the existence of a non\-trivial, non\-equivalent, unexploitable pair of transition models?

### 3\.3Unifying model exploitation and reward hacking

We next return to our main question:can exploitability be avoided by restricting the policy set?As discussed in[Section˜3\.2](https://arxiv.org/html/2605.15960#S3.SS2)above,[Theorem 1](https://arxiv.org/html/2605.15960#Thmskalsetheorem1)ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)shows that the answer is no for reward hacking on any policy set containing an open subset, but the result does not directly apply to exploitation\. Surprisingly, this does not mean that reward hacking has nothing to tell us about model exploitation\. In this section, we first show that every instance of model exploitation can be reduced to an instance of reward hacking, then develop this insight into a unified theory of exploitation and hacking\.

###### Proposition 2\(Exploitation implies hacking\)\.

For any exploitable pair\(𝒯,𝒯′\)\(\\mathcal\{T\},\\mathcal\{T\}^\{\\prime\}\)relative to a policy setΠ\\Piand a task\(𝒮,𝒜,\_,d0,ℛ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\\_,d\_\{0\},\\mathcal\{R\},\\gamma\), there exists a reward functionℛ′\\mathcal\{R\}^\{\\prime\}such that\(ℛ,ℛ′\)\(\\mathcal\{R\},\\mathcal\{R\}^\{\\prime\}\)is hackable relative toΠ\\Piand the environment\(𝒮,𝒜,𝒯,d0,\_,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},d\_\{0\},\\\_,\\gamma\)\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.texWhile the result is perhaps surprising, the proof is a short construction: setℛ′\\mathcal\{R\}^\{\\prime\}to be the difference in visit countsℱ𝒯π′−ℱ𝒯π\\mathcal\{F\}\_\{\\mathcal\{T\}\}^\{\\pi^\{\\prime\}\}\-\\mathcal\{F\}\_\{\\mathcal\{T\}\}^\{\\pi\}for some exploiting policy pairπ,π′∈Π\\pi,\\pi^\{\\prime\}\\in\\Pi\. Interestingly, the converse does not hold\.

###### Counterexample 1\(Hacking may not imply exploitation\)\.

Consider any MDP with\|𝒜\|≥2\|\\mathcal\{A\}\|\\geq 2andγ\>0\\gamma\>0\. Letℛ​\(s,a\)=𝟙​\{a=a1\}\\mathcal\{R\}\(s,a\)=\\mathbbm\{1\}\\\{a=a\_\{1\}\\\}andℛ′​\(s,a\)=𝟙​\{a=a2\}\\mathcal\{R\}^\{\\prime\}\(s,a\)=\\mathbbm\{1\}\\\{a=a\_\{2\}\\\}, and letπ\\pialways selecta1a\_\{1\}andπ′\\pi^\{\\prime\}always selecta2a\_\{2\}\. Then\(ℛ,ℛ′\)\(\\mathcal\{R\},\\mathcal\{R\}^\{\\prime\}\)is hackable, sinceJℛ​\(π\)=1/\(1−γ\)\>0=Jℛ​\(π′\)J\_\{\\mathcal\{R\}\}\(\\pi\)=1/\(1\-\\gamma\)\>0=J\_\{\\mathcal\{R\}\}\(\\pi^\{\\prime\}\)andJℛ′​\(π′\)=1/\(1−γ\)\>0=Jℛ′​\(π\)J\_\{\\mathcal\{R\}^\{\\prime\}\}\(\\pi^\{\\prime\}\)=1/\(1\-\\gamma\)\>0=J\_\{\\mathcal\{R\}^\{\\prime\}\}\(\\pi\), but no perturbation of the transition model can alter the relative ordering ofπ\\piandπ′\\pi^\{\\prime\}under either reward function\.

[Proposition˜2](https://arxiv.org/html/2605.15960#Thmproposition2)is unassuming by itself, but it hints at something more fundamental: a unified theory of model exploitation and reward hacking may existdespitethe anatomical differences between reward functions and transition models\. If we can find such a theory, we may also discover an answer to our initial question\.Both model exploitation and reward hacking are inversions in the value ordering of a policy set\. This suggests that the correct place to begin developing a unified treatment of exploitation and hacking is the value function\. Unfortunately, the difficulties that make transition models unwieldy relative to reward functions \([Section˜3\.2\.1](https://arxiv.org/html/2605.15960#S3.SS2.SSS1)\) also apply to the value function, so we still cannot apply the machinery ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)\. The key insight to breaking past this obstacle is that the value function is rational in policy space\.

###### Proposition 3\(Value function rationality\)\.

For any MDP with finite𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}, the value functionJ​\(π\)J\(\\pi\)is rational inπ\\pi\. In particular,JJis real\-analytic onΠ\+\\Pi^\{\+\}\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.texRational functions are typically easier to reason about compared to nonlinear functions in general because they have a host of desirable properties\(Rudin,[1974](https://arxiv.org/html/2605.15960#bib.bib13)\)\. The two properties that are useful for our purposes are analyticity and rigidity\. The former provides access to an arsenal of powerful results from analysis and differential geometry, while the latter allows us to promote properties from open subsets to the supersets containing them\.More concretely, from these two properties, we can derive the following lemmas, which relate the gradients of value functions to the existence \(or lack thereof\) of inversions and give a simple geometric picture: at every point in policy space, the relationship between∇J1\\nabla J\_\{1\}and∇J2\\nabla J\_\{2\}determines whether the two value functions can disagree on which policy is better\.333From now on, we identify the tangent space to the policy spaceΔ​\(𝒜\)\|𝒮\|\\Delta\(\\mathcal\{A\}\)^\{\|\\mathcal\{S\}\|\}at anyπ\\piwithℝ\|𝒮\|​\(\|𝒜\|−1\)\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\(\|\\mathcal\{A\}\|\-1\)\}, although the identification is not canonical\. We can thus write∇vJ​\(π\)\\nabla\_\{v\}J\(\\pi\)forv∈ℝ\|𝒮\|​\(\|𝒜\|−1\)v\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\(\|\\mathcal\{A\}\|\-1\)\}an infinitesimal perturbation ofπ\\pi\.

![Refer to caption](https://arxiv.org/html/2605.15960v1/x2.png)
Figure 2:Gradients for the value curves in[Figure˜1](https://arxiv.org/html/2605.15960#S2.F1)\.\(a\)∇J2\\nabla J\_\{2\}vanishes \(trivial\)\.\(b\)Both gradients positive: proportional gradients forceequivalence \([Lemma˜2](https://arxiv.org/html/2605.15960#Thmlemma2)\)\.\(c, d\)The gradients have opposite signs,yielding exploitation \([Lemma˜1](https://arxiv.org/html/2605.15960#Thmlemma1)\)\. In \(c\) thisoccurs only nearθ=0\\theta=0; in \(d\) it holds everywhere\.###### Lemma 1\(Local inversion\)\.

LetΠ\\Pibe open\. If there is a directionv∈ℝ\|𝒮\|​\(\|𝒜\|−1\)v\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\(\|\\mathcal\{A\}\|\-1\)\}along whichJ1J\_\{1\}increases andJ2J\_\{2\}decreases at someπ∈Π\\pi\\in\\Pi\(i\.e\.,∇vJ2​\(π\)<0<∇vJ1​\(π\)\\nabla\_\{v\}J\_\{2\}\(\\pi\)<0<\\nabla\_\{v\}J\_\{1\}\(\\pi\)\), thenJ1J\_\{1\}andJ2J\_\{2\}admit an inversion onΠ\\Pi\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.tex main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.tex

###### Lemma 2\(Global equivalence\)\.

LetΠ~⊆Π\\tilde\{\\Pi\}\\subseteq\\Pibe open\. IfJ1J\_\{1\}andJ2J\_\{2\}are non\-trivial and have positively proportional gradients wherever both are nonzero onΠ~\\tilde\{\\Pi\}, thenJ1J\_\{1\}andJ2J\_\{2\}are equivalent onΠ\\Pi\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.texThe intuition for[Lemma˜1](https://arxiv.org/html/2605.15960#Thmlemma1)is as follows\. If there is a direction in policy space that improves performance under one model but decreases it under the other, then a step in that direction leads to two policies that form an exploiting pair\.[Lemma˜2](https://arxiv.org/html/2605.15960#Thmlemma2)has a similar intuition\. If every direction that improves one value function also improves the other, then the two can never disagree on which of any two policies is better, so they must induce the same ordering\. While the result is simple, the formal proof is surprisingly involved and relies heavily on value function rationality\.Together, these two lemmas cover every possible relationship between∇J1\\nabla J\_\{1\}and∇J2\\nabla J\_\{2\}\([Figure˜2](https://arxiv.org/html/2605.15960#S3.F2)\)\. At every point in policy space, the gradients are either linearly independent, antiparallel, or positively proportional\. The first two cases produce an inversion via[Lemma˜1](https://arxiv.org/html/2605.15960#Thmlemma1); the third forces equivalence via[Lemma˜2](https://arxiv.org/html/2605.15960#Thmlemma2)\. This gives our main characterization\.

###### Theorem 1\(Value inversions\)\.

LetJ1J\_\{1\}andJ2J\_\{2\}be non\-trivial, non\-equivalent value functions\. If the policy set contains an open subset, then it admits a value inversion forJ1J\_\{1\}andJ2J\_\{2\}\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.texThe theorem immediately yields the desired characterization for model exploitation\.

###### Corollary 1\(Imperfect world models are exploitable\)\.

On any policy set containing an open subset, every non\-trivial, non\-equivalent pair of transition models is exploitable\.

In practice, one might hope that[Corollary˜1](https://arxiv.org/html/2605.15960#Thmcorollary1)only applies to high\-entropy or poorly performing policies that no reasonable learning algorithm would ever produce\. Unfortunately,[Theorem˜1](https://arxiv.org/html/2605.15960#Thmtheorem1)still applies both to the set ofε\\varepsilon\-suboptimal policies \(thoseπ\\piwithJ​\(π\)≥supπ′J​\(π′\)−εJ\(\\pi\)\\geq\\sup\_\{\\pi^\{\\prime\}\}J\(\\pi^\{\\prime\}\)\-\\varepsilonforε\>0\\varepsilon\>0\) and to the set ofδ\\delta\-deterministic policies \(thoseπ\\piwithmaxa⁡π​\(a∣s\)≥δ\\max\_\{a\}\\pi\(a\\mid s\)\\geq\\deltaforδ<1\\delta<1and everys∈𝒮s\\in\\mathcal\{S\}\)\.

###### Corollary 2\(Common exploitable policy sets\)\.

Every non\-trivial, non\-equivalent pair of transition models is exploitable on \(i\) the set of all stationary policies, \(ii\) the set of allε\\varepsilon\-suboptimal policies \(ε\>0\\varepsilon\>0\), and \(iii\) the set of allδ\\delta\-deterministic policies \(δ<1\\delta<1\)\.

[Theorem˜1](https://arxiv.org/html/2605.15960#Thmtheorem1)implies characterizations of both reward hacking and model exploitation\.Skalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)arrive at the reward hacking result through an argument in the visit\-count space, exploiting the linearity ofJJin the reward vector\. The same conclusion followsa fortiorifrom[Theorem˜1](https://arxiv.org/html/2605.15960#Thmtheorem1), since the only property ofJ1J\_\{1\}andJ2J\_\{2\}used in the proof is rationality inπ\\pi, which holds regardless of whether the index ranges over reward functions or transition models\.

###### Corollary 3\([Theorem 1](https://arxiv.org/html/2605.15960#Thmskalsetheorem1)ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)\)\.

On any policy set containing an open subset, every non\-trivial, non\-equivalent pair of reward functions is hackable\.

### 3\.4Relaxation

![Refer to caption](https://arxiv.org/html/2605.15960v1/x3.png)
Figure 3:Examples ofε\\varepsilon\-exploitability and a contour plot for the safe horizon\.\(a, b\)The exploitable transition model pairs from[Figure˜1](https://arxiv.org/html/2605.15960#S2.F1)\(c, d\), annotated with exploiting policies\.The pair in \(a\) is0\.30\.3\-exploitable and the pair in \(b\) is22\-exploitable\.\(c\)The safe horizonH​\(ε,δ\)H\(\\varepsilon,\\delta\)from[Theorem˜2](https://arxiv.org/html/2605.15960#Thmtheorem2)\. Each contour marks the largest effective horizon1/\(1−γ\)1/\(1\-\\gamma\)under whichε\\varepsilon\-unexploitability isguaranteed for a given toleranceε\\varepsilonand maximum total variationdistanceδ​\(𝒯1,𝒯2\)\\delta\(\\mathcal\{T\}\_\{1\},\\mathcal\{T\}\_\{2\}\)\.We now introduce a relaxed notion of exploitability to pursue two goals\.First, practitioners may not care about small\-value inversions in real\-world applications, so we seek a more graded measure of exploitability that distinguishes between negligible and catastrophic differences in value\.Second, we want to answer our initial question about when exploitation is avoidable, even if only approximately\. Thus far, we have seen that interesting unexploitability is impossible on large policy sets \([Corollary˜1](https://arxiv.org/html/2605.15960#Thmcorollary1)\) and resists characterization on finite ones \([Section˜3\.2\.2](https://arxiv.org/html/2605.15960#S3.SS2.SSS2)\)\. A weaker definition, in contrast, may be more likely to admit positive existence conditions than our usual notion of exploitation\.

###### Definition 3\(ε\\varepsilon\-Exploitation\)\.

Transition models𝒯\\mathcal\{T\}and𝒯′\\mathcal\{T\}^\{\\prime\}areε\\varepsilon\-exploitablerelative to a policy setΠ\\Piand a task\(𝒮,𝒜,\_,d0,ℛ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\\_,d\_\{0\},\\mathcal\{R\},\\gamma\)if there existπ,π′∈Π\\pi,\\pi^\{\\prime\}\\in\\Pisuch that

J𝒯​\(π\)−J𝒯​\(π′\)\>εandJ𝒯′​\(π′\)−J𝒯′​\(π\)\>ε,J\_\{\\mathcal\{T\}\}\(\\pi\)\-J\_\{\\mathcal\{T\}\}\(\\pi^\{\\prime\}\)\>\\varepsilon\\quad\\text\{and\}\\quad J\_\{\\mathcal\{T\}^\{\\prime\}\}\(\\pi^\{\\prime\}\)\-J\_\{\\mathcal\{T\}^\{\\prime\}\}\(\\pi\)\>\\varepsilon,otherwise they areε\\varepsilon\-unexploitable\.

Intuitively,ε\\varepsilon\-unexploitability says that while small inversions may exist,𝒯\\mathcal\{T\}and𝒯′\\mathcal\{T\}^\{\\prime\}may never confidently disagree about how to rank a policy pair\. This definition immediately satisfies our first goal, because it generalizes exploitation \(recovering[Definition˜1](https://arxiv.org/html/2605.15960#Thmdefinition1)whenε=0\\varepsilon=0\) and is a continuous rather than binary measure of “exploitability” \(i\.e\., via the minimumε\\varepsilonfor which a pair of transition models remainsε\\varepsilon\-unexploitable\)\.To address our second goal and establish sufficient conditions forε\\varepsilon\-unexploitability, we derive the following theorem\.

###### Theorem 2\(Safe horizon\)\.

Letℛ:𝒮×𝒜→\[0,1\]\\mathcal\{R\}\\colon\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]and letδ=12maxs,a∥𝒯\(⋅\|s,a\)−𝒯′\(⋅\|s,a\)∥1\\delta=\\frac\{1\}\{2\}\\max\_\{s,a\}\\\|\\mathcal\{T\}\(\\cdot\|s,a\)\-\\mathcal\{T\}^\{\\prime\}\(\\cdot\|s,a\)\\\|\_\{1\}be the total variation distance between distinct𝒯\\mathcal\{T\}and𝒯′\\mathcal\{T\}^\{\\prime\}\. For anyε\>0\\varepsilon\>0, define the safe horizon

H​\(ε,δ\)=\(1\+ε\)\+\(1−ε\)2\+4​ε/δ2\.H\(\\varepsilon,\\delta\)=\\frac\{\(1\+\\varepsilon\)\+\\sqrt\{\(1\-\\varepsilon\)^\{2\}\+4\\varepsilon/\\delta\}\}\{2\}\.Every pair\(𝒯,𝒯′\)\(\\mathcal\{T\},\\mathcal\{T\}^\{\\prime\}\)isε\\varepsilon\-unexploitable on every policy setΠ\\Piwhenever1/\(1−γ\)≤H​\(ε,δ\)1/\(1\-\\gamma\)\\leq H\(\\varepsilon,\\delta\)\. Furthermore, this bound is tight\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.texThe safe horizonH​\(ε,δ\)H\(\\varepsilon,\\delta\)is the longest effective horizon1/\(1−γ\)1/\(1\-\\gamma\)at whichε\\varepsilon\-unexploitability is guaranteed\. Its dependence on both arguments aligns with standard intuitions in model\-based RL\. Increasingδ\\deltalowersHH: a less accurate model exhausts its error budget sooner, so safe planning requires a shorter horizon\. This is exactly in line with the main result byJianget al\.\([2015](https://arxiv.org/html/2605.15960#bib.bib51)\), who show that a shorter effective horizon should be used when data is scarce\. In this sense, we corroborate and expand their findings through the lens ofε\\varepsilon\-exploitability\.Increasingε\\varepsilonraisesHH: a designer who tolerates larger inversions can plan further ahead\. At the extremes,H→∞H\\to\\inftyasδ→0\\delta\\to 0\(a perfect model is never exploitable\) andH→1\+εH\\to 1\+\\varepsilonasδ→1\\delta\\to 1\(a maximally inaccurate model is safe only near single\-step planning\)\. This formalizes the widely held intuition that model errors compound over the planning horizon\(Rosset al\.,[2011](https://arxiv.org/html/2605.15960#bib.bib6); Talvitie,[2014](https://arxiv.org/html/2605.15960#bib.bib27)\), and provides a closed\-form expression for exactly how much horizon a given level of model error can afford \([Figure˜3](https://arxiv.org/html/2605.15960#S3.F3)c\)\.The closed form ofHHis exact but unwieldy\. A looser but more memorable sufficient condition follows from dropping lower\-order terms\.

###### Corollary 4\(Square\-root heuristic\)\.

Letℛ:𝒮×𝒜→\[0,1\]\\mathcal\{R\}\\colon\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]and letδ=12maxs,a∥𝒯\(⋅\|s,a\)−𝒯′\(⋅\|s,a\)∥1\>0\\delta=\\frac\{1\}\{2\}\\max\_\{s,a\}\\\|\\mathcal\{T\}\(\\cdot\|s,a\)\-\\mathcal\{T\}^\{\\prime\}\(\\cdot\|s,a\)\\\|\_\{1\}\>0\. For anyε\>0\\varepsilon\>0, the pair\(𝒯,𝒯′\)\(\\mathcal\{T\},\\mathcal\{T\}^\{\\prime\}\)isε\\varepsilon\-unexploitable whenever1/\(1−γ\)<ε/δ1/\(1\-\\gamma\)<\\sqrt\{\\varepsilon/\\delta\}\.

main\-pratenddefaultcategory\.texmain\-pratenddefaultcategory\.tex[Corollary˜4](https://arxiv.org/html/2605.15960#Thmcorollary4)gives a quick sense of the magnitude of possible inversions: atγ=0\.9\\gamma=0\.9the effective horizon is1010, so the worst\-case inversion that cannot be ruled out scales as100​δ100\\delta; atγ=0\.99\\gamma=0\.99it scales as10,000​δ10\{,\}000\\delta\. Unlike the bound in[Theorem˜2](https://arxiv.org/html/2605.15960#Thmtheorem2), the approximate bound in[Corollary˜4](https://arxiv.org/html/2605.15960#Thmcorollary4)is not tight, so the actual inversion may be much smaller, but it provides an immediate sense of how model error and planning horizon interact without requiring further analysis\. For the transition model pairs in[Figure˜1](https://arxiv.org/html/2605.15960#S2.F1), we compare both bounds against the actual exploitation gaps in[Appendix˜E](https://arxiv.org/html/2605.15960#A5)\.

## 4Related work

##### History and etymology of model exploitation\.

The term “model exploitation” has no single origin in RL\. Adjacent concepts, such as “model bias” and “compounding error,” have appeared in the literature since at least the early 2010s\(Deisenroth and Rasmussen,[2011](https://arxiv.org/html/2605.15960#bib.bib3); Rosset al\.,[2011](https://arxiv.org/html/2605.15960#bib.bib6)\), themselves inherited from control theory\(Schneider,[1996](https://arxiv.org/html/2605.15960#bib.bib5); Kappen,[2005](https://arxiv.org/html/2605.15960#bib.bib4)\)\. As deep RL grew in popularity toward the end of that decade, practitioners turned to sample\-efficient model\-based RL \(MBRL\) to meet rising data demands\. Learned models, however, are uncertain where data is scarce, and unlike in model\-free methods, no trial\-and\-error mechanism exists to correct the resulting errors\. Researchers began describing policies as “exploiting” these uncertain regions\(Kurutachet al\.,[2018](https://arxiv.org/html/2605.15960#bib.bib23)\), though this usage of “exploit” also predates deep RL \(cf\.Schneider[1996](https://arxiv.org/html/2605.15960#bib.bib5)\)\. The set phrase “model exploitation” congealed only more recently, followingHa and Schmidhuber \([2018](https://arxiv.org/html/2605.15960#bib.bib41)\)andJanneret al\.\([2019](https://arxiv.org/html/2605.15960#bib.bib37)\), and has since motivated a lineage of conservative offline MBRL methods\(Yuet al\.,[2020](https://arxiv.org/html/2605.15960#bib.bib17); Kidambiet al\.,[2020](https://arxiv.org/html/2605.15960#bib.bib36)\)\.

##### Other adjacent phenomena\.

Model exploitation overlaps with several independent concepts\. Planner overfitting\(Arumugamet al\.,[2018](https://arxiv.org/html/2605.15960#bib.bib49); Jianget al\.,[2015](https://arxiv.org/html/2605.15960#bib.bib51)\)refers to a planner discovering spurious shortcuts in a learned model and develops regularization\-based remedies\.Jianget al\.\([2015](https://arxiv.org/html/2605.15960#bib.bib51)\)were the first to formalize and isolate this phenomenon by highlighting the role that the effective horizon plays in regularizing a learned transition model when data is scarce\. The sim\-to\-real gap\(Jakobiet al\.,[1995](https://arxiv.org/html/2605.15960#bib.bib18); Tobinet al\.,[2017](https://arxiv.org/html/2605.15960#bib.bib39)\)describes the same failure when transferring controllers from simulated to physical environments, typically addressed through domain randomization or system identification\. Objective robustness\(Langoscoet al\.,[2021](https://arxiv.org/html/2605.15960#bib.bib28)\)studies agents that achieve high reward in training environments but pursue unintended objectives at deployment, which can be viewed as exploitation of a training\-time model of the deployment distribution\. Lastly, as discussed, the value equivalence principle\(Grimmet al\.,[2020](https://arxiv.org/html/2605.15960#bib.bib61),[2021](https://arxiv.org/html/2605.15960#bib.bib24)\)provides a framing of when two models can be well\-understood as equivalent in terms ofvalue preservationunder the Bellman operator\. This principle is closely tied to our notion of equivalent transition models \([Figure˜1](https://arxiv.org/html/2605.15960#S2.F1)\)\.

##### MDP extensions\.

Several lines of work extend the MDP formalism to defend against imperfect transition models, typically by modifying the optimization problem rather than studying the misspecification itself\. Robust MDPs\(Iyengar,[2005](https://arxiv.org/html/2605.15960#bib.bib19); Nilim and El Ghaoui,[2005](https://arxiv.org/html/2605.15960#bib.bib20)\)represent transition uncertainty through uncertainty sets and optimize worst\-case value, a cardinal safety criterion rather than the ordinal one we propose\. MOReL\(Kidambiet al\.,[2020](https://arxiv.org/html/2605.15960#bib.bib36)\)constructs a pessimistic MDP by introducing absorbing states at high\-uncertainty transitions, and provides strong suboptimality guarantees for the resulting policy, but does not formalize the notion of exploitation it defends against\. Both modify the MDP to cope with model uncertainty; our work instead characterizes when that uncertainty inverts the policy ordering\.

##### World models\.

A growing family of methods trains policies entirely inside learned dynamics models, from the Dreamer line\(Hafneret al\.,[2020](https://arxiv.org/html/2605.15960#bib.bib26),[2021](https://arxiv.org/html/2605.15960#bib.bib31),[2025a](https://arxiv.org/html/2605.15960#bib.bib32),[2025b](https://arxiv.org/html/2605.15960#bib.bib33)\)to MuZero\(Schrittwieseret al\.,[2020](https://arxiv.org/html/2605.15960#bib.bib25)\)to recent JEPA\-style predictive representations\(Balestriero and LeCun,[2025](https://arxiv.org/html/2605.15960#bib.bib11)\)\. These methods measure model quality through predictive accuracy or downstream task performance, both of which are cardinal\. Our results show that low prediction error does not preclude ordinal failures: a model can be accurate yet still invert the policy ordering induced by the true dynamics \([Theorem˜2](https://arxiv.org/html/2605.15960#Thmtheorem2)\)\.

##### Outside machine learning\.

Optimizing against an imperfect model is a recurring failure mode across disciplines, so there is an extensive literature that exists outside machine learning\. For example, in macroeconomics, the Lucas critique\(Lucas,[1976](https://arxiv.org/html/2605.15960#bib.bib35)\)warns that policy rules derived from an estimated model can fail when agents adapt to the policy itself, and robust control\(Hansen and Sargent,[2008](https://arxiv.org/html/2605.15960#bib.bib34)\)responds by optimizing against worst\-case model perturbations\. In sim\-to\-real robotics, controllers optimized in simulation routinely exploit discrepancies with real\-world physics\(Jakobiet al\.,[1995](https://arxiv.org/html/2605.15960#bib.bib18); Tobinet al\.,[2017](https://arxiv.org/html/2605.15960#bib.bib39)\)\. Our formalization of exploitation as a value inversion applies in principle to any setting where a proxy model guides optimization over a structured decision space\.

## 5Conclusion

There is increasing interest in approaches to decision\-making that combine a forward predictive model with planning\. Although there is substantial literature on the design and analysis of such model\-based algorithms, the emergence of world models that make predictions directly from rich low\-level sensor streams pushes the boundaries of these methods\. This calls for a fresh look at the questions of model robustness and safety of the computed policies\. In this paper, we formalize the concept of model exploitation and study the conditions under which policies trained on empirical models may be safe\. We find that in sufficiently rich policy classes, exploitability is inevitable\. However, under a relaxed definition, we can establish conditions for unexploitability\. We suggest that this could guide the development of new approaches to safe policy synthesis, as well as parallel methods for quantifying the quality of synthesized policies with respect to such safety objectives\.

##### Limitations and future work\.

We identify three main limitations of our work\.First, our results, like those inSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\), concern MDPs with finite state and action spaces\. We defer extensions to continuous and partially observable MDPs \(POMDPs\) to future work, noting that solving the former may yield the latter \(since every POMDP can be written as a continuous belief\-state MDP\(Åström,[1965](https://arxiv.org/html/2605.15960#bib.bib7)\)\)\.Second, also like inSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\), our notion of exploitation is binary and restrictive\. In practice, it is unlikely that small, rare value inversions will pose meaningful safety concerns\. Future work could refine our notion of exploitation by introducing a more granular measure of exploitability \(e\.g\., deriving bounds using correlation metrics\(Laidlawet al\.,[2025](https://arxiv.org/html/2605.15960#bib.bib2)\)or expanding uponε\\varepsilon\-exploitation\)\.Third, there are, as of yet, no known closed\-form necessary and sufficient conditions that guarantee unexploitability in finite policy sets\. Although we question whether such simple conditions exist \([Question˜1](https://arxiv.org/html/2605.15960#Thmquestion1)\), there is still area to explore\. Researchers interested in pursuing this direction may, for instance, find it productive to begin work on special finite policy sets and specific MDPs \(e\.g\., characterize exploitation on deterministic policies in linear MDPs\)\.

## References

- D\. Abel, W\. Dabney, A\. Harutyunyan, M\. K\. Ho, M\. Littman, D\. Precup, and S\. Singh \(2021\)On the expressivity of Markov reward\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- A\. C\. Antoulas \(2005\)Approximation of large\-scale dynamical systems\.SIAM\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- D\. Arumugam, D\. Abel, K\. Asadi, N\. Gopalan, C\. Grimm, J\. K\. Lee, L\. Lehnert, and M\. L\. Littman \(2018\)Mitigating planner overfitting in model\-based reinforcement learning\.arXiv preprint arXiv:1812\.01129\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px2.p1.1)\.
- K\. J\. Åström \(1965\)Optimal control of Markov processes with incomplete state information i\.Journal of Mathematical Analysis and Applications10,pp\. 174–205\.Cited by:[§5](https://arxiv.org/html/2605.15960#S5.SS0.SSS0.Px1.p1.1)\.
- R\. Balestriero and Y\. LeCun \(2025\)LeJEPA: provable and scalable self\-supervised learning without the heuristics\.arXiv preprint arXiv:2511\.08544\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px4.p1.1)\.
- H\. W\. Bates \(1862\)Contributions to an insect fauna of the amazon valley\. Lepidoptera: Heliconidæ\.\.Transactions of the Linnean Society of London23\(3\),pp\. 495–566\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- D\. Bertsekas \(2012\)Dynamic programming and optimal control: volume i\.Vol\.4,Athena scientific\.Cited by:[§2\.1](https://arxiv.org/html/2605.15960#S2.SS1.p1.38)\.
- M\. Deisenroth and C\. E\. Rasmussen \(2011\)PILCO: a model\-based and data\-efficient approach to policy search\.International Conference on Machine Learning \(ICML\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1)\.
- C\. Grimm, A\. Barreto, G\. Farquhar, D\. Silver, and S\. Singh \(2021\)Proper value equivalence\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px2.p1.1)\.
- C\. Grimm, A\. Barreto, S\. Singh, and D\. Silver \(2020\)The value equivalence principle for model\-based reinforcement learning\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px2.p1.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)Recurrent world models facilitate policy evolution\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1),[footnote 1](https://arxiv.org/html/2605.15960#footnote1)\.
- D\. Hadfield\-Menell, S\. Milli, P\. Abbeel, S\. J\. Russell, and A\. Dragan \(2017\)Inverse reward design\.Neural Information Processing Systems \(NIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi \(2020\)Dream to control: learning behaviors by latent imagination\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px4.p1.1)\.
- D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson \(2019\)Learning latent dynamics for planning from pixels\.International Conference on Machine Learning \(ICML\)\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- D\. Hafner, T\. Lillicrap, M\. Norouzi, and J\. Ba \(2021\)Mastering Atari with discrete world models\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px4.p1.1)\.
- D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap \(2025a\)Mastering diverse control tasks through world models\.Nature640\(8059\),pp\. 647–653\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px4.p1.1)\.
- D\. Hafner, W\. Yan, and T\. Lillicrap \(2025b\)Training agents inside of scalable world models\.arXiv preprint arXiv:2509\.24527\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px4.p1.1)\.
- L\. P\. Hansen and T\. J\. Sargent \(2008\)Robustness\.Princeton University Press\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px5.p1.1)\.
- P\. E\. Hart, N\. J\. Nilsson, and B\. Raphael \(1968\)A formal basis for the heuristic determination of minimum cost paths\.IEEE Transactions on Systems Science and Cybernetics4\(2\),pp\. 100–107\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- G\. N\. Iyengar \(2005\)Robust dynamic programming\.Mathematics of Operations Research30\(2\),pp\. 257–280\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px3.p1.1)\.
- N\. Jakobi, P\. Husbands, and I\. Harvey \(1995\)Noise and the reality gap: the use of simulation in evolutionary robotics\.InEuropean Conference on Artificial Life,pp\. 704–720\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px5.p1.1)\.
- M\. Janner, J\. Fu, M\. Zhang, and S\. Levine \(2019\)When to trust your model: model\-based policy optimization\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1)\.
- K\. Javed and R\. S\. Sutton \(2024\)The big world hypothesis and its ramifications for artificial intelligence\.InFinding the Frame: An RLC Workshop for Examining Conceptual Frameworks,Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- N\. Jiang, A\. Kulesza, S\. Singh, and R\. Lewis \(2015\)The dependence of effective planning horizon on model accuracy\.Autonomous Agents and Multiagent Systems \(AAMAS\)\.Cited by:[§3\.4](https://arxiv.org/html/2605.15960#S3.SS4.p3.13),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px2.p1.1)\.
- H\. J\. Kappen \(2005\)Path integrals and symmetry breaking for optimal control theory\.Journal of statistical mechanics: theory and experiment2005\(11\),pp\. P11011–P11011\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1)\.
- M\. Kearns and S\. Singh \(2002\)Near\-optimal reinforcement learning in polynomial time\.Machine learning49\(2\),pp\. 209–232\.Cited by:[item 4](https://arxiv.org/html/2605.15960#S1.I1.i4.p1.2),[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- R\. Kidambi, A\. Rajeswaran, P\. Netrapalli, and T\. Joachims \(2020\)MOReL: model\-based offline reinforcement learning\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px3.p1.1)\.
- T\. Kurutach, I\. Clavera, Y\. Duan, A\. Tamar, and P\. Abbeel \(2018\)Model\-ensemble trust\-region policy optimization\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1)\.
- C\. Laidlaw, S\. Singhal, and A\. Dragan \(2025\)Correlated proxies: a new definition and improved mitigation for reward hacking\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§5](https://arxiv.org/html/2605.15960#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Langosco, J\. Koch, L\. Sharkey, J\. Pfau, L\. Orseau, and D\. Krueger \(2021\)Goal misgeneralization in deep reinforcement learning\.International Conference on Machine Learning \(ICML\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px2.p1.1)\.
- S\. Lobel and R\. Parr \(2024\)An optimal tightness bound for the simulation lemma\.Reinforcement Learning Conference \(RLC\)\.Cited by:[item 4](https://arxiv.org/html/2605.15960#S1.I1.i4.p1.2)\.
- R\. E\. Lucas \(1976\)Econometric policy evaluation: a critique\.InCarnegie\-Rochester conference series on public policy,Vol\.1,pp\. 19–46\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px5.p1.1)\.
- A\. Nilim and L\. El Ghaoui \(2005\)Robust control of Markov decision processes with uncertain transition matrices\.Operations Research53\(5\),pp\. 780–798\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px3.p1.1)\.
- S\. Ross, G\. Gordon, and D\. Bagnell \(2011\)A reduction of imitation learning and structured prediction to no\-regret online learning\.Artificial Intelligence and Statistics \(AISTATS\)\.Cited by:[§3\.4](https://arxiv.org/html/2605.15960#S3.SS4.p3.13),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1)\.
- S\. A\. Ross \(1976\)The arbitrage theory of capital asset pricing\.Journal of Economic Theory13\(3\),pp\. 341–360\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- W\. Rudin \(1974\)Real and complex analysis\.Cited by:[§3\.3](https://arxiv.org/html/2605.15960#S3.SS3.p4.2)\.
- J\. Schneider \(1996\)Exploiting model uncertainty estimates for safe dynamic control learning\.Neural Information Processing Systems \(NIPS\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel,et al\.\(2020\)Mastering Atari, go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px4.p1.1)\.
- R\. Scully\-Blaker \(2014\)A practiced practice: speedrunning through space with de certeau and virilio\.Game Studies14\(1\),pp\. 2016\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- H\. A\. Simon \(1955\)A behavioral model of rational choice\.The Quarterly Journal of Economics,pp\. 99–118\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4)\.
- J\. Skalse, N\. Howe, D\. Krasheninnikov, and D\. Krueger \(2022\)Defining and characterizing reward hacking\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[Appendix C](https://arxiv.org/html/2605.15960#A3.1.p1.11),[Appendix C](https://arxiv.org/html/2605.15960#A3.p1.1),[Appendix C](https://arxiv.org/html/2605.15960#A3.p2.1),[item 2](https://arxiv.org/html/2605.15960#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.15960#S1.p1.4),[§2\.1](https://arxiv.org/html/2605.15960#S2.SS1.p1.38),[§2\.2](https://arxiv.org/html/2605.15960#S2.SS2.p2.1),[§3\.1](https://arxiv.org/html/2605.15960#S3.SS1.p2.1),[§3\.2\.1](https://arxiv.org/html/2605.15960#S3.SS2.SSS1.p1.3),[§3\.2\.2](https://arxiv.org/html/2605.15960#S3.SS2.SSS2.p1.14),[§3\.2](https://arxiv.org/html/2605.15960#S3.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.15960#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2605.15960#S3.SS3.p3.1),[§3\.3](https://arxiv.org/html/2605.15960#S3.SS3.p9.4),[§5](https://arxiv.org/html/2605.15960#S5.SS0.SSS0.Px1.p1.1),[Corollary 3](https://arxiv.org/html/2605.15960#Thmcorollary3),[Definition 2](https://arxiv.org/html/2605.15960#Thmdefinition2),[Theorem S1](https://arxiv.org/html/2605.15960#Thmskalsetheorem1),[Theorem S2](https://arxiv.org/html/2605.15960#Thmskalsetheorem2)\.
- R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.MIT Press Cambridge\.Cited by:[§2\.1](https://arxiv.org/html/2605.15960#S2.SS1.p1.38),[§2](https://arxiv.org/html/2605.15960#S2.p1.1)\.
- E\. Talvitie \(2014\)Model regularization for stable sample rollouts\.InUAI,pp\. 780–789\.Cited by:[§3\.4](https://arxiv.org/html/2605.15960#S3.SS4.p3.13)\.
- J\. Tobin, R\. Fong, A\. Ray, J\. Schneider, W\. Zaremba, and P\. Abbeel \(2017\)Domain randomization for transferring deep neural networks from simulation to the real world\.International Conference on Intelligent Robots and Systems \(IROS\)\.Cited by:[§1](https://arxiv.org/html/2605.15960#S1.p1.4),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px5.p1.1)\.
- T\. Yu, G\. Thomas, L\. Yu, S\. Ermon, J\. Y\. Zou, S\. Levine, C\. Finn, and T\. Ma \(2020\)MOPO: model\-based offline policy optimization\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§4](https://arxiv.org/html/2605.15960#S4.SS0.SSS0.Px1.p1.1)\.

## Appendix ANotation

We first provide a table summarizing all relevant notation\.

Table 1:A summary of notation\.
## Appendix BProofs

## Appendix CRelevant reward hacking theorems

For convenience, we restate the two main characterization results fromSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)in our notation\. Note that[Corollary˜3](https://arxiv.org/html/2605.15960#Thmcorollary3)and[Theorem˜S1](https://arxiv.org/html/2605.15960#Thmskalsetheorem1)are contrapositives of the same fact:[Theorem˜S1](https://arxiv.org/html/2605.15960#Thmskalsetheorem1)says that any unhackable, non\-trivial pair must be equivalent, while[Corollary˜3](https://arxiv.org/html/2605.15960#Thmcorollary3)says that any non\-trivial, non\-equivalent pair must be hackable\. We use the latter phrasing to parallel our exploitability results\.

###### Theorem S1\(Theorem 1 ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)\)\.

In any environment\(𝒮,𝒜,𝒯,d0,\_,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},d\_\{0\},\\\_,\\gamma\), ifΠ^\\hat\{\\Pi\}contains an open set, then any pair of reward functions that are unhackable and non\-trivial onΠ^\\hat\{\\Pi\}are equivalent onΠ^\\hat\{\\Pi\}\.

As shown in[Section˜3\.2\.2](https://arxiv.org/html/2605.15960#S3.SS2.SSS2), the original statement of the following result requires an additional hypothesis not present in the original: that the visit counts ofΠ^\\hat\{\\Pi\}are not collinear\. The correction is minor and the strengthened condition is easily satisfied in practice; the core insights ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)are unaffected\.

###### Theorem S2\(Theorem 2 ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\), corrected\)\.

For any environment\(𝒮,𝒜,𝒯,d0,\_,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},d\_\{0\},\\\_,\\gamma\), any finite set of policiesΠ^\\hat\{\\Pi\}whose visit counts are not collinear, and any reward functionℛ1\\mathcal\{R\}\_\{1\}, there is a non\-trivial reward functionℛ2\\mathcal\{R\}\_\{2\}such thatℛ1\\mathcal\{R\}\_\{1\}andℛ2\\mathcal\{R\}\_\{2\}are unhackable but not equivalent\.

###### Proof\.

The proof is identical to that ofSkalseet al\.\([2022](https://arxiv.org/html/2605.15960#bib.bib1)\)\. The only step requiring the strengthened hypothesis is ensuring that there is a path fromℛ→1\\vec\{\\mathcal\{R\}\}\_\{1\}to−ℛ→1\-\\vec\{\\mathcal\{R\}\}\_\{1\}along which the first reward function where some inequality becomes an equality is not trivial\. With collinear visit counts, every pairwise value difference is a scalar multiple of a single inner product⟨ℛ,ℱπ1−ℱπ0⟩\\langle\\mathcal\{R\},\\mathcal\{F\}^\{\\pi\_\{1\}\}\-\\mathcal\{F\}^\{\\pi\_\{0\}\}\\rangle, so zeroing any one difference zeros them all\. Non\-collinearity guarantees three policiesπa,πb,πc\\pi\_\{a\},\\pi\_\{b\},\\pi\_\{c\}withℱπb−ℱπa\\mathcal\{F\}^\{\\pi\_\{b\}\}\-\\mathcal\{F\}^\{\\pi\_\{a\}\}andℱπc−ℱπa\\mathcal\{F\}^\{\\pi\_\{c\}\}\-\\mathcal\{F\}^\{\\pi\_\{a\}\}linearly independent\. The set of rewards trivial onΠ^\\hat\{\\Pi\}must be orthogonal to both, giving a subspace of codimension at least22, which cannot disconnectℝ\|𝒮\|​\|𝒜\|\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\|\\mathcal\{A\}\|\}\. A path fromℛ→1\\vec\{\\mathcal\{R\}\}\_\{1\}to−ℛ→1\-\\vec\{\\mathcal\{R\}\}\_\{1\}therefore exists that avoids trivial rewards entirely, and the remainder of the argument proceeds unchanged\.∎

## Appendix DTransition models for[Figures˜1](https://arxiv.org/html/2605.15960#S2.F1),[2](https://arxiv.org/html/2605.15960#S3.F2)and[3](https://arxiv.org/html/2605.15960#S3.F3)

The MDP used in[Figures˜1](https://arxiv.org/html/2605.15960#S2.F1),[2](https://arxiv.org/html/2605.15960#S3.F2)and[3](https://arxiv.org/html/2605.15960#S3.F3)has𝒮=\{s0,s1,s2\}\\mathcal\{S\}=\\\{s\_\{0\},s\_\{1\},s\_\{2\}\\\},𝒜=\{a0,a1\}\\mathcal\{A\}=\\\{a\_\{0\},a\_\{1\}\\\},ℛ​\(s,a\)=𝟙​\{s=s0\}\\mathcal\{R\}\(s,a\)=\\mathbbm\{1\}\\\{s=s\_\{0\}\\\},γ=0\.9\\gamma=0\.9, andd0=\(1/3,1/3,1/3\)d\_\{0\}=\(1/3,\\,1/3,\\,1/3\)\. Policies are parameterized byπθ​\(a0∣s\)=θ\\pi\_\{\\theta\}\(a\_\{0\}\\mid s\)=\\thetafor alls∈𝒮s\\in\\mathcal\{S\}\. Each transition model is specified by two stochastic matricesPa0P\_\{a\_\{0\}\}andPa1P\_\{a\_\{1\}\}, where entry\(i,j\)\(i,j\)gives𝒯​\(sj∣si,a\)\\mathcal\{T\}\(s\_\{j\}\\mid s\_\{i\},a\)\.The shared transition model𝒯1\\mathcal\{T\}\_\{1\}is

Pa0\(1\)=\(0\.70\.20\.10\.50\.30\.20\.40\.30\.3\),Pa1\(1\)=\(0\.10\.30\.60\.10\.20\.70\.10\.10\.8\)\.S​e​e​LABEL:restatable:\]​U​n​d​e​r​𝒯1,a​c​t​i​o​n​a0​c​o​n​c​e​n​t​r​a​t​e​s​t​r​a​n​s​i​t​i​o​n​s​t​o​w​a​r​d​s0​\(t​h​e​r​e​w​a​r​d​i​n​g​s​t​a​t​e\),w​h​i​l​e​a​c​t​i​o​n​a1​p​u​s​h​e​s​t​o​w​a​r​d​s2,m​a​k​i​n​g​J1​m​o​n​o​t​o​n​e​i​n​c​r​e​a​s​i​n​g​i​n​θ\.P\_\{a\_\{0\}\}^\{\(1\)\}=\\begin\{pmatrix\}0\.7&0\.2&0\.1\\\\ 0\.5&0\.3&0\.2\\\\ 0\.4&0\.3&0\.3\\end\{pmatrix\},\\qquad P\_\{a\_\{1\}\}^\{\(1\)\}=\\begin\{pmatrix\}0\.1&0\.3&0\.6\\\\ 0\.1&0\.2&0\.7\\\\ 0\.1&0\.1&0\.8\\end\{pmatrix\}\.See\\ref\{restatable:\]\}Under\\mathcal\{T\}\_\{1\},actiona\_\{0\}concentratestransitionstowards\_\{0\}\(therewardingstate\),whileactiona\_\{1\}pushestowards\_\{2\},makingJ\_\{1\}monotoneincreasingin\\theta\.
##### Panel \(a\):𝒯2\\mathcal\{T\}\_\{2\}trivial\.

Both actions produce identical transitions, soJ2J\_\{2\}is constant inθ\\theta:

Pa0\(2​a\)=Pa1\(2​a\)=\(0\.40\.30\.30\.20\.50\.30\.30\.30\.4\)\.S​e​e​LABEL:restatable:\]P\_\{a\_\{0\}\}^\{\(2a\)\}=P\_\{a\_\{1\}\}^\{\(2a\)\}=\\begin\{pmatrix\}0\.4&0\.3&0\.3\\\\ 0\.2&0\.5&0\.3\\\\ 0\.3&0\.3&0\.4\\end\{pmatrix\}\.See\\ref\{restatable:\]\}

##### Panel \(b\):𝒯1\\mathcal\{T\}\_\{1\},𝒯2\\mathcal\{T\}\_\{2\}equivalent\.

We construct𝒯2\\mathcal\{T\}\_\{2\}by blending𝒯1\\mathcal\{T\}\_\{1\}toward the uniform kernel with mixing weightα=0\.3\\alpha=0\.3, i\.e\.,𝒯2​\(s′∣s,a\)=0\.7⋅𝒯1​\(s′∣s,a\)\+0\.3/\|𝒮\|\\mathcal\{T\}\_\{2\}\(s^\{\\prime\}\\mid s,a\)=0\.7\\cdot\\mathcal\{T\}\_\{1\}\(s^\{\\prime\}\\mid s,a\)\+0\.3/\|\\mathcal\{S\}\|:

Pa0\(2​b\)=\(0\.590\.240\.170\.450\.310\.240\.380\.310\.31\),Pa1\(2​b\)=\(0\.170\.310\.520\.170\.240\.590\.170\.170\.66\)\.S​e​e​LABEL:restatable:\]​T​h​i​s​p​r​e​s​e​r​v​e​s​t​h​e​q​u​a​l​i​t​a​t​i​v​e​s​t​r​u​c​t​u​r​e​o​f​𝒯1​\(a​n​d​t​h​u​s​t​h​e​p​o​l​i​c​y​o​r​d​e​r​i​n​g\)​w​h​i​l​e​a​t​t​e​n​u​a​t​i​n​g​t​h​e​t​r​a​n​s​i​t​i​o​n​p​r​o​b​a​b​i​l​i​t​i​e​s\.P\_\{a\_\{0\}\}^\{\(2b\)\}=\\begin\{pmatrix\}0\.59&0\.24&0\.17\\\\ 0\.45&0\.31&0\.24\\\\ 0\.38&0\.31&0\.31\\end\{pmatrix\},\\qquad P\_\{a\_\{1\}\}^\{\(2b\)\}=\\begin\{pmatrix\}0\.17&0\.31&0\.52\\\\ 0\.17&0\.24&0\.59\\\\ 0\.17&0\.17&0\.66\\end\{pmatrix\}\.See\\ref\{restatable:\]\}Thispreservesthequalitativestructureof\\mathcal\{T\}\_\{1\}\(andthusthepolicyordering\)whileattenuatingthetransitionprobabilities\.

##### Panels \(c\) and \(d\):𝒯1\\mathcal\{T\}\_\{1\},𝒯2\\mathcal\{T\}\_\{2\}exploitable\.

The transition models𝒯2\(c\)\\mathcal\{T\}\_\{2\}^\{\(c\)\}and𝒯2\(d\)\\mathcal\{T\}\_\{2\}^\{\(d\)\}were found by computational search over random transition matrices drawn from a Dirichlet distribution\. Full numerical values and the search code are available in the supplementary material\.

## Appendix EBound comparison

Table 2:Exploitation gaps for the transition model pairs in[Figure˜1](https://arxiv.org/html/2605.15960#S2.F1)atγ=0\.9\\gamma=0\.9\. The actual gap is the largestε\\varepsilonfor which the pair isε\\varepsilon\-exploitable\. The[Theorem˜2](https://arxiv.org/html/2605.15960#Thmtheorem2)and[Corollary˜4](https://arxiv.org/html/2605.15960#Thmcorollary4)bounds give the uniqueε\\varepsilonabove which each result guaranteesε\\varepsilon\-unexploitability\. Both bounds are conservative\.[Table˜2](https://arxiv.org/html/2605.15960#A5.T2)compares both bounds against the actual exploitation gaps for the transition model pairs in[Figure˜1](https://arxiv.org/html/2605.15960#S2.F1)\. Both thresholds are conservative in every case and overestimate the actual gap\. This conservatism is unavoidable in general: we prove[Theorem˜2](https://arxiv.org/html/2605.15960#Thmtheorem2)is tight\.

Similar Articles

Reward as An Agent for Embodied World Models

arXiv cs.AI

This paper introduces Reward as an Agent and DynDiff-GRPO to address reward hacking and limited exploration in reinforcement learning for embodied world models, achieving significant accuracy gains.

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

arXiv cs.LG

This paper proposes a strategic robustness objective for learning simulators in model-based reinforcement learning, formulated as a minimax game between a model player and an adversarial policy player. Theoretical guarantees and a provably convergent algorithm are provided, with experiments showing reduced prediction error and improved real-world policy transfer.