Modification-Considering Value Learning for Reward Hacking Mitigation in RL
Summary
Proposes Modification-Considering Value Learning (MCVL), a safeguard for off-policy value-based RL that mitigates reward hacking by evaluating each transition's impact on a frozen bootstrapped-return estimator before admitting it into training.
View Cached Full Text
Cached at: 06/30/26, 05:30 AM
# Modification-Considering Value Learning for Reward Hacking Mitigation in RL
Source: [https://arxiv.org/html/2606.28955](https://arxiv.org/html/2606.28955)
Modification\-Considering Value Learning for Reward Hacking Mitigation in RL
Evgenii Opryshko, Umangi Jain, Igor Gilitschenski
Keywords:reinforcement learning, reward hacking, reward tampering, value learning, AI safety
SummaryReinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking\. Existing practical defenses typically constrain policy updates to stay near a known safe reference, creating a tension between suppressing hacking and permitting legitimate improvement\. We propose Modification\-Considering Value Learning \(MCVL\), which operationalizes the theoretical idea of current utility optimization for standard value\-based RL\. MCVL wraps an off\-policy learner and treats each incoming transition as a candidate modification: it forecasts two training paths, one that includes the transition and one that does not, and scores both with a frozen bootstrapped\-return estimator derived from a learned reward model and value function\. The transition is admitted only if inclusion does not decrease the score\. MCVL mitigates reward hacking across diverse environments while continuing to improve the intended objective\.
Contribution\(s\)1\.We propose Modification\-Considering Value Learning \(MCVL\), a safeguard for off\-policy value\-based RL that operationalizes current utility optimization: for each incoming transition, MCVL forecasts two training branches \(with and without the transition\), scores both with a frozen bootstrapped\-return estimator built from a learned reward model and Q\-function, and admits the transition only if inclusion does not decrease that score\. MCVL wraps any off\-policy value\-based learner and requires no access to a safe reference policy\. We instantiate MCVL with DDQN and TD3\. Context:MCVL requires a seed dataset of non\-hacking transitions for pretraining the return estimator to distinguish task progress from reward hacking\. Checking transitions introduces computational overhead\. Current utility optimization was discussed in the context of AI safety\(Yudkowsky,[2011](https://arxiv.org/html/2606.28955#bib.bib32); Hibbard,[2012](https://arxiv.org/html/2606.28955#bib.bib8); Yampolskiy,[2014](https://arxiv.org/html/2606.28955#bib.bib31)\), but has not been operationalized for standard value\-based RL\.2\.We show empirically that MCVL mitigates reward hacking across four safety\-relevant gridworlds and three modified MuJoCo tasks while achieving performance comparable to an Oracle trained on the true reward; for continuous control, random\-policy data suffices for the seed dataset\. Context:The evaluation demonstrates effectiveness across diverse hacking mechanisms rather than providing a controlled benchmark\. Gridworld tasks require a Safe variant with the hacking affordance removed for the seed dataset\.3\.We formalize safety, permissiveness, and bounded\-degradation guarantees for MCVL’s gating rule, parameterized by evaluator accuracyϵ\\epsilondecomposed into reward\-model and value\-function error\. Context:The guarantees depend on anϵ\\epsilon\-accurate return estimator\. The bound is conservative: both branches share the same frozen networks and start states, producing correlated errors that partially cancel\. Achieving smallϵ\\epsilonis not guaranteed in general, though pretraining on hack\-free data provides an initial fit, and successful filtering preserves buffer quality, helping maintain or improve the accuracy over time\.
###### Abstract
Reinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking\. Existing practical defenses typically constrain policy updates to stay near a known safe reference, creating a tension between suppressing hacking and permitting legitimate improvement\. We propose Modification\-Considering Value Learning \(MCVL\), which operationalizes the theoretical idea of current utility optimization for standard value\-based RL\. MCVL wraps an off\-policy learner and treats each incoming transition as a candidate modification: it forecasts two training paths, one that includes the transition and one that does not, and scores both with a frozen bootstrapped\-return estimator derived from a learned reward model and value function\. The transition is admitted only if inclusion does not decrease the score\. We formalize conditions under which this filtering is both safe and permissive, and instantiate MCVL with DDQN and TD3\. Across four safety\-relevant gridworlds and three modified MuJoCo continuous\-control tasks with diverse hacking mechanisms, MCVL mitigates reward hacking while continuing to improve the intended objective\. Project website:[ktolnos\.github\.io/mcvl/](https://ktolnos.github.io/mcvl/)\.
## 1Introduction
Optimizing poorly defined or incomplete rewards can push RL agents toward unintended behaviors, leading to*reward hacking*\(Skalse et al\.,[2022](https://arxiv.org/html/2606.28955#bib.bib26)\)\. For instance, an agent tasked with stacking blocks may learn to flip blocks if the reward is based on the height of the bottom face\(Popov et al\.,[2017](https://arxiv.org/html/2606.28955#bib.bib24)\)\. As RL systems scale to safety\-critical applications \(e\.g\., autonomous driving\(Kiran et al\.,[2021](https://arxiv.org/html/2606.28955#bib.bib10)\)or medical diagnostics\(Ghesu et al\.,[2017](https://arxiv.org/html/2606.28955#bib.bib7)\)\), ensuring reliable and safe behavior becomes increasingly important\. Reward hacking can become more prevalent as models grow in complexity\(Pan et al\.,[2022](https://arxiv.org/html/2606.28955#bib.bib23)\), which also affects large language models where RL is used for post\-training\(Denison et al\.,[2024](https://arxiv.org/html/2606.28955#bib.bib2); OpenAI,[2024](https://arxiv.org/html/2606.28955#bib.bib21); MacDiarmid et al\.,[2025](https://arxiv.org/html/2606.28955#bib.bib17)\)\. A common mitigation constrains policy updates around a trusted reference\(Laidlaw et al\.,[2024](https://arxiv.org/html/2606.28955#bib.bib13)\), often at a cost to optimality\.
A complementary safeguard is to*optimize what the agent currently values*while being conservative about changing those values, an idea discussed as*current utility optimization*\(Orseau & Ring,[2011](https://arxiv.org/html/2606.28955#bib.bib22); Hibbard,[2012](https://arxiv.org/html/2606.28955#bib.bib8); Everitt et al\.,[2016](https://arxiv.org/html/2606.28955#bib.bib4);[2021](https://arxiv.org/html/2606.28955#bib.bib5)\)\. None of these works, however, provides a practical evaluation of this concept\. We address this gap by investigating whether individual transitions can be predictive of reward hacking in the context of value\-based RL\. Our method,*Modification\-Considering Value Learning \(MCVL\)*, wraps a standard off\-policy learner and treats each update as a candidate modification\. For a newly observed transition, the agent forecasts two scenarios: one in which it learns from the transition and one that ignores it\. Then MCVL scores both resulting policies using its*current*learned return estimator, annn\-step bootstrapped return combining a learned reward model with a value\-function bootstrap, and accepts the transition only if inclusion does not decrease this score\. MCVL blocks updates that, according to the agent’s current return estimator, would shift behavior toward undesirable strategies\. To study it empirically, we instantiate MCVL with DDQN and TD3\.
Our method only assumes a seed dataset with non\-hacking transitions so that the evaluator can identify the intended behavior; for gridworlds we collect this in a*Safe*variant with the hacking affordance removed, while for continuous control a random\-policy dataset suffices \([Section˜3](https://arxiv.org/html/2606.28955#S3)\)\. Under these conditions, MCVL mitigates reward hacking in multiple safety\-relevant gridworlds\(Leike et al\.,[2017](https://arxiv.org/html/2606.28955#bib.bib14); Everitt et al\.,[2021](https://arxiv.org/html/2606.28955#bib.bib5)\)and modified Gymnasium continuous\-control environments\(Towers et al\.,[2024](https://arxiv.org/html/2606.28955#bib.bib29)\)\(Reacher, Ant, HalfCheetah\) which we introduce to enable reward\-hacking research in continuous control\. All code will be made publicly available\.
## 2Notation and Preliminaries
We denote a Markov decision process \(MDP\) by\(S,A,P,R,ρ,γ\)\(S,A,P,R,\\rho,\\gamma\)with state spaceSS, action spaceAA, transition modelP\(s′\|s,a\)∈\[0,1\]P\(s^\{\\prime\}\|s,a\)\\in\[0,1\], reward functionR:S×A→ℝR:S\\times A\\rightarrow\\mathbb\{R\}, initial state distributionρ\\rho, and discount factorγ∈\(0,1\)\\gamma\\in\(0,1\)\. For any reward functionrr, we writeJr\(π\)=𝔼ρ,π\[∑t≥0γtr\(st,at\)\]J\_\{r\}\(\\pi\)=\\mathbb\{E\}\_\{\\rho,\\pi\}\[\\sum\_\{t\\geq 0\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\]for the expected return of the policyπ\\piunderrr\. In standard RL, the agent’s training objective is to learn a policyπ\\pimaximizingJR\(π\)J\_\{R\}\(\\pi\)\. The state\-action valueQπ\(s,a\)Q^\{\\pi\}\(s,a\)is the expected return starting from\(s,a\)\(s,a\)and followingπ\\pithereafter\(Sutton & Barto,[2018](https://arxiv.org/html/2606.28955#bib.bib27)\)\. Deep value\-based methods like DDQN\(van Hasselt et al\.,[2016](https://arxiv.org/html/2606.28955#bib.bib30)\)and TD3\(Fujimoto et al\.,[2018](https://arxiv.org/html/2606.28955#bib.bib6)\)approximateQQwith a neural network and learn from transitions\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\)sampled from a replay buffer via temporal\-difference \(TD\) updates\.
##### Reward hacking\.
LetRRdenote the observed training reward andR∗R^\{\*\}the intended reward \(unobserved by the agent\)\. The agent’s true objective is to maximizeJR∗\(π\)J\_\{R^\{\*\}\}\(\\pi\), while observing only rewards fromRR\. For a policy update fromπ\\pitoπ′\\pi^\{\\prime\}, we say the update*induces reward hacking*ifJR\(π′\)\>JR\(π\)J\_\{R\}\(\\pi^\{\\prime\}\)\>J\_\{R\}\(\\pi\)butJR∗\(π′\)<JR∗\(π\)J\_\{R^\{\*\}\}\(\\pi^\{\\prime\}\)<J\_\{R^\{\*\}\}\(\\pi\)\. In words, the update looks better under the proxy while making intended performance worse\.Skalse et al\. \([2022](https://arxiv.org/html/2606.28955#bib.bib26)\)define*hackability*as a property of reward\-function pairs; our notion focuses on the policy update and is complementary\.
## 3Method
*Modification\-Considering Value Learning*\(MCVL\) wraps an off\-policy learner and treats each learning update as a candidate modification to be evaluated before adoption\. Because the desired objective is not observed, MCVL uses a learned*current return estimator*as a proxy to accept or reject updates\. MCVL poses a counterfactual question: if it were to allocate the nextlltraining steps either \(i\) to its current replay buffer𝒟\\mathcal\{D\}alone or \(ii\) to𝒟\\mathcal\{D\}augmented with the new transitionTnewT\_\{\\mathrm\{new\}\}, which resulting policy would achieve a higher expected return according to the agent’s*current bootstrapped\-return estimator*? Both branches use the same compute budget and are scored by the same evaluator\. The transition is accepted if and only if addingTnewT\_\{\\mathrm\{new\}\}does not decrease the score\. This yields a locally rational accept/reject rule under the agent’s present preferences\.
##### Current bootstrapped\-return estimator\.
MCVL maintains a reward modelRψ\(s,a\)R\_\{\\psi\}\(s,a\)trained by supervised regression on observed rewards and an action\-value functionQθ\(s,a\)Q\_\{\\theta\}\(s,a\)trained with standard TD targets\. Together they define a currentnn\-step bootstrapped return estimator for a trajectoryτ=\(s0,a0,…,sn−1,an−1,sn,an\)\\tau=\(s\_\{0\},a\_\{0\},\\ldots,s\_\{n\-1\},a\_\{n\-1\},s\_\{n\},a\_\{n\}\):
G^n\(τ\)=∑t=0n−1γtRψ\(st,at\)\+γnQθ\(sn,an\)\.\\hat\{G\}\_\{n\}\(\\tau\)\\;=\\;\\sum\_\{t=0\}^\{n\-1\}\\gamma^\{t\}\\,R\_\{\\psi\}\(s\_\{t\},a\_\{t\}\)\\;\+\\;\\gamma^\{n\}\\,Q\_\{\\theta\}\(s\_\{n\},a\_\{n\}\)\.\(1\)During scoring, the estimator parameters\(ψ,θ\)\(\\psi,\\theta\)are*frozen to the live agent’s current values*\.
##### Policy forecasting and comparison\.
Upon observingTnew=\(s,a,r,s′\)T\_\{\\mathrm\{new\}\}=\(s,a,r,s^\{\\prime\}\), MCVL constructs two forecasts under an identical training budget oflllearner updates:
π~0=𝖥𝗈𝗋𝖾𝖼𝖺𝗌𝗍\(𝒟,l\),π~\+=𝖥𝗈𝗋𝖾𝖼𝖺𝗌𝗍\(𝒟∪\{Tnew\},l\)\.\\tilde\{\\pi\}^\{\\,0\}\\;=\\;\\mathsf\{Forecast\}\(\\mathcal\{D\},\\,l\),\\qquad\\tilde\{\\pi\}^\{\\,\+\}\\;=\\;\\mathsf\{Forecast\}\(\\mathcal\{D\}\\cup\\\{T\_\{\\mathrm\{new\}\}\\\},\\,l\)\.The operator𝖥𝗈𝗋𝖾𝖼𝖺𝗌𝗍\(⋅,l\)\\mathsf\{Forecast\}\(\\cdot,l\)clones the current networks and runsllbase\-learner updates on minibatches from the specified dataset\. These updates do not affect the live agent\. Both forecasts are scored by the same frozen evaluator from[Equation 1](https://arxiv.org/html/2606.28955#S3.E1)\. Let\{s\(i\)\}i=1k∼ρ\\\{s^\{\(i\)\}\\\}\_\{i=1\}^\{k\}\\sim\\rhobe a set of initial states and let rollouts be of lengthnnunder the same transition modelP^\\hat\{P\}for both branches\. Define
J^\(π\)=1k∑i=1k\[G^n\(τ∼\(P^,π\)\|s0=s\(i\)\)\]\.\\hat\{J\}\(\\pi\)\\;=\\;\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\;\\big\[\\hat\{G\}\_\{n\}\(\\tau\\sim\(\\hat\{P\},\\pi\)\\,\|\\,s\_\{0\}=s^\{\(i\)\}\)\\big\]\.\(2\)as the approximated expected on\-policy return under the current bootstrapped return estimator\. It is used for evaluatingTnewT\_\{\\mathrm\{new\}\}: MCVL admitsTnewT\_\{\\mathrm\{new\}\}if and only ifJ^\(π~\+\)≥J^\(π~0\)\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\geq\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)\. Using matched compute, evaluator parameters, and transition model isolates the effect of admittingTnewT\_\{\\mathrm\{new\}\}and makes the comparison less sensitive to estimation errors\. The training procedure is described in[Algorithm 1](https://arxiv.org/html/2606.28955#alg1)\.
##### Instantiations \(MC\-DDQN and MC\-TD3\)\.
MC\-DDQN wraps a DDQN agent with anϵ\\epsilon\-greedy behavior policy\. Forecasting clones all parameters, including targets, and runsllordinary DDQN updates; forecasted policies are greedy with respect to their respective Q\-functions\. MC\-TD3 analogously clones the actor and critics and runsllstandard TD3 updates\. During scoring, rollouts use a transition modelP^\\hat\{P\}\(either the environment itself or a learned approximation;[Section˜4\.3](https://arxiv.org/html/2606.28955#S4.SS3)\) and compute rewards using the learned reward model\. If accepted,TnewT\_\{\\mathrm\{new\}\}is inserted into𝒟\\mathcal\{D\}and future updates may sample it to update bothQθQ\_\{\\theta\}andRψR\_\{\\psi\}\. If rejected, the transition is discarded and the episode is reset, since future transitions may not be connected to the transitions in the replay buffer which impedes policy forecasting\. Full algorithmic details appear in[Appendices˜E](https://arxiv.org/html/2606.28955#A5)and[F](https://arxiv.org/html/2606.28955#A6)\.
Algorithm 1MCVL \(wrapper around an off\-policy value\-based learner\)1:whiletrainingdo
2:Observe
Tnew=\(s,a,r,s′\)T\_\{\\mathrm\{new\}\}=\(s,a,r,s^\{\\prime\}\)⊳\\trianglerightAction is selected using the policy of a base learner
3:if
\|r−Rψ\(s,a\)\|<δr\|\\,r\-R\_\{\\psi\}\(s,a\)\\,\|<\\delta\_\{r\}then⊳\\trianglerightOptional check to avoid excessive evaluations
4:Insert
TnewT\_\{\\mathrm\{new\}\}into replay buffer
𝒟\\mathcal\{D\}; Perform a training step;continue
5:endif
6:
π~0←\\tilde\{\\pi\}^\{\\,0\}\\leftarrow𝖥𝗈𝗋𝖾𝖼𝖺𝗌𝗍\\mathsf\{Forecast\}\(𝒟,l\)\(\\mathcal\{D\},\\,l\)⊳\\trianglerightForecast performslltraining steps on a provided replay buffer
7:
π~\+←\\tilde\{\\pi\}^\{\\,\+\}\\leftarrow𝖥𝗈𝗋𝖾𝖼𝖺𝗌𝗍\\mathsf\{Forecast\}\(𝒟∪\{Tnew\},l\)\(\\mathcal\{D\}\\cup\\\{T\_\{\\mathrm\{new\}\}\\\},\\,l\)
8:Estimate
J^\(π~0\)\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)and
J^\(π~\+\)\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)via
kkrollouts of length
nnusing[Equation 2](https://arxiv.org/html/2606.28955#S3.E2)
9:if
J^\(π~\+\)≥J^\(π~0\)\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\geq\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)then
10:Insert
TnewT\_\{\\mathrm\{new\}\}into
𝒟\\mathcal\{D\}
11:else⊳\\trianglerightThe transition is suspected to be reward hacking\-inducing
12:Reset the environment⊳\\trianglerightContinued exploration might impede the detection accuracy
13:endif
14:Perform a training step: sample a batch from
𝒟\\mathcal\{D\}and update the base learner and
RψR\_\{\\psi\}on it\.
15:endwhile
##### Pretraining\.
MCVL assumes a seed dataset of non\-hacking transitions𝒟0\\mathcal\{D\}\_\{0\}\. The motivation is identifiability: without such data, a learned return estimator cannot distinguish genuine task progress from reward hacking\. We fitRψR\_\{\\psi\}by supervised regression on the observed proxy rewards in𝒟0\\mathcal\{D\}\_\{0\}and trainQθQ\_\{\\theta\}with standard TD targets as part of the normal RL training\. After pretraining, every newly observed transition is screened by the forecast\-and\-score check; sinceRψR\_\{\\psi\}andQθQ\_\{\\theta\}continue to update with each base\-learner step, the evaluator incorporates new information beyond the seed data\.
We study two practical sources of𝒟0\\mathcal\{D\}\_\{0\}: \(i\)*Safe*variants that match observations, actions, and reward but remove the hacking affordance, so that hacking transitions are unreachable \(used for gridworlds where undirected exploration quickly discovers hacks\); and \(ii\) random\-policy data collected in the*Full*environment, which suffices when hacking requires specific conditions unlikely under random exploration \(used for all three continuous\-control tasks\)\. Importantly, neither source revealsR∗R^\{\*\}whereRRandR∗R^\{\*\}differ, and policies trained on𝒟0\\mathcal\{D\}\_\{0\}transfer only imperfectly to the Full environment\.
##### Hyperparameters and cost\.
To limit overhead, we invoke forecasting only when the observed reward disagrees with the reward model,\|r−Rψ\(s,a\)\|≥δr\\lvert r\-R\_\{\\psi\}\(s,a\)\\rvert\\geq\\delta\_\{r\}; otherwiseTnewT\_\{\\mathrm\{new\}\}is admitted without a check as a heuristic\. As shown in[Section˜4\.3](https://arxiv.org/html/2606.28955#S4.SS3), this filtering does not change conclusions in our experiments\. The marginal per\-transition cost is2l2lbase\-learner updates plus2k⋅n2k\\\!\\cdot\\\!nrollout steps;δr\\delta\_\{r\}controls how often this cost is paid\. Detailed hyperparameter guidance appears in[Appendix˜K](https://arxiv.org/html/2606.28955#A11)\.
##### Reward hacking prevention\.
MCVL evaluates the*policy change*from admitting a transition using the agent’s current bootstrapped\-return estimator, relative to an equally trained counterfactual that excludes it\. This yields a local self\-consistency test: if inclusion steers learning toward behavior the evaluator already scores worse over horizonnn\(e\.g\., shifting effort from task completion to reward tampering\), the update is vetoed\. If inclusion raises \(or leaves unchanged\) the score, the transition is admitted\. This captures ordinary competence gains \(shorter paths, reduced control effort\) the evaluator already values\. While not every hack is guaranteed to lower the score, in our experiments MCVL rejects the updates that produce undesired behaviors across diverse environments\.
##### Theoretical analysis\.
We formalize the conditions under which MCVL correctly accepts or rejects transitions\. LetJR∗\(π\)=𝔼ρ,π\[∑t≥0γtR∗\(st,at\)\]J\_\{R^\{\*\}\}\(\\pi\)=\\mathbb\{E\}\_\{\\rho,\\pi\}\\bigl\[\\sum\_\{t\\geq 0\}\\gamma^\{t\}R^\{\*\}\(s\_\{t\},a\_\{t\}\)\\bigr\]denote the true expected return under the intended rewardR∗R^\{\*\}, which the agent does not observe\.
###### Assumption 3\.1\(ϵ\\epsilon\-accurate evaluator\)\.
The bootstrapped\-return estimatorJ^\\hat\{J\}is*ϵ\\epsilon\-accurate*over a policy classΠ\\Piif\|J^\(π\)−JR∗\(π\)\|≤ϵ\|\\hat\{J\}\(\\pi\)\-J\_\{R^\{\*\}\}\(\\pi\)\|\\leq\\epsilonfor allπ∈Π\\pi\\in\\Pi\.
###### Proposition 3\.2\(Safety, permissiveness, and bounded degradation\)\.
SupposeJ^\\hat\{J\}isϵ\\epsilon\-accurate overΠ⊇\{π~\+,π~0\}\\Pi\\supseteq\\\{\\tilde\{\\pi\}^\{\\,\+\},\\tilde\{\\pi\}^\{\\,0\}\\\}\. Then:
1. 1\.Safety\.IfJR∗\(π~\+\)<JR∗\(π~0\)−2ϵJ\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)<J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-2\\epsilon, then MCVL rejectsTnewT\_\{\\mathrm\{new\}\}\.
2. 2\.Permissiveness\.If MCVL rejectsTnewT\_\{\\mathrm\{new\}\}, thenJR∗\(π~\+\)<JR∗\(π~0\)\+2ϵJ\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)<J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\+2\\epsilon\. That is, MCVL never rejects a transition whose inclusion would improve true performance by more than2ϵ2\\epsilon\.
3. 3\.Bounded degradation\.If MCVL acceptsTnewT\_\{\\mathrm\{new\}\}, thenJR∗\(π~\+\)≥JR∗\(π~0\)−2ϵJ\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\geq J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-2\\epsilon\.
The2ϵ2\\epsilonthreshold can be made explicit in terms of the reward\-model errorϵR\\epsilon\_\{R\}and the value\-function errorϵQ\\epsilon\_\{Q\}; see[Appendix˜A](https://arxiv.org/html/2606.28955#A1)for proofs and the full error decomposition\. The bound clarifies two practical aspects: \(i\) pretraining on hack\-free data reducesϵR\\epsilon\_\{R\}andϵQ\\epsilon\_\{Q\}, tightening the safety bound, and \(ii\) successful filtering preserves buffer quality, keeping the evaluator accurate over time\. In practice, the effective gap is likely tighter than2ϵ2\\epsilonbecause both branches are scored by the same frozen evaluator on the same start states and rolled out under the same transition model, producing positively correlated errors that largely cancel in the comparison\.
## 4Experiments
\(a\)Safe
\(b\)Full
\(c\)No\-Hack

Figure 1:*Box Moving*: the agent moves up or down; stepping on an arrow moves the box that way, and the box teleports to the center on reaching the top\. \(a\)*Safe*: the agent learns to move the box upward\. \(b\)*Full*: a bottom cell yields a spurious\+5\+5when pressed; reward\-maximizing behavior repeatedly triggers it with down\-arrows, moving the box downward against the true objective, whereas a non\-hacking strategy can still raise return by alternating two up\-arrows\. \(c\)*No\-Hack*: collecting the\+5\+5does not block moving the box up, so such transitions are aligned with the objective and should not be rejected\.We evaluate whether MCVL mitigates reward hacking while continuing to improve task performance\. We compare MCVL to its base learner \(DDQN in discrete domains; TD3 in continuous control\), an Oracle agent trained with the base learner on the*true*reward \(which MCVL cannot access\), and a Frozen policy that fixes the pretrained networks and performs no further learning in the*Full*environment\. All methods share hyperparameters, pretrained weight initialization, and the pretraining replay buffer\. We use a learned reward model and true transition dynamics for scoring in all experiments unless stated otherwise\. We report the*true performance*\(intended objective\) and the*observed return*for each environment, with means and bootstrapped 95% CIs over 10 seeds\.
### 4\.1Environments
We introduce the*Box Moving*environment \([Figure 1](https://arxiv.org/html/2606.28955#S4.F1)\) to illustrate representative accept/reject decisions under MCVL, and adopt*Absent Supervisor*and*Tomato Watering*from AI Safety Gridworlds\(Leike et al\.,[2017](https://arxiv.org/html/2606.28955#bib.bib14)\)and*Rocks and Diamonds*fromEveritt et al\. \([2021](https://arxiv.org/html/2606.28955#bib.bib5)\); all are shown in[Figure 2](https://arxiv.org/html/2606.28955#S4.F2)\. For each gridworld we provide a*Safe*variant for pretraining\.
##### Continuous control \(MuJoCo\)\.
For continuous control, we modify Reacher\-v5, Ant\-v5, and HalfCheetah\-v5\(Towers et al\.,[2024](https://arxiv.org/html/2606.28955#bib.bib29)\)by introducing spurious reward sources that model reward tampering or sensor malfunctions \(details in[Appendix˜G](https://arxiv.org/html/2606.28955#A7)\)\. We augment Reacher\-v5 with a hidden button yielding a one\-time\+50\+50after the manipulator has spent 15 consecutive frames in a small region\. We modify Ant\-v5 with a sensor\-malfunction bonus triggered by high contact force and forward velocity, modeling IMU saturation in legged robots; and HalfCheetah\-v5 with an optical\-flow sensor failure bonus triggered when the front tip is low and speed is high\. In each case the performance metric excludes the bonus, and random exploration is unlikely to trigger the hack\.
\(a\)Absent Supervisor
\(b\)Tomato Watering
\(c\)Rocks and Diamonds
Figure 2:\(a\)*Absent Supervisor*: the shortest path to the goal passes through a punishment cell whose cost depends on whether a supervisor is present\. In*Safe*the supervisor is always present; in*Full*it appears with probability0\.50\.5\. True performance penalizes the punishment cell regardless of supervision\. \(b\)*Tomato Watering*: the agent waters tomatoes that dry stochastically; in*Full*a bucket causes*perceptual delusion*, making all cells appear as watered tomatoes\. True performance counts true watered tomatoes;*Safe*removes the bucket\. We provide a non\-delusional transition model for scoring\. \(c\)*Rocks and Diamonds*: diamonds yield\+1\+1per step, rocks−1\-1\. In*Full*a switch flips the observed rock reward to\+10\+10while true performance still prefers diamonds;*Safe*removes the switch\.
### 4\.2Main results
\(a\)Box
Moving
\(b\)Absent
Supervisor
\(c\)Tomato
Watering
\(d\)Rocks &
Diamonds
\(e\)Reacher
\(f\)Ant
\(g\)HalfCheetah

Figure 3:Main results\.Top: true performance metric \(intended objective, unobserved\)\. Bottom: observed return \(proxy\)\. We compare the base learner \(DDQN/TD3\), MCVL, an Oracle trained on true reward, and a Frozen policy that stops learning after pretraining\. Bold lines: mean over 10 seeds; bands: bootstrapped 95% CI\.[Figure 3](https://arxiv.org/html/2606.28955#S4.F3)shows that MCVL maintains high performance across all tasks while the base learners hack\. MCVL attains final performance that is comparable to the Oracle in all tasks despite never accessing the true reward\. In*Box Moving*,*Absent Supervisor*, and*Tomato Watering*, MCVL reaches strong performance faster than the Oracle, which we hypothesize is due to an implicit curriculum induced by rejecting transitions that cause large behavioral shifts early in training\.
In continuous\-control environments, MC\-TD3 achieves performance comparable to the Oracle while the Frozen policy substantially underperforms, suggesting that continued learning beyond pretraining is important in the considered environments\. Further details of the continuous\-control experiments are provided in[Appendix G](https://arxiv.org/html/2606.28955#A7)\.
Across environments, the qualitative pattern is consistent: transitions inducing reward hacking produce forecasted policies that score lower than counterfactuals trained without them\. In*Full*Box Moving, exploiting the\+5\+5tile yields lower forecasted return than pursuing up\-arrows \(while in the*No\-Hack*variant, MCVL admits all transitions and recovers the Oracle policy,[5\(b\)](https://arxiv.org/html/2606.28955#A3.F5.sf2)\); in Absent Supervisor, unsupervised punishment\-cell transitions produce policies that use the cell more often, scoring lower under an evaluator pretrained to always penalize it regardless of supervision state; in Rocks and Diamonds, rock\-pushing transitions score below diamond\-prioritizing ones\. In Tomato Watering, MCVL mitigates hacking only with a non\-delusional transition model which tracks true world state and lets the evaluator distinguish genuine watering from the bucket’s illusion, much as a person in VR still knows their physical location\. For continuous control, sensor\-corruption and reward\-tampering transitions produce policies that prioritize the bonus over movement\. In Ant pursuing high contact forces impedes stable policy learning for TD3, but MC\-TD3 tracks Oracle\.
### 4\.3Ablations and sensitivity
We study when to trigger checks, how to conduct them, and how to handle harmful transitions \([Figure 4](https://arxiv.org/html/2606.28955#S4.F4)\)\. Triggering only when\|r−Rψ\(s,a\)\|≥δr\|r\-R\_\{\\psi\}\(s,a\)\|\\geq\\delta\_\{r\}\(*Check\-by\-reward*\) performs comparably to*Check\-all*but with lower computational cost, and outperforms*Discard\-by\-reward*\(which never admits large\-discrepancy transitions\), as the latter filters out legitimately informative data and fails to reach optimal performance\. Removing the gate entirely and bootstrapping on a frozen reward model \(*Freeze RM*\) isolates the gate’s contribution: it matches the Oracle in five of our seven tasks, so a reward model fit on the pretraining data is often already sufficient, but it collapses where that model contains errors\. Notably, in*Absent Supervisor*, dropping the gate induces reward hacking and a sharp fall in true performance; in one Ant seed, the no\-gate policy even drove its predicted return far above any achievable true return while true performance collapsed \([Appendix H](https://arxiv.org/html/2606.28955#A8)\)\.
##### Importance of forecasting\.
An*Each\-step*variant that compares the policy before and after a*single*gradient step does not reliably prevent hacking\. Policy changes only occur once the critic begins assigning higher value to the new behavior, at which point both the critic and reward model already endorse it\. By contrast, allowingllstandard updates during forecasting gives the base learner enough room to translate a transition into a meaningful policy shift, which the evaluator can then assess effectively using current live networks\. The scoring noise for the branch comparison does not depend onll, while forecasted policies are more likely to diverge withll, so increasingllimproves detection reliability \(see[Appendix˜A](https://arxiv.org/html/2606.28955#A1)\)\.
##### Reject vs\. penalize\.
Replacing rejected transitions with large negative rewards \(*Punishment*\) is less effective than discarding them\. Punished transitions accumulate in the buffer which discourages learning policies that exploit them\. However, this also prevents learning these policies during forecasting, which decreases the reliability of the forecast\-and\-score gate\.
##### Pretraining budget\.
As shown in[4\(b\)](https://arxiv.org/html/2606.28955#S4.F4.sf2), some seeds avoid hacking with as few as 100 pretraining steps in*Safe*; by 300 steps all seeds succeed, even though most have not converged to the optimal policy in the*Safe*environment\. With zero pretraining, MCVL matches the base learner\.
\(a\)Alternative training schemes
\(b\)Pretraining steps
Figure 4:Additional experiments in Box Moving\. \(a\) Comparison of training schemes:*Check all*checks all transitions;*Check by reward*checks only transitions for which predicted reward differs from the observed by at leastδ\\delta;*Discard by reward*discards all transitions where predicted reward sufficiently differs from the observed;*Each step*evaluates policies before and after each gradient step without forecasting future policies;*Punishment*replaces rejected transitions’ rewards with a punishment reward\. \(b\) Effect of different amounts of pretraining, 0 means no pretraining\. After as low as 300 steps, MCVL can achieve optimal performance across all seeds\.
##### Learned transition model\.
Because both branches share the sameP^\\hat\{P\}, exact dynamics are unnecessary; it suffices that the model preserves the relative ranking of hacking and non\-hacking trajectories\. Replacing the environment with a learned forward model in*Box Moving*preserves Oracle performance while avoiding reward hacking \([Appendix D](https://arxiv.org/html/2606.28955#A4)\)\.
##### Forecast budgetll\.
Too smallllfails to capture the policy change induced by a transition, reducing the robustness of rejecting harmful updates, which slows learning of reward hacking, but does not completely prevent it\. Increasingllresolves this \([5\(a\)](https://arxiv.org/html/2606.28955#A3.F5.sf1)\)\. Additional experiments in different variations of gridworld environments are provided in[Appendix C](https://arxiv.org/html/2606.28955#A3)\.
### 4\.4Comparison to occupancy\-regularized objectives
The closest practical baseline in standard RL settings is occupancy\-regularized policy optimization toward a known safe policy\(Laidlaw et al\.,[2024](https://arxiv.org/html/2606.28955#bib.bib13)\)\. When the Oracle policy deviates substantially from the reference, the regularization strength sufficient to suppress hacking also suppresses Oracle\-level improvements, so no singleλ\\lambdaseparates the two\. We test this with the objectiveF\(π,πref\)=J\(π\)−λD\(μπ∥μπref\)F\(\\pi,\\pi\_\{\\text\{ref\}\}\)=J\(\\pi\)\-\\lambda\\,D\(\\mu\_\{\\pi\}\\\|\\mu\_\{\\pi\_\{\\text\{ref\}\}\}\)holding the reference fixed to Frozen\. Across 10 seeds and two divergences \(χ2\\sqrt\{\\chi^\{2\}\}, KL\), such aλ\\lambda*often does not exist*\([Appendix˜B](https://arxiv.org/html/2606.28955#A2)\)\. By contrast, MCVL achieves performance comparable to the Oracle without relying on a safe policy\.
## 5Limitations and Future Work
Computation\.MCVL adds forecasting and scoring overhead\. Triggering checks only on reward discrepancies \([Section˜3](https://arxiv.org/html/2606.28955#S3)\) keeps the overall wall\-clock training time to1\.8×1\.8\\times–4\.2×4\.2\\timesslowdowns versus the base learner on continuous\-control tasks \([Appendix˜J](https://arxiv.org/html/2606.28955#A10)\)\. Further reductions appear feasible through caching and batched rollouts\.
Scope of applicability\.MCVL assumes the evaluator ranks hacking\-inducing trajectories below non\-hacking ones at scoring horizons\. If proxy misspecification is already valued by the evaluator, harmful updates may be admitted\. This may happen due to incorrect reward shaping, as in CoastRunners\(OpenAI,[2023](https://arxiv.org/html/2606.28955#bib.bib20)\)where the agent learns to repeatedly collect boosts instead of following the track\. If the reward for boosts is learned by the reward model, the gating mechanism would not reject policies that exploit it\. MCVL is therefore complementary to better reward design, including potential\-based shaping\(Ng et al\.,[1999](https://arxiv.org/html/2606.28955#bib.bib19)\)\.
Transition dynamics\.Because both branches shareP^\\hat\{P\}, relative ranking accuracy matters more than absolute fidelity\. In*Box Moving*, learned\-model rollouts preserve Oracle\-level MCVL performance \([Appendix˜D](https://arxiv.org/html/2606.28955#A4)\)\. Other environments were tested with true dynamics for scoring; learning accurate world models for higher\-dimensional environments is left for future work\.
Pretraining dependence\.MCVL needs a seed dataset with no hacking transitions to identify the intended behavior; miscalibrated initial evaluators can entrench bad preferences\. In our tasks,*Safe*\-variant data \(gridworlds\) or random exploration \(continuous control\) was sufficient; alternatives include simpler\-task pretraining, trajectory filtering, or human demonstrations\. MCVL does not modify the base learner’s exploration policy; it only gates which transitions enter the replay buffer, so exploration isn’t affected, as demonstrated by the Oracle\-comparable performance\.
## 6Related Work
Agents exploiting misspecified objectives are studied as*reward hacking*\(Skalse et al\.,[2022](https://arxiv.org/html/2606.28955#bib.bib26)\),*reward gaming*\(Leike et al\.,[2018](https://arxiv.org/html/2606.28955#bib.bib15)\), and*specification gaming*\(Krakovna et al\.,[2020](https://arxiv.org/html/2606.28955#bib.bib11)\)\.Krakovna et al\. \([2020](https://arxiv.org/html/2606.28955#bib.bib11)\)survey such failures, andSkalse et al\. \([2022](https://arxiv.org/html/2606.28955#bib.bib26)\)formalize reward hacking and show that it cannot be prevented without restricting the set of possible policies or controlling optimization\.
A common mitigation is constraining policy to remain close to a trusted behavior distribution\(Laidlaw et al\.,[2024](https://arxiv.org/html/2606.28955#bib.bib13); Liu et al\.,[2026](https://arxiv.org/html/2606.28955#bib.bib16)\)\. MCVL instead filters individual replay updates using counterfactual forecasting, without requiring a safe reference policy or specific proxy\-family assumptions\. In our settings, MCVL reaches optimal returns where ORPO\-style objectives often cannot simultaneously avoid hacking and learn the optimal policy \([Appendix B](https://arxiv.org/html/2606.28955#A2)\)\.
Direct manipulation of reward channels is studied as*wireheading*\(Amodei et al\.,[2016](https://arxiv.org/html/2606.28955#bib.bib1); Taylor et al\.,[2016](https://arxiv.org/html/2606.28955#bib.bib28); Everitt & Hutter,[2016](https://arxiv.org/html/2606.28955#bib.bib3); Majha et al\.,[2019](https://arxiv.org/html/2606.28955#bib.bib18)\)or*reward tampering*\(Kumar et al\.,[2020](https://arxiv.org/html/2606.28955#bib.bib12); Everitt et al\.,[2021](https://arxiv.org/html/2606.28955#bib.bib5)\)\. A long\-standing idea is*current utility optimization*: choose actions that improve the current objective without changing what is optimized\(Yudkowsky,[2011](https://arxiv.org/html/2606.28955#bib.bib32); Hibbard,[2012](https://arxiv.org/html/2606.28955#bib.bib8); Yampolskiy,[2014](https://arxiv.org/html/2606.28955#bib.bib31)\)\.Schmidhuber \([2003](https://arxiv.org/html/2606.28955#bib.bib25)\)describes a self\-modifying*Gödel machine*agent that adopts only code or utility changes provably beneficial according to the current objective\.Everitt & Hutter \([2016](https://arxiv.org/html/2606.28955#bib.bib3)\)consider Bayesian agents over hand\-specified utility functions that select actions to avoid altering beliefs about the reward mechanism, andEveritt et al\. \([2021](https://arxiv.org/html/2606.28955#bib.bib5)\)give conditions under which optimizing the*current*reward avoids incentives to tamper\. MCVL operationalizes this perspective in practical off\-policy value\-based RL, including settings beyond direct reward/sensor tampering\.
## 7Conclusion
We introduced*Modification\-Considering Value Learning*, a forecast\-and\-score safeguard for off\-policy value\-based RL\. MCVL evaluates two counterfactual update paths with a frozen bootstrapped\-return estimator and admits a transition only if inclusion does not reduce that score, optimizing what the agent currently values while remaining conservative about changing those values\.
Implementations, MC\-DDQN and MC\-TD3, mitigate reward hacking in safety\-relevant gridworlds and modified MuJoCo tasks while remaining close to Oracle on true reward\. By operationalizing ideas from current utility optimization within standard deep RL, MCVL offers a practical way toward avoiding reward hacking without sacrificing performance on the intended objective\.
#### Acknowledgments
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, companies sponsoring the Vector Institute, Trajectory Labs \([trajectorylabs\.org](https://trajectorylabs.org/)\) and the Digital Research Alliance of Canada \([alliancecan\.ca](https://alliancecan.ca/)\)\. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the Canadian government\.
## Appendix AProofs and Error Decomposition
###### Proof of[Proposition˜3\.2](https://arxiv.org/html/2606.28955#S3.Thmtheorem2)\.
All three claims follow from theϵ\\epsilon\-accuracy ofJ^\\hat\{J\}\.
*\(i\) Safety\.*Byϵ\\epsilon\-accuracy,J^\(π~\+\)≤JR∗\(π~\+\)\+ϵ\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\leq J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\+\\epsilonandJ^\(π~0\)≥JR∗\(π~0\)−ϵ\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)\\geq J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-\\epsilon\. Combining with the hypothesisJR∗\(π~\+\)<JR∗\(π~0\)−2ϵJ\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)<J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-2\\epsilon:
J^\(π~\+\)≤JR∗\(π~\+\)\+ϵ<JR∗\(π~0\)−2ϵ\+ϵ=JR∗\(π~0\)−ϵ≤J^\(π~0\)\.\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\leq J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\+\\epsilon<J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-2\\epsilon\+\\epsilon=J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-\\epsilon\\leq\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)\.HenceJ^\(π~\+\)<J^\(π~0\)\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)<\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)and the accept condition fails\.
*\(ii\) Permissiveness\.*MCVL rejects meansJ^\(π~\+\)<J^\(π~0\)\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)<\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)\. Byϵ\\epsilon\-accuracy:
JR∗\(π~\+\)≤J^\(π~\+\)\+ϵ<J^\(π~0\)\+ϵ≤JR∗\(π~0\)\+2ϵ\.J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\leq\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\+\\epsilon<\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)\+\\epsilon\\leq J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\+2\\epsilon\.HenceJR∗\(π~\+\)<JR∗\(π~0\)\+2ϵJ\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)<J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\+2\\epsilon, so no transition with true improvement exceeding2ϵ2\\epsilonis rejected\.
*\(iii\) Bounded degradation\.*Acceptance impliesJ^\(π~\+\)≥J^\(π~0\)\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\geq\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)\. Byϵ\\epsilon\-accuracy:
JR∗\(π~\+\)≥J^\(π~\+\)−ϵ≥J^\(π~0\)−ϵ≥JR∗\(π~0\)−2ϵ\.∎J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\\geq\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\)\-\\epsilon\\geq\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-\\epsilon\\geq J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-2\\epsilon\.\\qed
##### Error decomposition\.
When rollouts use the true dynamics \(P^=P\\hat\{P\}=P\) and the number of rolloutsk→∞k\\to\\infty\(or the environment is deterministic\), the evaluator error decomposes as follows\.
###### Lemma A\.1\(Evaluator error bound\)\.
Under the conditions above,
\|J^\(π\)−JR∗\(π\)\|≤1−γn1−γϵR\+γnϵQ,\|\\hat\{J\}\(\\pi\)\-J\_\{R^\{\*\}\}\(\\pi\)\|\\;\\leq\\;\\frac\{1\-\\gamma^\{n\}\}\{1\-\\gamma\}\\,\\epsilon\_\{R\}\\;\+\\;\\gamma^\{n\}\\,\\epsilon\_\{Q\},whereϵR=maxs,a\|Rψ\(s,a\)−R∗\(s,a\)\|\\epsilon\_\{R\}=\\max\_\{s,a\}\|R\_\{\\psi\}\(s,a\)\-R^\{\*\}\(s,a\)\|andϵQ=maxs,a\|Qθ\(s,a\)−QR∗π\(s,a\)\|\\epsilon\_\{Q\}=\\max\_\{s,a\}\|Q\_\{\\theta\}\(s,a\)\-Q^\{\\pi\}\_\{R^\{\*\}\}\(s,a\)\|, withQR∗πQ^\{\\pi\}\_\{R^\{\*\}\}denoting the true action\-value function ofπ\\piunderR∗R^\{\*\}\.
###### Proof\.
Under exact dynamics and expectations,
J^\(π\)−JR∗\(π\)\\displaystyle\\hat\{J\}\(\\pi\)\-J\_\{R^\{\*\}\}\(\\pi\)=𝔼ρ,π\[∑t=0n−1γt\(Rψ\(st,at\)−R∗\(st,at\)\)\+γn\(Qθ\(sn,an\)−QR∗π\(sn,an\)\)\]\.\\displaystyle=\\mathbb\{E\}\_\{\\rho,\\pi\}\\\!\\left\[\\sum\_\{t=0\}^\{n\-1\}\\gamma^\{t\}\\bigl\(R\_\{\\psi\}\(s\_\{t\},a\_\{t\}\)\-R^\{\*\}\(s\_\{t\},a\_\{t\}\)\\bigr\)\\;\+\\;\\gamma^\{n\}\\bigl\(Q\_\{\\theta\}\(s\_\{n\},a\_\{n\}\)\-Q^\{\\pi\}\_\{R^\{\*\}\}\(s\_\{n\},a\_\{n\}\)\\bigr\)\\right\]\.Taking absolute values and applying the triangle inequality,
\|J^\(π\)−JR∗\(π\)\|≤∑t=0n−1γtϵR\+γnϵQ=1−γn1−γϵR\+γnϵQ\.∎\|\\hat\{J\}\(\\pi\)\-J\_\{R^\{\*\}\}\(\\pi\)\|\\;\\leq\\;\\sum\_\{t=0\}^\{n\-1\}\\gamma^\{t\}\\,\\epsilon\_\{R\}\+\\gamma^\{n\}\\,\\epsilon\_\{Q\}\\;=\\;\\frac\{1\-\\gamma^\{n\}\}\{1\-\\gamma\}\\,\\epsilon\_\{R\}\+\\gamma^\{n\}\\,\\epsilon\_\{Q\}\.\\qed
Substituting into[Proposition˜3\.2](https://arxiv.org/html/2606.28955#S3.Thmtheorem2), MCVL rejects a hacking\-inducing transition whenever the true return dropδ=JR∗\(π~0\)−JR∗\(π~\+\)\\delta=J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-J\_\{R^\{\*\}\}\(\\tilde\{\\pi\}^\{\\,\+\}\)satisfies
δ\>2\(1−γn1−γϵR\+γnϵQ\)\.\\delta\\;\>\\;2\\\!\\left\(\\frac\{1\-\\gamma^\{n\}\}\{1\-\\gamma\}\\,\\epsilon\_\{R\}\+\\gamma^\{n\}\\,\\epsilon\_\{Q\}\\right\)\.\(3\)This highlights the role of the scoring rollout stepsnn: each additional step replaces reliance on the bootstrapQθQ\_\{\\theta\}with a direct reward\-model observation\. The boundg\(n\)=1−γn1−γϵR\+γnϵQg\(n\)=\\frac\{1\-\\gamma^\{n\}\}\{1\-\\gamma\}\\,\\epsilon\_\{R\}\+\\gamma^\{n\}\\,\\epsilon\_\{Q\}is monotonically decreasing innnwhenϵQ\>ϵR1−γ\\epsilon\_\{Q\}\>\\frac\{\\epsilon\_\{R\}\}\{1\-\\gamma\}, which we expect in practice sinceRψR\_\{\\psi\}directly observes the forecasted policy’s behavior during rollouts whileQθQ\_\{\\theta\}was not trained on the forecasted policy\. WhenP^\\hat\{P\}is a learned model, compounding prediction error provides a practical upper bound on usefulnn; in episodic tasks with a simulator, the episode length is a natural choice\. Successful filtering keeps the buffer free of hacking transitions, helping maintain smallϵR\\epsilon\_\{R\}andϵQ\\epsilon\_\{Q\}over time\.
##### Why forecasting improves detection\.
MCVL scores each branch withkkrollouts of lengthnnunder the frozen evaluator\. Because evaluator parameters are frozen andnnis fixed, the variance of the per\-rollout bootstrapped return is bounded by someσ2\\sigma^\{2\}independently of the forecast budgetll\. Using paired rollouts with shared start states exploits positive covariance between the two branch estimates, so the variance of the estimatorJ^k\(π~0\)−J^k\(π~\+\)\\hat\{J\}\_\{k\}\(\\tilde\{\\pi\}^\{\\,0\}\)\-\\hat\{J\}\_\{k\}\(\\tilde\{\\pi\}^\{\\,\+\}\)is at most2σ2/k2\\sigma^\{2\}/k, independent ofll\.
The expected scoring gapΔ\(l\)=J^\(π~l0\)−J^\(π~l\+\)\\Delta\(l\)=\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,0\}\_\{l\}\)\-\\hat\{J\}\(\\tilde\{\\pi\}^\{\\,\+\}\_\{l\}\)\(positive when the hacking branch scores lower\) in general grows withll: each additional forecast step gives the base learner more room to translate the hacking transition into a policy change the evaluator can distinguish\. So increasing the forecast budget improves detection without increasing scoring cost\.
Atl=1l=1\(the*Each\-step*variant\), a single gradient step barely changes the policy, especially before the critic has begun to assign high value to the hacking behavior, soΔ\(1\)≈0\\Delta\(1\)\\approx 0and the check is unreliable\. Increasingllto the point where the hacking transition induces a meaningful policy divergence produces a gap the evaluator can reliably detect, explaining why the Each\-step variant fails while moderatellsucceeds \([5\(a\)](https://arxiv.org/html/2606.28955#A3.F5.sf1)\)\.
## Appendix BFeasibility of Occupancy\-Regularized Objectives
It would be trivial to show that regularizing to a safe policy either performs at the same level as the frozen safe policy \(or reward hacks\) by selecting a high \(or low\) regularization coefficient\. Instead, we test whether an ORPO\-style objective presented by\(Laidlaw et al\.,[2024](https://arxiv.org/html/2606.28955#bib.bib13)\)could,*in principle*, select the desired behavior in our settings\. For each gridworld environment we train DDQN Q\-functions for*Frozen*\(safe, post\-pretraining\),*Hacking*\(trained on observed reward\), and*Oracle*\(trained on true reward\)\. From these Q\-functions, we derive stochastic policies via \(i\) softmax over Q\-values and \(ii\)ϵ\\epsilon\-greedy withϵ=0\.05\\epsilon=0\.05\. We estimate occupancy measures and empirical on\-policy discounted episodic returns under observed rewardJπJ\_\{\\pi\}with 1000 rollouts\. We compute ORPO objectiveF\(π,πFrozen\)=J\(π\)−λD\(μπ∥μπFrozen\)F\(\\pi,\\pi\_\{\\mathrm\{Frozen\}\}\)=J\(\\pi\)\-\\lambda D\(\\mu\_\{\\pi\}\\\|\\mu\_\{\\pi\_\{\\mathrm\{Frozen\}\}\}\)forD∈\{KL,χ2\}D\\in\\\{\\mathrm\{KL\},\\sqrt\{\\chi^\{2\}\}\\\}\. We record the fraction of seeds \(out of 10\) where someλ\>0\\lambda\>0exists such that it satisfies*both*F\(πOracle,πFrozen\)\>F\(πFrozen,πFrozen\)F\(\\pi\_\{\\mathrm\{Oracle\}\},\\pi\_\{\\mathrm\{Frozen\}\}\)\>F\(\\pi\_\{\\mathrm\{Frozen\}\},\\pi\_\{\\mathrm\{Frozen\}\}\)andF\(πOracle,πFrozen\)\>F\(πHacking,πFrozen\)F\(\\pi\_\{\\mathrm\{Oracle\}\},\\pi\_\{\\mathrm\{Frozen\}\}\)\>F\(\\pi\_\{\\mathrm\{Hacking\}\},\\pi\_\{\\mathrm\{Frozen\}\}\)\. LetJO,JF,JHJ\_\{O\},J\_\{F\},J\_\{H\}denote the observed returns of Oracle, Frozen, and Hacking policies; letDO=D\(μO∥μF\)D\_\{O\}=D\(\\mu\_\{O\}\\\|\\mu\_\{F\}\)andDH=D\(μH∥μF\)D\_\{H\}=D\(\\mu\_\{H\}\\\|\\mu\_\{F\}\)\. The first inequality givesλ<JO−JFDO\\lambda<\\frac\{J\_\{O\}\-J\_\{F\}\}\{D\_\{O\}\}whenDO\>0D\_\{O\}\>0\(and requiresJO\>JFJ\_\{O\}\>J\_\{F\}whenDO=0D\_\{O\}=0\)\. The second inequality isλ\(DH−DO\)\>JH−JO\\lambda\(D\_\{H\}\-D\_\{O\}\)\>J\_\{H\}\-J\_\{O\}, which yields three cases: ifDH\>DOD\_\{H\}\>D\_\{O\}, thenλ\>JH−JODH−DO\\lambda\>\\frac\{J\_\{H\}\-J\_\{O\}\}\{D\_\{H\}\-D\_\{O\}\}; ifDH<DOD\_\{H\}<D\_\{O\}, thenλ<JH−JODH−DO\\lambda<\\frac\{J\_\{H\}\-J\_\{O\}\}\{D\_\{H\}\-D\_\{O\}\}; ifDH=DOD\_\{H\}=D\_\{O\}, it requiresJO\>JHJ\_\{O\}\>J\_\{H\}\. A seed is feasible iff the resulting open interval intersects\(0,\+∞\)\(0,\+\\infty\)\. Per\-seed feasibility does not imply a single globalλ\\lambdaacross seeds\. We present results in[Table 1](https://arxiv.org/html/2606.28955#A2.T1)\.
Table 1:Percentage of seeds \(of 10\) where a regularization weightλ\>0\\lambda\>0exists that ranks the Oracle policy above both Frozen and Hacking under an ORPO\-like objective\.In many cases, no suchλ\\lambdaexists, suggesting that occupancy regularization fails to suppress high\-value hacks without also suppressing Oracle\-like improvements\. In contrast, MCVL achieves performance comparable to the Oracle across all these tasks\.
## References
- Amodei et al\. \(2016\)Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané\.Concrete problems in AI safety\.*ArXiv preprint*, 2016\.
- Denison et al\. \(2024\)Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, et al\.Sycophancy to subterfuge: Investigating reward\-tampering in large language models\.*ArXiv preprint*, 2024\.
- Everitt & Hutter \(2016\)Tom Everitt and Marcus Hutter\.Avoiding wireheading with value reinforcement learning\.In*Artificial General Intelligence*, 2016\.
- Everitt et al\. \(2016\)Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter\.Self\-modification of policy and utility function in rational agents\.In*Artificial General Intelligence*, 2016\.
- Everitt et al\. \(2021\)Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna\.Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective\.*Synthese*, \(Suppl 27\), 2021\.
- Fujimoto et al\. \(2018\)Scott Fujimoto, Herke van Hoof, and David Meger\.Addressing function approximation error in actor\-critic methods\.In*ICML*, Proceedings of Machine Learning Research, 2018\.
- Ghesu et al\. \(2017\)Florin\-Cristian Ghesu, Bogdan Georgescu, Yefeng Zheng, Sasa Grbic, Andreas Maier, Joachim Hornegger, and Dorin Comaniciu\.Multi\-scale deep reinforcement learning for real\-time 3d\-landmark detection in ct scans\.*IEEE transactions on pattern analysis and machine intelligence*, \(1\), 2017\.
- Hibbard \(2012\)Bill Hibbard\.Model\-based utility functions\.*Journal of Artificial General Intelligence*, \(1\), 2012\.
- Huang et al\. \(2022\)Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G\.M\. Araújo\.Cleanrl: High\-quality single\-file implementations of deep reinforcement learning algorithms\.*Journal of Machine Learning Research*, \(274\), 2022\.
- Kiran et al\. \(2021\)B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez\.Deep reinforcement learning for autonomous driving: A survey\.*IEEE Transactions on Intelligent Transportation Systems*, \(6\), 2021\.
- Krakovna et al\. \(2020\)Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg\.Specification gaming: the flip side of AI ingenuity\.*DeepMind Blog*, 2020\.
- Kumar et al\. \(2020\)Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, and Shane Legg\.REALab: An embedded perspective on tampering\.*ArXiv preprint*, 2020\.
- Laidlaw et al\. \(2024\)Cassidy Laidlaw, Shivam Singhal, and Anca Dragan\.Correlated proxies: A new definition and improved mitigation for reward hacking, 2024\.
- Leike et al\. \(2017\)Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg\.AI safety gridworlds\.*ArXiv preprint*, 2017\.
- Leike et al\. \(2018\)Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg\.Scalable agent alignment via reward modeling: a research direction\.*ArXiv preprint*, 2018\.
- Liu et al\. \(2026\)Zixuan Liu, Xiaolin Sun, and Zizhan Zheng\.Robust optimization for mitigating reward hacking with correlated proxies\.In*ICLR*, 2026\.
- MacDiarmid et al\. \(2025\)Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger\.Natural emergent misalignment from reward hacking in production RL\.*ArXiv preprint*, 2025\.
- Majha et al\. \(2019\)Arushi Majha, Sayan Sarkar, and Davide Zagami\.Categorizing wireheading in partially embedded agents\.*ArXiv preprint*, 2019\.
- Ng et al\. \(1999\)A\. Ng, Daishi Harada, and Stuart J\. Russell\.Policy invariance under reward transformations: Theory and application to reward shaping\.In*ICML*, 1999\.
- OpenAI \(2023\)OpenAI\.Faulty reward functions\.[https://openai\.com/research/faulty\-reward\-functions](https://openai.com/research/faulty-reward-functions), 2023\.Accessed: 2024\-04\-10\.
- OpenAI \(2024\)OpenAI\.Openai o1 system card\.[https://openai\.com/index/openai\-o1\-system\-card/](https://openai.com/index/openai-o1-system-card/), 2024\.Accessed: 2024\-09\-26\.
- Orseau & Ring \(2011\)Laurent Orseau and Mark Ring\.Self\-modification and mortality in artificial agents\.In*Artificial General Intelligence*, 2011\.
- Pan et al\. \(2022\)Alexander Pan, Kush Bhatia, and Jacob Steinhardt\.The effects of reward misspecification: Mapping and mitigating misaligned models\.In*ICLR*, 2022\.
- Popov et al\. \(2017\)Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth\-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller\.Data\-efficient deep reinforcement learning for dexterous manipulation\.*ArXiv preprint*, 2017\.
- Schmidhuber \(2003\)Jürgen Schmidhuber\.Gödel machines: self\-referential universal problem solvers making provably optimal self\-improvements\.*arXiv preprint cs/0309048*, 2003\.
- Skalse et al\. \(2022\)Joar Skalse, Nikolaus H\. R\. Howe, Dmitrii Krasheninnikov, and David Krueger\.Defining and characterizing reward gaming\.In*NeurIPS*, 2022\.
- Sutton & Barto \(2018\)Richard S Sutton and Andrew G Barto\.*Reinforcement learning: An introduction*\.2018\.
- Taylor et al\. \(2016\)Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch\.Alignment for advanced machine learning systems\.*Ethics of Artificial Intelligence*, 2016\.
- Towers et al\. \(2024\)Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al\.Gymnasium: A standard interface for reinforcement learning environments\.*ArXiv preprint*, 2024\.
- van Hasselt et al\. \(2016\)Hado van Hasselt, Arthur Guez, and David Silver\.Deep reinforcement learning with double q\-learning\.In*AAAI*, 2016\.
- Yampolskiy \(2014\)Roman V Yampolskiy\.Utility function security in artificially intelligent agents\.*Journal of Experimental & Theoretical Artificial Intelligence*, \(3\), 2014\.
- Yudkowsky \(2011\)Eliezer Yudkowsky\.Complex value systems in friendly ai\.In*Artificial General Intelligence: 4th International Conference, AGI 2011, Mountain View, CA, USA, August 3\-6, 2011\. Proceedings 4*\. Springer, 2011\.
Supplementary Materials
*The following content was not necessarily subject to peer review\.*
## Appendix CAdditional Experiments
\(a\)Forecasting training steps
\(b\)Box Moving*No\-Hack*
\(c\)Number of supervisors
\(d\)Removing walls in*Absent Supervisor*
Figure 5:\(a\) Sensitivity to forecasting training stepsllin Box Moving\. \(b\) Results in the*No\-Hack*version of Box Moving\. \(c\) Varying the number of supervisors in Absent Supervisor\. \(d\) A variant of Absent Supervisor where a shorter path becomes available in*Full*\.In[5\(a\)](https://arxiv.org/html/2606.28955#A3.F5.sf1), we investigated the number of forecasting training stepsllneeded to avoid undesired behavior in Box Moving\. With an insufficient number of training steps, certain undesired transitions are not rejected, yet our algorithm still slows down the learning of reward hacking behavior\.
In[5\(b\)](https://arxiv.org/html/2606.28955#A3.F5.sf2), we examine the behavior of MC\-DDQN in the*No\-Hack*version of*Box Moving*\([Figure 1](https://arxiv.org/html/2606.28955#S4.F1)\)\. In this version, the agent receives a \+5 reward on the top cell which does not interfere with moving the box upward\. As anticipated, in this scenario our agent does not reject transitions and learns the optimal policy\.
We also conducted experiments in*Absent Supervisor*, varying the number of supervisors\. In[5\(c\)](https://arxiv.org/html/2606.28955#A3.F5.sf3), increasing the number of supervisors from 1 to 10 leads to less consistent detection of transitions that induce reward hacking, despite the change being purely visual\. Qualitative analysis revealed that our neural networks struggled to adapt to this distribution shift \(all gridworld environments are encoded as multi\-hot vectors\), resulting in predicted rewards deviating significantly from the ground truth\.
Furthermore, we explored the impact of removing two walls from*Absent Supervisor*after training in*Safe*\. Without these two walls, a shorter path to the goal is available that bypasses the punishment cell, although going through the punishment cell remains faster\. In[5\(d\)](https://arxiv.org/html/2606.28955#A3.F5.sf4), it is evident that while our algorithm can learn a better policy that avoids the punishment cell, the rejection of reward hacking transitions becomes less reliable\. This decline is attributed to the increased distribution shift between*Safe*and*Full*\.
## Appendix DLearned Transition Model
Figure 6:MC\-DDQN with a learned transition model in*Box Moving*\.Scoring uses a transition modelP^\\hat\{P\}solely to*compare*two candidate policies under a frozen evaluator; exact dynamics are unnecessary as long as the evaluator continues to rank hacking trajectories below non\-hacking ones\. To verify that a learned model suffices, we train a forward dynamics model and use it in place of the environment during scoring rollouts\. The model is a two\-hidden\-layer MLP with 128 units per layer, ReLU activations, and layer normalization\. It takes the concatenation of the current observation and a one\-hot encoded action as input and predicts the next observation\. The model is trained with MSE loss using Adam \(learning rate10−210^\{\-2\}\) on 50 episodes of random exploration data \(1000 gradient steps, batch size 256\) and frozen during deployment\. We run MC\-DDQN in*Box Moving*with the same hyperparameters as in the main experiments, replacing only the transition source\. As shown in[Figure 6](https://arxiv.org/html/2606.28955#A4.F6), MC\-DDQN with the learnedP^\\hat\{P\}avoids reward hacking and reaches Oracle performance, matching the result obtained with the true environment\. This supports the claim that approximate dynamics suffice for reliable gating\.
## Appendix EImplementation Details of MC\-DDQN
Algorithm 2Policy ForecastingInput: Set of transitionsTT, replay bufferDD, current Q\-network parametersθ\\theta, training stepsll Output: Forecasted policyπf\\pi\_\{f\}
1:
θf←Copy\(θ\)\\theta\_\{f\}\\leftarrow\\textsc\{Copy\}\(\\theta\)⊳\\trianglerightCopy current Q\-network parameters
2:fortraining step
t=1t=1to
lldo
3:Sample random mini\-batch
BBof transitions from
DD
4:
θf←TrainDDQN\(θf,B∪T\)\\theta\_\{f\}\\leftarrow\\textsc\{TrainDDQN\}\(\\theta\_\{f\},B\\cup T\)⊳\\trianglerightTTis added to each batch for deterministic environments
5:endfor
6:return
πf\(s\)=argmaxaQθf\(s,a\)\\pi\_\{f\}\(s\)=\\operatorname\*\{arg\\,max\}\_\{a\}Q\_\{\\theta\_\{f\}\}\(s,a\)⊳\\trianglerightReturn forecasted policy
Algorithm 3ScoringInput: Policyπ\\pi, transition modelP^\\hat\{P\}, return estimator parametersθ\\thetaandψ\\psi, initial statesρ\\rho, rollout stepsnn, number of rolloutskk Output: Estimated bootstrapped return of the policyπ\\pi
1:forrollout
r=1r=1to
kkdo
2:
g←0g\\leftarrow 0⊳\\trianglerightInitialize return for this rollout
3:
s0∼ρs\_\{0\}\\sim\\rho⊳\\trianglerightSample an initial state
4:
a0←π\(s0\)a\_\{0\}\\leftarrow\\pi\(s\_\{0\}\)⊳\\trianglerightGet action from policy
5:forstep
t=0t=0to
n−1n\-1do
6:
g←g\+γtRψ\(st,at\)g\\leftarrow g\+\\gamma^\{t\}R\_\{\\psi\}\(s\_\{t\},a\_\{t\}\)⊳\\trianglerightAccumulate predicted reward
7:
st\+1∼P^\(st,at\)s\_\{t\+1\}\\sim\\hat\{P\}\(s\_\{t\},a\_\{t\}\)⊳\\trianglerightSample next state from transition model
8:
at\+1←π\(st\+1\)a\_\{t\+1\}\\leftarrow\\pi\(s\_\{t\+1\}\)⊳\\trianglerightGet action from policy
9:endfor
10:
g←g\+γnQθ\(sn,an\)g\\leftarrow g\+\\gamma^\{n\}Q\_\{\\theta\}\(s\_\{n\},a\_\{n\}\)⊳\\trianglerightAdd final Q\-value
11:endfor
12:return
1k∑r=1kg\\frac\{1\}\{k\}\\sum\_\{r=1\}^\{k\}g⊳\\trianglerightReturn average return over rollouts
Algorithm 4Modification\-Considering Double Deep Q\-learning \(MC\-DDQN\)Input: Pretrained return estimator parametersθ\\thetaandψ\\psi, replay bufferDD, transition modelP^\\hat\{P\}, initial statesρ\\rho, rollout stepsnn, number of rolloutskk, forecasting train stepsll, thresholdδr\\delta\_\{r\}\. Output: Trained Q\-network and reward model
1:Initialize state
s0s\_\{0\}
2:fortime step
t=0t=0to end of trainingdo
3:
at←a\_\{t\}\\leftarrowϵ\\epsilon\-greedy\(
argmaxaQθ\(st,a\)\\operatorname\*\{arg\\,max\}\_\{a\}Q\_\{\\theta\}\(s\_\{t\},a\)\)
4:Execute action
ata\_\{t\}, observe reward
rtr\_\{t\}, and transition to state
st\+1s\_\{t\+1\}
5:
Tt←\(st,at,rt,st\+1\)T\_\{t\}\\leftarrow\(s\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\}\)
6:if
\|rt−Rψ\(st,at\)\|<δr\\lvert r\_\{t\}\-R\_\{\\psi\}\(s\_\{t\},a\_\{t\}\)\\rvert<\\delta\_\{r\}then
7:
𝑎𝑐𝑐𝑒𝑝𝑡←True\\mathit\{accept\}\\leftarrow\\textbf\{True\}
8:else
9:
π~\+←\\tilde\{\\pi\}^\{\\,\+\}\\leftarrowForecast\(
\{Tt\},D,θ,l\\\{T\_\{t\}\\\},D,\\theta,l\)⊳\\trianglerightForecast a policy with new transition
10:
π~0←\\tilde\{\\pi\}^\{\\,0\}\\leftarrowForecast\(
\{\},D,θ,l\\\{\\\},D,\\theta,l\)⊳\\trianglerightForecast a policy without new transition
11:
Jπ~\+←J\_\{\\tilde\{\\pi\}^\{\\,\+\}\}\\leftarrowScore\(
π~\+,P^,θ,ψ,ρ,n,k\\tilde\{\\pi\}^\{\\,\+\},\\hat\{P\},\\theta,\\psi,\\rho,n,k\)⊳\\trianglerightEstimate n\-step bootstrapped return forπ~\+\\tilde\{\\pi\}^\{\\,\+\}
12:
Jπ~0←J\_\{\\tilde\{\\pi\}^\{\\,0\}\}\\leftarrowScore\(
π~0,P^,θ,ψ,ρ,n,k\\tilde\{\\pi\}^\{\\,0\},\\hat\{P\},\\theta,\\psi,\\rho,n,k\)⊳\\trianglerightEstimate n\-step bootstrapped return forπ~0\\tilde\{\\pi\}^\{\\,0\}
13:
𝑎𝑐𝑐𝑒𝑝𝑡←\(Jπ~\+≥Jπ~0\)\\mathit\{accept\}\\leftarrow\(J\_\{\\tilde\{\\pi\}^\{\\,\+\}\}\\geq J\_\{\\tilde\{\\pi\}^\{\\,0\}\}\)⊳\\trianglerightAccept ifπ~\+\\tilde\{\\pi\}^\{\\,\+\}is not worse by current estimator
14:endif
15:if
𝑎𝑐𝑐𝑒𝑝𝑡\\mathit\{accept\}then
16:Store transition
TtT\_\{t\}in
DD
17:else
18:Reset the environment⊳\\trianglerightWithoutTtT\_\{t\}inDDfuture forcasted policies might fail to learn\.
19:endif
20:Sample random mini\-batch
BBof transitions from
DD
21:
θ←\\theta\\leftarrowTrainDDQN\(
θ,B\\theta,B\)⊳\\trianglerightUpdate Q\-network
22:
ψ←\\psi\\leftarrowTrain\(
ψ,B\\psi,B\)⊳\\trianglerightUpdate reward model usingL2L\_\{2\}loss
23:
st←st\+1s\_\{t\}\\leftarrow s\_\{t\+1\}
24:endfor
## Appendix FImplementation Details of MC\-TD3
Our implementation is based on the implementation provided byHuang et al\. \([2022](https://arxiv.org/html/2606.28955#bib.bib9)\)\. The overall structure of the algorithm is consistent with MC\-DDQN, described in[Appendix E](https://arxiv.org/html/2606.28955#A5), with key differences outlined below\. TD3 is an actor\-critic algorithm, meaning that the parametersθ\\thetadefine both a policy \(actor\) and a Q\-function \(critic\)\. In[Algorithm 2](https://arxiv.org/html/2606.28955#alg2)and[Algorithm 4](https://arxiv.org/html/2606.28955#alg4), calls toTrainDDQNare replaced withTrainTD3, which updates the actor and critic parametersθ\\thetaas specified byFujimoto et al\. \([2018](https://arxiv.org/html/2606.28955#bib.bib6)\)\. Additionally, in[Algorithm 2](https://arxiv.org/html/2606.28955#alg2), the returned policyπf\(s\)\\pi\_\{f\}\(s\)corresponds to the actor rather thanargmaxaQθ\(s,a\)\\arg\\max\_\{a\}Q\_\{\\theta\}\(s,a\), and in[Algorithm 4](https://arxiv.org/html/2606.28955#alg4)the action executed in the environment is also selected by the actor\.
##### Forecast implementation detail\.
In[Algorithm 2](https://arxiv.org/html/2606.28955#alg2), the transition setTTis added to each forecast minibatch \(B∪TB\\cup T\) rather than relying on stochastic replay sampling fromD∪TD\\cup T\. This is an implementation choice to reduce variance in the branch comparison by ensuring the checked transition is represented in the “with\-TT” branch under a fixed update budget\. The theoretical analysis is unchanged, because it only assumes an abstract𝖥𝗈𝗋𝖾𝖼𝖺𝗌𝗍\(⋅,l\)\\mathsf\{Forecast\}\(\\cdot,l\)operator\.
## Appendix GDetails of the Continuous\-Control Experiments
All three continuous\-control environments follow the standard procedure where the policy and critics are first trained on transitions collected by a random policy\. No*Safe*variant is required because random exploration is unlikely to simultaneously satisfy the velocity and physical\-configuration conditions that trigger hacking or keep the arm in place for multiple frames in Reacher\. This demonstrates that MCVL can function effectively without a separate safe environment whenever the hacking affordance is difficult to discover by chance\.
##### Reacher\.
The rewards in the original Reacher\-v5 environment are calculated as the sum of the negative distance to the target and the negative joint actuation strength\. This reward structure encourages the robotic arm to reach the target while minimizing large, energy\-intensive actions\. The target’s position is randomized at the start of each episode, and random noise is added to the joint rotations and velocities\. Observations include the angles and angular velocities of each joint, the target’s coordinates, and the difference between the target’s coordinates and the coordinates of the arm’s end\. Actions consist of torques applied to the joints, and each episode is truncated after 50 steps\.
We modified the environment by introducing a \+50 reward when the arm’s end remains within a small, fixed region for 15 consecutive steps\. This region remains unchanged across episodes, simulating a scenario where the robot can tamper with its reward function by holding a button, but such behavior is difficult to discover\. In our setup, a reward\-tampering policy is highly unlikely to emerge through random actions and is typically discovered only when the target happens to be inside the reward\-tampering region\.
In our experiments, TD3 consistently discovers the hack when the goal in one of the episodes falls within the hacking region\. MC\-TD3 successfully filters out such transitions\.
##### Ant\.
The standard Ant\-v5 environment rewards forward movement \(velocity in thexx\-direction\), a healthy\-posture bonus, and penalizes control effort and contact forces\. We augment it with a sensor\-malfunction model motivated by real\-world IMU failures in legged robots: when the total contact\-force norm exceeds a threshold \(400\) and the agent’s forward velocity is above a minimum \(3\.5\), the reward is augmented with a bonus equal to 10 times the contact force\. This models the self\-reinforcing loop in which aggressive gaits produce high ground\-reaction forces that saturate accelerometers, causing inflated velocity readings that feed back into the reward\. The performance metric is the reward*before*the sensor\-corruption bonus is applied\. Random exploration is unlikely to simultaneously achieve the required high velocity and high contact forces\.
In our experiments TD3 learner with this bonus reward achieved lower evaluation return compared to Oracle that did not receive the bonus reward \([Figure˜3\(f\)](https://arxiv.org/html/2606.28955#S4.F3.sf6)\)\. We hypothesize that pursuing transitions that achieved high contact forces in the past is brittle and doesn’t lead to stable policies\.
##### HalfCheetah\.
The standard HalfCheetah\-v5 environment rewards forward velocity and penalizes control effort\. We augment it with an optical\-flow sensor failure model: when the front tip of the cheetah drops below a height threshold \(\-0\.23\) while the forward velocity exceeds a minimum \(6\), 1 reward bonus is added for each millimeter of the tip’s negative height\. This models the velocity over\-estimation that occurs when a downward\-facing optical\-flow sensor operates too close to the ground, where proximity saturation and height\-division error compound\. The performance metric excludes this bonus\. Random exploration rarely produces the combination of high forward speed and low front\-tip height needed to trigger the hack\.
In HalfCheetah, MC\-TD3’s mean falls slightly below the Oracle though within the 95% CI, likely because of occasional over\-rejection of legitimate transitions or due to environment resets when hacking is discovered by exploration\.
## Appendix HFrozen Reward Model Baseline
To isolate the contribution of the forecast\-and\-score gate from that of merely having a learned reward model, we evaluate a*Freeze RM*baseline\. Unlike the*Frozen*baseline of[Figure 3](https://arxiv.org/html/2606.28955#S4.F3), which stops policy and critic learning after pretraining, here only the reward model is frozen: the policy and critic continue to train, but bootstrap on the frozen pretrained reward modelRψR\_\{\\psi\}instead of the observed proxy reward, and*every*transition is admitted \(no gate\)\. The reward model is frozen at the end of pretraining \(for continuous control, at the switch from random exploration to the learned policy\)\. All other components \(pretrained initialization, replay buffer, and hyperparameters\) are identical to MCVL \(MC\-DDQN in the gridworlds, MC\-TD3 in continuous control\)\. FreezingRψR\_\{\\psi\}is necessary because, without a gate, a live reward model would simply fit the hacking bonus and the baseline would collapse to the base learner\.
\(a\)Box
Moving
\(b\)Absent
Supervisor
\(c\)Tomato
Watering
\(d\)Rocks &
Diamonds
\(e\)Reacher
\(f\)Ant
\(g\)HalfCheetah

Figure 7:Freeze RM baseline across all environments\.Top: true performance \(unobserved objective\); bottom: observed return \(proxy\)\. Same format as[Figure 3](https://arxiv.org/html/2606.28955#S4.F3), with the Freeze RM baseline \(brown\) added\. The dashed brown line in the continuous\-control panels is the reward model’s own predicted return for Freeze RM \(the quantity it maximizes\)\. Bold lines: mean over 10 seeds; bands: bootstrapped 95% CI\.[Figure 7](https://arxiv.org/html/2606.28955#A8.F7)reports Freeze RM against MCVL and the Oracle over 10 seeds per environment\. On the*true*objective, Freeze RM matches the Oracle in five of the seven tasks \(Box Moving, Rocks & Diamonds, Reacher, Ant, and HalfCheetah\): freezing the reward model immediately after pretraining is, perhaps surprisingly, enough to recover Oracle\-level performance\. The two exceptions show that the gate can offer an advantage in some environments\. In*Absent Supervisor*, Freeze RM’s true performance collapses far below the Oracle \(pairedtt\-test vs\. MC\-DDQN,p<10−4p<10^\{\-4\};9/109/10seeds fall below the worst Oracle seed\), whereas MC\-DDQN stays at Oracle level\. The observed return also stays low, which indicates that the reward model is not accurately capturing the true reward\. In*Tomato Watering*, Freeze RM exhibits the classic hacking signature, with the observed return inflating well above the achievable true return while true performance drops; however, MCVL’s mitigation in this environment relies on a non\-delusional transition model used by the gate, and since Freeze RM has no gate, the comparison conflates the gate with the world model and cannot be made fair\.
Within continuous control, we observe a high proxy return for Freeze RM in the Ant environment\. It is significantly higher than MC\-TD3’s \(pairedtt\-test,p<0\.01p<0\.01;≈2\.7×\{\\approx\}2\.7\\times\), and its reward\-model\-predicted return \(dashed\) tracks the inflated proxy rather than the true performance\. In the single seed where this is most severe, the policy exploits errors in the frozenRψR\_\{\\psi\}, driving the predicted return far above any achievable true return while true performance collapses; because only one seed is affected, the effect on mean true performance is small and not significant\. Freeze RM is also somewhat more failure\-prone than MCVL, with more seeds whose converged true performance falls below the worst Oracle seed \(Ant33vs\.22, HalfCheetah33vs\.11\)\.
Overall, the gate’s necessity is environment\-dependent\. Where a reward model fit on the pretraining data already captures the true reward, bootstrapping on it without a gate suffices; where it does not, as in*Absent Supervisor*, removing the gate leads to reward hacking and collapse\. We therefore expect the gate’s contribution to grow in settings where a small pretraining set cannot capture the true reward\.
## Appendix IQualitative Observations
During preliminary experiments, we encountered instances where the algorithm failed to reject transitions that induce reward hacking\. Here we describe these occurrences and how they can be addressed\.
##### Return estimation rollout steps\.
When using much smaller rollout stepsnn, we noticed that during evaluation of forecasted trajectories, the non\-hacking policy sometimes needed to traverse several states with low rewards to reach a high\-reward region\. In such cases, the reward hacking policy, which remained stationary, had a higher estimated utility\. Increasingnnresolved this issue\.
##### Forecasting without a counterfactual\.
Initially, we forecasted only one future policy by training with the checked transition added to each mini\-batch, and compared the resulting policy to the current one\. However, in some cases this led to situations where the copy learned better non\-hacking behaviors than the current policy simply because it was trained for longer\. The solution was to forecast two policies, one with the checked transition added to each mini\-batch and one without\.
##### Sensitivity to stochasticity\.
Evaluations in stochastic environments were noisy\. To mitigate this, we compared the two policies starting from the same set of states and using the same random seeds of the transition model\. We also kept the random seeds fixed while sampling mini\-batches\.
##### Handling rejected transitions\.
We observed that if a hacking\-inducing transition was removed from the replay buffer and another such transition occurred in the same episode, the algorithm sometimes failed to detect it the second time because there was no set of transitions in the buffer connecting this second transition to the starting state\. We therefore reset the environment on every rejection, as shown in[Algorithm 1](https://arxiv.org/html/2606.28955#alg1)\. In practical applications, it would be reasonable to assume that after detecting potential reward hacking, the agent would be returned to a safe state instead of continuing exploration\. Alternatively, the learning can be just disabled until the end of the episode\.
##### Irreversible changes\.
In*Rocks and Diamonds*, when comparing policies starting from the current state after the rock was pushed into the goal area, the comparison results were always the same, as it was impossible to move the rock out of the goal area\. We addressed this by evaluating from the initial state of the environment\. In cases where reset is not possible, the agent may store starting states in a buffer\. This issue underscores the importance of future research into avoiding irreversible changes\.
## Appendix JComputational Requirements
All gridworld experiments were conducted on workstations equipped with Intel®Core™i9\-13900K processors and NVIDIA®GeForce RTX™4090 GPUs\. Continuous control experiments were conducted on NVIDIA®H100 MIG partitions \(1/8 GPU, 8 vCPU cores\)\. All experiments in the*Absent Supervisor*and*Tomato Watering*environments each required 12\-14 GPU\-hours, running 10 seeds in parallel\. In*Rocks and Diamonds*, experiments took 1 GPU\-day, while in*Box Moving*they required 2 hours each\. For the continuous\-control environments \(1M training steps for Ant and HalfCheetah, 200k for Reacher\), average per\-seed wall\-clock times on an H100 MIG partition were: Reacher TD3 75 min / MC\-TD3 135 min \(1\.8×1\.8\\times\), Ant TD3 68 min / MC\-TD3 206 min \(3\.0×3\.0\\times\), HalfCheetah TD3 65 min / MC\-TD3 272 min \(4\.2×4\.2\\times\)\. The variation in overhead across environments depends primarily on how frequently the reward\-discrepancy check is triggered\. In total, the main experiments described in[Section˜4](https://arxiv.org/html/2606.28955#S4)required approximately 20 GPU\-days, including around 6 GPU\-days for baselines\.
## Appendix KHyperparameters of MC\-DDQN
All hyperparameters are listed in[Table 2](https://arxiv.org/html/2606.28955#A11.T2)\. Our algorithm introduces several additional hyperparameters beyond those typically used by standard RL algorithms:
Table 2:Hyperparameters used for the experiments\.Hyperparameter NameValueQθQ\_\{\\theta\}andRψR\_\{\\psi\}hidden layers2QθQ\_\{\\theta\}andRψR\_\{\\psi\}hidden layer size128QθQ\_\{\\theta\}andRψR\_\{\\psi\}activation functionReLUQθQ\_\{\\theta\}andRψR\_\{\\psi\}optimizerAdamQθQ\_\{\\theta\}learning rate0\.0001RψR\_\{\\psi\}learning rate0\.01QθQ\_\{\\theta\}lossSmoothL1RψR\_\{\\psi\}lossL2L\_\{2\}Batch size32Discount factorγ\\gamma0\.95Training steps on*Safe*10000Training steps on*Full*10000Replay buffer size10000Exploration steps1000Explorationϵ*start*\\epsilon\_\{\\emph\{start\}\}1\.0Explorationϵ*end*\\epsilon\_\{\\emph\{end\}\}0\.05Target network EMA coefficient0\.005Forecasting training stepsll5000Scoring rollout stepsnn30Number of scoring rolloutskk20Predicted reward difference thresholdδr\\delta\_\{r\}0\.05Add transitions from transition modelFalse##### Reward model architecture and learning rate\.
Hyperparameters specify the architecture and learning rate of the reward modelRψR\_\{\\psi\}\. Since learning a reward model is a supervised learning task, these hyperparameters can be tuned on a dataset of transitions collected by any policy\. The reward model architecture may be chosen to match the Q\-functionQθQ\_\{\\theta\}\.
##### Forecasting training stepsll\.
This parameter describes the number of updates to the Q\-function needed to predict the future policy based on a new transition\. As shown in[5\(a\)](https://arxiv.org/html/2606.28955#A3.F5.sf1), this value must be sufficiently large to update the learned values and corresponding policy\. It can be selected by artificially adding a transition that alters the optimal policy and observing the number of training steps required to learn the new policy\.
##### Scoring rollout stepsnn\.
This parameter controls the length of the trajectories used to compare two forecasted policies\. Longer trajectories provide more direct behavioral information about each policy and reduce reliance on the value\-function bootstrap\. The error decomposition in[Appendix˜A](https://arxiv.org/html/2606.28955#A1)confirms that, under expected conditions \(ϵQ\>ϵR1−γ\\epsilon\_\{Q\}\>\\frac\{\\epsilon\_\{R\}\}\{1\-\\gamma\}\), the evaluator bound is monotonically decreasing innn, favoring longer rollouts\. In episodic tasks with access to a simulator, a safe choice is the maximum episode length; in continuing tasks, a truncation horizon typically used in training may be suitable\. When using a learned transition model, compounding prediction error provides a practical upper bound on usefulnn\. Computational costs scale linearly innnand can be reduced by choosing a smaller value based on domain knowledge\.
##### Number of scoring rolloutskk\.
This parameter specifies the number of trajectories obtained by rolling out each forecasted policy for comparison\. The required number depends on the stochasticity of the environment and policies\. If both the policy and environment are deterministic,kkcan be set to 1\. Otherwise,kkcan be selected using domain knowledge or by measuring how much rollouts are required to distinguish between multiple known policies with different behaviors\.
##### Predicted reward difference thresholdδr\\delta\_\{r\}\.
This threshold defines the minimum difference between the predicted and observed rewards for a transition to trigger a check\. As discussed in[Section˜4\.3](https://arxiv.org/html/2606.28955#S4.SS3), this parameter is introduced only to reduce computations and can be set to 0\. However, it can be adjusted based on domain knowledge to speed up training by minimizing unnecessary checks\. The key requirement is that any reward hacking behavior must increase the reward by more than this threshold relative to the reward predicted by the reward model\. In our gridworld and Reacher experiments, 0\.05 performed well when rewards were normalized to\[−1,1\]\[\-1,1\]\. For Ant and HalfCheetah, where the reward scale is larger, we useδr=2\.0\\delta\_\{r\}\{=\}2\.0\.
### K\.1Environment\-specific Parameters
The training steps in*Box Moving*were reduced to speed up training\.*Tomato Watering*has many stochastic transitions because each tomato has a chance of drying out at each step\. To increase the robustness of evaluations, we increased the number of scoring rolloutskk\.*Rocks and Diamonds*required more steps to converge to the optimal policy\. Additionally, using the transition model to collect fresh data while forecasting in*Rocks and Diamonds*makes reward hacking detection more reliable\. Each gridworld environment’s rewards were scaled to\[−1,1\]\[\-1,1\]\.
Table 3:Environment\-specific hyperparameter overrides\.Hyperparameter NameValueBox MovingTraining steps on*Safe*1000Training steps on*Full*1000Replay buffer size1000Exploration steps100Forecasting training stepsll500Absent SupervisorNumber of supervisors1Remove wallsFalseTomato WateringNumber of scoring rolloutskk100Rocks and DiamondsTraining steps on*Safe*15000Training steps on*Full*15000Forecasting training stepsll7500Add transitions from transition modelTrue
### K\.2Hyperparameters of MC\-TD3
Table 4:Hyperparameters used for the MC\-TD3 experiments \(Reacher, Ant, HalfCheetah\)\.Hyperparameter NameValueActor, critic, and reward model hidden layers2Actor, critic, and reward model hidden layer size256Actor, critic, and reward model activation functionReLUActor, critic, and reward model optimizerAdamActor and critic learning rate0\.0003RψR\_\{\\psi\}learning rate0\.003Batch size256Discount factorγ\\gamma0\.99Training steps200000Replay buffer size200000Exploration steps30000Target networks EMA coefficient0\.005Policy noise0\.01Exploration noise0\.1Policy update frequency2Forecasting training stepsll10000Scoring rollout stepsnn50Number of scoring rolloutskk100Predicted reward difference thresholdδr\\delta\_\{r\}0\.05We did not perform extensive hyperparameter tuning; most hyperparameters are inherited from the implementation provided byHuang et al\. \([2022](https://arxiv.org/html/2606.28955#bib.bib9)\)\. The same MC\-TD3 hyperparameters are used for all three continuous\-control environments \(Reacher, Ant, HalfCheetah\), with the following environment\-specific overrides:
Table 5:Environment\-specific hyperparameter overrides for MC\-TD3\.Ant and HalfCheetah require more training steps \(1M vs\. 200k\) because TD3 does not converge within 200k steps in these environments\. The exploration budget is scaled proportionally\. The reward difference thresholdδr\\delta\_\{r\}is increased from 0\.05 to 2\.0 to match the larger reward scale\. The scoring rollout stepsnnis set to the full episode length \(1000\)\.Similar Articles
Reward Hacking in Rubric-Based Reinforcement Learning
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
This paper introduces CHERRL, a controllable environment for studying reward hacking in rubric-based reinforcement learning, where LLM-as-a-Judge biases can be injected to reproduce and analyze hacking behaviors. The authors also explore an agent-based system for automatically detecting reward hacking onset from training logs.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…
This article summarizes a blog post detailing Anthropic's methods for identifying and mitigating reward hacking during RL training, including hidden tests, stress-test sets, SAE monitoring, and environment redesign.
Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds
This paper adapts AI Safety Gridworlds to text-based evaluation and finds that language model agents exhibit zero-shot reward hacking across scales, which is not corrected by standard RL mitigations.