Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
Summary
This paper introduces ReMax, a new objective for reinforcement learning that induces exploration as an emergent property by evaluating policies based on expected maximum return over multiple samples, without explicit exploration bonuses. The authors derive a policy gradient formulation and propose RePPO, a PPO variant that achieves efficient exploration on MinAtar and Craftax benchmarks.
View Cached Full Text
Cached at: 06/02/26, 03:39 PM
# Emergence of Exploration in Policy gradient reinforcement learning via Retrying
Source: [https://arxiv.org/html/2606.00151](https://arxiv.org/html/2606.00151)
Paavo ParmasSotetsu KoyamadaTadashi KozunoToshinori KitamuraShin IshiiYutaka Matsuo
###### Abstract
In reinforcement learning \(RL\), agents benefit from exploration*only*because they repeatedly encounter similar states: trying different actions can then improve performance or reduce uncertainty; without such retries, a greedy policy is optimal\. We formalize this intuition withReMax, an objective that evaluates a policy by the expected maximum return overMMsamples \(M∈ℕM\\in\\mathbb\{N\}\), while accounting for return uncertainty\. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms\. For efficient policy optimization, we derive a new policy\-gradient formulation for ReMax and introduceReMaxPPO\(RePPO\), a PPO variant that optimizes ReMax while generalizing the discrete retry countMMto a continuous parameterm\>0m\>0, enabling fine\-grained control of exploration\. Empirically, RePPO promotes exploration—without any explicit exploration bonuses—on the MinAtar and Craftax benchmarks\. The official code is available at[https://github\.com/nissymori/remax\-rl](https://github.com/nissymori/remax-rl)\.
Machine Learning, ICML
## 1Introduction
Exploration is a central problem in reinforcement learning \(RL\) and has been extensively studied\(Sutton and Barto,[2018](https://arxiv.org/html/2606.00151#bib.bib1); Haarnojaet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib30); Ziebartet al\.,[2008](https://arxiv.org/html/2606.00151#bib.bib34); Mnihet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib28)\)\. A prevailing approach adds bonuses, such as entropy\(Haarnojaet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib30); Ziebartet al\.,[2008](https://arxiv.org/html/2606.00151#bib.bib34)\)or count\-based bonuses\(Bellemareet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib10); Ostrovskiet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib46)\), to the environment’s reward to encourage exploration explicitly\. Other lines of work instantiate posterior sampling\(Thompson,[1933](https://arxiv.org/html/2606.00151#bib.bib58)\)using ensembles or Bayesian networks\(Osbandet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib38),[2018](https://arxiv.org/html/2606.00151#bib.bib39),[2019](https://arxiv.org/html/2606.00151#bib.bib40)\)\. We study a distinct mechanism that drives exploration via greedy reward maximization\.
The objective of RL is not to maximize the reward on the current trial, but rather to learn to maximize the rewards after several trials\. Exploration matters because it enables achieving higher rewards on subsequent trials\. Without the opportunity for retrying, exploration is unnecessary: the rational choice is the action currently believed to yield the highest reward\. Environmental uncertainty also motivates exploration, encouraging attempts at alternative actions for information gathering\. If there is no uncertainty, we do not need to explore: the problem is reduced to pure optimization\.
Building on this principle—*decision\-makers retry under uncertainty*—we propose theReMaxobjective, which formalizes exploration as reward maximization under uncertainty\. We first present ReMax in a bandit setting and compare it to the standard RL objective\.
The ReMax objective \(bandit\)\.LetK≥2K\\geq 2be the number of actions, letμ=\(μ1,…,μK\)\\mu=\(\\mu\_\{1\},\\ldots,\\mu\_\{K\}\)denote per\-action values, and letΠ\\Pibe a distribution overμ\\mu\(e\.g\., the agent’s current belief/posterior over unknown values\)\. For a policyπ∈ΔK−1\\pi\\in\\Delta^\{K\-1\}\(ΔK−1\\Delta^\{K\-1\}is the probability simplex overKKactions\),M∈ℕM\\in\\mathbb\{N\}and\[M\]≔\{1,…,M\},\[M\]\\coloneq\\\{1,\\ldots,M\\\},
JRL\(π\):=𝔼A∼π\[μA\],\\displaystyle J\_\{\\mathrm\{RL\}\}\(\\pi\)=\\mathbb\{E\}\_\{A\\sim\\pi\}\\left\[\{\\mu\_\{A\}\}\\right\],\(1\)JReMaxM\(π\):=𝔼𝝁∼𝚷\[𝔼𝑨\[𝑴\]∼𝝅\[maxm∈\[M\]μAm∣μ\]\]\.\\displaystyle J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi\)=\\mathbb\{E\}\_\{\{\\color\[rgb\]\{1,0\.4,0\.4\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0\.4,0\.4\}\\bm\{\\mu\\sim\\Pi\}\}\}\\left\[\{\\mathbb\{E\}\_\{\{\\color\[rgb\]\{0\.4,0\.4,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4,0\.4,1\}\\bm\{A\_\{\[M\]\}\\sim\\pi\}\}\}\\left\[\{\{\\color\[rgb\]\{0\.4,0\.4,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4,0\.4,1\}\\max\_\{m\\in\[M\]\}\}\\mu\_\{A\_\{m\}\}\\mid\\mu\}\\right\]\}\\right\]\.
Blue capturesretrying: we score a policy by the best ofMMdraws\. ForM=1M=1with fixed rewards,JReMaxMJ^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}reduces toJRL\(π\)J\_\{\\mathrm\{RL\}\}\(\\pi\)and the optimum is deterministic\(Sutton and Barto,[2018](https://arxiv.org/html/2606.00151#bib.bib1)\); forM≥2M\\geq 2it can be stochastic \(Sec\.[2](https://arxiv.org/html/2606.00151#S2)\)\. Red captures \(epistemic\)uncertainty\(Ghoshet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib2)\)overμ\\mu—uncertainty due to limited data rather than inherent randomness—which evolves during exploration; we address it via explicit modeling\(Osbandet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib38)\)or by sampling from nonstationary return estimates\(Moallaet al\.,[2024](https://arxiv.org/html/2606.00151#bib.bib72)\)\.
Comparatively, one canonical approach to exploration is to augment the reward with curiosity\-based bonuses, including pseudo\-count methods\(Bellemareet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib10); Lobelet al\.,[2023](https://arxiv.org/html/2606.00151#bib.bib47)\)and prediction\-error\-based approaches\(Pathaket al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib12); Burdaet al\.,[2019](https://arxiv.org/html/2606.00151#bib.bib21)\)\. While these methods have proven effective in ALE game domains, they typically require additional models to estimate the statistics needed to construct the bonuses, increasing algorithmic complexity and imposing extra computational overhead\.
Unlike methods that add explicit bonuses, ReMax induces exploration*without*bonuses by optimizing a purely reward\-based objective\. Recent subsequent/concurrent works to ours, in LLM training for reasoning tasks have studied retry\-style objectives such as pass@KK\(Walder and Karkhanis,[2025](https://arxiv.org/html/2606.00151#bib.bib36); Tanget al\.,[2025a](https://arxiv.org/html/2606.00151#bib.bib37)\), which share a similar idea to ours\.111See\(Koyamadaet al\.,[2022](https://arxiv.org/html/2606.00151#bib.bib80)\)for an early preprint of the current paper that proposed the ReMax objective and its simple optimization with REINFORCE already in 2022, preceding later max@KK/pass@KKpolicy\-optimization works in LLM training\.The key distinction is that ReMax explicitly accounts for reward uncertainty, whereas LLM reasoning tasks typically assume fixed, verifiable rewards\. For a broader discussion and connections to related work, see App\.[A](https://arxiv.org/html/2606.00151#A1)\.
Throughout this paper, we address the following question:
Can we promote exploration without adding explicit bonuses by optimizing ReMax in RL?
To answer this question, we organize the paper as follows\.
Step 1: Empirical study in Bandits\.We empirically illustrate the core idea of how ReMax \(defined in Eq\. \([1](https://arxiv.org/html/2606.00151#S1.E1)\)\) induces effective exploration in bandits in Sec\.[2](https://arxiv.org/html/2606.00151#S2)\. As the retry parameterMMincreases, the optimal policy becomes more exploratory, and ReMax adapts exploration to the scale of reward uncertainty; in a posterior bandit setting, it exhibits*empirically*sublinear regret \(as observed in our experiments\)\.
Step 2: ReMax in RL\.We define the ReMax objective for RL in Sec\.[3](https://arxiv.org/html/2606.00151#S3)\. Unlike bandits, state transitions hinder retrying multiple actions from the same state to observe returns, so we emulate retries via queries to aQQ\-function and discuss possible instantiations of ReMax in RL\.
Step 3: Policy Gradient for ReMax\.To optimize ReMax, we develop a practical policy\-gradient \(PG\) method in Sec\.[4](https://arxiv.org/html/2606.00151#S4)\. We derive a new PG formulation that is estimable from trajectory returns and generalize the integer draw countMMto a positive real parameterm\>0m\>0, enabling finer control of the exploration–exploitation trade\-off\. Building on this formulation, we introduceReMaxPPO\(RePPO\), a PPO\-based deep actor–critic algorithm\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib35)\)\.
Step 4: Experiments\.Finally, we evaluate RePPO on MinAtar\(Young and Tian,[2019](https://arxiv.org/html/2606.00151#bib.bib52)\)and Craftax\(Matthewset al\.,[2024](https://arxiv.org/html/2606.00151#bib.bib73)\)in Sec\.[5](https://arxiv.org/html/2606.00151#S5)\. RePPO optimizes ReMax without exploration bonuses, achieves better performance, and maintains higher policy entropy than PPO with an entropy bonus; peak performance occurs aroundm=1\.2m=1\.2–1\.41\.4\. On Craftax, a larger\-scale open\-ended RL environment, RePPO achieves competitive performance compared to a tuned entropy\-regularized PPO, despite not using an exploration bonus\. Overall, ReMax emerges as a promising objective for exploration in reinforcement learning\.



Figure 1:Bandit problems\. \(Left\) ReMax objective for a binary bandit\. The standard reinforcement learning objective \(M=1M=1\) has a deterministic optimal policy \(p∗=1p^\{\*\}=1\); in contrast, increasing the retry countM≥2M\\geq 2shifts the optimal policy toward stochastic exploration to hedge against reward uncertainty\. \(Center\) A plot of how the optimal policy changes as the reward variance changes\. While Softmax remains fixed regardless of variance, ReMax automatically increases exploration as reward uncertainty \(scaleα1\\alpha\_\{1\}\) grows, targeting rare high\-reward outcomes\. \(Right\) ReMax objective for a fixed deterministic reward vector\. The continuous parametermmreshapes the objective’s curvature\. Values ofm\>1m\>1flatten the gradient to slow convergence and sustain exploration, whilem<1m<1accelerates updates\.
## 2ReMax in Bandits: An Empirical Study
This section builds intuition for how ReMax \(Eq\. \([1](https://arxiv.org/html/2606.00151#S1.E1)\)\) promotes exploration via retrying and uncertainty\. We first design simple reward distributions to illustrate ReMax’s optima \(Sec\.[2\.1](https://arxiv.org/html/2606.00151#S2.SS1)\), then move to a posterior\-updating bandit whereΠ\\Piis learned from data \(i\.e\., a standard Bayesian bandit learning setting as in Thompson sampling; Sec\.[2\.2](https://arxiv.org/html/2606.00151#S2.SS2)\)\.
### 2\.1Warm\-up: Properties of ReMax\.
#### ReMax optima yield stochastic policies\.
This example shows how retrying can induce stochastic policies\. Consider a two\-armed bandit with arms indexed bya∈\{0,1\}a\\in\\\{0,1\\\}\(soμ=\(μ0,μ1\)\\mu=\(\\mu\_\{0\},\\mu\_\{1\}\)\):μ=\(0,1\)\\mu=\(0,1\)w\.p\.0\.750\.75andμ=\(1,0\)\\mu=\(1,0\)w\.p\.0\.250\.25\. For the RL objective, the optimal policy is deterministic and always chooses arm11\. For ReMax \(Eq\. \([1](https://arxiv.org/html/2606.00151#S1.E1)\)\), the optimal policy is stochastic: mixing between arms hedges which one is rewarding \(repeating the same arm cannot improve the max\-over\-MMvalue\)\. Sinceμ=\(0,1\)\\mu=\(0,1\)is more likely, a limited retry budgetMMstill assigns substantial mass to arm11to avoid missing it\.
AsMMgrows, the policy can explore arm0more often and the optimum becomes increasingly exploratory\. Fig\.[1](https://arxiv.org/html/2606.00151#S1.F1)\(Left\) plotsJReMaxM\(p\)J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(p\)vs\.p:=π\(a=1\)p:=\\pi\(a\{=\}1\)forM=1,…,5M=1,\\dots,5\(analytic values; see App\.[C\.1](https://arxiv.org/html/2606.00151#A3.SS1)\)\. Dots mark the maximizerp∗p^\{\*\}:p∗=1p^\{\*\}=1forM=1M=1\(value 0\.75\), and forM≥2M\\geq 2it shifts toward exploration as the value approaches11\. Therefore, the retry mechanism induces stochastic behavior in the presence of uncertainty\.
#### ReMax adapts exploration to reward uncertainty\.
The previous example showed that ReMax induces stochastic behavior by the retry mechanism; here we show that it adapts to the magnitude of reward uncertainty\. We consider a two\-armed Bernoulli bandit withμi=αiXi\\mu\_\{i\}=\\alpha\_\{i\}X\_\{i\}, whereXi∼Bernoulli\(pi\)X\_\{i\}\\sim\\mathrm\{Bernoulli\}\(p\_\{i\}\)and𝔼\[μi\]=αipi\\mathbb\{E\}\[\\mu\_\{i\}\]=\\alpha\_\{i\}p\_\{i\}\. We fixp0=1p\_\{0\}=1andα0=2\\alpha\_\{0\}=2, and varyα1\\alpha\_\{1\}from11to1010, adjustingp1p\_\{1\}so thatα1p1=1\\alpha\_\{1\}p\_\{1\}=1remains constant \(fixed mean, varying variance\)\. Fig\.[1](https://arxiv.org/html/2606.00151#S1.F1)\(Center\) shows the optimal probability of selecting arm 1,π∗\(a=1\)\\pi^\{\*\}\(a=1\), forM=2M=2, alongside that of the softmax policy, which is the analytical optimum of the RL objective with an entropy bonus \(App\.[C\.2](https://arxiv.org/html/2606.00151#A3.SS2)\)\. Whenα1≤2\\alpha\_\{1\}\\leq 2, arm 1 is never chosen since its maximum cannot exceed that of arm 0\. Asα1\\alpha\_\{1\}increases beyond22, ReMax increasingly favors arm 1, reflecting its adaptiveness to rare but high\-reward outcomes\. In contrast, Softmax remains flat acrossα1∈\[1,10\]\\alpha\_\{1\}\\in\[1,10\]as it depends solely on the mean, which is fixed\.
#### ReMax with deterministic rewards\.
The previous examples focused on ReMax with stochastic rewards\. We now turn to the deterministic setting and show that,*even with fixed rewards*, ReMax reshapes the objective geometry and thus modulates convergence via the retry parameter\.
Consider a binary bandit with rewards fixed toμ=\(0,1\)\\mu=\(0,1\)and letp:=π\(a=1\)∈\[0,1\]p:=\\pi\(a\{=\}1\)\\in\[0,1\]\. In this case, the ReMax objective admits a closed form: form\>0m\>0,JReMaxm\(p\)=1−\(1−p\)mJ\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}^\{\\,m\}\(p\)=1\-\(1\-p\)^\{m\}, wheremmcan be an integer or a positive real\. Fig\.[1](https://arxiv.org/html/2606.00151#S1.F1)\(Right\) plotsJReMaxm\(p\)J\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}^\{\\,m\}\(p\)form=0\.5,0\.75,1,1\.5,2m=0\.5,0\.75,1,1\.5,2, foreshadowing our continuous\-mmformulation in Sec\.[3](https://arxiv.org/html/2606.00151#S3)and Sec\.[4](https://arxiv.org/html/2606.00151#S4)\. The maximizer remainsp⋆=1p^\{\\star\}=1for allmm, indicating that after sufficient exploration removes epistemic uncertainty, the policy will converge to the optimal policy\. However, the local geometry near highppdepends strongly onmm: largermmflattens the objective and reduces gradient magnitudes, whereas smallermmsharpens curvature and amplifies gradients\. Thus, tuningmmcontrols convergence even in deterministic settings:m\>1m\>1slows updates \(encouraging exploration\), whilem<1m<1accelerates them, mitigating the slow convergence often observed with softmax policies\(Henneset al\.,[2020](https://arxiv.org/html/2606.00151#bib.bib61)\)\. Furthermore, non\-integermmnaturally interpolates between integer retry counts \(e\.g\.,m=1\.5m=1\.5fits betweenm=1m=1andm=2m=2\), enabling finer\-grained control\.
### 2\.2Bandit with Posterior: Empirical Sublinear Regret\.
In the previous section, we intentionally designed the distribution over rewardΠ\\Pito illustrate properties of ReMax\. In practice, the distribution is*estimated*and updated from observed data as the agent explores the environment\. To validate ReMax in this more realistic setting, we consider aKK\-armed bandit with posterior updates, optimize ReMax using samples from the evolving posterior\(Thompson,[1933](https://arxiv.org/html/2606.00151#bib.bib58)\), and evaluate cumulative regret\.
Problem setup\.At each run, draw means\(μ1,…,μK\)∼Π∗\(\\mu\_\{1\},\\ldots,\\mu\_\{K\}\)\\sim\\Pi^\{\*\}and fix them, withμ∗=maxiμi\\mu^\{\*\}=\\max\_\{i\}\\mu\_\{i\}\. At roundt∈\[T\]t\\in\[T\], we chooseAtA\_\{t\}, observeRtR\_\{t\}\(meanμAt\\mu\_\{A\_\{t\}\}\), and update the posteriorΠt\+1\\Pi\_\{t\+1\}\(priorΠ0=Π∗\\Pi\_\{0\}=\\Pi^\{\*\}\)\. ReMax optimizesJ^ReMaxM\(πθ\)\\widehat\{J\}^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi\_\{\\theta\}\)using samples fromΠt\\Pi\_\{t\}to produceπt\\pi\_\{t\}\(see App\.[C\.3](https://arxiv.org/html/2606.00151#A3.SS3)\)\. We study \(i\)Beta–Bernoulli\(Π0=Beta\(1,1\)\\Pi\_\{0\}=\\mathrm\{Beta\}\(1,1\),Rt∼Bernoulli\(μAt\)R\_\{t\}\\sim\\mathrm\{Bernoulli\}\(\\mu\_\{A\_\{t\}\}\)\) and \(ii\)Gaussian–Gaussian\(Π0=𝒩\(0,1\)\\Pi\_\{0\}=\\mathcal\{N\}\(0,1\),Rt∼𝒩\(μAt,1\)R\_\{t\}\\sim\\mathcal\{N\}\(\\mu\_\{A\_\{t\}\},1\)\)\. We compare with Thompson sampling\(Thompson,[1933](https://arxiv.org/html/2606.00151#bib.bib58); Agrawal and Goyal,[2017](https://arxiv.org/html/2606.00151#bib.bib56); Honda and Takemura,[2014](https://arxiv.org/html/2606.00151#bib.bib57)\), UCB\(Aueret al\.,[2002](https://arxiv.org/html/2606.00151#bib.bib59)\)\(both sublinear\-regret\), and a Softmax baseline, where we took the softmax of the posterior means and selected the arm following the distribution\. We useK=10K=10,T=1000T=1000, andM∈\{2,3\}M\\in\\\{2,3\\\}, and report mean±\\pmstandard error cumulative regret over 256 runs \(instantaneous regretμ∗−μAt\\mu^\{\*\}\-\\mu\_\{A\_\{t\}\}\); full details are in App\.[C\.3](https://arxiv.org/html/2606.00151#A3.SS3)\.
Results\.ReMax exhibits empirically sublinear cumulative regret, comparable to the classic UCB and Thompson sampling baselines, while Softmax incurs higher cumulative regret, especially for the Gaussian\-Gaussian bandit \(Fig\.[2](https://arxiv.org/html/2606.00151#S2.F2)\)\. We do not claim superiority; rather, these results demonstrate that ReMax yields effective exploration in practice\. We leave theoretical regret bounds to future work\.


Figure 2:Average cumulative regret and standard error over 256 runs\. ReMax withM=2M=2andM=3M=3, optimized by gradient ascent\.
## 3ReMax in RL
We extend ReMax from bandits to episodic Markov Decision Processes \(MDPs\)\(Puterman,[2014](https://arxiv.org/html/2606.00151#bib.bib68)\)ℳ=\(𝒮,𝒜,r,P,T\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},r,P,T\)with discrete actions𝒜=\{1,…,K\}\\mathcal\{A\}=\\\{1,\\dots,K\\\}\. Forτ∼\(π,P\)\\tau\\sim\(\\pi,P\), letℛ\(τ\):=∑t=0T−1r\(st,at\)\\mathcal\{R\}\(\\tau\):=\\sum\_\{t=0\}^\{T\-1\}r\(s\_\{t\},a\_\{t\}\)and maximize𝔼\[ℛ\(τ\)\]\\mathbb\{E\}\\left\[\{\\mathcal\{R\}\(\\tau\)\}\\right\]\(subscript byℳ\\mathcal\{M\}, i\.e\.,ℛℳ,𝒬ℳ\\mathcal\{R\}\_\{\\mathcal\{M\}\},\\mathcal\{Q\}\_\{\\mathcal\{M\}\}, when needed\)\.
In extending ReMax from bandits to RL, two issues arise: the unit ofuncertaintyand the feasibility ofretries\. Uncertainty in RL concerns the entire environment, including rewards and transitions, so we place a distribution over MDPs,ℳ∼P\(ℳ\)\\mathcal\{M\}\\sim P\(\\mathcal\{M\}\), capturing \(epistemic\) uncertainty over unexplored regions of the environment\(Ghoshet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib2)\)\. Retrying in RL would naïvely require multiple rollouts from the same state until termination, which is infeasible without a resettable simulator\(Ecoffetet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib25)\)and often prohibitively expensive even with one\. To address this, we introduce function approximation via a Q\-functionQℳπ\(s,a\):=𝔼τ∼\(π,P\)\[ℛℳ\(τ\)∣s,a\]Q\_\{\\mathcal\{M\}\}^\{\\pi\}\(s,a\):=\\mathbb\{E\}\_\{\\tau\\sim\(\\pi,P\)\}\\left\[\{\\mathcal\{R\}\_\{\\mathcal\{M\}\}\(\\tau\)\\mid s,a\}\\right\], i\.e\., the expected episodic return from starting in statess, taking actionaa, and then followingπ\\pi; this replaces Monte Carlo returns from\(s,a\)\(s,a\)with queries toQℳπQ\_\{\\mathcal\{M\}\}^\{\\pi\}\. To make the retry structure explicit, we fix a states∈𝒮s\\in\\mathcal\{S\}at which retries are considered\. In practice,ssranges over states encountered during training; our actor–critic implementation optimizes an average of this objective over the on\-policy state distribution\. The ReMax objective is defined as
JReMaxM\(π,s\)\\displaystyle J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi,s\)\(2\):=𝔼𝓜∼𝑷\(𝓜\)\[𝔼𝑨\[𝑴\]∼𝝅\[maxm∈\[M\]Qℳπ\(s,Am\)\]\]\.\\displaystyle\\;=\\mathbb\{E\}\_\{\{\\color\[rgb\]\{1,0\.4,0\.4\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0\.4,0\.4\}\\bm\{\\mathcal\{M\}\\sim P\(\\mathcal\{M\}\)\}\}\}\\left\[\{\\mathbb\{E\}\_\{\{\\color\[rgb\]\{0\.4,0\.4,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4,0\.4,1\}\\bm\{A\_\{\[M\]\}\\sim\\pi\}\}\}\\left\[\{\{\\color\[rgb\]\{0\.4,0\.4,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4,0\.4,1\}\\max\_\{m\\in\[M\]\}\}Q\_\{\\mathcal\{M\}\}^\{\\pi\}\(s,A\_\{m\}\)\}\\right\]\}\\right\]\.As in the bandit setting, settingM=1M=1with a fixed environmentℳ\\mathcal\{M\}recovers the standard RL objective, and the optimal policy is deterministic\(Sutton and Barto,[2018](https://arxiv.org/html/2606.00151#bib.bib1)\)\. In Eq\. \([2](https://arxiv.org/html/2606.00151#S3.E2)\), all within\-environment randomness is already absorbed intoQℳπQ\_\{\\mathcal\{M\}\}^\{\\pi\}, so the dependence onℳ\\mathcal\{M\}is only throughQℳπQ\_\{\\mathcal\{M\}\}^\{\\pi\}\. Let𝒬\\mathcal\{Q\}denote the induced distribution over Q\-functions \(fromℳ∼P\(ℳ\)\\mathcal\{M\}\\sim P\(\\mathcal\{M\}\)or an uncertainty model\); replacing the outer expectation overℳ\\mathcal\{M\}by an expectation overQ∼𝒬Q\\sim\\mathcal\{Q\}yields a practical form for optimization:
###### Definition 3\.1\(ReMax objective with Q distribution\)\.
JReMaxM\(π,s,𝒬\)\\displaystyle J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi,s,\\mathcal\{Q\}\)\(3\):=𝔼𝑸∼𝓠\[𝔼𝑨\[𝑴\]∼𝝅\[maxm∈\[M\]Q\(s,Am\)\]\]\.\\displaystyle\\;=\\mathbb\{E\}\_\{\{\\color\[rgb\]\{1,0\.4,0\.4\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0\.4,0\.4\}\\bm\{Q\\sim\\mathcal\{Q\}\}\}\}\\left\[\{\\mathbb\{E\}\_\{\{\\color\[rgb\]\{0\.4,0\.4,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4,0\.4,1\}\\bm\{A\_\{\[M\]\}\\sim\\pi\}\}\}\\left\[\{\{\\color\[rgb\]\{0\.4,0\.4,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4,0\.4,1\}\\max\_\{m\\in\[M\]\}\}Q\(s,A\_\{m\}\)\}\\right\]\}\\right\]\.
For a fixed state, the Q\-valuesQ\(s,⋅\)Q\(s,\\cdot\)form aKK\-dimensional vector, reducing the problem to the bandit case\. Therefore, we can expect that ReMax enjoys the exploration advantages observed in Sec\.[2](https://arxiv.org/html/2606.00151#S2)\. In practice, optimizing the ReMax objective involves several algorithmic considerations\.
Modeling Q\-function uncertainty\.The objective in Eq\. \([3](https://arxiv.org/html/2606.00151#S3.E3)\) can be estimated given samples of Q\-values from the distribution𝒬\\mathcal\{Q\}, via two approaches: \(1\)Explicit modeling:explicitly model𝒬\\mathcal\{Q\}with ensembles or Bayesian methods\(Osbandet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib38),[2018](https://arxiv.org/html/2606.00151#bib.bib39),[2019](https://arxiv.org/html/2606.00151#bib.bib40); Anet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib65)\); model\-based methods can likewise capture environment\-level uncertainty\(Hafneret al\.,[2020](https://arxiv.org/html/2606.00151#bib.bib69)\)\. \(2\)Implicit modeling:in standard deep RL, criticsQϕ\(s,a\)Q\_\{\\phi\}\(s,a\)are inherently nonstationary due to distribution shift and algorithmic randomness\(Moallaet al\.,[2024](https://arxiv.org/html/2606.00151#bib.bib72); Tang and Berseth,[2024](https://arxiv.org/html/2606.00151#bib.bib71)\), so we treatQϕ\(s,⋅\)Q\_\{\\phi\}\(s,\\cdot\)at each update as a sample from this*implicit*distribution\. As in the bandit setting \(Sec\.[2](https://arxiv.org/html/2606.00151#S2)\), ReMax is expected to adapt to this variability for exploration and further, even with nearly deterministic Q\-values, ReMax withM\>1M\>1can still promote exploration by slowing down the convergence as demonstrated in the deterministic bandit in Sec\.[2\.1](https://arxiv.org/html/2606.00151#S2.SS1.SSS0.Px3)\.
Computing the expected maximum overMMtrials\.Suppose a sample of Q\-values for a statessis given byq=\(Q\(s,a1\),…,Q\(s,aK\)\)q=\(Q\(s,a\_\{1\}\),\\dots,Q\(s,a\_\{K\}\)\)withQ∼𝒬Q\\sim\\mathcal\{Q\}via either approach in the previous paragraph\. The question is how to compute the inner expectation for a fixedqq, defined asJReMaxM\(π,s,q\):=𝔼A\[M\]∼π\[maxm∈\[M\]qAm\]J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi,s,q\):=\\mathbb\{E\}\_\{A\_\{\[M\]\}\\sim\\pi\}\\left\[\{\\max\_\{m\\in\[M\]\}q\_\{A\_\{m\}\}\}\\right\]\. SamplingMMactions yields an unbiased but high\-variance estimator, so we provide a closed form\.
###### Proposition 3\.2\(Closed\-form expression of the inner expectation\)\.
Given a Q\-value vectorq∈ℝKq\\in\\mathbb\{R\}^\{K\}, sort it asq\(1\)≥⋯≥q\(K\)q\_\{\(1\)\}\\geq\\cdots\\geq q\_\{\(K\)\}, breaking ties arbitrarily, with aligned policy massesπ\(j\)\\pi\_\{\(j\)\}\. DefineC0:=0C\_\{0\}:=0andCj:=∑u=1jπ\(u\)C\_\{j\}:=\\sum\_\{u=1\}^\{j\}\\pi\_\{\(u\)\},j=1,…,Kj=1,\\ldots,K\. Then,
JReMaxM\(π,s,q\)=q\(1\)\+∑j=1K−1\(q\(j\+1\)−q\(j\)\)\(1−Cj\)M\.J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi,s,q\)=q\_\{\(1\)\}\+\\sum\_\{j=1\}^\{K\-1\}\\bigl\(q\_\{\(j\+1\)\}\-q\_\{\(j\)\}\\bigr\)\\bigl\(1\-C\_\{j\}\\bigr\)^\{M\}\.\(4\)
Refer to App\.[B\.1](https://arxiv.org/html/2606.00151#A2.SS1)for the proof\. The computational cost isO\(KlogK\)O\(K\\log K\)to sortqqand does not depend onMM\. This closed\-form expression allows us to calculate the inner expectation exactly, and it is differentiable with respect toπ\\pi\. Note that this relies on a \(relatively small\) discrete action space, where Q\-values for all actions are available and we can sort them; for continuous or vast action spaces \(e\.g\., language models\), sample\-based estimation is required\.
Base algorithm\.ReMax can be integrated into a broad family of RL algorithms that rely on Q\-functions as critics\. We highlight two classes: \(1\)Actor–critic:Actor–critic methods\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib35); Haarnojaet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib30)\)are a natural fit since they already optimize policies with respect to critic signals\. Thus, ReMax can be instantiated by replacing the policy\-optimization module of the actor–critic algorithm by ReMax\. \(2\)Q\-learning:ReMax can also be combined with Q\-learning variants\(Mnihet al\.,[2013](https://arxiv.org/html/2606.00151#bib.bib62); Galliciet al\.,[2025](https://arxiv.org/html/2606.00151#bib.bib51); Vieillardet al\.,[2020](https://arxiv.org/html/2606.00151#bib.bib67)\), though it requires training a policy model in addition to the Q\-function\.
Our instantiation\.Because ReMax is a new policy\-optimization objective, we adopt a simple and efficient instantiation: an on\-policy actor–critic method with implicit Q\-uncertainty and closed\-form computation\. Specifically, we use PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib35)\)as the base method due to its strong performance with discrete actions\. We do not use Eq\. \([4](https://arxiv.org/html/2606.00151#S3.E4)\) directly; instead, we derive a closed\-form policy gradient that similarly leverages sorting in Sec\.[4](https://arxiv.org/html/2606.00151#S4)\. Other instantiations, such as explicit Q\-distribution modeling, optimizing ReMax through Eq\. \([4](https://arxiv.org/html/2606.00151#S3.E4)\), or integrating ReMax with other RL algorithms \(e\.g\., Q\-learning\(Mnihet al\.,[2013](https://arxiv.org/html/2606.00151#bib.bib62)\)\), are left for future work\.
## 4Policy Gradient in ReMax
To optimize the ReMax objective in Eq\. \([3](https://arxiv.org/html/2606.00151#S3.E3)\) within on\-policy actor–critic methods, we develop a practical policy gradient \(PG\) approach\. Since only a single\-trajectory return from\(s,a\)\(s,a\)is observable, we design a PG estimator based on such returns\. In Sec\.[4\.1](https://arxiv.org/html/2606.00151#S4.SS1), we show that a naïve PG derivation is not directly estimable from trajectory returns and derive an estimation\-friendly reformulation\. We then provide a closed form of our proposed PG and generalize the number of draws to a positive realmmto enable fine\-grained control over exploration in Sec\.[4\.2](https://arxiv.org/html/2606.00151#S4.SS2)\. Finally, we present an actor–critic instantiation,ReMaxPPO\(RePPO\), based on Q\-critic PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib35)\)\.
### 4\.1Estimation\-Friendly Policy Gradient for ReMax
We seek an unbiased PG for ReMax that is estimable from single\-trajectory returns; we first recall why standard RL admits the REINFORCE estimator\(Williams,[1992](https://arxiv.org/html/2606.00151#bib.bib27)\)\.
#### Policy Gradient in Standard RL\.
Letπθ:𝒮×𝒜→\[0,1\]\\pi\_\{\\theta\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\[0,1\]be a parametrized policy, and defineJRL\(πθ,s\):=𝔼τ∼\(πθ,P\)\[ℛ\(τ\)∣s\]J\_\{\\scriptscriptstyle\\mathrm\{RL\}\}\(\\pi\_\{\\theta\},s\):=\\mathbb\{E\}\_\{\\tau\\sim\(\\pi\_\{\\theta\},P\)\}\\left\[\{\\mathcal\{R\}\(\\tau\)\\mid s\}\\right\]\. The policy gradient theorem\(Sutton and Barto,[2018](https://arxiv.org/html/2606.00151#bib.bib1)\)gives
∇θJRL\(πθ,s\)=𝔼a∼πθ\[∇θlogπθ\(a∣s\)Qπθ\(s,a\)\]\.\\nabla\_\{\\theta\}J\_\{\\scriptscriptstyle\\mathrm\{RL\}\}\(\\pi\_\{\\theta\},s\)=\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\}\\left\[\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\,Q^\{\\pi\_\{\\theta\}\}\(s,a\)\}\\right\]\.\(5\)Because the outer expectation is over a single action𝔼a∼πθ\[⋅\]\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\}\\left\[\{\\cdot\}\\right\], replacingQπθ\(s,a\)Q^\{\\pi\_\{\\theta\}\}\(s,a\)by the trajectory returnℛ\(s,a\)\\mathcal\{R\}\(s,a\)yields an unbiased REINFORCE estimator:g^RL:=∇θlogπθ\(a∣s\)ℛ\(s,a\)\\hat\{g\}\_\{\\scriptscriptstyle\\mathrm\{RL\}\}:=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\,\\mathcal\{R\}\(s,a\)\.
#### Problem with the naïve PG for ReMax\.
For ReMax with fixed Q\-valuesJReMaxM\(πθ,s,q\)J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi\_\{\\theta\},s,q\), whereq∈ℝKq\\in\\mathbb\{R\}^\{K\}, applying the policy gradient theorem yields
∇θJReMaxM\(πθ,s,q\)\\displaystyle\\nabla\_\{\\theta\}J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi\_\{\\theta\},s,q\)\(6\)=𝔼A\[M\]∼πθ\[\(maxm∈\[M\]qAm\)∑m=1M∇θlogπθ\(Am∣s\)\]\.\\displaystyle\\;=\\mathbb\{E\}\_\{A\_\{\[M\]\}\\sim\\pi\_\{\\theta\}\}\\left\[\{\\left\(\\max\_\{m\\in\[M\]\}q\_\{A\_\{m\}\}\\right\)\\;\\sum\_\{m=1\}^\{M\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{m\}\\mid s\)\}\\right\]\.Unlike Eq\. \([5](https://arxiv.org/html/2606.00151#S4.E5)\), the expectation is overMMactions, so an unbiased estimator would require observing returns for allMMsampled actions, which is infeasible in episodic RL\. Moreover,A1,…,AMA\_\{1\},\\dots,A\_\{M\}are coupled through themax\\maxoperator, so a single\-action expectation does not follow directly\. We resolve this by introducing a baseline that decouples the max and enables a single\-action expectation\.
#### Policy gradient via expected improvement\.
Following\(Tanget al\.,[2025a](https://arxiv.org/html/2606.00151#bib.bib37)\), for eachmmin Eq\. \([6](https://arxiv.org/html/2606.00151#S4.E6)\), we can insert a baselinebmb\_\{m\}that may depend on\(s,A−m\)\(s,A\_\{\-m\}\)but not onAmA\_\{m\}:
∇θJReMaxM\(πθ,s,q\)=𝔼A\[M\]∼πθ\[\\displaystyle\\nabla\_\{\\theta\}J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi\_\{\\theta\},s,q\)=\\mathbb\{E\}\_\{A\_\{\[M\]\}\\sim\\pi\_\{\\theta\}\}\\bigg\[\(7\)∑m=1M∇θlogπθ\(Am∣s\)\(maxj∈\[M\]\(qAj−bm\)\)\],\\displaystyle~~~~\\;\\;\\;\\sum\_\{m=1\}^\{M\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{m\}\\mid s\)\\left\(\\max\_\{j\\in\[M\]\}\(q\_\{A\_\{j\}\}\-b\_\{m\}\)\\right\)\\bigg\],which preserves unbiasedness of the policy gradient because𝔼Am\[∇θlogπθ\(Am∣s\)bm\]=0\\mathbb\{E\}\_\{A\_\{m\}\}\\left\[\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{m\}\\mid s\)b\_\{m\}\}\\right\]=0\. We choosebmb\_\{m\}asW−m:=max\{qA1,…,qAm−1,qAm\+1,…,qAM\}W\_\{\-m\}:=\\max\\\{q\_\{A\_\{1\}\},\\dots,q\_\{A\_\{m\-1\}\},q\_\{A\_\{m\+1\}\},\\dots,q\_\{A\_\{M\}\}\\\}, having
maxj∈\[M\]\(qAj−W−m\)=\(qAm−W−m\)\+,\\max\_\{j\\in\[M\]\}\(q\_\{A\_\{j\}\}\-W\_\{\-m\}\)=\(q\_\{A\_\{m\}\}\-W\_\{\-m\}\)\_\{\+\},where\(x\)\+=max\(x,0\)\(x\)\_\{\+\}=\\max\(x,0\)forx∈ℝx\\in\\mathbb\{R\}\. Intuitively,W−mW\_\{\-m\}is the best value among the otherM−1M\{\-\}1sampled actions, so\(qAm−W−m\)\+\(q\_\{A\_\{m\}\}\-W\_\{\-m\}\)\_\{\+\}measures how muchAmA\_\{m\}improves on that best alternative \(clipped at zero\)\. This turns themax\\maxinto an action\-specific term plus an “others” term, enabling the following single\-action form:
###### Proposition 4\.1\.
LetWM−1:=max\{qA1,…,qAM−1\}W\_\{M\-1\}:=\\max\\\{q\_\{A\_\{1\}\},\\dots,q\_\{A\_\{M\-1\}\}\\\}\. Then, we have
∇θJReMaxM\(θ,s,q\)=𝔼a∼πθ\[\\displaystyle\\nabla\_\{\\theta\}J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\theta,s,q\)=\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\}\\bigg\[\(8\)M∇θlogπθ\(a∣s\)𝔼A\[M−1\]∼πθ\[\(qa−WM−1\)\+\]\]\.\\displaystyle\\;\\;M\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\,\{\\color\[rgb\]\{0\.4,0\.4,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4,0\.4,1\}\\mathbb\{E\}\_\{A\_\{\[M\-1\]\}\\sim\\pi\_\{\\theta\}\}\\left\[\{\(q\_\{a\}\-W\_\{M\-1\}\)\_\{\+\}\}\\right\]\}\\bigg\]\.
See App\.[B\.2](https://arxiv.org/html/2606.00151#A2.SS2)for the proof\. Theblueterm is the expected improvement of theMM\-th draw when that draw is fixed to actionaa: it comparesqaq\_\{a\}toWM−1W\_\{M\-1\}, the max over the otherM−1M\{\-\}1draws\. Crucially, we take the expectation over the otherM−1M\{\-\}1actionsA\[M−1\]∼πθA\_\{\[M\-1\]\}\\sim\\pi\_\{\\theta\}, which turns the coupledmax\\maxinto a scalar weight that depends only onaa,πθ\\pi\_\{\\theta\}, andqq\. This is the key step that yields a single\-action expectation and makes the PG estimable from single\-trajectory returns\. Since this quantity is central to our reformulated policy gradient, we refer to it asExpected Improvement \(EI\), borrowing terminology from Bayesian optimization\(Joneset al\.,[1998](https://arxiv.org/html/2606.00151#bib.bib60)\)222In Bayesian optimization, EI is the expected gain over the current best; here it is the improvement of an action over the best ofM−1M\{\-\}1other draws, a related but distinct notion\.\. For a referenceR∈ℝR\\in\\mathbb\{R\}, policyπ\\pi, and Q\-valuesq∈ℝKq\\in\\mathbb\{R\}^\{K\}, defineEIM\(R,π,q\):=𝔼A\[M−1\]∼π\[\(R−WM−1\)\+\]\\mathrm\{EI\}\_\{M\}\(R,\\pi,q\):=\\mathbb\{E\}\_\{A\_\{\[M\-1\]\}\\sim\\pi\}\\left\[\{\(R\-W\_\{M\-1\}\)\_\{\+\}\}\\right\]\. Then, we have the EI\-based PG\.
###### Definition 4\.2\(EI\-based PG\)\.
For fixed Q\-valuesq∈ℝKq\\in\\mathbb\{R\}^\{K\}, the EI\-based PG is
∇θJReMaxM\(θ,s,q\)\\displaystyle\\nabla\_\{\\theta\}J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\theta,s,q\)\(9\)=M𝔼a∼πθ\[∇θlogπθ\(a∣s\)EIM\(qa,πθ,q\)\]\.\\displaystyle\\;=M\\,\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\}\\left\[\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\,\\mathrm\{EI\}\_\{M\}\(q\_\{a\},\\pi\_\{\\theta\},q\)\}\\right\]\.
Taking the expectation over q values \(q∈ℝKq\\in\\mathbb\{R\}^\{K\}\) will recover the PG for ReMax in Eq\. \([3](https://arxiv.org/html/2606.00151#S3.E3)\)\. As desired, a single\-trajectory return from\(s,a\)\(s,a\)then yields the estimatorg^ReMax:=M∇θlogπθ\(a∣s\)EIM\(R,πθ,q\)\\hat\{g\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}:=M\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\,\\mathrm\{EI\}\_\{M\}\(R,\\pi\_\{\\theta\},q\)withR=ℛ\(s,a\)R=\\mathcal\{R\}\(s,a\)\. While\(Walder and Karkhanis,[2025](https://arxiv.org/html/2606.00151#bib.bib36); Tanget al\.,[2025a](https://arxiv.org/html/2606.00151#bib.bib37)\)use related comparator baselines mainly for variance reduction in sampling\-based estimators for retry\-style objectives, our reformulation expresses the gradient in terms of the policy probabilitiesπθ\\pi\_\{\\theta\}and is therefore estimable from single\-trajectory returns\. In the following section, we show that EI also admits a closed\-form computation\.
### 4\.2Efficient and Generalized Computation of EI
We provide a closed\-form computation of the EI by leveraging the sorting of Q\-values as in Prop\.[3\.2](https://arxiv.org/html/2606.00151#S3.Thmtheorem2)\.
###### Proposition 4\.3\(Closed\-form computation of EI\)\.
Letq∈ℝKq\\in\\mathbb\{R\}^\{K\}be Q\-values at a state,π∈ΔK−1\\pi\\in\\Delta^\{K\-1\}a policy,R∈ℝR\\in\\mathbb\{R\}a reference, andM∈ℕM\\in\\mathbb\{N\}\. Definevi:=\(R−qi\)\+v\_\{i\}:=\(R\-q\_\{i\}\)\_\{\+\}and sortqqasq\(1\)≥⋯≥q\(K\)q\_\{\(1\)\}\\geq\\cdots\\geq q\_\{\(K\)\}, breaking ties arbitrarily, with aligned massesπ\(j\)\\pi\_\{\(j\)\}\. DefineC0:=0C\_\{0\}:=0andCj:=∑u=1jπ\(u\)C\_\{j\}:=\\sum\_\{u=1\}^\{j\}\\pi\_\{\(u\)\},j=1,…,Kj=1,\\ldots,K\. Then, we have
EIM\(R;π,q\)=v\(1\)\+∑j=1K−1\(v\(j\+1\)−v\(j\)\)\(1−Cj\)M−1\.\\mathrm\{EI\}\_\{M\}\(R;\\pi,q\)=v\_\{\(1\)\}\+\\sum\_\{j=1\}^\{K\-1\}\\big\(v\_\{\(j\+1\)\}\-v\_\{\(j\)\}\\big\)\\,\(1\-C\_\{j\}\)^\{M\-1\}\.\(10\)
See App\.[B\.3](https://arxiv.org/html/2606.00151#A2.SS3)for the proof\. This also costs𝒪\(KlogK\)\\mathcal\{O\}\(K\\log K\)time\. Although ReMax is motivated by an integer number of drawsMM, Eq\. \(10\) naturally extends to realm\>0m\>0by replacing\(1−Cj\)M−1\(1\-C\_\{j\}\)^\{M\-1\}with\(1−Cj\)m−1\(1\-C\_\{j\}\)^\{m\-1\}\. Form<1m<1, this finite\-valued extension requiresCj<1C\_\{j\}<1for allj<Kj<K, which holds for full\-support policies; in our implementation, we additionally clip1−Cj1\-C\_\{j\}from below for numerical stability as in App\.[D](https://arxiv.org/html/2606.00151#A4)\. We therefore define thegeneralized EI,EIm\(R;π,q\)\\mathrm\{EI\}\_\{m\}\(R;\\pi,q\), by substitutingmmforMM, enabling finer control of the exploration–exploitation trade\-off\. The closed form of ReMax with fixed Q\-values \(Eq\.[4](https://arxiv.org/html/2606.00151#S3.E4)\) is also valid for any realm\>0m\>0\.
Algorithm 1RePPO1:repeat
2:Collect trajectories under
πθ\\pi\_\{\\theta\}and compute returns
RtλR^\{\\lambda\}\_\{t\}\.
3:For each
\(st,at\)\(s\_\{t\},a\_\{t\}\): form
q←Qϕ\(st,⋅\)q\\\!\\leftarrow\\\!Q\_\{\\phi\}\(s\_\{t\},\\cdot\); compute
R\+\(t\):=EIm\(Rtλ;πθ,q\)R\_\{\+\}\(t\)\\\!:=\\\!\\mathrm\{EI\}\_\{m\}\(R^\{\\lambda\}\_\{t\};\\pi\_\{\\theta\},q\),
Q\+\(st,a\)=EIm\(Qϕ\(st,a\);πθ,q\)Q\_\{\+\}\(s\_\{t\},a\)\\\!=\\\!\\mathrm\{EI\}\_\{m\}\(Q\_\{\\phi\}\(s\_\{t\},a\);\\pi\_\{\\theta\},q\), and advantage
A\+\(t\)=R\+\(t\)−b\+\(st\)A\_\{\+\}\(t\)\\\!=\\\!R\_\{\+\}\(t\)\\\!\-\\\!b\_\{\+\}\(s\_\{t\}\)\.
4:Update actor by PPO objective \(Eq\. \([12](https://arxiv.org/html/2606.00151#S4.E12)\)\) with
A\+\(t\)A\_\{\+\}\(t\); update critic
QϕQ\_\{\\phi\}toward
RtλR^\{\\lambda\}\_\{t\}\.
5:untilconvergence


Figure 3:MinAtar results\.Left:normalized scores aggregated with median, IQM, and mean across four games; boxes denote RLiable summaries over 10 seeds\. RePPO, without entropy bonus, outperforms PPO\-V, PPO\-Q with entropy, and PPO\-V \+ RND\.Right:policy entropy during Breakout training\. We observe that RePPO keeps high entropy without entropy bonus, indicating the promoted exploration\.

Figure 4:MinAtar results\.Effect of retry parametermm\.Left:median, IQM, and mean of normalized evaluation return across all games\. The best performance occurred aroundm∈\[1\.2,1\.4\]m\\in\[1\.2,1\.4\]\.Right:policy entropy on Breakout\. Largermmslowed entropy decay and encouraged exploration, while smallermmleads to faster entropy decay\.
### 4\.3RePPO: Practical Policy Gradient for ReMax
We obtain a policy gradient that can be estimated directly from trajectory returns \(Def\.[4\.2](https://arxiv.org/html/2606.00151#S4.Thmtheorem2)\) and computed in closed form \(Eq\. \([10](https://arxiv.org/html/2606.00151#S4.E10)\)\)\. Incorporating this into PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib35)\)leads to our new algorithm,ReMaxPPO\(RePPO\)\. We begin by revisiting the PPO surrogate\.
#### PPO surrogate\.
PPO collects trajectories underπold\\pi\_\{\\mathrm\{old\}\}, computesλ\\lambda\-returnsRtλR^\{\\lambda\}\_\{t\}, and forms advantagesA\(t\)=Rtλ−Vϕ\(st\)A\(t\)=R^\{\\lambda\}\_\{t\}\-V\_\{\\phi\}\(s\_\{t\}\)\. It then optimizes a clipped, importance\-weighted surrogate withrθ\(t\)=πθ\(at∣st\)/πold\(at∣st\)r\_\{\\theta\}\(t\)=\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)/\\pi\_\{\\mathrm\{old\}\}\(a\_\{t\}\\mid s\_\{t\}\)and clipε\>0\\varepsilon\>0, yielding:
ℒPPO\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\scriptscriptstyle\\text\{PPO\}\}\(\\theta\)\(11\):=𝔼\[min\(rθ\(t\)A\(t\),clip\(rθ\(t\),1−ε,1\+ε\)A\(t\)\)\]\.\\displaystyle\\;=\\mathbb\{E\}\\left\[\{\\min\{\\left\(r\_\{\\theta\}\(t\)A\(t\),\\ \\mathrm\{clip\}\\big\(r\_\{\\theta\}\(t\),1\{\-\}\\varepsilon,1\{\+\}\\varepsilon\\big\)A\(t\)\\right\)\}\}\\right\]\.
#### RePPO surrogate\.
RePPO modifies PPO in two ways: \(1\) use a Q\-criticQϕQ\_\{\\phi\}; \(2\) use an EI\-based advantage\. GivenRtλR^\{\\lambda\}\_\{t\}, defineR\+\(t\):=EIm\(Rtλ,πθ,Qϕ\(st,⋅\)\)R\_\{\+\}\(t\):=\\mathrm\{EI\}\_\{m\}\(R^\{\\lambda\}\_\{t\},\\pi\_\{\\theta\},Q\_\{\\phi\}\(s\_\{t\},\\cdot\)\)andQ\+\(st,a\):=EIm\(Qϕ\(st,a\),πθ,q\)Q\_\{\+\}\(s\_\{t\},a\):=\\mathrm\{EI\}\_\{m\}\(Q\_\{\\phi\}\(s\_\{t\},a\),\\pi\_\{\\theta\},q\)withq=Qϕ\(st,⋅\)q=Q\_\{\\phi\}\(s\_\{t\},\\cdot\)\. Setb\+\(st\):=𝔼a∼πθ\[Q\+\(st,a\)\]b\_\{\+\}\(s\_\{t\}\):=\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\}\\left\[\{Q\_\{\+\}\(s\_\{t\},a\)\}\\right\]andA\+\(t\):=R\+\(t\)−b\+\(st\)A\_\{\+\}\(t\):=R\_\{\+\}\(t\)\-b\_\{\+\}\(s\_\{t\}\), so the surrogate objective for RePPO becomes
ℒRePPO\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\scriptscriptstyle\\text\{RePPO\}\}\(\\theta\)\(12\):=𝔼\[min\(rθ\(t\)A\+\(t\),clip\(rθ\(t\),1−ε,1\+ε\)A\+\(t\)\)\]\.\\displaystyle\\;=\\mathbb\{E\}\\left\[\{\\min\{\\left\(r\_\{\\theta\}\(t\)A\_\{\+\}\(t\),\\ \\mathrm\{clip\}\\big\(r\_\{\\theta\}\(t\),1\{\-\}\\varepsilon,1\{\+\}\\varepsilon\\big\)A\_\{\+\}\(t\)\\right\)\}\}\\right\]\.The algorithm is summarized in Alg\.[1](https://arxiv.org/html/2606.00151#alg1)\. Since RePPO differs from PPO solely through the use of a Q\-critic and an EI\-based advantage, its implementation is simple and the extra computation is minimal \(see App\.[E\.2](https://arxiv.org/html/2606.00151#A5.SS2)\)\. We provide the advantage computation code in App\.[D](https://arxiv.org/html/2606.00151#A4)\.
Q\-replacement for efficient exploration\.When we compute the EI for the trajectory return,R\+\(s,a\):=EIm\(R\(s,a\);πθ,Qϕ\(s,⋅\)\)R\_\{\+\}\(s,a\):=\\mathrm\{EI\}\_\{m\}\(R\(s,a\);\\pi\_\{\\theta\},Q\_\{\\phi\}\(s,\\cdot\)\), we expectva=\(R\(s,a\)−Qϕ\(s,a\)\)\+v\_\{a\}=\(R\(s,a\)\-Q\_\{\\phi\}\(s,a\)\)\_\{\+\}to be zero in Eq\. \([10](https://arxiv.org/html/2606.00151#S4.E10)\), since it measures an action’s improvement over itself\. In practice, an underspecified critic may yieldR\(s,a\)\>Qϕ\(s,a\)R\(s,a\)\>Q\_\{\\phi\}\(s,a\), overestimatingR\+R\_\{\+\}and causing the policy to overfit toaa, harming exploration\. To mitigate this, we*replace*theaa\-th element ofq=Qϕ\(s,⋅\)q=Q\_\{\\phi\}\(s,\\cdot\)withℛ\(s,a\)\\mathcal\{R\}\(s,a\)when evaluating EI\. This enforcesva=0v\_\{a\}=0for the sampled action by construction and reduces spurious “self\-improvement” caused by critic underestimation\.
## 5Experiments
To confirm the emergence of exploration in RePPO, we used three benchmark environments:*MinAtar*\(Young and Tian,[2019](https://arxiv.org/html/2606.00151#bib.bib52)\),*Atari*\(Bellemareet al\.,[2013](https://arxiv.org/html/2606.00151#bib.bib53)\), and*Craftax*\(Matthewset al\.,[2024](https://arxiv.org/html/2606.00151#bib.bib73)\)\(an open\-ended, long\-horizon exploration benchmark\)\. MinAtar is a simplified version of Atari 2600 games providing Breakout, Asterix, Freeway, and Space Invaders; we used thepgximplementation\(Koyamadaet al\.,[2023](https://arxiv.org/html/2606.00151#bib.bib50)\)for efficient vectorized simulation\. For Atari, we used 10 games from Bellemare’s hard\-exploration problems\(Bellemareet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib10)\)to verify exploration benefits \(App\.[F](https://arxiv.org/html/2606.00151#A6)\)\. To further validate scalability, we used Craftax\(Matthewset al\.,[2024](https://arxiv.org/html/2606.00151#bib.bib73)\), a vectorizable version of Crafter\(Hafner,[2022](https://arxiv.org/html/2606.00151#bib.bib74)\), an open\-ended RL environment\.
### 5\.1MinAtar
Here, we evaluate RePPO on the MinAtar benchmark to demonstrate its effectiveness in promoting exploration\. We also analyze the impact of key components, specifically the continuous retry parametermmand the Q\-replacement strategy, on the agent’s performance and behavior\.
Baselines and hyperparameters\.We comparedRePPOtoPPO\-V\(state\-value critic\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib35)\)\),PPO\-Q\(Q\-critic\),PPO\-V \+ RND\(Random Network Distillation\(Burdaet al\.,[2019](https://arxiv.org/html/2606.00151#bib.bib21)\)\), andPQN\(Galliciet al\.,[2025](https://arxiv.org/html/2606.00151#bib.bib51)\), a strong Q\-learning baseline\. PPO baselines followed the defaultpgxsettings \(based on PureJaxRL\(Luet al\.,[2022](https://arxiv.org/html/2606.00151#bib.bib54)\)\), and PQN followed the official implementations\. We report runs with and without entropy regularization \(\+Entindicates it was enabled\)\. Main results used RePPO withm∈\{1\.2,1\.4\}m\\in\\\{1\.2,1\.4\\\}; for speed comparisons vs\. PPO\-V and PPO\-Q, see App\.[E](https://arxiv.org/html/2606.00151#A5)\. For RePPO, we tunedmmand setλ=0\.8\\lambda=0\.8for theλ\\lambda\-return, keeping all other hyperparameters identical to PPO\-V\. PQN’s official setup used 128 parallel environments; we used 1024 and adjusted the number of environments, minibatch size, and update epochs \(tuningλ\\lambdaand the learning rate\) to match the number of gradient updates\. Additional comparisons with the unmodified PQN hyperparameters \(yielding∼\\sim5×\\timesmore updates\) are provided in App\.[E\.2](https://arxiv.org/html/2606.00151#A5.SS2)\. Full hyperparameters are listed in App\.[E\.1](https://arxiv.org/html/2606.00151#A5.SS1)\.
Training and evaluation\.Agents trained for1010M environment steps with1010seeds; evaluation averaged100100episodes/seed\. For PPO variants, during evaluation, the action was selected by argmax over policy logits; during training, we sampled from the policy\. We reported*normalized scores*across games \(median, IQM, mean\) via RLiable\(Agarwalet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib55)\), normalized by the best returns across all methods; per\-game scores are in App\.[E\.2](https://arxiv.org/html/2606.00151#A5.SS2)\.
Main results\.[Figure˜3](https://arxiv.org/html/2606.00151#S4.F3)\(Left\) aggregates median, IQM, and mean across the four games\. Among PPO variants,RePPOwithm∈\[1\.2,1\.4\]m\\in\[1\.2,1\.4\]performed best overall, even compared to entropy or RND bonuses\. RePPO was comparable to PQN on median and IQM, and it consistently outperformed PQN on mean, indicating better tail performance\. In[Figure˜3](https://arxiv.org/html/2606.00151#S4.F3)\(Right\), RePPO*without*an entropy bonus maintained higher policy entropy than PPO*with*an entropy bonus\. Finally, PPO\-V \+ RND performed poorly when entropy was disabled, highlighting RND’s reliance on entropy bonuses and that RePPO promotes exploration without bonuses\.
Effect ofmm: exploration–exploitation trade\-off\.We testedm∈\{0\.9,1\.0,1\.2,1\.4,1\.6,2,3\}m\\in\\\{0\.9,1\.0,1\.2,1\.4,1\.6,2,3\\\}and reported median, IQM, and mean of normalized scores and policy entropy on Breakout \([Figure˜4](https://arxiv.org/html/2606.00151#S4.F4)\)\. The sweep showed that increasingmmsystematically slowed entropy decay, with returns peaking nearm≈1\.2–1\.4m\\approx 1\.2\\text\{–\}1\.4and falling whenmmwas larger or smaller, showing that the continuous parametermmenabled more precise control of the exploration\.
Figure 5:MinAtar results\. Median \(all games\)\. The results show that either removing the action\-independent baseline or the Q\-replacement strategy substantially degrades performance\.Ablation on baseline and Q\-replacement\.We further evaluated two key components of RePPO: the action\-independent baseline and the Q\-replacement strategy\. Withm=1\.2m=1\.2without entropy bonus, we compared \(i\)*w/o base*\(no baseline, Q\-replacement enabled\), \(ii\)*w/o rep*\(baseline enabled, no Q\-replacement\), and \(iii\)*RePPO \(full\)*with both\. Removing either component substantially degraded performance \([Figure˜5](https://arxiv.org/html/2606.00151#S5.F5)\), indicating that both are necessary to realize the full benefits of RePPO \(App\.[E\.2](https://arxiv.org/html/2606.00151#A5.SS2)\)\.
### 5\.2Craftax
Craftax\(Matthewset al\.,[2024](https://arxiv.org/html/2606.00151#bib.bib73)\)is an open\-ended RL environment built on Crafter\(Hafner,[2022](https://arxiv.org/html/2606.00151#bib.bib74)\)and NetHack\(Küttleret al\.,[2020](https://arxiv.org/html/2606.00151#bib.bib75)\), in which the agent must both plan over long horizons and continually adapt to newly revealed parts of the environment\. We use the symbolic version of Craftax\.
Setup\.We compared RePPO \(m∈\{1\.2,1\.4\}m\\in\\\{1\.2,1\.4\\\}, no entropy bonus\) against PPO\-V, PPO\-Q, and RND, each with/without an entropy bonus \(coef\.0\.010\.01\)\. All methods used the official Craftax implementation and hyperparameters333[https://github\.com/MichaelTMatthews/Craftax\_Baselines](https://github.com/MichaelTMatthews/Craftax_Baselines); RePPO changed only method\-specific settings\. FollowingMatthewset al\.\([2024](https://arxiv.org/html/2606.00151#bib.bib73)\), agents were trained for 1B timesteps with 5 seeds and evaluated for 100 episodes per seed; we reported mean and standard deviation across seeds as % of the max reward \(226\)\. Hyperparameters are in App\.[G](https://arxiv.org/html/2606.00151#A7)\.
Table 1:Craftax results\. Mean and std of the % of the max reward \(226\) over 5 seeds\. RePPO \(1\.2\) isboldedand is comparable to the PPO with entropy and RND bonuses \(underlined\)\.Results\.[Table˜1](https://arxiv.org/html/2606.00151#S5.T1)shows that RePPO \(1\.2\), without entropy or intrinsic bonuses, was competitive with PPO/RND variants that use such bonuses and outperformed PPO without them\. App\.[G\.1](https://arxiv.org/html/2606.00151#A7.SS1)shows RePPO maintained higher entropy even without bonuses \(RND degraded without entropy\), supporting that RePPO promotes exploration at larger scale\.
## 6Concluding Remarks
We conclude with the discussion and summary of the paper\.
### 6\.1Scope and Future Work\.
Our study focuses on discrete\-action problems with a relatively small number of actions, where the full action\-value vector is available and the ReMax gradient can be computed efficiently by sorting Q\-values\. Extending ReMax to large discrete or continuous action spaces via sampling\-based estimators is an important direction\. As discussed in Sec\.[3](https://arxiv.org/html/2606.00151#S3), ReMax can be integrated into RL in multiple ways, and exploring alternatives to our PPO\-based implementation, such as Q\-learning\(Mnihet al\.,[2013](https://arxiv.org/html/2606.00151#bib.bib62)\)and off\-policy actor–critic methods\(Haarnojaet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib30)\), is left for future work\.
#### Stochastic and deep Exploration\.
One can distinguish between stochastic exploration and deep exploration\(Guptaet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib85); Osbandet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib38)\)\. The former injects randomness into action selection to diversify data collection, as in entropy\-regularized methods\(Haarnojaet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib30)\)\. The latter aims at temporally coherent information gathering, for example through visitation counts\(Ostrovskiet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib46)\)or posterior sampling\(Osbandet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib38)\), based on the mechanisms behind bandit algorithms with sublinear regret, such as UCB\(Aueret al\.,[2002](https://arxiv.org/html/2606.00151#bib.bib59)\)and Thompson sampling\(Thompson,[1933](https://arxiv.org/html/2606.00151#bib.bib58)\)\. In principle, both forms of exploration can emerge under ReMax\. In the deterministic bandit example \(Sec\.[2\.1](https://arxiv.org/html/2606.00151#S2.SS1)\), ReMax encourages the stochastic policy by slowing down the convergence, while in posterior bandits it leverages return uncertainty and yields empirically sublinear regret \(Sec\.[2\.2](https://arxiv.org/html/2606.00151#S2.SS2)\)\. In our deep RL experiments, however, the observed behavior was more consistent with stochastic exploration, as indicated by increased policy entropy\. We conjecture that more structured exploration may also have contributed to RePPO’s performance gains, but confirming this requires more targeted empirical analysis\. A promising direction is therefore to instantiate posterior\-based ReMax by modeling epistemic uncertainty in Q\-values explicitly, for example with ensembles or Bayesian critics\. Finally, formal regret bounds for such variants would further strengthen the theoretical support for ReMax\.
### 6\.2Summary\.
We introducedReMax, a novel RL objective that encourages exploration by*directly*maximizing reward, without auxiliary bonus terms\. We demonstrated its effectiveness as an exploration mechanism in bandit settings and extended the formulation to RL\. To optimize ReMax efficiently, we derived a new policy–gradient expression and relaxed the retry count to a continuous parametermm, enabling fine\-grained control over the exploration–exploitation trade\-off\. We instantiated these ideas inRePPO, a PPO\-style deep actor–critic method with a Q\-critic, and showed on MinAtar that optimizing ReMax induces exploratory behavior and improves performance without entropy bonuses\.
## Impact Statement
This work proposes a novel objective for exploration in RL\. Its main potential impact is to improve exploration in domains where repeated attempts or diverse trials are important, including reasoning tasks with LLMs\. Unlike exploration methods based on explicit bonuses or posterior sampling, ReMax does not require a separately designed exploration bonus or an explicit posterior model, but rather is optimized using return samples\. This may broaden the applicability of RL algorithms to settings where reliable bonus design or posterior modeling is difficult\.
## Author Contributions
The project started in 2021 when Paavo Parmas proposed the idea to Sotetsu Koyamada who initially worked on it based on Paavo’s guidance\. Paavo moved to UTokyo and the project restarted in August 2024 when Soichiro Nishimori started working on it as his internship project under the guidance of Paavo Parmas\. The contributions in the final manuscript are as follows:
Soichiro Nishimori: Lead writer, key implementations \(the RePPO implementation and all experiments concerning it, as well as other final implementations\) and experimentation, figures, discussion and literature search, proposal and experimentation with some early versions of continuousmm\(not included in the final paper\), and proposal of KL\-extension to our earlier AC style approaches and its experiment \(also not included in the final paper\)\.
Paavo Parmas: Conceptualized the idea, did all of the key mathematical derivations \(derivation of the gradient estimator, the expected improvement formulation, proposal of the final version of continuousmm, came up with the bandit algorithm, proposed RePPO, hypothesized the adaptive properties of ReMax, e\.g\., the ones in Figure 1\), example implementations \(first REINFORCE implementation for ReMax, first ReMax bandit implementations, some debugging\), significant comments and editing on the paper\.
Sotetsu Koyamada: Initial development and implementation on the project\. This led to a resetting based ReMax implementation similar to the A2C algorithm with experiments on MinAtar and maze domains\. Wrote an early preprint on the project based on guidance mainly from Paavo\.
Tadashi Kozuno was involved in some early discussion and guidance of Sotetsu\. Toshinori Kitamura provided comments on the paper\. Shin Ishii is the PI of the lab at Kyoto University where Paavo and Sotetsu belonged to when the project started\. General supervision of Sotetsu at the time\. Yutaka Matsuo is the PI of the UTokyo lab where Paavo belongs to\. Funding acquisition and general management and supervision at the lab\.
## Acknowledgment
This work was supported by JSPS KAKENHI Grant Number JP22H04998\. SN was supported by JSPS KAKENHI Grant Number JP24KJ0818\.
## References
- R\. Agarwal, M\. Schwarzer, P\. S\. Castro, A\. C\. Courville, and M\. Bellemare \(2021\)Deep reinforcement learning at the edge of the statistical precipice\.Advances in Neural Information Processing Systems34,pp\. 29304–29320\.Cited by:[§E\.1](https://arxiv.org/html/2606.00151#A5.SS1.SSS0.Px4.p1.3),[§F\.1](https://arxiv.org/html/2606.00151#A6.SS1.p1.2),[§5\.1](https://arxiv.org/html/2606.00151#S5.SS1.p3.3)\.
- S\. Agrawal and N\. Goyal \(2017\)Near\-optimal regret bounds for thompson sampling\.Journal of the ACM \(JACM\)64\(5\),pp\. 1–24\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1),[§C\.3](https://arxiv.org/html/2606.00151#A3.SS3.SSS0.Px4.p1.8),[§2\.2](https://arxiv.org/html/2606.00151#S2.SS2.p2.20)\.
- G\. An, S\. Moon, J\. Kim, and H\. O\. Song \(2021\)Uncertainty\-based offline reinforcement learning with diversified Q\-ensemble\.Advances in Neural Information Processing Systems34,pp\. 7436–7447\.Cited by:[§3](https://arxiv.org/html/2606.00151#S3.p5.5)\.
- P\. Auer, N\. Cesa\-Bianchi, and P\. Fischer \(2002\)Finite\-time analysis of the multiarmed bandit problem\.Machine Learning47\(2\),pp\. 235–256\.Cited by:[§C\.3](https://arxiv.org/html/2606.00151#A3.SS3.SSS0.Px4.p1.8),[§2\.2](https://arxiv.org/html/2606.00151#S2.SS2.p2.20),[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.SSS0.Px1.p1.1)\.
- M\. G\. Azar, I\. Osband, and R\. Munos \(2017\)Minimax regret bounds for reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1)\.
- K\. Azizzadenesheli, E\. Brunskill, and A\. Anandkumar \(2018\)Efficient exploration through bayesian deep Q\-networks\.In2018 Information Theory and Applications Workshop \(ITA\),pp\. 1–9\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1)\.
- A\. P\. Badia, P\. Sprechmann, A\. Vitvitskyi, D\. Guo, B\. Piot, S\. Kapturowski, O\. Tieleman, M\. Arjovsky, A\. Pritzel, A\. Bolt, and C\. Blundell \(2020\)Never give up: learning directed exploration strategies\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1)\.
- N\. Baram, G\. Tennenholtz, and S\. Mannor \(2021\)Action redundancy in reinforcement learning\.InConference on Uncertainty in Artificial Intelligence,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- J\. Bayrooti, C\. Ek, and A\. Prorok \(2025\)Efficient model\-based reinforcement learning through optimistic thompson sampling\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 5338–5358\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1)\.
- M\. G\. Bellemare, W\. Dabney, and R\. Munos \(2017\)A distributional perspective on reinforcement learning\.InInternational Conference on Machine Learning,pp\. 449–458\.Cited by:[§A\.3](https://arxiv.org/html/2606.00151#A1.SS3.p2.1)\.
- M\. G\. Bellemare, W\. Dabney, and M\. Rowland \(2023\)Distributional reinforcement learning\.MIT Press\.Cited by:[§A\.3](https://arxiv.org/html/2606.00151#A1.SS3.p2.1)\.
- M\. G\. Bellemare, Y\. Naddaf, J\. Veness, and M\. Bowling \(2013\)The arcade learning environment: an evaluation platform for general agents\.Journal of Artificial Intelligence Research47,pp\. 253–279\.Cited by:[§5](https://arxiv.org/html/2606.00151#S5.p1.1)\.
- M\. Bellemare, S\. Srinivasan, G\. Ostrovski, T\. Schaul, D\. Saxton, and R\. Munos \(2016\)Unifying count\-based exploration and intrinsic motivation\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1),[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1),[Appendix F](https://arxiv.org/html/2606.00151#A6.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p6.1),[§5](https://arxiv.org/html/2606.00151#S5.p1.1)\.
- Y\. Burda, H\. Edwards, A\. Storkey, and O\. Klimov \(2019\)Exploration by random network distillation\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p4.1),[§G\.1](https://arxiv.org/html/2606.00151#A7.SS1.p1.4),[§1](https://arxiv.org/html/2606.00151#S1.p6.1),[§5\.1](https://arxiv.org/html/2606.00151#S5.SS1.p2.7)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§A\.2](https://arxiv.org/html/2606.00151#A1.SS2.p1.6)\.
- Z\. Chen, X\. Qin, Y\. Wu, Y\. Ling, Q\. Ye, W\. X\. Zhao, and G\. Shi \(2025\)Pass@k training for adaptively balancing exploration and exploitation of large reasoning models\.arXiv preprint arXiv:2508\.10751\.Cited by:[§A\.2](https://arxiv.org/html/2606.00151#A1.SS2.p1.6)\.
- W\. Dabney, M\. Rowland, M\. Bellemare, and R\. Munos \(2018\)Distributional reinforcement learning with quantile regression\.Proceedings of the AAAI Conference on Artificial Intelligence32\(1\)\.Cited by:[§A\.3](https://arxiv.org/html/2606.00151#A1.SS3.p2.1)\.
- C\. Dann and E\. Brunskill \(2015\)Sample complexity of episodic fixed\-horizon reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1)\.
- A\. Ecoffet, J\. Huizinga, J\. Lehman, K\. O\. Stanley, and J\. Clune \(2021\)First return, then explore\.Nature590\(7847\),pp\. 580–586\.Cited by:[§3](https://arxiv.org/html/2606.00151#S3.p2.9)\.
- L\. Espeholt, H\. Soyer, R\. Munos, K\. Simonyan, V\. Mnih, T\. Ward, Y\. Doron, V\. Firoiu, T\. Harley, I\. Dunning, S\. Legg, and K\. Kavukcuoglu \(2018\)IMPALA: Scalable distributed deep\-RL with importance weighted actor\-learner architectures\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- B\. Eysenbach and S\. Levine \(2022\)Maximum entropy RL \(provably\) solves some robust RL problems\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- Y\. Fei, Z\. Yang, Y\. Chen, and Z\. Wang \(2021\)Exponential bellman equation and improved regret bounds for risk\-sensitive reinforcement learning\.Advances in Neural Information Processing Systems34,pp\. 20436–20446\.Cited by:[§A\.3](https://arxiv.org/html/2606.00151#A1.SS3.p1.1)\.
- L\. Fox, L\. Choshen, and Y\. Loewenstein \(2018\)DORA The Explorer: directed outreaching reinforcement action\-selection\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p4.1)\.
- J\. Fu, J\. Co\-Reyes, and S\. Levine \(2017\)EX2: exploration with exemplar models for deep reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1)\.
- M\. Gallici, M\. Fellows, B\. Ellis, B\. Pou, I\. Masmitja, J\. Foerster, and M\. Martin \(2025\)Simplifying deep temporal difference learning\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 78148–78190\.Cited by:[3rd item](https://arxiv.org/html/2606.00151#A5.I1.i3.p1.1),[§3](https://arxiv.org/html/2606.00151#S3.p8.1),[§5\.1](https://arxiv.org/html/2606.00151#S5.SS1.p2.7)\.
- D\. Ghosh, J\. Rahme, A\. Kumar, A\. Zhang, R\. P\. Adams, and S\. Levine \(2021\)Why generalization in RL is difficult: epistemic POMDPs and implicit partial observability\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2606.00151#S1.p5.6),[§3](https://arxiv.org/html/2606.00151#S3.p2.9)\.
- A\. Gupta, R\. Mendonca, Y\. Liu, P\. Abbeel, and S\. Levine \(2018\)Meta\-reinforcement learning of structured exploration strategies\.Advances in Neural Information Processing Systems31\.Cited by:[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.SSS0.Px1.p1.1)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018\)Soft actor\-critic: Off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§3](https://arxiv.org/html/2606.00151#S3.p8.1),[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.p1.1)\.
- D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi \(2020\)Dream to control: learning behaviors by latent imagination\.InInternational Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2606.00151#S3.p5.5)\.
- D\. Hafner \(2022\)Benchmarking the spectrum of agent capabilities\.InInternational Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2606.00151#S5.SS2.p1.1),[§5](https://arxiv.org/html/2606.00151#S5.p1.1)\.
- J\. I\. Hamid, I\. H\. Orney, E\. Xu, C\. Finn, and D\. Sadigh \(2026\)Polychromic objectives for reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2606.00151#A1.SS2.p1.6)\.
- D\. Hennes, D\. Morrill, S\. Omidshafiei, R\. Munos, J\. Perolat, M\. Lanctot, A\. Gruslys, J\. Lespiau, P\. Parmas, E\. Duéñez\-Guzmán,et al\.\(2020\)Neural replicator dynamics: multiagent learning via hedging policy gradients\.InProceedings of the 19th international conference on autonomous agents and multiagent systems,pp\. 492–501\.Cited by:[§2\.1](https://arxiv.org/html/2606.00151#S2.SS1.SSS0.Px3.p2.21)\.
- J\. Honda and A\. Takemura \(2014\)Optimality of thompson sampling for gaussian bandits depends on priors\.InArtificial Intelligence and Statistics,pp\. 375–383\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1),[§C\.3](https://arxiv.org/html/2606.00151#A3.SS3.SSS0.Px4.p1.8),[§2\.2](https://arxiv.org/html/2606.00151#S2.SS2.p2.20)\.
- R\. Houthooft, X\. Chen, X\. Chen, Y\. Duan, J\. Schulman, F\. De Turck, and P\. Abbeel \(2016\)VIME: variational information maximizing exploration\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- R\. A\. Howard and J\. E\. Matheson \(1972\)Risk\-sensitive markov decision processes\.Management science18\(7\),pp\. 356–369\.Cited by:[§A\.3](https://arxiv.org/html/2606.00151#A1.SS3.p1.1)\.
- S\. Huang, R\. F\. J\. Dossa, C\. Ye, J\. Braga, D\. Chakraborty, K\. Mehta, and J\. G\.M\. Araújo \(2022\)CleanRL: high\-quality single\-file implementations of deep reinforcement learning algorithms\.Journal of Machine Learning Research23\(274\),pp\. 1–18\.Cited by:[Appendix F](https://arxiv.org/html/2606.00151#A6.p1.1)\.
- H\. Ishfaq, Q\. Cui, V\. Nguyen, A\. Ayoub, Z\. Yang, Z\. Wang, D\. Precup, and L\. Yang \(2021\)Randomized exploration in reinforcement learning with general value function approximation\.InInternational Conference on Machine Learning,pp\. 4607–4616\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1)\.
- H\. Ishfaq, Q\. Lan, P\. Xu, A\. R\. Mahmood, D\. Precup, K\. Azizzadenesheli,et al\.\(2024\)Provable and practical: efficient exploration in reinforcement learning via langevin monte carlo\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 47602–47647\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1)\.
- H\. Ishfaq, G\. Wang, S\. Islam, and D\. Precup \(2025\)Langevin soft actor\-critic: efficient exploration through uncertainty\-driven critic learning\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 13758–13784\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1)\.
- T\. Jaksch, R\. Ortner, and P\. Auer \(2010\)Near\-optimal regret bounds for reinforcement learning\.Journal of Machine Learning Research11\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1)\.
- C\. Jin, Z\. Allen\-Zhu, S\. Bubeck, and M\. I\. Jordan \(2018\)Is Q\-learning provably efficient?\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1)\.
- D\. R\. Jones, M\. Schonlau, and W\. J\. Welch \(1998\)Efficient global optimization of expensive black\-box functions\.Journal of Global Optimization13\(4\),pp\. 455–492\.Cited by:[§4\.1](https://arxiv.org/html/2606.00151#S4.SS1.SSS0.Px3.p2.15)\.
- S\. Koyamada, S\. Okano, S\. Nishimori, Y\. Murata, K\. Habara, H\. Kita, and S\. Ishii \(2023\)Pgx: hardware\-accelerated parallel game simulators for reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 45716–45743\.Cited by:[1st item](https://arxiv.org/html/2606.00151#A5.I1.i1.p1.1),[2nd item](https://arxiv.org/html/2606.00151#A5.I1.i2.p1.1),[§5](https://arxiv.org/html/2606.00151#S5.p1.1)\.
- S\. Koyamada, P\. Parmas, T\. Kozuno, and S\. Ishii \(2022\)Emergence of exploration in policy gradient reinforcement learning via resetting\.External Links:[Link](https://openreview.net/forum?id=GKsNIC_mQRG)Cited by:[§A\.2](https://arxiv.org/html/2606.00151#A1.SS2.p1.6),[footnote 1](https://arxiv.org/html/2606.00151#footnote1)\.
- H\. Küttler, N\. Nardelli, A\. Miller, R\. Raileanu, M\. Selvatici, E\. Grefenstette, and T\. Rocktäschel \(2020\)The NetHack learning environment\.Advances in Neural Information Processing Systems33,pp\. 7671–7684\.Cited by:[§5\.2](https://arxiv.org/html/2606.00151#S5.SS2.p1.1)\.
- S\. Labbi, D\. Tiapkin, P\. Mangold, and E\. Moulines \(2026\)Beyond softmax and entropy: convergence rates of policy gradients with𝒇\\bm\{f\}\-softargmax parameterization & coupled regularization\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- S\. Levine \(2018\)Reinforcement learning and control as probabilistic inference: tutorial and review\.arXiv preprint arXiv:1805\.00909\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- S\. Lobel, A\. Bagaria, and G\. Konidaris \(2023\)Flipping coins to estimate pseudocounts for exploration in reinforcement learning\.InInternational Conference on Machine Learning,pp\. 22594–22613\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1),[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1),[§1](https://arxiv.org/html/2606.00151#S1.p6.1)\.
- C\. Lu, J\. Kuba, A\. Letcher, L\. Metz, C\. Schroeder de Witt, and J\. Foerster \(2022\)Discovered policy optimisation\.Advances in Neural Information Processing Systems35,pp\. 16455–16468\.Cited by:[2nd item](https://arxiv.org/html/2606.00151#A5.I1.i2.p1.1),[§5\.1](https://arxiv.org/html/2606.00151#S5.SS1.p2.7)\.
- M\. Matthews, M\. Beukman, B\. Ellis, M\. Samvelyan, M\. Jackson, S\. Coward, and J\. Foerster \(2024\)Craftax: a lightning\-fast benchmark for open\-ended reinforcement learning\.InProceedings of the 41st International Conference on Machine Learning,pp\. 35104–35137\.Cited by:[§1](https://arxiv.org/html/2606.00151#S1.p14.2),[§5\.2](https://arxiv.org/html/2606.00151#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2606.00151#S5.SS2.p2.2),[§5](https://arxiv.org/html/2606.00151#S5.p1.1)\.
- V\. Mnih, A\. P\. Badia, M\. Mirza, A\. Graves, T\. Lillicrap, T\. Harley, D\. Silver, and K\. Kavukcuoglu \(2016\)Asynchronous methods for deep reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p1.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. Graves, I\. Antonoglou, D\. Wierstra, and M\. Riedmiller \(2013\)Playing Atari with deep reinforcement learning\.arXiv preprint arXiv:1312\.5602\.Cited by:[§3](https://arxiv.org/html/2606.00151#S3.p8.1),[§3](https://arxiv.org/html/2606.00151#S3.p9.1),[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.p1.1)\.
- S\. Moalla, A\. Miele, D\. Pyatko, R\. Pascanu, and C\. Gulcehre \(2024\)No representation, no trust: connecting representation, collapse, and trust issues in PPO\.Advances in Neural Information Processing Systems37,pp\. 69652–69699\.Cited by:[§1](https://arxiv.org/html/2606.00151#S1.p5.6),[§3](https://arxiv.org/html/2606.00151#S3.p5.5)\.
- M\. Mutti, L\. Pratissoli, and M\. Restelli \(2021\)Task\-agnostic exploration via policy gradient of a non\-parametric state entropy estimate\.Proceedings of the AAAI Conference on Artificial Intelligence35\(10\),pp\. 9028–9036\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- I\. Osband, J\. Aslanides, and A\. Cassirer \(2018\)Randomized prior functions for deep reinforcement learning\.Advances in Neural Information Processing Systems31\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§3](https://arxiv.org/html/2606.00151#S3.p5.5)\.
- I\. Osband, C\. Blundell, A\. Pritzel, and B\. Van Roy \(2016\)Deep exploration via bootstrapped DQN\.InAdvances in Neural Information Processing Systems,Vol\.29,pp\. 4026–4034\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p5.6),[§3](https://arxiv.org/html/2606.00151#S3.p5.5),[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.SSS0.Px1.p1.1)\.
- I\. Osband, B\. Van Roy, D\. J\. Russo, and Z\. Wen \(2019\)Deep exploration via randomized value functions\.Journal of Machine Learning Research20\(124\),pp\. 1–62\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§3](https://arxiv.org/html/2606.00151#S3.p5.5)\.
- I\. Osband, Z\. Wen, S\. M\. Asghari, V\. Dwaracherla, M\. Ibrahimi, X\. Lu, and B\. Van Roy \(2023\)Approximate thompson sampling via epistemic neural networks\.InUncertainty in Artificial Intelligence,pp\. 1586–1595\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1)\.
- G\. Ostrovski, M\. G\. Bellemare, A\. Oord, and R\. Munos \(2017\)Count\-based exploration with neural density models\.InInternational Conference on Machine Learning,pp\. 2721–2730\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1),[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1),[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.SSS0.Px1.p1.1)\.
- D\. Pathak, P\. Agrawal, A\. A\. Efros, and T\. Darrell \(2017\)Curiosity\-driven exploration by self\-supervised prediction\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.00151#S1.p6.1)\.
- S\. Pitis, H\. Chan, S\. Zhao, B\. Stadie, and J\. Ba \(2020\)Maximum entropy gain exploration for long horizon multi\-goal reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- M\. L\. Puterman \(2014\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons\.Cited by:[§3](https://arxiv.org/html/2606.00151#S3.p1.7)\.
- R\. T\. Rockafellar, S\. Uryasev,et al\.\(2000\)Optimization of conditional value\-at\-risk\.Journal of Risk2,pp\. 21–42\.Cited by:[§A\.3](https://arxiv.org/html/2606.00151#A1.SS3.p1.1)\.
- R\. Sasso, M\. Conserva, and P\. Rauber \(2023\)Posterior sampling for deep reinforcement learning\.InInternational Conference on Machine Learning,pp\. 30042–30061\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1)\.
- N\. Savinov, A\. Raichuk, D\. Vincent, R\. Marinier, M\. Pollefeys, T\. Lillicrap, and S\. Gelly \(2019\)Episodic curiosity through reachability\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1)\.
- J\. Schmidhuber \(1991a\)A Possibility for Implementing Curiosity and Boredom in Model\-Building Neural Controllers\.InInternational Conference on Simulation of Adaptive Behavior: From Animals to Animats,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- J\. Schmidhuber \(1991b\)Curious model\-building control systems\.InInternational Joint Conference on Neural Networks,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- J\. Schmidhuber \(2010\)Formal theory of creativity, fun, and intrinsic motivation \(1990–2010\)\.IEEE Transactions on Autonomous Mental Development2\(3\),pp\. 230–247\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2606.00151#S1.p13.2),[§3](https://arxiv.org/html/2606.00151#S3.p8.1),[§3](https://arxiv.org/html/2606.00151#S3.p9.1),[§4\.3](https://arxiv.org/html/2606.00151#S4.SS3.p1.1),[§4](https://arxiv.org/html/2606.00151#S4.p1.2),[§5\.1](https://arxiv.org/html/2606.00151#S5.SS1.p2.7)\.
- B\. C\. Stadie, S\. Levine, and P\. Abbeel \(2015\)Incentivizing exploration in reinforcement learning with deep predictive models\.CoRRabs/1507\.00814\.External Links:1507\.00814Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- A\. L\. Strehl, L\. Li, E\. Wiewiora, J\. Langford, and M\. L\. Littman \(2006\)PAC model\-free reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1)\.
- A\. L\. Strehl and M\. L\. Littman \(2005\)A theoretical analysis of model\-based interval estimation\.InInternational Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1)\.
- A\. L\. Strehl and M\. L\. Littman \(2008\)An analysis of model\-based interval estimation for Markov decision processes\.Journal of Computer and System Sciences74\(8\),pp\. 1309–1331\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px1.p1.1)\.
- B\. Sukhija, S\. Coros, A\. Krause, P\. Abbeel, and C\. Sferrazza \(2025\)MaxInfoRL: boosting exploration in reinforcement learning through information gain maximization\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- Y\. Sun, F\. J\. Gomez, and J\. Schmidhuber \(2011\)Planning to be surprised: Optimal Bayesian exploration in dynamic environments\.InConference on Artificial General Intelligence,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: An introduction\.2nd edition,The MIT Press\.Cited by:[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§1](https://arxiv.org/html/2606.00151#S1.p5.6),[§3](https://arxiv.org/html/2606.00151#S3.p3.9),[§4\.1](https://arxiv.org/html/2606.00151#S4.SS1.SSS0.Px1.p1.2)\.
- A\. A\. Taiga, W\. Fedus, M\. C\. Machado, A\. Courville, and M\. G\. Bellemare \(2020\)On bonus\-based exploration methods in the arcade learning environment\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1)\.
- H\. Tang, R\. Houthooft, D\. Foote, A\. Stooke, O\. Xi Chen, Y\. Duan, J\. Schulman, F\. DeTurck, and P\. Abbeel \(2017\)\#Exploration: A study of count\-based exploration for deep reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p3.1)\.
- H\. Tang and G\. Berseth \(2024\)Improving deep reinforcement learning by reducing the chain effect of value and policy churn\.Advances in Neural Information Processing Systems37,pp\. 15320–15355\.Cited by:[§3](https://arxiv.org/html/2606.00151#S3.p5.5)\.
- Y\. Tang, K\. Zheng, G\. Synnaeve, and R\. Munos \(2025a\)Optimizing language models for inference time objectives using reinforcement learning\.InInternational Conference on Machine Learning,pp\. 59066–59085\.Cited by:[§A\.2](https://arxiv.org/html/2606.00151#A1.SS2.p1.6),[§1](https://arxiv.org/html/2606.00151#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.00151#S4.SS1.SSS0.Px3.p1.4),[§4\.1](https://arxiv.org/html/2606.00151#S4.SS1.SSS0.Px3.p3.5)\.
- Y\. Tang, Y\. Zhang, J\. Ackermann, Y\. Zhang, S\. Nishimori, and M\. Sugiyama \(2025b\)Recursive reward aggregation\.InReinforcement Learning Conference,Cited by:[§A\.2](https://arxiv.org/html/2606.00151#A1.SS2.p1.6)\.
- W\. R\. Thompson \(1933\)On the likelihood that one unknown probability exceeds another in view of the evidence of two samples\.Biometrika25\(3/4\),pp\. 285–294\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px4.p1.1),[§C\.3](https://arxiv.org/html/2606.00151#A3.SS3.SSS0.Px4.p1.8),[§1](https://arxiv.org/html/2606.00151#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00151#S2.SS2.p1.2),[§2\.2](https://arxiv.org/html/2606.00151#S2.SS2.p2.20),[§6\.1](https://arxiv.org/html/2606.00151#S6.SS1.SSS0.Px1.p1.1)\.
- S\. B\. Thrun and K\. Möller \(1991\)On planning and exploration in non\-discrete environments\.Technical reportGesellschaft fur Mathematik und Datenverarbeitung, D\-5205 St\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px2.p2.1)\.
- N\. Vieillard, O\. Pietquin, and M\. Geist \(2020\)Munchausen reinforcement learning\.Advances in Neural Information Processing Systems33,pp\. 4235–4246\.Cited by:[§3](https://arxiv.org/html/2606.00151#S3.p8.1)\.
- C\. Walder and D\. T\. Karkhanis \(2025\)Pass@K policy optimization: solving harder reinforcement learning problems\.Advances in Neural Information Processing Systems38,pp\. 152416–152445\.Cited by:[§A\.2](https://arxiv.org/html/2606.00151#A1.SS2.p1.6),[§1](https://arxiv.org/html/2606.00151#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.00151#S4.SS1.SSS0.Px3.p3.5)\.
- J\. Weng, M\. Lin, S\. Huang, B\. Liu, D\. Makoviichuk, V\. Makoviychuk, Z\. Liu, Y\. Song, T\. Luo, Y\. Jiang,et al\.\(2022\)EnvPool: a highly parallel reinforcement learning environment execution engine\.Advances in Neural Information Processing Systems35,pp\. 22409–22421\.Cited by:[Appendix F](https://arxiv.org/html/2606.00151#A6.p1.1)\.
- R\. J\. Williams and J\. Peng \(1991\)Function optimization using connectionist reinforcement learning algorithms\.Connection Science3\(3\),pp\. 241–268\.Cited by:[§A\.1](https://arxiv.org/html/2606.00151#A1.SS1.SSS0.Px3.p1.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine Learning8\(3\),pp\. 229–256\.Cited by:[§4\.1](https://arxiv.org/html/2606.00151#S4.SS1.p1.1)\.
- C\. Ying, X\. Zhou, H\. Su, D\. Yan, N\. Chen, and J\. Zhu \(2022\)Towards safe reinforcement learning via constraining conditional value\-at\-risk\.InProceedings of the Thirty\-First International Joint Conference on Artificial Intelligence, IJCAI\-22,L\. D\. Raedt \(Ed\.\),pp\. 3673–3680\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2022/510)Cited by:[§A\.3](https://arxiv.org/html/2606.00151#A1.SS3.p1.1)\.
- K\. Young and T\. Tian \(2019\)MinAtar: an Atari\-inspired testbed for thorough and reproducible reinforcement learning experiments\.arXiv preprint arXiv:1903\.03176\.Cited by:[§1](https://arxiv.org/html/2606.00151#S1.p14.2),[§5](https://arxiv.org/html/2606.00151#S5.p1.1)\.
- B\. D\. Ziebart, A\. Maas, J\. A\. Bagnell, and A\. K\. Dey \(2008\)Maximum entropy inverse reinforcement learning\.InAAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.00151#S1.p1.1)\.
###### Appendix Contents
1. [1Introduction](https://arxiv.org/html/2606.00151#S1)
2. [2ReMax in Bandits: An Empirical Study](https://arxiv.org/html/2606.00151#S2)1. [2\.1Warm\-up: Properties of ReMax\.](https://arxiv.org/html/2606.00151#S2.SS1) 2. [2\.2Bandit with Posterior: Empirical Sublinear Regret\.](https://arxiv.org/html/2606.00151#S2.SS2)
3. [3ReMax in RL](https://arxiv.org/html/2606.00151#S3)
4. [4Policy Gradient in ReMax](https://arxiv.org/html/2606.00151#S4)1. [4\.1Estimation\-Friendly Policy Gradient for ReMax](https://arxiv.org/html/2606.00151#S4.SS1) 2. [4\.2Efficient and Generalized Computation of EI](https://arxiv.org/html/2606.00151#S4.SS2) 3. [4\.3RePPO: Practical Policy Gradient for ReMax](https://arxiv.org/html/2606.00151#S4.SS3)
5. [5Experiments](https://arxiv.org/html/2606.00151#S5)1. [5\.1MinAtar](https://arxiv.org/html/2606.00151#S5.SS1) 2. [5\.2Craftax](https://arxiv.org/html/2606.00151#S5.SS2)
6. [6Concluding Remarks](https://arxiv.org/html/2606.00151#S6)1. [6\.1Scope and Future Work\.](https://arxiv.org/html/2606.00151#S6.SS1) 2. [6\.2Summary\.](https://arxiv.org/html/2606.00151#S6.SS2)
7. [References](https://arxiv.org/html/2606.00151#bib)
8. [AExtended Related Work](https://arxiv.org/html/2606.00151#A1)1. [A\.1Exploration in RL](https://arxiv.org/html/2606.00151#A1.SS1) 2. [A\.2Retry\-based objectives](https://arxiv.org/html/2606.00151#A1.SS2) 3. [A\.3Risk\-sensitive RL](https://arxiv.org/html/2606.00151#A1.SS3)
9. [BProofs](https://arxiv.org/html/2606.00151#A2)1. [B\.1Proof of Proposition3\.2](https://arxiv.org/html/2606.00151#A2.SS1) 2. [B\.2Proof of Proposition4\.1\.](https://arxiv.org/html/2606.00151#A2.SS2) 3. [B\.3Proof of Proposition4\.3\.](https://arxiv.org/html/2606.00151#A2.SS3)
10. [CDetails of the Bandit Experiments](https://arxiv.org/html/2606.00151#A3)1. [C\.1Binary Bandit](https://arxiv.org/html/2606.00151#A3.SS1) 2. [C\.2Bernoulli bandit](https://arxiv.org/html/2606.00151#A3.SS2) 3. [C\.3Bandit with Posterior](https://arxiv.org/html/2606.00151#A3.SS3)
11. [DRePPO](https://arxiv.org/html/2606.00151#A4)
12. [EMinAtar Experiment](https://arxiv.org/html/2606.00151#A5)1. [E\.1Experimental setup](https://arxiv.org/html/2606.00151#A5.SS1) 2. [E\.2Additional results](https://arxiv.org/html/2606.00151#A5.SS2)
13. [FAtari Experiment](https://arxiv.org/html/2606.00151#A6)1. [F\.1Experimental setup](https://arxiv.org/html/2606.00151#A6.SS1) 2. [F\.2Results](https://arxiv.org/html/2606.00151#A6.SS2)
14. [GCraftax Experiment](https://arxiv.org/html/2606.00151#A7)1. [G\.1Results](https://arxiv.org/html/2606.00151#A7.SS1)
15. [HSpeed Benchmark](https://arxiv.org/html/2606.00151#A8)
16. [ILLM Usage](https://arxiv.org/html/2606.00151#A9)
17. [JHyperparameter Tables](https://arxiv.org/html/2606.00151#A10)
#### Reproducibility Statement\.
We strive to make our results easy to reproduce\. The ReMax objective, its closed‑form components, and the policy‑gradient estimator are specified in Secs\.[3](https://arxiv.org/html/2606.00151#S3)and[4](https://arxiv.org/html/2606.00151#S4), with all assumptions and complete proofs in App\.[B](https://arxiv.org/html/2606.00151#A2)\. The exact advantage computation used in our implementation is provided in App\.[D](https://arxiv.org/html/2606.00151#A4)\. The MinAtar setup, environments, evaluation protocol, normalization, and all hyperparameters for every method, appear in Sec\.[5](https://arxiv.org/html/2606.00151#S5)and App\.[E](https://arxiv.org/html/2606.00151#A5)\(including Tables and additional analyses\)\. We report results over 10 random seeds and 100 evaluation episodes per seed and summarize performance with RLiable aggregates; per‑game curves and ablations are in App\.[E\.2](https://arxiv.org/html/2606.00151#A5.SS2)\. Complete details for the bandit experiments \(problem setups, optimization procedure, and baselines\) are in App\.[C](https://arxiv.org/html/2606.00151#A3)\. Official code is available at[https://github\.com/nissymori/remax\-rl](https://github.com/nissymori/remax-rl)\.
## Appendix AExtended Related Work
This section provides a brief overview of exploration in RL\.
### A\.1Exploration in RL
#### Optimism in the Face of Uncertainty \(OFU\)\.
A prominent exploration strategy is OFU\. Methods based on OFU mainly fall into two categories: confidence\-based\(Strehl and Littman,[2005](https://arxiv.org/html/2606.00151#bib.bib8); Jakschet al\.,[2010](https://arxiv.org/html/2606.00151#bib.bib7); Dann and Brunskill,[2015](https://arxiv.org/html/2606.00151#bib.bib9)\)and bonus\-based\(Strehlet al\.,[2006](https://arxiv.org/html/2606.00151#bib.bib6); Strehl and Littman,[2008](https://arxiv.org/html/2606.00151#bib.bib5); Azaret al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib3); Jinet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib4)\)\. While these approaches enjoy strong theoretical guarantees, they do not directly extend to deep RL, where explicit visitation counts are impractical to maintain\.Bellemareet al\.\([2016](https://arxiv.org/html/2606.00151#bib.bib10)\)generalize visitation counts to enable OFU\-style exploration in deep RL\. There are several follow\-ups that explore the count\-based approach to deep RL, includingLobelet al\.\([2023](https://arxiv.org/html/2606.00151#bib.bib47)\); Ostrovskiet al\.\([2017](https://arxiv.org/html/2606.00151#bib.bib46)\)\. In contrast to OFU, ReMax does not employ an explicit exploration mechanism; instead, exploration emerges by maximizing a multiple‑retry objective defined over Q‑values\.
#### Intrinsic Motivation \(IM\)\.
A complementary line of work that scales well with deep RL is intrinsic motivation \(IM\), which is broadly categorized into prediction\-error\-based, information\-gain\-based, and novelty\-based methods\.
Prediction\-error\-based methods learn a dynamics model and encourage visiting states \(or state–action pairs\) whose successor states are hard to predict\(Stadieet al\.,[2015](https://arxiv.org/html/2606.00151#bib.bib11); Pathaket al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib12)\); the idea dates back toSchmidhuber \([1991a](https://arxiv.org/html/2606.00151#bib.bib13)\)andThrun and Möller \([1991](https://arxiv.org/html/2606.00151#bib.bib14)\)\. However, they can overemphasize inherently noisy states\(Schmidhuber,[1991b](https://arxiv.org/html/2606.00151#bib.bib15)\)\. Information\-gain\-based methods instead seek states that reduce model uncertainty\(Schmidhuber,[2010](https://arxiv.org/html/2606.00151#bib.bib17); Sunet al\.,[2011](https://arxiv.org/html/2606.00151#bib.bib16); Houthooftet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib18); Sukhijaet al\.,[2025](https://arxiv.org/html/2606.00151#bib.bib64)\)\.
Novelty\-based methods directly incentivize visiting "novel" states \(or state–action pairs\)\. Notions of novelty include pseudo\-counts\(Bellemareet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib10); Lobelet al\.,[2023](https://arxiv.org/html/2606.00151#bib.bib47); Taigaet al\.,[2020](https://arxiv.org/html/2606.00151#bib.bib48); Ostrovskiet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib46)\), the estimated probability of a state appearing in a replay buffer\(Fuet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib20)\), reachability\-based metrics\(Savinovet al\.,[2019](https://arxiv.org/html/2606.00151#bib.bib19)\), and intra\-episode state diversity\(Badiaet al\.,[2020](https://arxiv.org/html/2606.00151#bib.bib24)\)\.Tanget al\.\([2017](https://arxiv.org/html/2606.00151#bib.bib23)\)discretize the state space with hashing to obtain counts\.
These methods typically require estimating auxiliary density or dynamics models \(e\.g\., transition models or visitation frequencies\)\. By contrast, our method does not introduce any additional estimation targets\. Two notable exceptions that avoid explicit density estimation are RND\(Burdaet al\.,[2019](https://arxiv.org/html/2606.00151#bib.bib21)\)and E\-values\(Foxet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib22)\)\. Combining such novelty detectors to flag unexplored states and to intensify the use of ReMax for exploration is an interesting future direction\.
#### Entropy Maximization\.
Entropy\-based exploration is widely used with policy\-gradient and actor–critic methods\(Williams and Peng,[1991](https://arxiv.org/html/2606.00151#bib.bib26); Mnihet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib28); Espeholtet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib29); Levine,[2018](https://arxiv.org/html/2606.00151#bib.bib70); Eysenbach and Levine,[2022](https://arxiv.org/html/2606.00151#bib.bib66)\)\. SAC is a popular example for continuous control\(Haarnojaet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib30)\)\. While these approaches increase*action*entropy, increasing*state*entropy may align better with exploration goals\(Pitiset al\.,[2020](https://arxiv.org/html/2606.00151#bib.bib32); Muttiet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib31)\), since action entropy alone promotes undirected exploration\.Baramet al\.\([2021](https://arxiv.org/html/2606.00151#bib.bib33)\)propose maximizing transition entropy \(the entropy of the next\-state distribution given the current state\) as a proxy for state entropy\. Recent work explored beyond entropy regularization, such as f\-divergence regularization\(Labbiet al\.,[2026](https://arxiv.org/html/2606.00151#bib.bib83)\)\. In contrast, with ReMax, as shown in Sec\.[2](https://arxiv.org/html/2606.00151#S2), directed \(uncertainty\-aware\) exploration emerges from maximizing the retry objective\.
#### Posterior Sampling\.
Another family draws inspiration from Thompson sampling\(Thompson,[1933](https://arxiv.org/html/2606.00151#bib.bib58); Honda and Takemura,[2014](https://arxiv.org/html/2606.00151#bib.bib57); Agrawal and Goyal,[2017](https://arxiv.org/html/2606.00151#bib.bib56)\)\. Bootstrapped DQN\(Osbandet al\.,[2016](https://arxiv.org/html/2606.00151#bib.bib38)\)samples a Q\-function from an ensemble to emulate Thompson sampling, later enhanced with randomized priors and value functions\(Osbandet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib39),[2019](https://arxiv.org/html/2606.00151#bib.bib40); Ishfaqet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib79)\)\. Other works use Bayesian neural networks to model the Q\-function posterior\(Azizzadenesheliet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib41); Bayrootiet al\.,[2025](https://arxiv.org/html/2606.00151#bib.bib42); Osbandet al\.,[2023](https://arxiv.org/html/2606.00151#bib.bib43)\)or adopt model\-based posterior sampling\(Sassoet al\.,[2023](https://arxiv.org/html/2606.00151#bib.bib44)\)\. Some works use Langevin Monte Carlo and efficient approximate sampling schemes to realize Thompson\-style exploration in deep RL\(Ishfaqet al\.,[2024](https://arxiv.org/html/2606.00151#bib.bib78),[2025](https://arxiv.org/html/2606.00151#bib.bib45)\)\. Integrating ReMax with such methods to*explicitly*model the Q\-value distribution is a promising avenue\.
### A\.2Retry\-based objectives
Pass@K was introduced as an evaluation metric for LLM reasoning, measuring whether at least one ofKKsamples is correct\(Chenet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib63)\)\. Beyond evaluation, recent work optimizes pass@K\-like objectives directly\. An early preprint of this submission proposed the ReMax objective and its simple optimization with REINFORCE already in 2022\(Koyamadaet al\.,[2022](https://arxiv.org/html/2606.00151#bib.bib80)\), preceding subsequent max@K/pass@K policy\-optimization works in LLM training\.Walder and Karkhanis \([2025](https://arxiv.org/html/2606.00151#bib.bib36)\)extend to continuous rewards via max@K, deriving unbiased estimators via reward transformations computed from mini\-batches of sampled rewards/returns, and show that maximizing the best ofKKtrials improves both pass@K and pass@1\. Their formulation coincides with ours when per\-action values are fixed\.Tanget al\.\([2025a](https://arxiv.org/html/2606.00151#bib.bib37)\)analyzeKK\-sample objectives \(pass@K, majority vote\), studying bias–variance and KL efficiency and proposing leave\-one\-out control variates\.Chenet al\.\([2025](https://arxiv.org/html/2606.00151#bib.bib49)\)propose RLVR with pass@K as the training signal, using bootstrap grouping and analytic advantages to cut rollout cost\. These works consistently show that retry\-based training improves exploration and robustness in reasoning tasks\. Despite sharing a similar idea, several key distinctions separate the scope of our work from theirs\.Uncertainty over rewards:We model uncertainty over rewards or Q\-values and adapt exploration to \(epistemic\) uncertainty, whereas LLM reasoning benchmarks typically assume a fixed reward\.Retry feasibility:In episodic RL, multiple returns from the same state are generally infeasible; we emulate retries via a learned Q\-function and derive a policy gradient estimable from single\-trajectory data\.Closed\-form vs\. sample\-based computation:Our ReMax objective and EI admit closed\-form expressions in discrete\-action RL that compute directly from the policy action probabilities \(thepp/π\\piterms\), whereas LLM pass@K/max@K training typically constructs objectives and gradient estimates from a batch of sampled completions and their rewards/returns\(Walder and Karkhanis,[2025](https://arxiv.org/html/2606.00151#bib.bib36); Tanget al\.,[2025a](https://arxiv.org/html/2606.00151#bib.bib37)\)\.Novel PG with continuous retry:Moreover, prior work controls retries with an integerKK, whereas we introduce a continuous retry parameter, enabling fine\-grained exploration–exploitation trade\-offs\. This extension was enabled by our novel policy gradient formulation and the closed\-form expression as shown in Sec\.[4](https://arxiv.org/html/2606.00151#S4)\. Thus, although max@K coincides with ReMax under fixed values, our contribution extends retry\-based objectives to uncertainty\-aware, episodic RL with continuous control of retries\. Recent work has proposed other objectives that move beyond standard expected return\.Hamidet al\.\([2026](https://arxiv.org/html/2606.00151#bib.bib82)\)proposed polychromic objectives that explicitly encourage diverse behaviors by assigning high value to sets of rollouts only when they are both rewarding and diverse\.Tanget al\.\([2025b](https://arxiv.org/html/2606.00151#bib.bib81)\)explored other aggregation functions for evaluation of the sequence of actions, such as max, min, and variance\.
### A\.3Risk\-sensitive RL
Risk\-sensitive reinforcement learning modifies the criterion used to evaluate return distributions, rather than relying solely on their expectation\. Classical risk\-sensitive Markov decision processes study exponential utility or entropic criteria, which lead to risk\-aware Bellman equations\(Howard and Matheson,[1972](https://arxiv.org/html/2606.00151#bib.bib86); Feiet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib87)\)\. Another prominent criterion is Conditional Value\-at\-Risk \(CVaR\), which focuses on the lower tail of the return distribution and has been used to formulate safer RL objectives\(Rockafellaret al\.,[2000](https://arxiv.org/html/2606.00151#bib.bib89); Yinget al\.,[2022](https://arxiv.org/html/2606.00151#bib.bib88)\)\. These approaches are related to our method in that they also go beyond the standard expected\-return objective and reason about distributions of returns\. However, their primary goal is to encode a risk preference over environmental returns\. In contrast, ReMax evaluates the expected maximum over multiple sampled actions, thereby favoring actions that may yield high values under multiple retries and encouraging exploration\.
Distributional RL provides another related perspective by explicitly modeling the full return distribution in deep RL\(Bellemareet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib92); Dabneyet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib90); Bellemareet al\.,[2023](https://arxiv.org/html/2606.00151#bib.bib91)\)\. This line of work represents the inherent randomness of returns, often viewed as*aleatoric*uncertainty, and has led to practical algorithms based on categorical\(Bellemareet al\.,[2017](https://arxiv.org/html/2606.00151#bib.bib92)\)or quantile approximations of return distributions\(Dabneyet al\.,[2018](https://arxiv.org/html/2606.00151#bib.bib90); Bellemareet al\.,[2023](https://arxiv.org/html/2606.00151#bib.bib91)\)\. By contrast, the motivation of ReMax is primarily*epistemic*: it targets uncertainty arising from insufficient exploration and limited knowledge of action values\. Nevertheless, ReMax can in principle be combined with distributional RL\. For example, one could combine a retry\-based objective with an explicit distributional value model, using the learned return distribution to evaluate the expected maximum over sampled action values or to control the degree of risk\-seeking behavior\. Exploring such combinations may be a promising direction\.
## Appendix BProofs
This section contains the proofs of the propositions and theorems in the main text\.
### B\.1Proof of Proposition[3\.2](https://arxiv.org/html/2606.00151#S3.Thmtheorem2)
Proposition[3\.2](https://arxiv.org/html/2606.00151#S3.Thmtheorem2)\.Letq=\(q1,…,qK\)q=\(q\_\{1\},\\dots,q\_\{K\}\)and write the order statisticsq\(1\)≥⋯≥q\(K\)q\_\{\(1\)\}\\geq\\cdots\\geq q\_\{\(K\)\}, breaking ties arbitrarily, with aligned massesπ\(j\)\\pi\_\{\(j\)\}\. DefineC0:=0C\_\{0\}:=0andCj:=∑u=1jπ\(u\)C\_\{j\}:=\\sum\_\{u=1\}^\{j\}\\pi\_\{\(u\)\},j=1,…,Kj=1,\\ldots,K\. Then
JReMaxM\(π,s,q\)=q\(1\)\+∑j=1K−1\(q\(j\+1\)−q\(j\)\)\(1−Cj\)M\.J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi,s,q\)=q\_\{\(1\)\}\+\\sum\_\{j=1\}^\{K\-1\}\\big\(q\_\{\(j\+1\)\}\-q\_\{\(j\)\}\\big\)\\,\\big\(1\-C\_\{j\}\\big\)^\{M\}\.\(13\)
###### Proof\.
We compute the expectation by conditioning on the best sampled rank\. Let\(j\)\(j\)denote the action with thejj\-th largest value after sorting the entries ofqq, with ties broken arbitrarily\. For sampled actionsA1,…,AMA\_\{1\},\\ldots,A\_\{M\}, define
R⋆:=min\{j∈\[K\]:Am=\(j\)for somem∈\[M\]\}\.R^\{\\star\}:=\\min\\\{j\\in\[K\]:A\_\{m\}=\(j\)\\text\{ for some \}m\\in\[M\]\\\}\.Then the maximum sampled value is
maxm∈\[M\]qAm=q\(R⋆\)\.\\max\_\{m\\in\[M\]\}q\_\{A\_\{m\}\}=q\_\{\(R^\{\\star\}\)\}\.Therefore,
JReMaxM\(π,s,q\)\\displaystyle J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi,s,q\)=𝔼A\[M\]∼π\[maxm∈\[M\]qAm\]\\displaystyle=\\mathbb\{E\}\_\{A\_\{\[M\]\}\\sim\\pi\}\\left\[\\max\_\{m\\in\[M\]\}q\_\{A\_\{m\}\}\\right\]\(14\)=∑j=1Kℙ\(R⋆=j\)q\(j\)\.\\displaystyle=\\sum\_\{j=1\}^\{K\}\\mathbb\{P\}\(R^\{\\star\}=j\)q\_\{\(j\)\}\.\(15\)
It remains to computeℙ\(R⋆=j\)\\mathbb\{P\}\(R^\{\\star\}=j\)\. The eventR⋆=jR^\{\\star\}=joccurs iff allMMdraws miss the top\-\(j−1\)\(j\-1\)actions, but not allMMdraws miss the top\-jjactions\. Since the total policy mass on the top\-jjactions isCjC\_\{j\}, we have
ℙ\(R⋆=j\)\\displaystyle\\mathbb\{P\}\(R^\{\\star\}=j\)=ℙ\(allMdraws miss the top\-\(j−1\)\)−ℙ\(allMdraws miss the top\-j\)\\displaystyle=\\mathbb\{P\}\(\\text\{all \}M\\text\{ draws miss the top\-\}\(j\-1\)\)\-\\mathbb\{P\}\(\\text\{all \}M\\text\{ draws miss the top\-\}j\)\(16\)=\(1−Cj−1\)M−\(1−Cj\)M\.\\displaystyle=\(1\-C\_\{j\-1\}\)^\{M\}\-\(1\-C\_\{j\}\)^\{M\}\.\(17\)Let
αj:=\(1−Cj\)M\.\\alpha\_\{j\}:=\(1\-C\_\{j\}\)^\{M\}\.Thenα0=1\\alpha\_\{0\}=1andαK=0\\alpha\_\{K\}=0, and hence
JReMaxM\(π,s,q\)\\displaystyle J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi,s,q\)=∑j=1K\(αj−1−αj\)q\(j\)\\displaystyle=\\sum\_\{j=1\}^\{K\}\(\\alpha\_\{j\-1\}\-\\alpha\_\{j\}\)q\_\{\(j\)\}\(18\)=α0q\(1\)\+∑j=1K−1αj\(q\(j\+1\)−q\(j\)\)−αKq\(K\)\\displaystyle=\\alpha\_\{0\}q\_\{\(1\)\}\+\\sum\_\{j=1\}^\{K\-1\}\\alpha\_\{j\}\\bigl\(q\_\{\(j\+1\)\}\-q\_\{\(j\)\}\\bigr\)\-\\alpha\_\{K\}q\_\{\(K\)\}\(19\)=q\(1\)\+∑j=1K−1\(q\(j\+1\)−q\(j\)\)\(1−Cj\)M\.\\displaystyle=q\_\{\(1\)\}\+\\sum\_\{j=1\}^\{K\-1\}\\bigl\(q\_\{\(j\+1\)\}\-q\_\{\(j\)\}\\bigr\)\(1\-C\_\{j\}\)^\{M\}\.\(20\)This completes the proof\. ∎
### B\.2Proof of Proposition[4\.1](https://arxiv.org/html/2606.00151#S4.Thmtheorem1)\.
Proposition[4\.1](https://arxiv.org/html/2606.00151#S4.Thmtheorem1)\.LetWM−1:=max\{qA1,…,qAM−1\}W\_\{M\-1\}:=\\max\\\{q\_\{A\_\{1\}\},\\dots,q\_\{A\_\{M\-1\}\}\\\}, we have:
∇θJReMaxM\(θ,s,q\)=M𝔼a∼πθ\[∇θlogπθ\(a\|s\)𝔼A\[M−1\]\[\(qa−WM−1\)\+\]\]\.\\nabla\_\{\\theta\}J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\theta,s,q\)=M\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\}\\left\[\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\|s\)\\mathbb\{E\}\_\{A\_\{\[M\-1\]\}\}\\left\[\{\{\\,\(q\_\{a\}\-W\_\{M\-1\}\)\_\{\+\}\}\}\\right\]\}\\right\]\.\(21\)
###### Proof\.
We start by considering the PG with a per\-term baseline\.
∇θJM\(πθ,s,q\)=𝔼A\[M\]\[∑m=1M∇θlogπθ\(Am\)\(maxj∈\[M\]qAj−bm\)\]\.\\nabla\_\{\\theta\}J\_\{M\}\(\\pi\_\{\\theta\},s,q\)=\\mathbb\{E\}\_\{A\_\{\[M\]\}\}\\left\[\{\\,\\sum\_\{m=1\}^\{M\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{m\}\)\\,\\big\(\\max\_\{j\\in\[M\]\}q\_\{A\_\{j\}\}\-b\_\{m\}\\big\)\\,\}\\right\]\.\(22\)Choosebmb\_\{m\}as
bm:=W−m:=max\{qA1,…,qAm−1,qAm\+1,…,qAM\},b\_\{m\}\\ :=\\ W\_\{\-m\}\\ :=\\ \\max\\\{q\_\{A\_\{1\}\},\\dots,q\_\{A\_\{m\-1\}\},q\_\{A\_\{m\+1\}\},\\dots,q\_\{A\_\{M\}\}\\\},\(23\)which preserves unbiasedness since
𝔼Am∼πθ\[∇θlogπθ\(Am\)bm\]=bm∇θ∑i=1Kπθ\(i\)=0,\\mathbb\{E\}\_\{A\_\{m\}\\sim\\pi\_\{\\theta\}\}\\left\[\{\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{m\}\)\\,b\_\{m\}\\,\}\\right\]\\;=\\;b\_\{m\}\\,\\nabla\_\{\\theta\}\\sum\_\{i=1\}^\{K\}\\pi\_\{\\theta\}\(i\)\\;=\\;0,\(24\)With this baseline,
maxj∈\[M\]qAj−W−m=\(qAm−W−m\)\+\.\\max\_\{j\\in\[M\]\}q\_\{A\_\{j\}\}\-W\_\{\-m\}\\;=\\;\\big\(q\_\{A\_\{m\}\}\-W\_\{\-m\}\\big\)\_\{\+\}\.\(25\)Hence
∇θJM\(πθ,s,q\)=𝔼A\[M\]\[∑m=1M∇θlogπθ\(Am\)\(qAm−W−m\)\+\]\.\\nabla\_\{\\theta\}J\_\{M\}\(\\pi\_\{\\theta\},s,q\)=\\mathbb\{E\}\_\{A\_\{\[M\]\}\}\\left\[\{\\sum\_\{m=1\}^\{M\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{m\}\)\(q\_\{A\_\{m\}\}\-W\_\{\-m\}\)\_\{\+\}\}\\right\]\.\(26\)
#### Condition on the firstM−1M\{\-\}1samples\.
By symmetry of i\.i\.d\. draws,
∇θJM\(πθ,s,q\)=M𝔼A\[M−1\]\[𝔼AM∼πθ\[∇θlogπθ\(AM\)\(qAM−WM−1\)\+\]\],\\nabla\_\{\\theta\}J\_\{M\}\(\\pi\_\{\\theta\},s,q\)=M\\,\\mathbb\{E\}\_\{A\_\{\[M\-1\]\}\}\\left\[\{\\,\\mathbb\{E\}\_\{A\_\{M\}\\sim\\pi\_\{\\theta\}\}\\left\[\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{M\}\)\\,\(q\_\{A\_\{M\}\}\-W\_\{M\-1\}\)\_\{\+\}\}\\right\]\\,\}\\right\],\(27\)where
WM−1:=max\{qA1,…,qAM−1\}\.W\_\{M\-1\}\\ :=\\ \\max\\\{q\_\{A\_\{1\}\},\\dots,q\_\{A\_\{M\-1\}\}\\\}\.\(28\)For fixed\(q,A\[M−1\]\)\(q,A\_\{\[M\-1\]\}\),WM−1W\_\{M\-1\}is a constant andAM∼πθA\_\{M\}\\sim\\pi\_\{\\theta\}, so
𝔼AM∼πθ\[∇θlogπθ\(AM\)\(qAM−WM−1\)\+\]=∑i=1Kπθ\(i\)∇θlogπθ\(i\)\(qi−WM−1\)\+\.\\mathbb\{E\}\_\{A\_\{M\}\\sim\\pi\_\{\\theta\}\}\\left\[\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{M\}\)\\,\(q\_\{A\_\{M\}\}\-W\_\{M\-1\}\)\_\{\+\}\}\\right\]=\\sum\_\{i=1\}^\{K\}\\pi\_\{\\theta\}\(i\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(i\)\\,\(q\_\{i\}\-W\_\{M\-1\}\)\_\{\+\}\.\(29\)
#### Separate the action expectation\.
Then, we have
∇θJM\(πθ,s,q\)=M𝔼i∼πθ\[𝔼A\[M−1\]∼πθ\[∇θlogπθ\(i\)\(qi−WM−1\)\+\]\],\\nabla\_\{\\theta\}J\_\{M\}\(\\pi\_\{\\theta\},s,q\)\\ =\\ M\\,\\mathbb\{E\}\_\{i\\sim\\pi\_\{\\theta\}\}\\left\[\{\\,\\mathbb\{E\}\_\{A\_\{\[M\-1\]\}\\sim\\pi\_\{\\theta\}\}\\left\[\{\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(i\)\\;\(q\_\{i\}\-W\_\{M\-1\}\)\_\{\+\}\}\\right\]\}\\right\],\(30\)which completes the proof\. ∎
### B\.3Proof of Proposition[4\.3](https://arxiv.org/html/2606.00151#S4.Thmtheorem3)\.
Proposition[4\.3](https://arxiv.org/html/2606.00151#S4.Thmtheorem3)\.Letq∈ℝKq\\in\\mathbb\{R\}^\{K\}be Q\-values at a state,π∈ΔK−1\\pi\\in\\Delta^\{K\-1\}a policy,R∈ℝR\\in\\mathbb\{R\}a reference, andM∈ℕM\\in\\mathbb\{N\}\. Definevi:=\(R−qi\)\+v\_\{i\}:=\(R\-q\_\{i\}\)\_\{\+\}and sortqqasq\(1\)≥⋯≥q\(K\)q\_\{\(1\)\}\\geq\\cdots\\geq q\_\{\(K\)\}with aligned massesπ\(j\)\\pi\_\{\(j\)\}\. DefineC0:=0C\_\{0\}:=0andCj:=∑u=1jπ\(u\)C\_\{j\}:=\\sum\_\{u=1\}^\{j\}\\pi\_\{\(u\)\},j=1,…,Kj=1,\\ldots,K\. Then
EIM\(R;π,q\)=v\(1\)\+∑j=1K−1\(v\(j\+1\)−v\(j\)\)\(1−Cj\)M−1\.\\mathrm\{EI\}\_\{M\}\(R;\\pi,q\)=v\_\{\(1\)\}\+\\sum\_\{j=1\}^\{K\-1\}\\big\(v\_\{\(j\+1\)\}\-v\_\{\(j\)\}\\big\)\\,\(1\-C\_\{j\}\)^\{M\-1\}\.\(31\)
###### Proof\.
By definition,
EIM\(R;π,q\)=𝔼A\[M−1\]∼π\[\(R−maxm∈\[M−1\]qAm\)\+\]\.\\mathrm\{EI\}\_\{M\}\(R;\\pi,q\)=\\mathbb\{E\}\_\{A\_\{\[M\-1\]\}\\sim\\pi\}\\left\[\\left\(R\-\\max\_\{m\\in\[M\-1\]\}q\_\{A\_\{m\}\}\\right\)\_\{\+\}\\right\]\.Sincevi=\(R−qi\)\+v\_\{i\}=\(R\-q\_\{i\}\)\_\{\+\}andq\(1\)≥⋯≥q\(K\)q\_\{\(1\)\}\\geq\\cdots\\geq q\_\{\(K\)\}, we havev\(1\)≤⋯≤v\(K\)v\_\{\(1\)\}\\leq\\cdots\\leq v\_\{\(K\)\}and
\(R−maxm∈\[M−1\]qAm\)\+=minm∈\[M−1\]vAm\.\\left\(R\-\\max\_\{m\\in\[M\-1\]\}q\_\{A\_\{m\}\}\\right\)\_\{\+\}=\\min\_\{m\\in\[M\-1\]\}v\_\{A\_\{m\}\}\.Using the same best\-rank decomposition as in the proof of Proposition[3\.2](https://arxiv.org/html/2606.00151#S3.Thmtheorem2), withM−1M\-1draws, the probability that the best sampled rank isjjis
\(1−Cj−1\)M−1−\(1−Cj\)M−1\.\(1\-C\_\{j\-1\}\)^\{M\-1\}\-\(1\-C\_\{j\}\)^\{M\-1\}\.Therefore,
EIM\(R;π,q\)=∑j=1K\{\(1−Cj−1\)M−1−\(1−Cj\)M−1\}v\(j\)\.\\mathrm\{EI\}\_\{M\}\(R;\\pi,q\)=\\sum\_\{j=1\}^\{K\}\\left\\\{\(1\-C\_\{j\-1\}\)^\{M\-1\}\-\(1\-C\_\{j\}\)^\{M\-1\}\\right\\\}v\_\{\(j\)\}\.Applying the same telescoping argument as in Proposition[3\.2](https://arxiv.org/html/2606.00151#S3.Thmtheorem2)gives
EIM\(R;π,q\)=v\(1\)\+∑j=1K−1\(v\(j\+1\)−v\(j\)\)\(1−Cj\)M−1\.\\mathrm\{EI\}\_\{M\}\(R;\\pi,q\)=v\_\{\(1\)\}\+\\sum\_\{j=1\}^\{K\-1\}\\bigl\(v\_\{\(j\+1\)\}\-v\_\{\(j\)\}\\bigr\)\(1\-C\_\{j\}\)^\{M\-1\}\.This completes the proof\. ∎
## Appendix CDetails of the Bandit Experiments
This appendix provides details of the bandit experiments\.
### C\.1Binary Bandit
We can compute the ReMax objective for the binary bandit analytically as follows:
JReMaxM\(p\):=0\.75⋅\(1−\(1−p\)M\)\+0\.25⋅\(1−pM\),J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(p\):=0\.75\\cdot\(1\-\(1\-p\)^\{M\}\)\+0\.25\\cdot\(1\-p^\{M\}\),\(32\)whereppis the probability of pulling arm 1\.
### C\.2Bernoulli bandit
We describe \(i\) the Bernoulli\-bandit setup, \(ii\) the experimental design for Fig\.[1](https://arxiv.org/html/2606.00151#S1.F1)\(Center\), and \(iii\) the computations for each method\.
#### \(i\) Bernoulli\-bandit setup\.
We consider two arms with rewardsRi=αiXiR\_\{i\}=\\alpha\_\{i\}X\_\{i\}, whereXi∼Bernoulli\(pi\)X\_\{i\}\\sim\\mathrm\{Bernoulli\}\(p\_\{i\}\)andμi=𝔼\[Ri\]=αipi\\mu\_\{i\}=\\mathbb\{E\}\\left\[\{R\_\{i\}\}\\right\]=\\alpha\_\{i\}p\_\{i\}\. For a fixed realizationr=\(r0,r1\)r=\(r\_\{0\},r\_\{1\}\)and a policyπ∈\[0,1\]\\pi\\in\[0,1\]denotingPr\(a=0\)=π\\Pr\(a\{=\}0\)=\\pi\(thusPr\(a=1\)=1−π\\Pr\(a\{=\}1\)=1\-\\pi\), the ReMax objective withM=2M\{=\}2i\.i\.d\. draws is
J2\(π∣r\)=π2r0\+2π\(1−π\)max\{r0,r1\}\+\(1−π\)2r1\.J^\{2\}\(\\pi\\mid r\)=\\pi^\{2\}r\_\{0\}\+2\\pi\(1\-\\pi\)\\max\\\{r\_\{0\},r\_\{1\}\\\}\+\(1\-\\pi\)^\{2\}r\_\{1\}\.Taking expectation over Bernoulli outcomese1=\(α0,0\),e2=\(0,α1\),e3=\(α0,α1\),e4=\(0,0\)e\_\{1\}=\(\\alpha\_\{0\},0\),\\ e\_\{2\}=\(0,\\alpha\_\{1\}\),\\ e\_\{3\}=\(\\alpha\_\{0\},\\alpha\_\{1\}\),\\ e\_\{4\}=\(0,0\)with probabilitiespe1=p0\(1−p1\)p\_\{e\_\{1\}\}=p\_\{0\}\(1\-p\_\{1\}\),pe2=\(1−p0\)p1p\_\{e\_\{2\}\}=\(1\-p\_\{0\}\)p\_\{1\},pe3=p0p1p\_\{e\_\{3\}\}=p\_\{0\}p\_\{1\},pe4=\(1−p0\)\(1−p1\)p\_\{e\_\{4\}\}=\(1\-p\_\{0\}\)\(1\-p\_\{1\}\)gives
𝔼\[J2\(π∣R\)\]=∑k=14pekJ2\(π∣ek\)\.\\mathbb\{E\}\\left\[\{J^\{2\}\(\\pi\\mid R\)\}\\right\]=\\sum\_\{k=1\}^\{4\}p\_\{e\_\{k\}\}\\,J^\{2\}\(\\pi\\mid e\_\{k\}\)\.
#### \(ii\) Experimental design\.
We fixp0=1p\_\{0\}=1andα0=2\\alpha\_\{0\}=2, and sweep arm\-1 scaleα1∈\[1,10\]\\alpha\_\{1\}\\in\[1,10\]while adjustingp1p\_\{1\}to keep the mean constant:α1p1=1\\alpha\_\{1\}p\_\{1\}=1\. For eachα1\\alpha\_\{1\}, we evaluateπ⋆\(a=1\)=1−π⋆\\pi^\{\\star\}\(a\{=\}1\)=1\-\\pi^\{\\star\}underM=2M=2and plot it together with the softmax baseline in Fig\.[1](https://arxiv.org/html/2606.00151#S1.F1)\(Center\)\.
#### \(iii\) Method computations\.
*ReMax \(M=2M\{=\}2\)\.*Compute𝔼\[J2\(π∣R\)\]\\mathbb\{E\}\\left\[\{J^\{2\}\(\\pi\\mid R\)\}\\right\]via the above event decomposition and maximize overπ∈\[0,1\]\\pi\\in\[0,1\]\(we use a dense numerical search; a closed form exists because the objective is quadratic inπ\\pi\)\. *Softmax \(entropy\-regularized\)\.*For temperatureβ\>0\\beta\>0\(we useβ=1\\beta\{=\}1\),
πsoft\(a=1\)=exp\(μ1/β\)exp\(μ0/β\)\+exp\(μ1/β\),μ0=α0p0,μ1=α1p1\(=1in our sweep\)\.\\pi\_\{\\mathrm\{soft\}\}\(a\{=\}1\)=\\frac\{\\exp\(\\mu\_\{1\}/\\beta\)\}\{\\exp\(\\mu\_\{0\}/\\beta\)\+\\exp\(\\mu\_\{1\}/\\beta\)\},\\quad\\mu\_\{0\}=\\alpha\_\{0\}p\_\{0\},\\ \\mu\_\{1\}=\\alpha\_\{1\}p\_\{1\}\\ \(=1\\ \\text\{in our sweep\}\)\.
### C\.3Bandit with Posterior
Algorithm 2ReMax with Posterior1:Initialize prior
Π0,i\\Pi\_\{0,i\}for each arm\.
2:for
t=1,2,…,Tt=1,2,\\dots,Tdo
3:Compute
πt=argmaxπ𝔼μt∼Πt\[JReMaxM\(π;μt\)\]\\pi\_\{t\}=\\operatornamewithlimits\{argmax\}\_\{\\pi\}\\mathbb\{E\}\_\{\\mu\_\{t\}\\sim\\Pi\_\{t\}\}\\left\[\{J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi;\\mu\_\{t\}\)\}\\right\]by Alg\.[3](https://arxiv.org/html/2606.00151#alg3)\.
4:Play
at∼πta\_\{t\}\\\!\\sim\\\!\\pi\_\{t\}, observe
rt∈ℝr\_\{t\}\\\!\\in\\\!\\mathbb\{R\}\.
5:Update posterior of arm
ata\_\{t\},
Πt\+1,at\\Pi\_\{t\+1,a\_\{t\}\}, by the posterior update rule\.
6:endfor
Algorithm 3ReMax optimization0:Posterior
Πt\\Pi\_\{t\}, batch size
BB, draws
MM, epochs
SSand policy
θt−1\\theta\_\{t\-1\}
1:Initialize policy
πθ\\pi\_\{\\theta\}with
θt−1\\theta\_\{t\-1\}
2:for
s=1,2,…,Ss=1,2,\\dots,Sdo
3:Sample
μb,t∼Πt\\mu^\{b,t\}\\sim\\Pi\_\{t\}for
b=1,2,…,Bb=1,2,\\dots,B\.
4:Compute
JReMaxM\(πθ;μb,t\)J^\{M\}\_\{\\scriptscriptstyle\\mathrm\{ReMax\}\}\(\\pi\_\{\\theta\};\\mu^\{b,t\}\)for
b=1,2,…,Bb=1,2,\\dots,B\.
5:Update
πθ\\pi\_\{\\theta\}by gradient ascent\.
6:endfor
7:return
πt=πθ\\pi\_\{t\}=\\pi\_\{\\theta\}
To confirm empirically sublinear regret, we conduct a posterior\-driven bandit experiment in two settings: Beta–Bernoulli and Gaussian–Gaussian\. We consider a ground\-truth priorΠ∗\\Pi^\{\*\}over the arm means\{μi\}i=1K\\\{\\mu\_\{i\}\\\}\_\{i=1\}^\{K\}\. Initially, eachμi∼Π∗\\mu\_\{i\}\\sim\\Pi^\{\*\}and the learner’s prior isΠ0=Π∗\\Pi\_\{0\}=\\Pi^\{\*\}\. At each roundtt, we selectata\_\{t\}, observertr\_\{t\}drawn with meanμat\\mu\_\{a\_\{t\}\}, and update the posteriorΠt\+1,at\\Pi\_\{t\+1,a\_\{t\}\}\.
#### Beta–Bernoulli\.
The priorΠ∗\\Pi^\{\*\}isBeta\(α0,β0\)\\mathrm\{Beta\}\(\\alpha\_\{0\},\\beta\_\{0\}\)withα0=β0=1\\alpha\_\{0\}=\\beta\_\{0\}=1for all arms\. Rewards are Bernoulli with meanμi\\mu\_\{i\}\. The posterior update is
Πt\+1,i=Beta\(αt\+rt,βt\+1−rt\)\.\\Pi\_\{t\+1,i\}=\\mathrm\{Beta\}\(\\alpha\_\{t\}\+r\_\{t\},\\,\\beta\_\{t\}\+1\-r\_\{t\}\)\.
#### Gaussian–Gaussian\.
The priorΠ∗\\Pi^\{\*\}is𝒩\(μ0,σ02\)\\mathcal\{N\}\(\\mu\_\{0\},\\sigma\_\{0\}^\{2\}\)withμ0=0\\mu\_\{0\}=0andσ02=1\\sigma\_\{0\}^\{2\}=1for all arms\. Rewards are𝒩\(μi,σR2\)\\mathcal\{N\}\(\\mu\_\{i\},\\sigma\_\{R\}^\{2\}\)withσR2=1\\sigma\_\{R\}^\{2\}=1\. The posterior update is
Πt\+1,i=𝒩\(μi,t\+1,σi,t\+12\),σi,t\+12=\(1σi,t2\+1σR2\)−1,μi,t\+1=σi,t\+12\(μi,tσi,t2\+rtσR2\)\.\\Pi\_\{t\+1,i\}=\\mathcal\{N\}\(\\mu\_\{i,t\+1\},\\sigma\_\{i,t\+1\}^\{2\}\),\\qquad\\sigma\_\{i,t\+1\}^\{2\}=\\Big\(\\tfrac\{1\}\{\\sigma\_\{i,t\}^\{2\}\}\+\\tfrac\{1\}\{\\sigma\_\{R\}^\{2\}\}\\Big\)^\{\-1\},\\quad\\mu\_\{i,t\+1\}=\\sigma\_\{i,t\+1\}^\{2\}\\\!\\left\(\\tfrac\{\\mu\_\{i,t\}\}\{\\sigma\_\{i,t\}^\{2\}\}\+\\tfrac\{r\_\{t\}\}\{\\sigma\_\{R\}^\{2\}\}\\right\)\.
#### ReMax optimization\.
Arm selection by ReMax is shown in Alg\.[2](https://arxiv.org/html/2606.00151#alg2)\. At eachtt, we optimize the ReMax objective via Alg\.[3](https://arxiv.org/html/2606.00151#alg3), the exact computation of the objective from Prop\.[3\.2](https://arxiv.org/html/2606.00151#S3.Thmtheorem2)with batch sizeB=16B=16and epochsS=50S=50\. For ease of optimization, at each roundttwe initialize the policyθt\\theta\_\{t\}with the parameters from the previous roundt−1t\-1\. To avoid optimizer state carrying over across rounds, we reinitialize the optimizer at each round\.
#### Baselines\.
We compare against Thompson sampling\(Thompson,[1933](https://arxiv.org/html/2606.00151#bib.bib58); Honda and Takemura,[2014](https://arxiv.org/html/2606.00151#bib.bib57); Agrawal and Goyal,[2017](https://arxiv.org/html/2606.00151#bib.bib56)\)and UCB\(Aueret al\.,[2002](https://arxiv.org/html/2606.00151#bib.bib59)\), both with sublinear\-regret guarantees\. Thompson sampling: sampleμa\\mu\_\{a\}fromΠt,a\\Pi\_\{t,a\}and selecta=argmaxaμaa=\\arg\\max\_\{a\}\\mu\_\{a\}\. UCB: after initializing by pulling all arms, select the arm with highest empirical mean plus a bonusclog\(t\)/\(2Na\)c\\sqrt\{\\log\(t\)/\(2N\_\{a\}\)\}, whereNaN\_\{a\}is the number of pulls; we usec=1\.0c=1\.0for both settings\. To compare with the entropy\-regularized exploration, we prepared a Softmax baseline, where we took the softmax of the posterior means of each arm and selected the arm following the softmax distribution\. We fine\-tune the temperature parameter in \(0\.01,0\.1,1\.00\.01,0\.1,1\.0\) and used0\.10\.1for the experiments\.
## Appendix DRePPO
Below is the code for the advantage computation in RePPO\.
Listing 1:Advantage Computation for RePPO1defexpected\_improvement\_min\(
2R:jnp\.ndarray,
3q:jnp\.ndarray,
4pi:jnp\.ndarray,
5M:float,
6\):
7"""EI\_M\(R;pi\)=E\[min\_\{1\.\.M\}\(R\-q\_A\)\_\+\],A~pii\.i\.d\.
8Returns:\(B,N\_ref\)
9"""
10idx=jnp\.argsort\(\-q,axis=\-1\)
11q\_sorted=jnp\.take\_along\_axis\(q,idx,axis=\-1\)
12pi\_sorted=jnp\.take\_along\_axis\(pi,idx,axis=\-1\)
13
14C=jnp\.cumsum\(pi\_sorted,axis=\-1\)
15
16v=jnp\.maximum\(R\[\.\.\.,None\]\-q\_sorted\[:,None,:\],0\.0\)
17v\_first=v\[\.\.\.,0\]
18dv=v\[\.\.\.,1:\]\-v\[\.\.\.,:\-1\]
19eps=1e\-8
20w=jnp\.power\(jnp\.clip\(1\.0\-C\[\.\.\.,:\-1\],eps,1\.0\),M\)
21EI=v\_first\+jnp\.sum\(dv\*w\[:,None,:\],axis=\-1\)
22returnEI
23
24
25defreppo\_advantage\(
26R:jnp\.ndarray,
27q:jnp\.ndarray,
28pi:jnp\.ndarray,
29action:jnp\.ndarray,
30M:float,
31\):
32"""ComputeRePPOadvantagewithQ\-replacementbyreturn\.
33Returns:\(B,\)
34"""
35ifR\.ndim==1:
36R=R\[:,None\]
37
38
39q\_ref=q\.at\[jnp\.arange\(q\.shape\[0\]\),action\]\.set\(R\[:,0\]\)
40
41
42R\_plus=expected\_improvement\_min\(R,q\_ref,pi,M\-1\)\[\.\.\.,0\]
43q\_plus=expected\_improvement\_min\(q,q\_ref,pi,M\-1\)
44baseline=jnp\.sum\(pi\*jax\.lax\.stop\_gradient\(q\_plus\),axis=\-1\)
45returnR\_plus\-baseline
\(a\)Breakout
\(b\)Asterix
\(c\)Freeway
\(d\)Space Invaders
Figure 6:Per\-game learning curves, ordered as Breakout, Asterix, Freeway, and Space Invaders\. Mean±\\pms\.e\. over 10 seeds\.
## Appendix EMinAtar Experiment
In this appendix, we provide additional details on the MinAtar experiment\.
### E\.1Experimental setup
#### Network architecture\.
We use the same network architecture as the public implementations of PPO and PQN \(official implementation\)\.PPO \(Actor–Critic\)\.A shared CNN \(Conv2×22\{\\times\}2\+ ReLU \+ avg\-pool\) and MLP produce a latent state, which branches into \(i\) an actor head with two hidden layers \(ReLU/Tanh\) outputting action logits, and \(ii\) a critic head with two hidden layers outputting per\-action Q\-values \(i\.e\., a Q\-critic instead of a scalarVV\)\.PQN \(Q\-Network\)\.Inputs are scaled \(optional BatchNorm\), then passed to a CNN feature extractor \(Conv3×33\{\\times\}3with LayerNorm/BatchNorm \+ ReLU\) and an MLP; a final linear layer outputs Q\-values for all actions\. This is a single\-head, value\-based model tailored to pure Q\-learning\.
#### Code references\.
We list the code references used in our experiments\.
- •
- •
- •
#### Hyperparameters\.
For PQN with 1024 parallel environments, we tuned the learning rate \(0\.00050\.0005,0\.0010\.001\) and GAEλ\\lambda\(0\.650\.65,0\.80\.8,0\.950\.95\) for each environment\. For RND, we tuned the learning rate \(0\.0010\.001,0\.00030\.0003,0\.00010\.0001\) of the RND network and bonus coefficient \(0\.50\.5,1\.01\.0,1\.51\.5\) for each environment\. We report the hyperparameters for RePPO, PPO\-V, PPO\-Q, and PQN \(both original and tuned configurations\) in Tables[2](https://arxiv.org/html/2606.00151#A10.T2),[3](https://arxiv.org/html/2606.00151#A10.T3), and[4](https://arxiv.org/html/2606.00151#A10.T4)\(App\.[J](https://arxiv.org/html/2606.00151#A10)\), respectively\.
#### Training and evaluation\.
Agents are trained for1010M environment steps with1010random seeds\. During evaluation, we average100100test episodes per seed\. For PPO variants, evaluation uses the argmax of the policy’s logits, whereas training samples actions from the policy\. We report*normalized scores*aggregated across games using median, interquartile mean \(IQM\), and mean, following the RLiable framework\(Agarwalet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib55)\)\. Scores are normalized by the maximum score achieved across all methods\. Per\-game scores are reported in App\.[E\.2](https://arxiv.org/html/2606.00151#A5.SS2)\. The maximum scores used for normalization are: Breakout 251\.15, Asterix 64\.95, Freeway 67\.05, Space Invaders 880\.91\.
### E\.2Additional results
We present supplementary analyses to complement the main results\. Unless stated otherwise, curves show the mean and standard error over 10 seeds\.
#### Per\-game results\.
Fig\.[6](https://arxiv.org/html/2606.00151#A4.F6)reports learning curves for each MinAtar game\. RePPO clearly outperforms PPO on*Breakout*and*Asterix*, and converges faster than the baselines on*Freeway*, consistent with the main findings\. On*Asterix*, PQN outperforms all other methods\.
Figure 7:Aggregate metrics \(Median, IQM, Mean\) across all games\.\(a\)Breakout
\(b\)Asterix
\(c\)Freeway
\(d\)Space Invaders
Figure 8:Policy entropy for all environments overm∈\{0\.9,1\.0,1\.2,1\.4,1\.6,2,3\}m\\in\\\{0\.9,1\.0,1\.2,1\.4,1\.6,2,3\\\}, ordered as Breakout, Asterix, Freeway, and Space Invaders\.
#### Entropy across environments\.
Fig\.[8](https://arxiv.org/html/2606.00151#A5.F8)shows policy entropy for all environments andm∈\{0\.9,1\.0,1\.2,1\.4,1\.6,2,3\}m\\in\\\{0\.9,1\.0,1\.2,1\.4,1\.6,2,3\\\}\. As expected, entropy aligns with the retry parametermm: smallmmleads to rapid entropy decay, while largermmsustains higher entropy\. Entropy can also be tuned more finely by varyingmmbetween these values\.
#### Comparison to the original PQN\.
Figure 9:Entropy with and without Q\-replacement on*Breakout*\.In the main text, we compared RePPO to PQN tuned for our setup\. Fig\.[7](https://arxiv.org/html/2606.00151#A5.F7)adds results for the original PQN\. Overall, the original PQN is comparable to the one that we tuned for our setup\. PQN is comparable to RePPO on Median and IQM, but underperforms on Mean, the same trend we observed in the main text\. This confirms that the difference of the setup does not affect the performance of PQN that much, confirming the validity of our analysis\.
#### Details on the analysis of Q\-replacement\.
Recall that Q\-replacement substitutes the critic estimateQ\(s,a\)Q\(s,a\)at the sampled action with the empirical returnRRbefore computing the EI advantage\. We hypothesize that this substitution mitigates distortions of the EI advantage caused by inaccurate Q\-values, which may otherwise reduce exploration\. To test this, we compare entropy with and without Q\-replacement on*Breakout*under the same retry parametersm∈\{1\.2,1\.4\}m\\in\\\{1\.2,1\.4\\\}\(Fig\.[9](https://arxiv.org/html/2606.00151#A5.F9)\)\. Across both values ofmm, the no\-replacement variant exhibits consistently lower entropy throughout training\. This suggests that Q\-replacement helps preserve the exploratory pressure induced by the retry parameter and contributes to RePPO’s overall performance\.
#### The standard deviation of the EI\-based advantage\.
Fig\.[10](https://arxiv.org/html/2606.00151#A5.F10)reports the standard deviation of the EI\-based advantage for*Breakout*during training, computed over the minibatch and averaged over 5 random seeds, with error bars denoting the standard error across seeds\. This statistic measures the variability of the scalar advantage signal used in the actor update, and can therefore serve as a diagnostic of the stability of the advantage estimates\.
Figure 10:Standard deviation of the EI\-based advantage on*Breakout*\.We observe that retry parameters that performed well in our experiments, such asm=1\.2m=1\.2andm=1\.4m=1\.4, tend to yield smaller advantage standard deviations than a less exploratory setting such asm=0\.9m=0\.9\. This suggests that the same range ofmmthat promotes exploration can also produce a more stable EI\-based advantage signal on*Breakout*\. We emphasize, however, that this observation does not by itself isolate the cause of the variance reduction; it may reflect the combined effects of the EI transformation, the induced policy entropy, critic accuracy, and Q\-replacement\.
## Appendix FAtari Experiment
We use the 10 Atari environments identified as dense\-reward, hard\-exploration problems byBellemareet al\.\([2016](https://arxiv.org/html/2606.00151#bib.bib10)\), namely: Alien, Amidar, BattleZone, Frostbite, Hero, Ms\. Pacman, Q\*bert, Surround, Wizard of Wor, and Zaxxon\. Our implementation is based on the CleanRL codebase\(Huanget al\.,[2022](https://arxiv.org/html/2606.00151#bib.bib76)\)and employs EnvPool\(Wenget al\.,[2022](https://arxiv.org/html/2606.00151#bib.bib77)\)for parallel environment execution\.
### F\.1Experimental setup
We follow the hyperparameters listed in Tables[5](https://arxiv.org/html/2606.00151#A10.T5)and[6](https://arxiv.org/html/2606.00151#A10.T6)\(App\.[J](https://arxiv.org/html/2606.00151#A10)\), which are adapted from CleanRL\. As baselines, we use PPO\-V and PPO\-Q, each evaluated with and without entropy regularization\. RePPO is trained with retry parametersm=0\.8,0\.9,1\.0,1\.2,1\.4m=0\.8,0\.9,1\.0,1\.2,1\.4, and we report results for all configurations\. Training is conducted for1×1071\\times 10^\{7\}environment steps\. Evaluation is carried out in parallel with training using eight environments, and we report normalized scores at the final training step\. All results are averaged over five random seeds and aggregated across the 10 environments using the RLiable framework\(Agarwalet al\.,[2021](https://arxiv.org/html/2606.00151#bib.bib55)\)to compute the median, interquartile mean \(IQM\), and mean performance and it is shown in Fig\.[11](https://arxiv.org/html/2606.00151#A6.F11)\. We also plotted the raw return curves for all games in Fig\.[12](https://arxiv.org/html/2606.00151#A6.F12)\.
### F\.2Results
Figure 11:Normalized scores aggregated with median, IQM, and mean across 10 games; boxes denote RLiable summaries over 5 seeds\. For RePPO, we see the performance peak aroundm=0\.9m=0\.9to1\.01\.0, and PPO without entropy performs better than that with entropy\. This indicates that those environments indeed require less exploration\.Fig\.[11](https://arxiv.org/html/2606.00151#A6.F11)shows the normalized scores aggregated with median, IQM, and mean across 10 games\. Across all environments, methods with weaker exploration, such as PPO\-V, PPO\-Q, and RePPO withm=0\.9m=0\.9or1\.01\.0, achieve the highest performance\. In contrast, RePPO with larger retry parameters \(m=1\.2m=1\.2and1\.41\.4\) and entropy\-regularized PPO variants exhibit lower performance\. This results in a performance peak aroundm=0\.9m=0\.9to1\.01\.0\. These findings suggest that, in practice, Bellemare’s suite of hard\-exploration environments may require relatively little exploration\. Fig\.[13](https://arxiv.org/html/2606.00151#A6.F13)shows the evolution of policy entropy during training on all games\. Form=1\.2m=1\.2and1\.41\.4, the policy maintains higher entropy, indicating that RePPO indeed encourages more exploratory behavior even in complex, pixel\-based environments such as Atari\. Taken together, these results demonstrate that RePPO promotes exploration in Atari, but also remains flexible: when little exploration is needed, choosing a smaller retry parameter \(e\.g\.,m=0\.9m=0\.9or1\.01\.0\) yields strong performance\. In contrast, larger values encourage exploration when desired\.
Figure 12:Plot of the return curves for all games\. Mean±\\pms\.e\. over 5 seeds\. Overall, PPO without entropy and RePPO withm=0\.9m=0\.9and1\.01\.0perform better than that with entropy and RePPO withm=1\.2m=1\.2and1\.41\.4\. This indicates that those environments indeed require less exploration\.Figure 13:Policy entropy during training on all games\. Mean±\\pms\.e\. over 5 seeds\. RePPO withm=1\.2m=1\.2and1\.41\.4maintains higher entropy, while that withm=0\.9m=0\.9and1\.01\.0exhibits faster entropy decay, demonstrating the RePPO’s ability to control the trade\-off between exploration and exploitation\.
## Appendix GCraftax Experiment
#### Hyperparameters\.
For PPO\-V and PPO\-V \+ RND \(RND\), we used the same hyperparameters as in the original implementation888[https://github\.com/MichaelTMatthews/Craftax\_Baselines](https://github.com/MichaelTMatthews/Craftax_Baselines)\. For RePPO, we adopted the same hyperparameters and modified only the RePPO\-specific retry parametermm\. The full set of hyperparameters is listed in Table[7](https://arxiv.org/html/2606.00151#A10.T7)\(App\.[J](https://arxiv.org/html/2606.00151#A10)\)\.
### G\.1Results
Figure 14:Entropy \(Craftax\)\.Fig\.[14](https://arxiv.org/html/2606.00151#A7.F14)reports the policy entropy during training over 1B environment steps for RePPO \(m∈\{1\.2,1\.4\}m\\in\\\{1\.2,1\.4\\\}, without entropy bonus\) and the baseline PPO variants with and without entropy regularization, as well as PPO\-V combined with an RND\(Burdaet al\.,[2019](https://arxiv.org/html/2606.00151#bib.bib21)\)intrinsic bonus\. Three regimes are clearly visible\. First, both RePPO configurations maintain the highest entropy throughout training, withm=1\.4m=1\.4sustaining a noticeably higher level thanm=1\.2m=1\.2, mirroring the behavior we observed on MinAtar and Atari\. Second, PPO\-V and PPO\-Q without an entropy bonus collapse to low\-entropy policies within the first∼\\sim10% of training, indicating premature exploitation\. Third, the entropy\-regularized baselines \(PPO\-V/Q \+ Ent\) and PPO\-V \+ RND settle at an intermediate entropy level controlled by the bonus coefficient\. Crucially, RePPO achieves this elevated entropy purely through its retry mechanism, without any explicit exploration bonus\. Combined with[Table˜1](https://arxiv.org/html/2606.00151#S5.T1)in Sec\.[5](https://arxiv.org/html/2606.00151#S5), where RePPO \(1\.2\) matches the performance of entropy\-regularized PPO and PPO \+ RND and clearly outperforms PPO variants without an entropy bonus, this confirms that RePPO effectively promotes exploration even on large\-scale, open\-ended environments such as Craftax—without requiring entropy regularization or an additional intrinsic\-reward model\.
## Appendix HSpeed Benchmark
\(a\)MinAtar \(Breakout\)\.
\(b\)Craftax\.
Figure 15:Average training time of RePPO and baseline methods\.#### MinAtar\.
We compare the training speed of RePPO, PPO\-V, and PPO\-Q in MinAtar\. Fig\.[15\(a\)](https://arxiv.org/html/2606.00151#A8.F15.sf1)reports the average wall\-clock time on*Breakout*for 10M timesteps \(no evaluation\), averaged over 5 seeds\. Hyperparameters match Sec\.[5](https://arxiv.org/html/2606.00151#S5)\. Since our implementation uses the JAX framework, we exclude JIT warmup time for all methods\. As expected, RePPO is slower than PPO\-V and PPO\-Q because it computes EI for the advantage\. However, the additional cost is comparable to the gap between PPO\-V and PPO\-Q, that is, to the overhead incurred by replacing aVV\-critic with aQQ\-critic\. This suggests that the additional cost of RePPO, such as sorting Q\-values and computing EI, is negligible compared to the improvement in performance\.
#### Craftax\.
We also compare the training speed of RePPO with PPO\-V and PPO\-V \+ RND in Fig\.[15\(b\)](https://arxiv.org/html/2606.00151#A8.F15.sf2)\. Computation time is averaged over 5 random seeds\. RePPO performs comparably to PPO\-V and PPO\-Q, and runs faster than PPO\-V \+ RND—an expected outcome, as RePPO does not require additional models beyond those used in PPO\-V and PPO\-Q\.
## Appendix ILLM Usage
We made limited, assistive use of Large Language Models \(LLMs\) for presentation\-related tasks\. In particular, we used LLMs to revise wording for readability, provide minor assistance when organizing proof steps, suggest code refactoring options, and propose small figure improvements\. LLMs were not used for research ideation, study design, or the development of substantive scientific contributions\.
## Appendix JHyperparameter Tables
This appendix collects the hyperparameter tables referenced throughout the experiment sections\.
Table 2:Hyperparameters for RePPO \(MinAtar\)\.Table 3:Hyperparameters for PPO\-V and PPO\-Q \(MinAtar\)\.Table 4:Hyperparameters for PQN on MinAtar: original \(128 parallel envs\) and tuned \(1024 parallel envs\)\.Table 5:Hyperparameters for RePPO \(Atari\)\.Table 6:Hyperparameters for PPO\-V and PPO\-Q \(Atari\)\.Table 7:Hyperparameters for RePPO, PPO\-V and PPO\-Q \(Craftax\)\.Similar Articles
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
Some considerations on learning to explore via meta-reinforcement learning
OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.
Evolved Policy Gradients
OpenAI introduces Evolved Policy Gradients (EPG), a meta-learning approach that learns loss functions through evolution rather than learning policies directly, enabling RL agents to generalize better across tasks by leveraging prior experience similar to how humans transfer skills.
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
This paper proposes a quantile Bayesian risk-aware MDP framework for online RL that adaptively balances robustness and exploration over time, providing theoretical regret bounds and demonstrating strong empirical performance.