Finding the Time to Think: Learning Planning Budgets in Real-Time RL
Summary
This paper introduces variable-delay real-time RL, where agents decide how long to deliberate in environments that progress during decision-making, and proposes a lightweight gating policy to select state-dependent planning budgets, outperforming fixed-budget and heuristic baselines in several real-time games.
View Cached Full Text
Cached at: 06/26/26, 05:19 AM
# Learning Planning Budgets in Real-Time RL
Source: [https://arxiv.org/html/2606.26463](https://arxiv.org/html/2606.26463)
## Finding the Time to Think: Learning Planning Budgets in Real\-Time RL
###### Abstract
Deliberating takes time\. In real\-time settings, that time is not free\. Standard reinforcement learning \(RL\) sidesteps this as the environment waits indefinitely for the agent’s decision\. Instead, we study real\-time RL environments where the environment progresses while waiting for the agent’s action\. Building on prior real\-time formalizations, we introduce*variable\-delay real\-time RL*, where the agent chooses how long to deliberate at each decision point since the environment progresses\. For the planning agents we use, the right delay is state\-dependent, and naively planning how long to plan can paralyze the agent\. We instead approach this setting by training a lightweight gating policy on top of a planner to select state\-dependent planning budgets\. Across real\-time Pac\-Man, Tetris, Snake, Speed Hex, and Speed Go, our gating policy outperforms fixed\-budget and heuristic baselines, and transfers to a real\-time setup where the environment and agent run on two different GPUs\.
## 1Introduction
Expert decision\-makers do not think equally hard about every choice\. They must recognize which decisions require more time or effort to think through and which decisions can be made quickly, as they are constrained in the cognitive resources they can devote to decision\-making\(otto2019opportunity\)\. For example, a skilled chess player managing the clock in speed chess will play most moves quickly and reserve their time for critical board positions when they need to think for longer\. In standard RL, and Markov Decision Processes \(MDPs\), the environment waits as the agent may deliberate indefinitely before committing to an action, and computation does not carry any cost\.
Real\-time settings break this assumption: the world keeps progressing even if the agent takes longer to act\. The cost of deliberation is not wall\-clock latency, but steps taken by the environment while the agent is still deciding\. A perfect action executed one moment too late may be far worse than a timely, good\-enough action\.
ramstedt2019rtrlformalized this concern by fixing the action delay at exactly one timestep\. We generalizeramstedt2019rtrl’s framework and introduce*variable\-delay real\-time RL*, where the agent is a procedure that runs in time rather than a one\-frame function, so the time any particular action\-producing computation spans is paid as progress in the world\.
We explore this setting with agents that use anytime planning algorithms to improve the quality of their actions, but must confront that doing so takes time in which the environment continues to evolve\. Specifically, we use AlphaZero\-style models\(silver2018general\)that run Monte Carlo Tree Search \(MCTS\) at decision time to improve action quality \(see Appendix[B](https://arxiv.org/html/2606.26463#A2)for a primer\)\. A property of these agents is that more search produces better actions but takes longer to run, and we verify empirically \(Figure[3](https://arxiv.org/html/2606.26463#S4.F3)\) that planning quality and inference latency scale together\.
Our setting introduces three challenges\. Firstly, choosing the right delay is state\-dependent: some states reward careful planning while others demand immediate reaction\. Secondly, the agent must commit to*something*during the intermediate frames, since the environment will not wait for the planner\. Lastly, choosing how long to deliberate is fundamentally a problem of*meta\-reasoning*\(russell1991right;horvitz2013reasoning\), which risks an obvious*paradox of planning\-about\-planning*: the meta\-decision occurs while the environment continues to evolve, so deciding how long to think could incur the same per\-frame cost as thinking itself\.
We address these challenges by training a lightweight*gating policy*on top of a*frozen*planner that decides how long to plan at each decision point\. Our design escapes the paradox in practice because a single gate forward pass is orders of magnitude cheaper than an MCTS rollout, and therefore adds negligible real\-time overhead\. We employ our*gating policy*in*real\-time*Pac\-Man, Tetris, Snake, in which the world progresses while the planner runs in the background \(Figure[1](https://arxiv.org/html/2606.26463#S1.F1)\), and*clock environments*\(Speed Hex, Speed Go\), in which the board is static but each player has a clock that depletes with thinking time\. Section[3](https://arxiv.org/html/2606.26463#S3)formalizes both\.
Our contributions are the following:
1. 1\.Variable\-delay real\-time RL\.A generalization oframstedt2019rtrl’s fixed\-delay framework: under the real\-time interaction protocol the agent is a procedure that runs in time rather than a function evaluated instantaneously, and the time spent producing any action is paid as progress in the world\.
2. 2\.Planning quality vs\. inference time\.We empirically characterize how planning quality and real\-time inference cost co\-scale with MCTS simulation count, establishing the joint tradeoff that motivates adaptive allocation\.
3. 3\.Adaptive gating on a frozen planner\.A lightweight gating policy, trained with PPO on top of a frozen AlphaZero planner, that selects a state\-dependent planning budget at each decision\.
Figure 1:Given the current state, the gating policy chooses whether to react immediately or spend time planning, selecting the number of timestepskkover which to plan\. The agent then takesk−1k\{\-\}1committed actions usingπreflex\\pi\_\{\\mathrm\{reflex\}\}\(π0\\pi\_\{\\mathrm\{0\}\}\) while MCTS plans, and finally executes the planned action\.Across five environments, the gating policy which adaptively selects among a set of planning budgets outperforms baselines constrained to use a single fixed planning budget at every step\. To validate that the environments and method we designed capture the challenges of real\-time decision\-making, we transfer the learned gating policy to a real\-time hardware setting where the environment runs on one GPU and the agent runs on a second GPU without any retraining or architectural changes to demonstrate that it performs effectively\.
## 2Related Work
Real\-time RL and delayed MDPs\.The standard MDP assumes the environment waits during action selection\.travnik2018reactiveflagged this as a fundamental mismatch with real\-time interaction, andramstedt2019rtrlformalized an alternative in which the agent has exactly one timestep to compute its next action, mathematically equivalent to a 1\-step constant\-delay MDP\(walsh2009delayed\)\. Subsequent work extends this to fixed action and observation delays of larger length
\(derman2021delayed;bouteiller2021random;katsikopoulos2003markov\), to concurrent control\(xiao2020thinking\), and to asynchronous and pipelined architectures that mitigate inference cost\(riemer2025staggered;anokhin2025handling\)\. Across this literature, delay is imposed by the system\. We instead let the agent choose its delay at each decision point, turning delay selection itself into the learning problem\.
Value of computation and meta\-reasoning\.The view that rational agents must reason not only about what to do but about how much to think traces to bounded rationality\(simon1972theories\)and Good’s*Type II rationality*\(good1952rational\)\.russell1991rightmade this operational by defining the*value of computation*\(VOC\) as the expected improvement in decision quality minus its cost, and bounded optimality\(russell1991architecture\)reframed rational agency as the optimal use of fixed computational resources\. Anytime algorithms\(dean1988analysis;zilberstein1996using;likhachev2003ara;korf1990rta\)produce monotonically improving answers given more time, and Bayesian flexible\-computation methods\(horvitz1989reflection\)formalize when to stop\. Applied to MCTS, value\-of\-computation ideas yield Bayesian formulations of which simulation to run next\(hay2012selecting;tolpin2012simple;sezener2020static;lin2015metareasoning\), while classical chess\- and Go\-engine time management allocates clock per move via hand\-designed heuristics\(baier2016time;huang2010timemanagement;baudivs2011pachi\)\. The cognitive\-science literature models bounded deliberation itself as a meta\-MDP whose actions are computations\(lieder2017strategy;griffiths2019doing;callaway2018learning;cope\_learning\_2023\)and shows empirically that humans allocate planning effort to high\-stakes decisions in ways predicted by this model\. Our gating policy fits this lineage: it is a learned estimator that chooses total budget per decision rather than which simulation to run next, formalized as a meta\-MDP over a frozen planner\.
Adaptive compute in model\-based RL\.The closest neighbors to our work learn to allocate planning compute on top of a model\-based RL agent\. Metacontrol\(hamrick2017metacontrol\)chooses how many imagination steps to run; Thinker and Dynamic Thinker\(chung2023thinker;wang2024dynamic\)let the agent decide when to imagine alternative trajectories on top of a learned world model;hamrick2021roleprovides the key empirical motivation: shallow MuZero trees often suffice, and the marginal value of additional simulations varies sharply by state\. AlphaZero\-family planners\(silver2018general;schrittwieser2020muzero;danihelka2022policy\)treat MCTS budget as a fixed deployment hyperparameter\. Other axes of “learning to search”\(guez2018mctsnets;farquhar2018treeqn;hamrick2020save;racaniere2017i2a\)modify the planner’s internal behavior rather than its budget; we hold the planner fixed and control only how long it runs\.
Adaptive compute in language models\.The same thesis has been independently developed for sequence models\. PonderNet and adaptive computation time\(graves2016act;banino2021pondernet;schuster2022calm\)learn per\-input halting; test\-time compute scaling\(snell2024scaling;deepseek2025r1;muennighoff2025s1\)and RL\-trained reasoning\-length controllers\(aggarwal2025l1;muppidi2025predictive;fang2025thinkless;shen2025dast\)learn per\-query thinking budgets; cascaded routing\(kim2023bigl;chen2023frugalgpt;ong2025routellm\)defers easy queries to cheap models\. The shared thesis is that a small learned policy can profitably decide how much computation to spend per decision\. Our setting differs structurally: the cost of thinking is endogenous to the environment dynamics, so thinking longer changes the state the agent eventually acts in rather than incurring an external penalty\.
Our setting\.We share the thesis of the works above but address a setting none of them does\. We generalizeramstedt2019rtrl’s fixed\-delay framework to variable delay, where the cost of deliberation is paid in*environmental progression*rather than artificial reward shaping, so the gating policy learns the value of simulation budgets from outcomes alone\. We gate a*frozen*AlphaZero\-style planner with no joint training, formalized as a meta\-MDP over options\(sutton1999between;bacon2017optioncritic\)whose holding times are the chosen budgets\. We also introduce a protocol that makes this real\-time cost legible during training and that describes the dynamics of true asynchronous execution, so the trained policy transfers to a two\-GPU deployment with no architectural change\.
## 3Variable\-Delay Real\-Time RL
### 3\.1Problem: real\-time MDPs
Consider a standard MDPE=\(𝒮,𝒜,P,r,γ\)E=\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma\)\. The*real\-time interaction protocol*changes only how the agent and environment exchange actions: the environment advances by one frame at fixed intervalst=0,1,2,…t=0,1,2,\\dotsregardless of the agent, applying at each frame whatever action the agent has submitted by then and defaulting to an environment\-defined fallback \(typically a no\-op\) if none\. The MDP and the return𝔼\[∑tγtrt\]\\mathbb\{E\}\\\!\\left\[\\sum\_\{t\}\\gamma^\{t\}r\_\{t\}\\right\]are unchanged\.
The agent is therefore no longer a functionπ:𝒮→Δ\(𝒜\)\\pi:\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\)but a*procedure*that runs in time and emits an action at each frame\. This is the only change, but it has a sharp consequence: thinking costs progress in the world, since the environment advances while the agent computes\.
#### Relation to existing real\-time RL\.
The Real\-Time MDP oframstedt2019rtrlis the special case in which the agent’s procedure is constrained to be a functionπ:𝒮→Δ\(𝒜\)\\pi:\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\)that takes exactly one frame to evaluate—equivalent to a 1\-step constant\-delay MDP\(walsh2009delayed\)\. Fixed\-delay MDPs of lengthKK\(walsh2009delayed;derman2021delayed\)fix the procedure’s latency toKKframes\. Our setting subsumes both: any procedure satisfying the one\-action\-per\-frame contract is valid, and the agent \(not the environment\) decides how many frames any particular computation spans\.
### 3\.2Solution: an SMDP over budgeted options
The real\-time protocol leaves the agent’s procedure unspecified\. Our framework constructs one from three ingredients: a fast*reflex policy*that supplies an action every frame, a finite set of slow\-but\-better*anytime action\-refinement computations*that the agent invokes when it can afford to wait, and a learned*gating policy*that decides at each meta\-decision which computation \(if any\) to run\. The first two are combined into temporally extended*budgeted options*; the gating policy operates as a meta\-policy over those options in a semi\-Markov decision process \(SMDP\)\.
#### Reflex policy\.
We commit to a single fast policyπreflex\(a∣s\)\\pi\_\{\\mathrm\{reflex\}\}\(a\\mid s\)that runs in well under one frame\. It supplies the agent’s frame\-by\-frame output under the real\-time protocol regardless of what else is being computed in the background\. Sections[3\.3](https://arxiv.org/html/2606.26463#S3.SS3)and[3\.4](https://arxiv.org/html/2606.26463#S3.SS4)instantiateπreflex\\pi\_\{\\mathrm\{reflex\}\}per environment\.
#### Anytime action\-refinement computations\.
We additionally equip the agent with a finite family of*anytime action\-refinement computations*\{ck\}k∈𝒦\\\{c\_\{k\}\\\}\_\{k\\in\\mathcal\{K\}\}\(russell1991right;zilberstein1996using\), indexed by discrete durationk∈𝒦k\\in\\mathcal\{K\}: eachckc\_\{k\}is an anytime algorithm whose job is to refine the action choice given more compute\. Computationckc\_\{k\}runs for exactlykkframes once initiated and, at the start of itskk\-th frame, produces a refined action distributionπk\(⋅∣stn\)\\pi\_\{k\}\(\\cdot\\mid s\_\{t\_\{n\}\}\)at the statestns\_\{t\_\{n\}\}where it was initiated; longer\-runningckc\_\{k\}produce expectedly better actions, a property we verify for our instantiation in Figure[3](https://arxiv.org/html/2606.26463#S4.F3)\. Throughout this paper we instantiateckc\_\{k\}as MCTS run forkkframes fromstns\_\{t\_\{n\}\}, but the framework applies to any anytime action\-refinement algorithm with known per\-budget duration\.
#### Budgeted options\.
For eachk∈𝒦k\\in\\mathcal\{K\}we define a*budgeted option*oko\_\{k\}111Our budgeted options are a special case of the semi\-Markov options ofsutton1999between: an option is a triple⟨ℐ,π,β⟩\\langle\\mathcal\{I\},\\pi,\\beta\\ranglewith input setℐ\\mathcal\{I\}, internal policyπ\\pi, and termination conditionβ\\beta\.sutton1999betweennote that “\[s\]ometimes it is useful for options to ‘timeout,’ to terminate after some period of time has elapsed even if they have failed to reach any particular state,” and that this requires the semi\-Markov generalization in whichβ\\betamay depend on the history since the option was initiated\. Eachoko\_\{k\}uses exactly this construction: deterministic termination afterkkprimitive frames\. See[Sutton, Precup, and Singh \(1999\), §4](http://www.incompleteideas.net/papers/SPS-98.pdf)for the full formalism\.that wrapsckc\_\{k\}into a single temporally extended action:oko\_\{k\}initiatesckc\_\{k\}, emitsk−1k\-1*committed actions*drawn fromπreflex\\pi\_\{\\mathrm\{reflex\}\}during the wait window, and on the terminal frame appliesckc\_\{k\}’s output\. Concretely, ifoko\_\{k\}is initiated in statests\_\{t\}, then for the intermediate framesj=0,…,k−2j=0,\\dots,k\-2,
at\+j∼πreflex\(⋅∣st\+j\),st\+j\+1∼P\(⋅∣st\+j,at\+j\),a\_\{t\+j\}\\sim\\pi\_\{\\mathrm\{reflex\}\}\(\\cdot\\mid s\_\{t\+j\}\),\\qquad s\_\{t\+j\+1\}\\sim P\(\\cdot\\mid s\_\{t\+j\},a\_\{t\+j\}\),\(1\)and on the terminal frame
at\+k−1∼πk\(⋅∣st\),st\+k∼P\(⋅∣st\+k−1,at\+k−1\),a\_\{t\+k\-1\}\\sim\\pi\_\{k\}\(\\cdot\\mid s\_\{t\}\),\\qquad s\_\{t\+k\}\\sim P\(\\cdot\\mid s\_\{t\+k\-1\},a\_\{t\+k\-1\}\),\(2\)whereπk\\pi\_\{k\}is the action distribution produced byckc\_\{k\}initiated atsts\_\{t\}\. The option induces a transition kernelPk\(s′∣s\)=Pr\(st\+k=s′∣st=s,ok\)P\_\{k\}\(s^\{\\prime\}\\mid s\)=\\Pr\(s\_\{t\+k\}=s^\{\\prime\}\\mid s\_\{t\}=s,o\_\{k\}\)and an option\-level reward
Rk\(s\)=𝔼\[∑j=0k−1γjr\(st\+j,at\+j\)\|st=s,ok\]\.R\_\{k\}\(s\)\\;=\\;\\mathbb\{E\}\\\!\\left\[\\sum\_\{j=0\}^\{k\-1\}\\gamma^\{j\}\\,r\(s\_\{t\+j\},a\_\{t\+j\}\)\\;\\Big\|\\;s\_\{t\}=s,o\_\{k\}\\right\]\.\(3\)
Figure 2:Timeline of a budgeted optionoko\_\{k\}\. The agent emitsk−1k\{\-\}1committed actions fromπreflex\\pi\_\{\\mathrm\{reflex\}\}while computationckc\_\{k\}runs \(instantiated as MCTS forkkframes\), then appliesckc\_\{k\}’s outputπk\\pi\_\{k\}and returns to the meta level atst\+ks\_\{t\+k\}\. In clock environments, the committed steps are no\-ops that only consume clock\.
#### Meta\-level SMDP\.
The gating policyπgate\(k∣st\)\\pi\_\{\\mathrm\{gate\}\}\(k\\mid s\_\{t\}\)is a meta\-policy that chooses among the\|𝒦\|\|\\mathcal\{K\}\|budgeted options\. At each meta\-decision statests\_\{t\}the agent sampleskk, executesoko\_\{k\}, and returns to the meta level in statest\+ks\_\{t\+k\}afterkkprimitive frames\. The induced control problem is a semi\-Markov decision process\(sutton1999between;bacon2017optioncritic\)with meta\-state space𝒮\\mathcal\{S\}, meta\-action space𝒦\\mathcal\{K\}, holding timeτ\(k\)=k\\tau\(k\)=k, kernelPkP\_\{k\}, and rewardRkR\_\{k\}\. Its meta\-Bellman equation is
V\(st\)=𝔼k∼πgate\(⋅∣st\)\[∑j=0k−1γjrt\+j⏟option reward overkframes\+γkV\(st\+k\)⏟discounted bootstrap\]\.V\(s\_\{t\}\)\\;=\\;\\mathbb\{E\}\_\{k\\sim\\pi\_\{\\mathrm\{gate\}\}\(\\cdot\\mid s\_\{t\}\)\}\\\!\\left\[\\,\\underbrace\{\\sum\_\{j=0\}^\{k\-1\}\\gamma^\{j\}r\_\{t\+j\}\}\_\{\\text\{option reward over $k$ frames\}\}\\;\+\\;\\underbrace\{\\gamma^\{k\}V\(s\_\{t\+k\}\)\}\_\{\\text\{discounted bootstrap\}\}\\,\\right\]\.\(4\)The gating policy is therefore solving a discrete control problem: not which primitive action to take now, but which temporally extended computation\-and\-control routine to invoke\. In simulation we throttle MCTS to a fixed number of simulations per frame \(32 in the committed\-action environments; see Section[3\.3](https://arxiv.org/html/2606.26463#S3.SS3)\), so the holding time ofoko\_\{k\}is exactlykkframes by construction and this SMDP has deterministic holding times\. In real\-time deployment, MCTS runtime is slightly stochastic, so the hardware system is closer to a continuous\-time SMDP\. Appendix[L](https://arxiv.org/html/2606.26463#A12)discusses this distinction\.
#### Why options rather than state augmentation\.
The RTMDP framework\(ramstedt2019rtrl\)handles its fixed one\-step delay by augmenting the state with the action currently in flight,𝐱t=\(st,at\)\\mathbf\{x\}\_\{t\}=\(s\_\{t\},a\_\{t\}\), with transition kernel
PRT\(𝐱t\+1∣𝐱t,a^t\)=P\(st\+1∣st,at\)δ\(at\+1−a^t\),P\_\{\\mathrm\{RT\}\}\(\\mathbf\{x\}\_\{t\+1\}\\mid\\mathbf\{x\}\_\{t\},\\hat\{a\}\_\{t\}\)=P\(s\_\{t\+1\}\\mid s\_\{t\},a\_\{t\}\)\\,\\delta\(a\_\{t\+1\}\-\\hat\{a\}\_\{t\}\),\(5\)whereδ\\deltais a Dirac delta\. This is the natural representation when delay is fixed and the agent emits nothing during the wait window\. In our setting both assumptions fail:kkis chosen by the agent at every decision frame, and during the wait window the agent is actively controlling the environment throughπreflex\\pi\_\{\\mathrm\{reflex\}\}\. State augmentation would have to grow with\|𝒦\|\|\\mathcal\{K\}\|and encode the within\-option phase; budgeted options collapse the within\-option dynamics into a single SMDP transition, leaving the gating policy a standard meta\-decision problem\.
### 3\.3Committed\-action environments: Pac\-Man, real\-time Tetris, Snake
In committed\-action environments, acting and planning happen concurrently\. Thek−1k\-1committed actions are drawn fromπreflex\\pi\_\{\\mathrm\{reflex\}\}, which we instantiate as the planner’s own policy network evaluated without MCTS, computed via a single forward pass in roughly two milliseconds, while the full MCTS runs in the background\. We choose this instantiation rather than a separately trained reflex policy because it requires no additional training and keeps theπreflex\\pi\_\{\\mathrm\{reflex\}\}aligned with the MCTS tree’s internal rollouts\. On thekk\-th frame the planner’s chosen action is applied\. MCTS rolls out the samek−1k\{\-\}1committed steps that the agent will actually execute, so search is performed over the future state in which the planned action will land\. The cost of deeper planning is therefore exactlyk−1k\{\-\}1additional frames of world progression before the carefully chosen action lands\.
We instantiate this setting in three environments derived fromJumanji\(bonnet2024jumanji\): Pac\-Man, Snake, and a real\-time variant of Jumanji Tetris \(Appendix[D\.1](https://arxiv.org/html/2606.26463#A4.SS1)\)\. The source of progression differs across environments \(ghost motion in Pac\-Man, gravity in real\-time Tetris, body elongation in Snake\), but in all cases a deeper plan arrives later, after the world has changed\.
We use𝒦=\{1,2,3,4\}\\mathcal\{K\}=\\\{1,2,3,4\\\}and calibrate 32 MCTS simulations to one environment frame, which corresponds to roughly 9 FPS on an H100 \(Section[7](https://arxiv.org/html/2606.26463#S7)\)\. At budgetkkthe planner runs32k32ksimulations overkkframes, so each frame still receives 32 simulations on average regardless ofkk\. Differences in performance reflect*when*compute is allocated, not the average compute spent per frame\.
### 3\.4Clock environments: Speed Hex and Speed Go
Clock environments instantiate our framework with a degenerate reflex policy:πreflex\\pi\_\{\\mathrm\{reflex\}\}deterministically emits the no\-op action, so committed steps leave the board unchanged and onlyckc\_\{k\}’s terminal action affects play\. The clock is therefore the only dynamic element of the environment; it decrements with thinking time over thekkframes of an option, while the board state remains fixed until the planner’s chosen action is applied\. Intuitively, this is the same pressure as in speed chess: the board waits for your move, but your limited clock keeps running while you think\.
We instantiate this setting in Speed Hex \(11×1111\\times 11\) and Speed Go \(9×99\\times 9\), both board\-game environments built onpgx\(koyamada2023pgx\)\. We use a one\-to\-one mapping between MCTS simulations and clock ticks: each simulation increments the player’s clock by one unit\.
For clock environments the equal\-compute\-per\-frame constraint does not apply, so we calibrate the simulation options separately to ensure each option represents a meaningfully distinct tradeoff between inference latency and planning quality\. Calibration details and the resulting option sets𝒦=\{2,8,32,128\}\\mathcal\{K\}=\\\{2,8,32,128\\\}for Speed Hex and𝒦=\{16,32,64,96\}\\mathcal\{K\}=\\\{16,32,64,96\\\}for Speed Go are in Appendix[E](https://arxiv.org/html/2606.26463#A5)\.
## 4Adaptive Gating Policy
### 4\.1The planning tradeoff
The core observation motivating our approach is that planning quality and inference cost scale together\. As an agent runs more MCTS simulations, its action quality improves; but more simulations also take longer to run, and in real\-time settings that longer inference is exactly what advances the environment, or depletes the clock, before the agent acts\. We empirically verify this relationship in Figure[3](https://arxiv.org/html/2606.26463#S4.F3)\. The full five\-environment scaling results appear in Appendix[E](https://arxiv.org/html/2606.26463#A5)\. This joint scaling means we do not need to learn*how*to plan as the planner handles that, only*when*the additional quality gain is worth the additional real\-time cost\.
### 4\.2The gating policy
The gating policy takes three inputs at each meta\-step: \(1\) the raw game observation, \(2\) the planner’s intermediate spatial features extracted from its frozen trunk and \(3\) the planner’s scalar value estimateV\(st\)V\(s\_\{t\}\)\. In clock environments the observation also includes the remaining time budget\. Together these give the gating policy both a direct view of the board and a compressed summary of the planner’s own confidence, without requiring any access to the planner’s internals beyond a single forward pass\. A lightweight network processes these inputs and produces a distribution overkkand a baseline value for training\. All architecture details are in Appendix[I](https://arxiv.org/html/2606.26463#A9)\.
Figure 3:Across Pac\-Man, Tetris, and 2\-player Speed Hex, planning quality rises with simulation count while inference latency rises alongside it\. Blue denotes performance; dashed curves denote per\-step latency on H100, A100, and A40; shading shows±\\pmSE\.
### 4\.3Training
We train in two phases\. First, we train AlphaZero\-style base planners for each environment\. We then freeze the selected base planner and train the gating policy with PPO\(schulman2017proximal\)on top of it\. For the clock environments, these base planners are trained by self\-play \(Appendix[I](https://arxiv.org/html/2606.26463#A9)\)\. Because each meta\-step has variable durationkk, we adapt Generalized Advantage Estimation\(schulman2015high\)to carry the per\-meta\-step discountγk\\gamma^\{k\}through the advantage computation \(see Appendix[C](https://arxiv.org/html/2606.26463#A3)\)\. Making this meta\-MCTS AlphaZero stack computationally feasible requires careful attention to several JAX implementation details, which we summarize in Appendix[I](https://arxiv.org/html/2606.26463#A9)\.
## 5Experiments
#### Baselines\.
For real\-time Pacman, Tetris, and Snake we compare against:*always\-kk*policies for eachk∈𝒦k\\in\\mathcal\{K\}; and a*random*policy selectingkkuniformly at each meta\-step\. For speed environments we additionally compare against*greedy*\(always use maximum simulations\) and*midpeak*\(a bell\-curve allocation that concentrates compute at the critical mid\-game, a common heuristic in game enginesbaier2016time;huang2010timemanagement\)\. For these clock heuristics we tune the bell\-curve shape separately at each evaluation budget and report the best resulting strategy\. All results are averaged over 100 independent episode seeds; we report mean±\\pmstandard error in Fig\.[4](https://arxiv.org/html/2606.26463#S5.F4)\.
Figure 4:Across all five environments, the gating policy outperforms fixed\-budget and heuristic baselines, showing that adaptive allocation matters more than committing to a single search budget\. Bars show mean±\\pmSE over 100 episodes; for Speed Hex and Speed Go, expected score is averaged over the shared sampled clock budgets\.For both Speed Hex and Speed Go, we evaluate across the same five sampled clock budgets,T∈\{300,1200,2300,3500,4100\}T\\in\\\{300,1200,2300,3500,4100\\\}, and average the resulting head\-to\-head expected scores\. The fixed\-budget baselines use 2/8/32/128 simulations for Speed Hex and 16/32/64/96 simulations for Speed Go\. In these main clock benchmarks, exhausting the clock does not immediately end the game\. We found that simply exhausting the clock was too easy a setting \(Appendix[H](https://arxiv.org/html/2606.26463#A8)\), so we study the harderπreflex\\pi\_\{\\mathrm\{reflex\}\}\-fallback setting, where control passes toπreflex\\pi\_\{\\mathrm\{reflex\}\}after the clock runs out and the task becomes one of resource allocation rather than simply avoiding timeout\.
#### The gating policy outperforms every fixed\-budget baseline across all environments\.
In Pac\-Man, the best fixed policy \(k=3k\{=\}3, 96 sims/step\) scores 2149; the gating policy reaches 2370 \(\+10\.3%\+10\.3\\%\)\. In real\-time Tetris, the best fixed policy \(k=3k\{=\}3\) scores 27\.6; the gating policy reaches 45\.6 \(\+65%\+65\\%\)\. In Snake, the best fixed policy \(k=1k\{=\}1\) scores 14\.91; the gating policy reaches 16\.54 \(\+10\.9%\+10\.9\\%\)\. In Speed Hex, averaged over the shared sampled clock budgets, the gating policy reaches an expected score of 0\.58, compared with 0\.43 for the best fixed\-budget baseline \(k=128k\{=\}128\) and 0\.46 for the best heuristic baseline \(greedy\)\. In Speed Go, averaged over the same budgets, the gating policy reaches an expected score of 0\.59, compared with 0\.51 for the best fixed\-budget baseline \(k=64k\{=\}64\) and 0\.50 for the best heuristic baseline \(midpeak\)\.
Across environments, the gate learns dynamic, state and budget\-dependent allocation strategies that outperform both fixed\-budget policies and hand\-designed heuristics \(Figure[6](https://arxiv.org/html/2606.26463#S6.F6), Appendix Figure[11](https://arxiv.org/html/2606.26463#A7.F11)\)\. The random baseline falls well below every fixed\-kkpolicy in both environments, confirming that*when*to allocate compute matters as much as how much\.
## 6Analysis
### 6\.1What triggers deeper planning?
Across the environments, our gating policy allocates compute in states where the consequence of a suboptimal action is large and additional search budget is likely to change the decision \(Figure[5](https://arxiv.org/html/2606.26463#S6.F5)\)\.
#### Pac\-Man\.
In the current evaluation set, larger chosen budgets are associated with greater nearest\-ghost distance in Pac\-Man: when the nearest ghost is close, the policy defaults to the more reactivek=1k\{=\}1option; at intermediate distances it shifts towardk=2k\{=\}2; and when the threat is farthest away it can afford the deeperk=4k\{=\}4plan\. We also see in Appendix[G](https://arxiv.org/html/2606.26463#A7)that the gating policy becomes more reactive late in the episode\.
#### Real\-time Tetris\.
Board density is the dominant feature in real\-time Tetris\.k=1k\{=\}1is selected almost exclusively on near\-empty boards \(fill≈0\.05\\approx 0\.05, stack≈3\.5\\approx 3\.5rows\) where any reasonable placement is acceptable\.k=2k\{=\}2andk=4k\{=\}4are selected on dense boards \(fill≈0\.25\\approx 0\.25–0\.320\.32, stacks≈12\\approx 12rows\) where placement precision is consequential\.k=3k\{=\}3is never used \(0%0\\%\), producing the bimodal “react or plan deeply” strategy visible in Figure[5](https://arxiv.org/html/2606.26463#S6.F5)\. Piece type also modulates budget: the Z\-piece \(k¯=2\.98±0\.04\\bar\{k\}\{=\}2\.98\{\\scriptstyle\\pm 0\.04\}\) and O\- and T\-pieces \(k¯=2\.88±0\.04\\bar\{k\}\{=\}2\.88\{\\scriptstyle\\pm 0\.04\}\) receive the most deliberation, while the J\-piece \(k¯=2\.66±0\.04\\bar\{k\}\{=\}2\.66\{\\scriptstyle\\pm 0\.04\}\) receives the least, reflecting differences in placement geometry and expected outcome variance\.
#### Snake\.
Spatial constraint, rather than simple goal distance, is the main trigger\.k=1k\{=\}1dominates on open boards, whilek=3k\{=\}3appears when reachability drops, local body density rises, and the snake has fewer safe continuations\.k=2k\{=\}2occupies a milder regime associated with navigating toward more distant fruit on otherwise open boards\. Immediately after eating, when the body has just grown,k=3k\{=\}3usage jumps from3\.2%3\.2\\%to13\.5%13\.5\\%, indicating that the policy treats body\-growth events as a prospective spatial\-risk signal\.
#### Clock environments\.
In the clocked PGX games, a key trigger is the interaction between board state and remaining time\. Figure[6](https://arxiv.org/html/2606.26463#S6.F6)shows the resulting shift in allocation for both Speed Hex and Speed Go\. Under the small clock budget \(T=300T\{=\}300\), the policy heavily favors the cheapest option, reflecting the high opportunity cost of thinking\. Under the larger clock budget \(T=4100T\{=\}4100\), the same learned policy spreads probability mass much more evenly across the budgets, allocating deeper search substantially more often\. Because the policy is trained across many clock settings rather than tuned separately at each one, this shift is evidence of budget\-conditioned generalization\.
Figure 5:The policy plans deeply precisely when the state is dangerous or constrained\. Across Pac\-Man, real\-time Tetris, and Snake, larger chosen budgets are associated with higher threat, denser boards, or fewer safe continuations, indicating that the gate is responding to meaningful decision difficulty\. Plots show state features conditioned on chosen budgetkk\(mean±\\pm1 SE, 100 episodes\)\.Figure 6:In both Speed Hex and Speed Go, the learned policy is much more reactive under the small budget \(T=300T\{=\}300\) and distributes mass toward deeper options once more clock is available \(T=4100T\{=\}4100\)\.
## 7Real\-Time Deployment
We validate the committed\-action framework in a two\-GPU asynchronous setup across three committed\-action environments: real\-time Tetris, Pac\-Man, and Snake\. One GPU runs the environment and executes committed actions continuously; a second GPU runs the MCTS planner\. At each meta\-decision the current game state is transferred to the planning GPU, which begins search while the environment GPU keeps the game moving with committed actions\. When planning finishes, afterk×32k\\times 32simulations, the result is applied at the next environment step \(Figure[7](https://arxiv.org/html/2606.26463#S7.F7)\)\.
Figure[8](https://arxiv.org/html/2606.26463#S7.F8)summarizes 45 measured deployments: 3 environments×\\times3 GPU classes×\\times5 frame rates \(8–12 FPS\)\. The asynchronous execution pattern is unchanged from training \(see Section[3\.3](https://arxiv.org/html/2606.26463#S3.SS3)\): ak=4k\{=\}4meta\-step at 9 FPS spans four 111 ms frames, while committed actions from the reflex policy keep the environment moving and MCTS finishes asynchronously on the second GPU\.
The main result, as seen in Fig\.[8](https://arxiv.org/html/2606.26463#S7.F8), is that simulation\-trained policies transfer cleanly to asynchronous hardware deployment over a broad operating range\. Latency remains well within budget on H100 in all three environments, and remains reliable on A100 except near the tightest deadline regimes\. Return transfer is similarly strong\. At 9 FPS, deployed returns remain close to their simulation counterparts across all three environments, and the learned budget distribution also transfers cleanly\. In real\-time Tetris on H100, the policy preserves the same strongly bimodal “react or plan deeply” strategy seen in simulation\. Appendix[J](https://arxiv.org/html/2606.26463#A10)gives the full per\-environment, per\-FPS, per\-GPU analysis\.
This result tests a key hypothesis behind the committed\-action training protocol\. We treat thek−1k\{\-\}1committed steps executed in simulation as a model of real asynchronous deployment: if this abstraction is correct, then a policy trained under it should transfer to a hardware\-separated environment\-and\-planner system without modification\. Because training simulates the planning delay by actually executing committed actions in the environment, the transfer result supports this hypothesis and suggests that the separation between environment and planner is the right abstraction for real\-time deployment\.
Figure 7:Two\-GPU asynchronous deployment pipeline\.GPU 0 runs the environment continuously via committed \(reflex\) actions; GPU 1 runs MCTS in parallel, so a largerkkbuys more planning time without pausing the game\.*Left:*schematic of the two\-GPU loop\.*Right:*execution trace at 9 FPS, ak=4k\{=\}4meta\-step spans four 111 ms frames, followed by ak=1k\{=\}1recovery step\.Figure 8:Simulation\-trained policies transfer cleanly to hardware deployment\.*Left:*MCTS latency variance tightens with GPU class—H100 is most consistent, A100 intermediate, A40 widest\.*Centre:*H100 and A100 miss deadlines rarely; A40 breaks down only at the tightest frame rates\.*Right:*Deployed returns at 9 FPS match simulation closely across all environments and GPU classes\.
## 8Conclusion
We generalizedramstedt2019rtrl’s fixed\-delay real\-time RL framework by treating the agent as a procedure that runs in time rather than a function evaluated instantaneously: under the real\-time interaction protocol the environment advances every frame regardless, and the time spent producing any action is paid as progress in the world\. We then instantiated this with an SMDP over budgeted options, in which a lightweight gating policy trained with PPO on top of a frozen AlphaZero planner learns to invest more search where it matters and react quickly elsewhere\. Across five environments spanning committed\-action and clock\-based real\-time mechanisms, the gating policy outperforms every fixed\-budget baseline, and the simulation\-trained policy transfers to a true two\-GPU asynchronous deployment with no architectural change\. While we instantiate the framework with MCTS, it applies more broadly to any anytime action\-refinement algorithm with known per\-budget duration\.
#### Limitations\.
Our committed\-action protocol relies on the MCTS tree faithfully simulating thek−1k\{\-\}1committed steps, which assumes a perfect environment simulator; extending to learned\-dynamics planners \(e\.g\., MuZero\) is natural future work\. The gating policy is trained on top of a*frozen*base planner, and Appendix[F](https://arxiv.org/html/2606.26463#A6)shows that the choice of base checkpoint substantially affects downstream performance; joint optimization of planner and gate remains open\. We hand\-calibrate a small discrete budget set per environment; continuous or richer compute vocabularies are natural extensions but require recalibrating the cost\-quality tradeoff\.
## Acknowledgments
We thank[Justin Svegliato](https://justinsvegliato.com/)for valuable feedback on our metareasoning definitions and framing, and[Mattie Fellows](https://scholar.google.com/citations?user=7TVJf1gAAAAJ&hl=en)and[Uljad Berdica](https://uljad.com/)for helpful discussions and feedback on earlier drafts\. A\. Muppidi and F\. Darwish are supported by the[Rhodes Scholarship](https://www.rhodeshouse.ox.ac.uk/)\(Rhodes Trust\)\. The authors declare no competing interests\.
## References
## Appendix AReproducibility
The full JAX implementation of every environment, base planner, and gating policy in this paper, together with pretrained checkpoints for all five environments and the two\-GPU deployment harness, is released at[https://aneeshers\.github\.io/realtime\-rl/](https://aneeshers.github.io/realtime-rl/)\. Each result reported in Sections[5](https://arxiv.org/html/2606.26463#S5)and[7](https://arxiv.org/html/2606.26463#S7)has a corresponding environment\-variable\-driven launcher script \(e\.g\.eval\_pacman\_gating\.sh,deploy\.sh\) that reproduces the reported number from a single command using the shipped checkpoints, and the release README lists expected returns per command so any discrepancy is immediately visible\. Full training hyperparameters are in Appendix[I](https://arxiv.org/html/2606.26463#A9), base\-checkpoint selection is in Appendix[F](https://arxiv.org/html/2606.26463#A6), and the two\-GPU deployment grid is in Appendix[J](https://arxiv.org/html/2606.26463#A10)\.
## Appendix BAlphaZero and MCTS
This appendix explains the planning infrastructure used in this paper\. It targets readers familiar with deep RL but not with tree\-search methods\.
The planner used throughout this paper is Monte Carlo Tree Search \(MCTS\) guided by a neural network, the same combination introduced in AlphaZero\(silver2018general\)\. The intuition is that the network alone is fast but imperfect: it gives a quick estimate of which actions look promising, but it never actually checks\. MCTS turns a fixed compute budget into action\-quality: it spends that budget on simulated rollouts, focusing on the actions the network considers most promising while still exploring less\-likely alternatives, and uses what it learns to choose a better action than the network’s raw guess\. More simulations produce better actions but take longer, which is the tradeoff the gating policy in the main paper learns to manage\.
What does “running a simulation” mean in MCTS?
A*simulation*\(or*playout*\) is one traversal of the search tree from the root \(the current state\) down to a leaf, followed by a backup of the resulting value estimate up the tree\. Each step of the traversal picks a child to descend into, balancing two pressures: prefer children that have looked good so far \(high empirical value\), but also try children the search has barely visited \(high prior, low visit count\) in case those are even better\. The PUCT rule formalizes this:
a∗=argmaxa\[Q\(s,a\)\+c⋅πθ\(a\|s\)⋅N\(s\)1\+N\(s,a\)\],a^\{\*\}=\\arg\\max\_\{a\}\\Bigl\[Q\(s,a\)\+c\\cdot\\pi\_\{\\theta\}\(a\|s\)\\cdot\\frac\{\\sqrt\{N\(s\)\}\}\{1\+N\(s,a\)\}\\Bigr\],whereQ\(s,a\)Q\(s,a\)is the empirical mean value of subtreeaa,πθ\(a\|s\)\\pi\_\{\\theta\}\(a\|s\)is the network’s prior over actions,N\(s\)N\(s\)is the total visit count of nodess, andN\(s,a\)N\(s,a\)is the visit count of edge\(s,a\)\(s,a\)\. Repeated simulations concentrate visits on high\-value, high\-prior subtrees while continuing to explore less\-visited ones\. Afternnsimulations the recommended action is the*most\-visited*child of the root\.
How is the final action selected?
At evaluation time the agent picks the most\-visited root child\. During training, the visit\-count distributionπ^\(a\)∝N\(s0,a\)1/τ\\hat\{\\pi\}\(a\)\\propto N\(s\_\{0\},a\)^\{1/\\tau\}\(temperatureτ\>0\\tau\>0\) is used as a*search target*: the network is updated to predictπ^\\hat\{\\pi\}at states0s\_\{0\}\. Lowerτ\\tauconcentrates probability on the most\-visited action; higherτ\\tauspreads probability across alternatives, useful for exploration early in training\. This setup \(search produces a better\-than\-raw\-policy distribution, and the network learns to imitate it\) is called*expert iteration*\(anthony2021expert\), and over many iterations it tightens the network’s prior, making future searches better\-targeted\.
What is the difference between AlphaZero and MuZero?
AlphaZerosilver2018generalrequires a*perfect simulator*: given statessand actionaa, the simulator returns the exact next states′s^\{\\prime\}\. MCTS runs entirely inside this simulator; the network supplies priors and value estimates but never models transitions\.
MuZeroschrittwieser2020muzeroreplaces the perfect simulator with a*learned latent\-space dynamics model*\. Transitions during search are predicted by a network rather than a ground\-truth engine, so MuZero can plan in environments without a simulator \(e\.g\. Atari frames\)\.
All environments in this paper expose a JAX\-native step function, so we use the AlphaZero paradigm\. This choice is also what enables our committed\-action protocol \(Section[3\.3](https://arxiv.org/html/2606.26463#S3.SS3)\): the search tree explicitly rolls forward thek−1k\{\-\}1committed steps to reach the future state where the planned action will land, which is only possible when the simulator is perfect\.
AlphaZero was designed for two\-player zero\-sum games\. How does it transfer to single\-agent settings?
The core loop is*expert iteration*anthony2021expert: at each iteration the agent uses MCTS to produce an expert distribution over actions \(better than the network’s raw policy because it incorporates lookahead\), and the network is trained to imitate that expert\. Over many iterations, the network’s prior gets stronger, which makes future searches better\-targeted\. The value head uses a bootstrapped return instead of a zero\-sum game outcome; otherwise the algorithm is identical to two\-player AlphaZero\. Algorithm[1](https://arxiv.org/html/2606.26463#alg1)formalizes this\.
Algorithm 1Single\-agent AlphaZero via Expert Iteration1:Network
fθ=\(πθ,vθ\)f\_\{\\theta\}=\(\\pi\_\{\\theta\},v\_\{\\theta\}\), environment
ℰ\\mathcal\{E\}, simulations per step
nn, replay buffer
ℬ\\mathcal\{B\}
2:repeat
3:
𝒟←\{\}\\mathcal\{D\}\\leftarrow\\\{\\\}⊳\\trianglerightepisode data buffer
4:
s0←ℰ\.reset\(\)s\_\{0\}\\leftarrow\\mathcal\{E\}\.\\text\{reset\}\(\)
5:for
t=0,1,…,T−1t=0,1,\\ldots,T\-1do⊳\\trianglerightAct
6:Run
nnMCTS simulations from
sts\_\{t\}using
fθf\_\{\\theta\}
7:
π^t\(a\)∝N\(st,a\)1/τ\\hat\{\\pi\}\_\{t\}\(a\)\\propto N\(s\_\{t\},a\)^\{1/\\tau\}⊳\\trianglerightvisit\-count policy
8:
at∼π^ta\_\{t\}\\sim\\hat\{\\pi\}\_\{t\};
st\+1←ℰ\.step\(at\)s\_\{t\+1\}\\leftarrow\\mathcal\{E\}\.\\text\{step\}\(a\_\{t\}\)
9:
𝒟←𝒟∪\{\(st,π^t\)\}\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\\{\(s\_\{t\},\\,\\hat\{\\pi\}\_\{t\}\)\\\}
10:endfor
11:Compute returns
zt=∑k≥0γkrt\+kz\_\{t\}=\\sum\_\{k\\geq 0\}\\gamma^\{k\}r\_\{t\+k\}for each
tt⊳\\trianglerightStore
12:
ℬ←ℬ∪\{\(st,π^t,zt\)\}t=0T−1\\mathcal\{B\}\\leftarrow\\mathcal\{B\}\\cup\\\{\(s\_\{t\},\\hat\{\\pi\}\_\{t\},z\_\{t\}\)\\\}\_\{t=0\}^\{T\-1\}
13:Sample mini\-batch from
ℬ\\mathcal\{B\}⊳\\trianglerightLearn
14:
ℒ\(θ\)=1\|ℬ\|∑\(s,π^,z\)∈ℬ\[\(vθ\(s\)−z\)2−π^⊤logπθ\(s\)\]\\mathcal\{L\}\(\\theta\)=\\displaystyle\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{\(s,\\hat\{\\pi\},z\)\\in\\mathcal\{B\}\}\\Bigl\[\\bigl\(v\_\{\\theta\}\(s\)\-z\\bigr\)^\{2\}\-\\hat\{\\pi\}^\{\\top\}\\log\\pi\_\{\\theta\}\(s\)\\Bigr\]
15:
θ←θ−α∇θℒ\(θ\)\\theta\\leftarrow\\theta\-\\alpha\\,\\nabla\_\{\\theta\}\\,\\mathcal\{L\}\(\\theta\)
16:untilconvergence
How does MCTS scale on modern accelerators?
A common worry is that MCTS is inherently sequential, each simulation updates theQQandNNtables that the next simulation reads\. In practice, two layers of parallelism make this much less of a bottleneck than it appears\.
Environment\-level parallelism\.We maintain a batch ofEEindependent environment instances \(32 in our experiments\)\. Each instance has its own search tree; allEEleaf nodes are evaluated in a*single batched forward pass*of the neural network, which is the GPU/TPU bottleneck\.
Intra\-tree simulation loop viajax\.lax\.scan\.Within each tree thennsimulations are unrolled as ascan\. Each iteration: \(1\) traverse allEEtrees to their current leaf using PUCT \(parallelised overEE\), \(2\) batch\-evaluate theEEleaves in one network call, \(3\) back\-propagate values\. Because JAX traces the full scan at compile time, the tree\-update logic is fused into a single XLA kernel with zero Python\-level iteration overhead at runtime\.
The result is that allnnsimulations across allEEenvironments are compiled into a singlejit\-ted function that runs entirely on the accelerator, with one network forward pass per simulation step and no host\-device transfers during rollout\.
## Appendix CVariable\-duration GAE
We re\-introduce the meta\-step subscriptktk\_\{t\}in this appendix because the derivation reasons about a sequence of meta\-decisions rather than a single budget choice\.
Standard GAE\(schulman2015high\)estimates the advantage at timestepttas aλ\\lambda\-weighted sum of one\-step TD residuals,
A^t=∑l=0∞\(γλ\)lδt\+l,δt=rt\+γV\(st\+1\)−V\(st\),\\hat\{A\}\_\{t\}\\;=\\;\\sum\_\{l=0\}^\{\\infty\}\(\\gamma\\lambda\)^\{l\}\\,\\delta\_\{t\+l\},\\qquad\\delta\_\{t\}\\;=\\;r\_\{t\}\+\\gamma\\,V\(s\_\{t\+1\}\)\-V\(s\_\{t\}\),\(6\)which assumes a unit time gap between consecutive states\. In our SMDP, consecutive meta\-statessts\_\{t\}andst\+kts\_\{t\+k\_\{t\}\}are separated by a variable number of environment framesktk\_\{t\}, so the per\-step discount in the TD residual becomesγkt\\gamma^\{k\_\{t\}\}:
δt=Rt\+γktV\(st\+kt\)−V\(st\),Rt=∑j=0kt−1γjrt\+j\.\\delta\_\{t\}\\;=\\;R\_\{t\}\+\\gamma^\{k\_\{t\}\}\\,V\(s\_\{t\+k\_\{t\}\}\)\-V\(s\_\{t\}\),\\qquad R\_\{t\}\\;=\\;\\sum\_\{j=0\}^\{k\_\{t\}\-1\}\\gamma^\{j\}\\,r\_\{t\+j\}\.\(7\)The discount accumulated between meta\-stepsttandt\+lt\+lisγ∑j=0l−1kt\+j\\gamma^\{\\sum\_\{j=0\}^\{l\-1\}k\_\{t\+j\}\}rather thanγl\\gamma^\{l\}, giving the meta\-step advantage
A^t=∑l=0∞λl\(∏j=0l−1γkt\+j\)δt\+l,\\hat\{A\}\_\{t\}\\;=\\;\\sum\_\{l=0\}^\{\\infty\}\\lambda^\{l\}\\left\(\\prod\_\{j=0\}^\{l\-1\}\\gamma^\{k\_\{t\+j\}\}\\right\)\\delta\_\{t\+l\},\(8\)or, recursively,
A^t=δt\+γktλA^t\+1\.\\hat\{A\}\_\{t\}\\;=\\;\\delta\_\{t\}\+\\gamma^\{k\_\{t\}\}\\lambda\\,\\hat\{A\}\_\{t\+1\}\.\(9\)The structure is identical to standard GAE; the only change is that the per\-step discount isγkt\\gamma^\{k\_\{t\}\}rather thanγ\\gamma\. We compute these advantages in a single backward pass over each rollout using equation \([9](https://arxiv.org/html/2606.26463#A3.E9)\)\.
Without this correction, treating each meta\-step as a unit time gap would systematically understate the discount on long\-budget meta\-steps, biasing the value function in favor of many short budgets over fewer long ones\. Theγkt\\gamma^\{k\_\{t\}\}correction makes the value function comparable across budget choices\.
## Appendix DEnvironment Details
### D\.1Real\-time Tetris
Our real\-time Tetris environment is a modification of Jumanji Tetris in which each environment step is one gravity tick rather than one full placement decision\. The board is20×1020\\times 10, episodes last at most 2000 gravity ticks, and the observation contains the locked board, the current falling tetromino, its\(x,y\)\(x,y\)position, and the gravity\-tick count\.
Every call toenv\.step\(action\)first applies the chosen control and then applies one gravity update\. The action set has six elements:
- •Left / Right: translate the falling piece one column in the respective direction; invalid lateral moves are ignored\.
- •Rotate\-CW / Rotate\-CCW: rotate the piece90°90°clockwise or counter\-clockwise; invalid rotations are ignored\.
- •Hard\-drop: instantly translate the piece to the lowest valid row and lock it; a new piece spawns immediately\.
- •Noop: no horizontal or rotational change; the piece descends one row by gravity\.
A piece also locks whenever gravity would move it into an occupied cell; completed rows are then cleared and scored with the standard convex Tetris reward schedule\. The episode ends on board top\-out, i\.e\. when a newly spawned piece cannot be placed at the spawn position, or when the 2000\-tick horizon is reached\.
For the committed\-action experiments in the main paper we use theTetrisRTKTvariant\. The underlying environment dynamics are identical, but during thek−1k\{\-\}1delay steps inside MCTS the tree rolls out policy\-guided committed actions rather than assuming a pure gravity\-only noop sequence\. This makes the search\-time delay model match the deployment\-time reflex behavior more closely\.
### D\.2Committed\-action execution
During thek−1k\{\-\}1delay steps while the MCTS planner is running, the agent executes committed actions in the environment\. In all variants these are drawn from the argmax of the base model’s raw policy logits—a single inexpensive forward pass, with no additional MCTS compute\. The two variants differ only in what the*MCTS tree simulates*during its internal rollout of the delay window, not in what is executed in the environment\. We refer to these internal simulator variants as RT\_KStep \(noop\) and RT\_KT \(logit\)\.
RT\_KStep \(noop\)simulates the delay steps inside the tree using a fixed action \(gravity/noop\), which is fast but mismatches what the agent will actually do\.
RT\_KT \(logit\)simulates the delay steps inside the tree usingargmax\(πlogits\)\\mathrm\{argmax\}\(\\pi\_\{\\mathrm\{logits\}\}\), matching the committed actions the agent executes at deployment and making the search model more accurate at no extra environment cost\. All real\-time Tetris results reported in the main paper use the RT\_KT variant\.
Crucially, neither variant uses MCTS for the committed actions executed in the environment, so there is no compute\-parity violation: every optionk∈𝒦k\\in\\mathcal\{K\}uses the same number of policy forward passes for its delay steps\.
### D\.3Speed Hex
Speed Hex is built on the pgx Hex library \(11×\\times11 board\)\. Each player begins withT=300T=300clock ticks; time spent planning is deducted after each move\. MCTS search uses board\-only dynamics \(no clock deduction during simulation\), but the observation \(and therefore the network’s value estimates\) includes the remaining clock for both players\. The gating policy is a GRU network that maintains hidden state across moves within an episode, allowing it to track game progression and clock pressure jointly\.
#### Training notes\.
Early experiments fine\-tuned the gating policy against fixed\-budget opponents rather than via self\-play\. This regime exhibited plasticity loss, the network’s ability to adapt to new opponents degraded over successive fine\-tuning rounds, consistent with findings in continual RL settings\(muppidi2024fast;boopathy2024permutation\)\. We therefore switched to self\-play, which eliminated plasticity loss entirely\. In self\-play we found that increasing entropy regularization was the single most impactful training intervention, consistent with recent results in other self\-play board\-game domains\(forkel2025entropy\)\.
### D\.4Speed Go
Speed Go is built on the pgx Go library \(9×\\times9 board\)\. As in Speed Hex, each player has an explicit clock, planning consumes clock only after the move is played, and MCTS search itself uses board\-only dynamics while the network observes the remaining clock for both players\. We use the same recurrent gating architecture and evaluate the policy over the same shared five sampled clock budgets used in the main text\. The same training observations apply: self\-play with elevated entropy regularization was decisive, and no plasticity loss was observed once fine\-tuning against fixed opponents was abandoned\.
## Appendix ESimulation Option Calibration for Clock Environments
For committed\-action environments, the equal\-budget constraint \(32 sims/frame\) naturally produces meaningful option spacing\. For clock environments this constraint does not apply, so we select simulation options by examining two curves simultaneously: planning quality \(win rate or return\) vs\. simulation count, and inference latency vs\. simulation count\.
An option is*useful*if it is distinguishable from its neighbors on both dimensions: adding more simulations should noticeably improve play*and*noticeably increase latency\. Options that are close in both cost and quality give the gating policy nothing to learn; options that differ in cost but not quality waste planning budget\.
For Speed Hex we found that options\{2,8,32,128\}\\\{2,8,32,128\\\}satisfy this criterion across the relevant range of clock settings, and for Speed Go we found that\{16,32,64,96\}\\\{16,32,64,96\\\}yield similarly distinct quality\-latency tradeoffs\. Figure[9](https://arxiv.org/html/2606.26463#A5.F9)shows the full co\-scaling results across all five evaluation environments\.
#### Go and Hex calibration\.
To calibrate the clock\-environment options more directly, we examine how average expected score changes as inference budget increases\. The resulting curves are shown in Figure[10](https://arxiv.org/html/2606.26463#A5.F10)\. For Speed Go, the selected set\{16,32,64,96\}\\\{16,32,64,96\\\}spans the steep part of the quality curve: average expected score versus all other budgets rises from0\.3140\.314at 16 simulations to0\.5060\.506at 32,0\.7040\.704at 64, and0\.7820\.782at 96\. The largest gains occur at16→3216\{\\rightarrow\}32\(\+0\.193\+0\.193\) and32→6432\{\\rightarrow\}64\(\+0\.198\+0\.198\), while64→9664\{\\rightarrow\}96still provides a clear additional improvement \(\+0\.078\+0\.078\)\. Pairwise tournament results among the selected Go options also remain well separated: 64 simulations scores0\.7130\.713against 32 and0\.8680\.868against 16, while 96 scores0\.5950\.595against 64 and0\.7730\.773against 32\.
For Speed Hex, the selected set\{2,8,32,128\}\\\{2,8,32,128\\\}follows the same design principle and is now supported by the completed inference\-budget tournament\. Average expected score versus all other budgets rises from0\.3400\.340at 2 simulations to0\.4220\.422at 8,0\.4950\.495at 32, and0\.7100\.710at 128, with successive gains of\+0\.082\+0\.082,\+0\.073\+0\.073, and\+0\.214\+0\.214\. The pairwise Hex tournament shows the same separation: 8 simulations scores0\.5500\.550against 2, 32 scores0\.5640\.564against 8 and0\.6430\.643against 2, and 128 scores0\.6930\.693against 32 and0\.7270\.727against 8\. Thus, in both clocked domains, the selected options provide the gating policy with meaningfully separated quality\-latency tradeoffs\.
Figure 9:Full co\-scaling results across Pac\-Man, real\-time Tetris, Speed Hex, Speed Go, and Snake\. Blue denotes planning quality; dashed curves denote per\-step latency on H100, A100, and A40; shading shows±\\pmSE\.Figure 10:Simulation\-option calibration for the two clock environments\. Each point shows a budget’s average expected score against all other candidate budgets\. Red circles mark the options used in our clocked experiments: 16/32/64/96 simulations for Speed Go and 2/8/32/128 for Speed Hex\.
## Appendix FCross\-Evaluation and Base Model Selection
For the committed\-action environments, we trained four AlphaZero checkpoints at budgetsk∈𝒦k\\in\\mathcal\{K\}and evaluated all 16 \(train\-kk, eval\-kk\) combinations\. Here, train\-kkdenotes the action\-delay budget used during training, while eval\-kkdenotes the action\-delay budget used at test time\. For each train\-kkcheckpoint we additionally swept the MCTS simulation budget at evaluation time and report the best raw episode return achieved for each eval\-kk\. The evaluation sweeps were: real\-time Tetris with simulation counts\{32,64,96,128\}\\\{32,64,96,128\\\}, Pac\-Man with\{8,16,32,64,96,128\}\\\{8,16,32,64,96,128\\\}, and Snake with\{32,64,96,128\}\\\{32,64,96,128\\\}\.
Table 1:Cross\-evaluation of committed\-action base models\. Entries report the best raw episode return for each train\-kk/eval\-kkcombination after optimizing over the evaluation simulation\-budget sweep; the selected simulation count is shown after the ‘@’ symbol\. Bold marks the best train\-kkfor each eval\-kkwithin an environment\.The resulting pattern is environment\-dependent but clear\. For real\-time Tetris,k=1k\{=\}1is the strongest base model overall: although thek=3k\{=\}3checkpoint is best at eval\-k=1k\{=\}1, thek=1k\{=\}1checkpoint wins at eval\-k∈\{2,3,4\}k\\in\\\{2,3,4\\\}and has the highest mean return when averaging the best\-per\-eval\-kkscores\. For Pac\-Man,k=1k\{=\}1also wins overall, dominating eval\-k∈\{1,3,4\}k\\in\\\{1,3,4\\\}and achieving the highest average performance across eval budgets, whilek=4k\{=\}4is only slightly better at eval\-k=2k\{=\}2\. Snake differs qualitatively: thek=3k\{=\}3checkpoint is the strongest across all four eval budgets and therefore serves as the frozen base model in that domain\.
These results match the intuition that low\-delay training is often causally cleaner, since less delay\-induced mismatch enters the value targets\. In addition, committed actions during delay steps are drawn from the base model’s policy, so a stronger base policy also produces better intermediate states before the final MCTS action lands\. Accordingly, we use thek=1k\{=\}1checkpoint as the frozen base for Pac\-Man and real\-time Tetris, and thek=3k\{=\}3checkpoint for Snake\.
## Appendix GStrategy Profiles Across Environments
This section expands the main\-text analysis by showing the full learned budget\-allocation profiles for each environment \(Figure[11](https://arxiv.org/html/2606.26463#A7.F11)\)\.
Figure 11:Different environments induce different allocation profiles\. Pac\-Man becomes more reactive later in the episode, real\-time Tetris shifts toward deeper planning as the board densifies, Snake remains mostly reactive with occasional deeper planning in constrained states, and both clock games become less reactive when more time is available\. This figure complements Figure[6](https://arxiv.org/html/2606.26463#S6.F6)by showing the full cross\-environment strategy profiles rather than only the clock\-game comparison\.
## Appendix HStrict\-Timeout Speed Hex Control
The main Speed Hex setup allows the game to continue after the acting side exhausts its remaining clock budget, so the experiment measures*resource allocation*rather than a pure timeout\-avoidance game\. To verify that this distinction matters, we ran a strict\-timeout control in which overspending the selected search budget causes an immediate loss and there is no fallback behavior\.
Figure 12:Strict\-timeout Speed Hex is substantially easier than the main Speed Hex benchmark\. Using the unique\-game expected\-score metric, the learned gate remains above parity against every opponent on average, with means of 0\.952 against the policy\-only opponent, 0\.903 against fixed 2\-simulation play, 0\.904 against random allocation, and 0\.893 even against the fixed 128\-simulation opponent\. At large budgets, the more aggressive opponents become the weakest because they overspend clock and lose on time\.Figure[12](https://arxiv.org/html/2606.26463#A8.F12)shows the resulting per\-budget head\-to\-head curves under the unique\-game metric used in the main paper, evaluated on a wider clock sweep than the main benchmark\. The learned gate stays above parity against every opponent*on average*, and against five of the eight opponents it is above parity at*every*evaluated budget\. Even the hardest fixed\-budget baseline under this metric, always\-32, averages only 0\.690 against the gate, while policy\-only averages 0\.952 and random allocation averages 0\.904\. The only sub\-parity points occur for the more clock\-aggressive baselines at the largest budgets: fixed 128\-simulation play dips to 0\.490 atT=5700T\{=\}5700, midpeak dips to 0\.491 atT=4100T\{=\}4100, and greedy dips to 0\.487 atT=5700T\{=\}5700\. This pattern is the opposite of what we observe in the main benchmark\. Once timeout is an immediate terminal event, stronger search is no longer reliably beneficial; it often just increases the probability of burning too much clock\. In other words, the game becomes easier because the gating policy can win largely by learning a conservative time\-management rule rather than by solving the underlying Hex positions more effectively\.
## Appendix IImplementation Details
### I\.1Gating network architecture
Spatial branch\.The game grid is processed by a1×11\{\\times\}1convolution lifting to 64 channels, followed by three residual blocks with Layer Normalization and global average pooling to a 64\-dimensional spatial representation\. Layer Normalization is used because the gating policy operates on a single environment instance per rollout step, making batch statistics undefined\.
Time / clock embedding\.For committed\-action environments, the normalized step count is encoded as a 2\-vector and projected through a small MLP before being added to the spatial features\. For clock environments, the remaining clock fraction is similarly embedded and added\.
Fusion and heads\.The spatial representation \(64\-dim\), planner trunk features \(128\-dim\), and planner value \(1\-dim\) are concatenated and passed through dual MLP heads: a policy head producing logits overkk, and a value head producing a scalar baseline\.
Speed Hex gating network\.Uses a GRU backbone rather than a feedforward network, maintaining a recurrent hidden state across moves within each game to track temporal context\.
### I\.2Training hyperparameters
Table 2:PPO hyperparameters for committed\-action environments\.#### Clock\-environment training details\.
The clock\-game training runs use a separate self\-play stack built onpgx\. For Speed Go, the base AlphaZero planner is trained by pure self\-play on 1024 parallel games with 32 MCTS simulations per move, a 256\-move horizon, training batch size 4096, Adam with learning rate10−310^\{\-3\}, and 800 training iterations; evaluation against thepgxbaseline occurs every 5 iterations and checkpoints are saved every 200 iterations\. The recurrent Speed Go gate then loads the frozen 16\-simulation base checkpoint matched to the same seed and is trained in pure self\-play for 500 PPO updates on 256 parallel environments\. Each PPO update collects 32768 rollout steps \(128 per environment\), chunks them into GRU sequences of length 64, and uses 4 PPO epochs with batch size 512,γ=0\.99\\gamma\{=\}0\.99,λ=0\.95\\lambda\{=\}0\.95, PPO clipϵ=0\.2\\epsilon\{=\}0\.2, learning rate3×10−43\\times 10^\{\-4\}, value\-loss coefficient 0\.5, entropy coefficient 0\.01, and global gradient clipping at 1\.0\. The gate chooses among simulation options\{16,32,64,96\}\\\{16,32,64,96\\\}, domain\-randomizes episode clocks over\{100,300,600,900,1200,1800,2300,2900,3500,4100,4800,5700\}\\\{100,300,600,900,1200,1800,2300,2900,3500,4100,4800,5700\\\}, and saves checkpoints every 25 updates\. Although the codebase contains hooks for richer opponent schedules, the experiments reported here use pure self\-play throughout\.
### I\.3JAX implementation and batching strategy
The main computational challenge is that our system nests a meta\-policy on top of an AlphaZero\-style planner, so a naive implementation would repeatedly invoke MCTS inside an outer RL loop with substantial Python overhead\. Our implementation avoids this by pushing nearly all control flow into JAX primitives\.
Base planner training\.The frozen planners are trained with our own Gumbel\-AlphaZero stack built on top of the Jumanji environments, usingmctx\.gumbel\_muzero\_policy\(deepmind2020jax\)for search\. Within each planner step, the MCTS simulations are executed inside a compiledlax\.scan; leaf expansion and backup are batched across parallel environments withvmap; and the full self\-play/training pipeline is wrapped injit\. In the multi\-device setting, the leading batch dimension is sharded withpmap, so each accelerator runs the same compiled self\-play/update program on its local slice of environments\.
Gating\-policy training\.For the gating layer, we again compile whole rollouts rather than individual steps\. Each PPO rollout is onejit\-compiledlax\.scanover meta\-decisions, and advantage computation is another reverselax\.scan\. The recurrent Speed Hex gate uses the same pattern: the GRU unroll is expressed as a scan, while PPO operates on sequence chunks rather than on Python loops over timesteps\.
Evaluating all budget options in parallel\.The key batching trick is that, during gating training and evaluation, we compute the transition for every available budget option at each meta\-step and only then select the branch corresponding to the gate’s chosen action\. Concretely, fork∈\{1,2,3,4\}k\\in\\\{1,2,3,4\\\}we run all fourmeta\_step\_one\_kbranches, stack the resulting rewards, discounts, next states, and next observations, and use a batched selection operator to pick the chosen branch per environment\. This spends extra compute relative to evaluating only the selected option, but it preserves static shapes, keeps the whole computation batched, and avoids recompilation or Python\-side control flow\. In practice this systems tradeoff is what makes the meta\-MCTS stack fast enough to train routinely rather than as a one\-off systems effort\.
Practical effect\.These engineering choices —mctx\(deepmind2020jax\)for search,vmapfor environment parallelism,jitfor whole\-program compilation,pmapfor device parallelism, andlax\.scanfor both MCTS inner loops and PPO rollout loops — reduce wall\-clock training time dramatically\. In our internal runs, the end\-to\-end training workflow dropped from roughly two weeks in an earlier less\-batched pipeline to roughly six hours in the optimized JAX implementation\. The full implementation is available at[https://aneeshers\.github\.io/realtime\-rl/](https://aneeshers.github.io/realtime-rl/)\.
## Appendix JDetailed Two\-GPU Deployment Breakdown
We ran the full asynchronous deployment grid for all 45 settings: 3 environments \(real\-time Tetris, Pac\-Man, Snake\)×\\times3 GPUs \(H100, A100, A40\)×\\times5 frame rates \(8–12 FPS\)\. Figure[13](https://arxiv.org/html/2606.26463#A10.F13)gives the complete breakdown by environment, GPU, and FPS for return, deadline miss rate, and p95 slack to deadline\.
Several patterns are clear\. First, H100 is consistently reliable across the entire tested range\. In real\-time Tetris its miss rate stays at0\.014%0\.014\\%for all FPS values, with p95 slack remaining positive from\+291\.0\+291\.0ms at 8 FPS to\+123\.6\+123\.6ms at 12 FPS\. Snake is even easier: H100 and A100 stay at0\.001%0\.001\\%miss rate throughout, and A40 stays at or below0\.005%0\.005\\%\.
Second, the deployment boundary appears where slack collapses\. Real\-time Tetris on A40 is the clearest example: p95 slack decreases monotonically from\+162\.2\+162\.2ms \(8 FPS\) to−4\.1\-4\.1ms \(12 FPS\), and the miss rate jumps from0\.046%0\.046\\%to50\.669%50\.669\\%\. By contrast, real\-time Tetris on A100 remains usable even at 12 FPS, with p95 slack still positive at\+24\.5\+24\.5ms and miss rate only0\.188%0\.188\\%\.
Third, Pac\-Man is the most timing\-sensitive environment\. Even with positive p95 slack throughout, A100 and A40 exhibit non\-trivial miss rates across all FPS values\. At 12 FPS, Pac\-Man reaches30\.369%30\.369\\%misses on A100 and13\.073%13\.073\\%on A40, whereas H100 stays much lower at3\.927%3\.927\\%\. This indicates that for Pac\-Man, the far latency tail beyond the p95 threshold matters operationally even when the central mass remains in\-budget\.
Finally, return remains comparatively stable over a broad deployment range despite these timing differences\. Real\-time Tetris returns remain in the4141–5050range on H100/A100 and in the high\-30s to low\-40s on A40 except at the most constrained settings\. Pac\-Man degrades more gradually with hardware and FPS than its miss\-rate curve might suggest, while Snake is nearly flat across all tested settings\. Together these results support the main\-text conclusion: the committed\-action training protocol transfers robustly to asynchronous hardware deployment, and the observed failures emerge in predictable deadline\-limited regimes rather than as qualitatively new behaviors\.
Figure 13:Detailed two\-GPU deployment breakdown across real\-time Tetris, Pac\-Man, and Snake\. Top: return vs\. FPS\. Middle: deadline miss rate\. Bottom: p95 slack to thek=4k\{=\}4deadline\. Snake remains robust, real\-time Tetris fails only near the tightest A40 regime, and Pac\-Man is the most latency\-sensitive\.
## Appendix KAdditional Results and Ablations
We also ran input\-ablation evaluations for the committed\-action environments, in which the gating policy received zeroed observation features, zeroed time features, zeroed planner trunk features, or a zeroed planner value input while the environment dynamics and MCTS planner remained unchanged\. Across environments, ablating planner trunk features was consistently disruptive, while zeroing the scalar value input had only minor effect; in real\-time Tetris, zeroing either the board observation or planner trunk substantially reduced return, while in Pac\-Man and Snake the learned behavior was more robust to removing any single input channel\.
Table 3:Input\-ablation results for the committed\-action gating policies\. Entries report mean episode return under each ablation condition while the environment and MCTS planner remain unchanged\.
## Appendix LReal\-Time RL, SMDPs, and Committed\-Action Training
This appendix expands Section[3\.2](https://arxiv.org/html/2606.26463#S3.SS2)by focusing on the distinction between the discrete\-time SMDP used in training and the continuous\-time SMDP approximated in deployment\.
### The RTRL framework and its single\-step restriction
Section[3\.2](https://arxiv.org/html/2606.26463#S3.SS2)already gives the RTMDP equation and explains why Ramstedt et al\.’s fixed one\-step delay formalism is not the natural representation for our variable\-budget committed\-action setting\. The additional point needed here is that, once delay is represented as budgeted options, training and deployment instantiate two closely related SMDPs with different time models\.
### Training is a discrete SMDP; deployment is a genuine SMDP
#### Simulation\.
In training the environment is synchronous: MCTS runs in zero simulated time\. As described in Section[3\.2](https://arxiv.org/html/2606.26463#S3.SS2), each meta\-step proceeds by choosingkk, executingk−1k\{\-\}1committed argmax steps, and then applying the MCTS result\. From the meta\-policy’s perspective this is a*Semi\-Markov Decision Process*\(puterman1994mdp\)with*deterministic*holding timeτ=k\\tau=k, so successive meta\-decisions are spacedkkenvironment steps apart and the appropriate discount isγk\\gamma^\{k\}\. This is the discrete\-time SMDP solved during training\.
#### Deployment\.
In real\-time deployment the holding time between meta\-decisions equalsk×Tframek\\times T\_\{\\mathrm\{frame\}\}: MCTS runs concurrently on GPU 1 while the environment executes thekkreflex frames on GPU 0, so the two durations*overlap*rather than sum\. Givenkk, the elapsed wall\-clock time isk×Tframek\\times T\_\{\\mathrm\{frame\}\}regardless of MCTS latency, provided search completes within the deadline\. The holding time is stochastic only becausekkis chosen stochastically byπgate\\pi\_\{\\mathrm\{gate\}\}; this is a true SMDP in the continuous\-time sense\(puterman1994mdp\)\. Our measurements confirm that MCTS latency stays well within the frame budget222Note the distinction between*frame budget*and*planning budget*kk\. The frame budget is the time between frames at the chosen FPS \(e\.g\., 111,ms at 9,FPS\)\(H100: mean122122ms, p95205205ms; A40: mean197197ms, p95338338ms vs\. ak=4k\{=\}4frame budget of444444ms at 9 FPS\), so the deterministic simulation discountγk\\gamma^\{k\}matches the deployment discount closely, which is why sim\-to\-real transfer succeeds without retraining\.
### How committed\-action training avoids state augmentation
The central reason thekk\-step RTMDP state augmentation is unnecessary in our framework is that*the committed action at each intermediate step is computed from the current state, not from a statekksteps ago*\. Specifically, the committed action at framettisat=argmaxaπθ\(a∣st\)a\_\{t\}=\\arg\\max\_\{a\}\\pi\_\{\\theta\}\(a\\mid s\_\{t\}\), evaluated fresh in∼\\sim2 ms—well within the111111ms frame budget\. The MCTS search is initiated ats0s\_\{0\}\(start of meta\-step\), but the tree explicitly rolls forward the samek−1k\{\-\}1committed steps that will be executed in the environment and selects its terminal action for the resulting future landing state\. This delay is therefore explicitly modeled in training: thek−1k\{\-\}1committed steps actually execute in the environment before the MCTS action lands, so the policy learns to handle thekk\-step gap without any state augmentation\.
In the RTRL vocabulary, the committed\-action component operates in the*turn\-based*regime \(computation time≪\\llframe time, so the environment effectively pauses during action selection\), while the MCTS component operates in akk\-step delay regime whose effects are absorbed by the training curriculum rather than by augmenting the state space\. The key property that makes this possible—and that the standard RTMDP does not exploit—is that the intermediate policyπreflex\\pi\_\{\\mathrm\{reflex\}\}is a*deterministic function of the current state*, so its actions do not need to be carried in the state to recover the Markov property\.Similar Articles
Not only where, But when: Temporal Scheduling for RLVR
Introduces temporal scheduling for credit allocation criteria in reinforcement learning with verifiable rewards, showing that scheduling when learning signals are applied improves policy evolution and stability.
Plan Before You Trade: Inference-Time Optimization for RL Trading Agents
FPILOT is a plugin inference-time optimization framework for RL trading agents that leverages price forecasts without retraining, yielding consistent improvements in returns and risk-adjusted metrics on the TradeMaster DJ30 benchmark.
Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning
Introduces Thoughts-as-Planning, a framework that models chain-of-thought optimization as sequential decision-making using latent world models and reinforcement learning, outperforming existing methods in efficiency and generalization.
Performance-Driven Environment Abstraction with Multi-Timescale Learning
This paper proposes a performance-driven state abstraction method for reinforcement learning that directly optimizes decision quality, using a multi-timescale framework to jointly adapt the policy and a tree-structured abstraction. The algorithm refines or aggregates state space based on Q-value discrepancies, achieving better sample efficiency and faster replanning than baselines.
@blc_16: If you want to understand why RL struggles with long-horizon agent tasks, this is a good explanation. The core issue is…
The post explains why Reinforcement Learning struggles with long-horizon tasks due to sparse rewards and highlights GEPA, a method that uses trajectory-level textual reflection to preserve richer feedback signals for optimization.