ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network
Summary
This paper introduces ACSAC, a reinforcement learning method that uses an adaptive chunk size actor-critic algorithm with a causal Transformer Q-network to handle long-horizon, sparse-reward tasks. It demonstrates state-of-the-art performance on manipulation tasks by dynamically adjusting action chunk sizes based on state-dependent needs.
View Cached Full Text
Cached at: 05/13/26, 06:27 AM
# ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network
Source: [https://arxiv.org/html/2605.11009](https://arxiv.org/html/2605.11009)
Qian Chen1Junqiao Zhao1Hongtu Zhou1Hang Yu1 Yanping Zhao1Chen Ye1Guang Chen1 1Tongji University
###### Abstract
Long\-horizon, sparse\-reward tasks pose a fundamental challenge for reinforcement learning, since single\-step TD learning suffers from bootstrapping error accumulation across successive Bellman updates\. Actor\-critic methods with action chunking address this by operating over temporally extended actions, which reduce the effective horizon, enable fast value backups, and support temporally consistent exploration\. However, existing methods rely on a fixed chunk size and therefore cannot adaptively balance reactivity against temporal consistency\. A large fixed chunk size reduces responsiveness to new observations, while a small one produces incoherent motions, forcing task\-specific tuning of the chunk size\. To address this limitation, we proposeAdaptiveChunkSizeActor\-Critic \(ACSAC\)\. ACSAC leverages a causal Transformer critic to evaluate expected returns for action chunks of different sizes\. At each chunk boundary, it adaptively selects the chunk size that maximizes the expected return, supporting flexible, state\-dependent chunk sizes without task\-specific tuning\. We prove that the ACSAC Bellman operator is a contraction whose unique fixed point is the action\-value function of the adaptive policy\. Experiments on OGBench demonstrate that ACSAC achieves state\-of\-the\-art performance on long\-horizon, sparse\-reward manipulation tasks across both offline RL and offline\-to\-online RL settings\.
## 1Introduction
Offline reinforcement learning \(RL\) trains policies from previously collected datasets without environment interaction\[LevineSurvey\], and the offline\-to\-online setting further refines the offline\-pretrained policy through additional online interaction\[AWAC\]\. However, on long\-horizon, sparse\-reward tasks, single\-step temporal difference \(TD\) learning suffers from bootstrapping error accumulation, since each update regresses toward the Q\-network’s own next\-state estimate and small errors compound across many recursive Bellman updates\[SHARSA\]\. A common remedy is to use multi\-step returns, which accelerate value propagation by shifting the regression target further into the future, but they introduce an off\-policy bias because the intermediate rewards are collected under the behavior policy rather than the current policy\[TOP\-ERL,QC\]\. Action chunking has emerged as a recent and effective response to these limitations\.
Action chunking trains the critic and the policy on action sequences rather than single actions, which enables multi\-step value backups without incurring off\-policy bias, since the critic conditions on the full sequence\[QC\]\. Beyond these critic\-side benefits, predicting full action sequences also produces temporally coherent behaviors that capture non\-Markovian patterns in the data and improve exploration in sparse\-reward settings\[ACT,QC\]\. However, existing action\-chunking actor\-critic methods rely on a fixed chunk size, manually tuned per task and shared across all states\[QC,DQC,MAC,DEAS\]\. Yet the optimal balance between reactivity and temporal consistency is itself state\-dependent\[DQC,AAC\]\. Stable phases admit long open\-loop chunks, whereas sensitive states demand frequent replanning\. Over\-committing to a long chunk in such states can drive the agent off the goal\-reaching path \(Figure[1](https://arxiv.org/html/2605.11009#S1.F1)\)\[DQC,AAC\]\.
Figure 1:Motivation for adaptive action chunk size\.\(A\)Single\-step execution preserves fine\-grained reactivity by replanning at every step, but suffers from slow value backups and produces incoherent motions\.\(B\)A fixed chunk size improves motion coherence and accelerates value propagation, but its open\-loop execution reduces reactivity within the chunk\. In sensitive states such as turns, this over\-commitment to a long chunk can drive the agent off the goal\-reaching path, lowering the return\.\(C\)ACSAC adaptively selects the execution horizon per state via return\-aware chunk size selection, executing longer chunks on the straight route and shorter chunks near the turn, achieving a balance between reactivity and coherence\.To address this limitation, we proposeAdaptiveChunkSizeActor\-Critic \(ACSAC\), which adaptively selects the chunk size to maximize the critic’s value estimate\. At each replanning state, ACSAC samples multiple candidate action chunks of lengthHHfrom an expressive flow BC policy\. Across\-horizon calibratedcausal Transformer critic evaluates the Q\-value of every*prefix*\(i\.e\., the firsthhactions of a chunk,h=1,…,Hh=1,\\ldots,H\) of each candidate\. ACSAC then applies rejection sampling jointly over the candidate index and the prefix length, executes the highest\-value prefix, and replans at the next chunk boundary\. ACSAC thus retains the multi\-step value backups and coherent exploration of action chunking, while the chunk size becomes state\-dependent rather than fixed, adaptively balancing reactivity and temporal consistency at each state\.
Our contributions can be summarized as follows: 1\) We proposeACSAC, an action\-chunking actor\-critic that adaptively selects the chunk size per state, using a causal Transformer critic and joint rejection sampling over candidate index and prefix length\. 2\) We prove that ACSAC’s prefix\-conditioned Q\-values are cross\-horizon comparable, and that its Bellman operator is a contraction whose unique fixed point is the action\-value function of the adaptive policy\. 3\) On long\-horizon, sparse\-reward manipulation tasks from OGBench\[OGBench\], ACSAC outperforms single\-step, multi\-step, and fixed chunk size baselines in both offline and offline\-to\-online settings\.
## 2Related Work
Offline RL and offline\-to\-online RL\.Offline RL aims to learn a policy from a fixed dataset without environment interaction\[LevineSurvey\]\. The main challenge is the distributional shift between the behavior policy and the learned policy, which can cause value overestimation and suboptimal performance\[TD3\+BC,ReBRAC\]\. Recently, expressive generative policies based on diffusion and flow matching\[DDPM,FM\]have been widely adopted for their expressivity over Gaussian policies\. Common policy extraction strategies for these generative policies include reparameterized gradients\[DQL,CAC,FQL\], weighted regression\[QGPO,EDP,QVPO,QIPO\], and rejection sampling\[SfBC,IDQL,AlignIQL\]\. In the offline\-to\-online setting, the offline\-pretrained policy is further fine\-tuned with online interactions\[AWAC\], with techniques such as balanced sampling\[Off2On\], high update\-to\-data ratios\[RLPD\], value calibration\[CalQL\], and more\[Hybrid,ACA,EDIS,WSRL\]\. Our method uses the same algorithm for both offline and online training, simply adding online transitions to the offline dataset and applying none of the above specialized techniques\.
Action chunking\.Action chunking originated in imitation learning, where a policy predicts and executes a sequence of actions in an open\-loop manner, improving robustness and capturing non\-Markovian behavior\[ACT,DP\]\. Recent RL methods have brought action chunking into actor\-critic frameworks, where the critic evaluates whole action chunks and enables multi\-step backup without off\-policy bias\. In the online setting with expert demonstrations, CQN\-AS\[CQN\-AS\]learns a multi\-level factorized critic on action chunks, and AC3\[AC3\]builds on a DDPG\-style framework to predict continuous action chunks\. In the offline or offline\-to\-online setting, Q\-chunking\[QC\]runs RL at an action chunk level with a flow BC policy and rejection sampling\. DQC\[DQC\]decouples the policy chunk size from the critic chunk size, with the policy predicting a shorter action chunk while retaining the value learning benefits of the chunked critic\. MAC\[MAC\]combines an action\-chunk dynamics model with rejection sampling from an expressive flow BC policy\. DEAS\[DEAS\]leverages action sequences for training critics with detached value learning and classification loss\. CGQ\[CGQ\]regularizes a single\-step critic toward a chunked critic\. All of the above RL methods rely on a fixed chunk size across states and tasks\. However, the optimal chunk size may vary by state, and any fixed choice forces a single trade\-off between reactivity and temporal consistency\. Our method brings adaptive, state\-dependent chunk size selection to RL, driven by the critic’s value estimate\.
Transformer\-based Q\-networks\.Transformers have become strong backbones for the Q\-network in RL\. Q\-Transformer\[QT\]converts Q\-function estimation into a discrete token sequence modeling problem, with per\-dimension discretization treating each action dimension as a separate time step for autoregressive Q\-value prediction\. TQL\[TQL\]scales Transformer Q\-learning through per\-layer control of attention entropy that prevents attention collapse\. While Q\-Transformer and TQL apply Transformers to single\-action Q\-networks, recent work instead trains Transformer critics on action chunks for direct multi\-step value evaluation\. TOP\-ERL\[TOP\-ERL\]introduces a causal Transformer critic in episodic RL, with the policy outputting full trajectories via movement primitives\. T\-SAC\[T\-SAC\]adapts a similar causal Transformer critic to step\-based RL, conditioning the critic on short trajectory segments while keeping the actor single\-step\. SEAR\[SEAR\]combines a causal Transformer critic with multi\-horizon targets and random replanning during data collection\. CO\-RFT\[CO\-RFT\]applies a causal Transformer critic to fine\-tune vision\-language\-action models with offline RL\. None of these methods adapt the chunk size to state\. Our method also adopts a causal Transformer critic, but uniquely exploits its prefix\-conditioned outputs at all prefix lengths to support state\-dependent chunk size selection during both training and inference\.
## 3Preliminaries
Problem formulation\.We consider an infinite\-horizon Markov decision process \(MDP\)ℳ=\(𝒮,𝒜,T,r,ρ,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},T,r,\\rho,\\gamma\), where𝒮\\mathcal\{S\}is the state space,𝒜⊆ℝd\\mathcal\{A\}\\subseteq\\mathbb\{R\}^\{d\}is the continuous action space of dimensiondd,T\(s′\|s,a\):𝒮×𝒜→Δ\(𝒮\)T\(s^\{\\prime\}\|s,a\):\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Delta\(\\mathcal\{S\}\)is the transition dynamics distribution,r\(s,a\):𝒮×𝒜→ℝr\(s,a\):\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}is the reward function,ρ∈Δ\(𝒮\)\\rho\\in\\Delta\(\\mathcal\{S\}\)is the initial state distribution, andγ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor\. HereΔ\(𝒳\)\\Delta\(\\mathcal\{X\}\)denotes the set of probability distributions over a space𝒳\\mathcal\{X\}\. We assume access to a prior offline dataset𝒟\\mathcal\{D\}consisting of transition rollouts\{\(s,a,s′,r\)\}\\\{\(s,a,s^\{\\prime\},r\)\\\}collected fromℳ\\mathcal\{M\}\. The goal is to find a policyπ\(a\|s\):𝒮→Δ\(𝒜\)\\pi\(a\|s\):\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\)that maximizes the expected discounted returnJ\(π\):=𝔼st\+1∼T\(st,at\),at∼π\(⋅\|st\)\[∑t=0∞γtr\(st,at\)\]J\(\\pi\):=\\mathbb\{E\}\_\{s\_\{t\+1\}\\sim T\(s\_\{t\},a\_\{t\}\),a\_\{t\}\\sim\\pi\(\\cdot\|s\_\{t\}\)\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\\right\]\. In the*offline*setting, the policy is learned entirely from the fixed dataset𝒟\\mathcal\{D\}without environment interactions; in the*offline\-to\-online*setting, the offline\-pretrained policy is further fine\-tuned with online interactions\.
Chunk\-based reinforcement learning\.Standard TD\-based methods learn a state\-action value functionQϕ\(s,a\)Q\_\{\\phi\}\(s,a\)by minimizing the single\-step Bellman error:
ℒTD\(ϕ\)=𝔼\(st,at,rt,st\+1\)∼𝒟,at\+1∼π\(⋅\|st\+1\)\[\(Qϕ\(st,at\)−rt−γQϕ¯\(st\+1,at\+1\)\)2\],\\mathcal\{L\}^\{\\text\{TD\}\}\(\\phi\)=\\mathbb\{E\}\_\{\(s\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\}\)\\sim\\mathcal\{D\},\\,a\_\{t\+1\}\\sim\\pi\(\\cdot\|s\_\{t\+1\}\)\}\\left\[\\left\(Q\_\{\\phi\}\(s\_\{t\},a\_\{t\}\)\-r\_\{t\}\-\\gamma Q\_\{\\bar\{\\phi\}\}\(s\_\{t\+1\},a\_\{t\+1\}\)\\right\)^\{2\}\\right\],\(1\)whereQϕ¯Q\_\{\\bar\{\\phi\}\}is a target network with delayed parametersϕ¯\\bar\{\\phi\}\. Each single\-step backup propagates value only one step backward, slowing learning in long\-horizon tasks\. A common strategy is the multi\-step return, which replaces the single\-step target with∑τ=0n−1γτrt\+τ\+γnQϕ¯\(st\+n,at\+n\)\\sum\_\{\\tau=0\}^\{n\-1\}\\gamma^\{\\tau\}r\_\{t\+\\tau\}\+\\gamma^\{n\}Q\_\{\\bar\{\\phi\}\}\(s\_\{t\+n\},a\_\{t\+n\}\)for some horizonn≥1n\\geq 1and allows for annn\-fold speed\-up in value propagation\. However, the multi\-step return introduces off\-policy bias, since the discounted reward sum from the replay buffer may not reflect the expected rewards under the current policy when the intermediate actionsat\+1,…,at\+n−1a\_\{t\+1\},\\ldots,a\_\{t\+n\-1\}are chosen by a different policy\[TOP\-ERL,QC\]\.
Action chunking attains the value\-propagation speedup of multi\-step returns without their off\-policy bias by extending RL to action sequences\. An action chunk of lengthHHstarting at timettisat:t\+H:=\(at,at\+1,…,at\+H−1\)∈𝒜Ha\_\{t:t\+H\}:=\(a\_\{t\},a\_\{t\+1\},\\ldots,a\_\{t\+H\-1\}\)\\in\\mathcal\{A\}^\{H\}, and the correspondingHH\-step discounted reward isrtH:=∑τ=0H−1γτrt\+τr\_\{t\}^\{H\}:=\\sum\_\{\\tau=0\}^\{H\-1\}\\gamma^\{\\tau\}r\_\{t\+\\tau\}\. The chunked criticQϕ\(st,at:t\+H\)Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+H\}\)is trained with:
ℒchunk\(ϕ\)=𝔼\(st,at:t\+H,rtH,st\+H\)∼𝒟\[\(Qϕ\(st,at:t\+H\)−rtH−γHQϕ¯\(st\+H,at\+H:t\+2H\)\)2\],\\mathcal\{L\}^\{\\text\{chunk\}\}\(\\phi\)=\\mathbb\{E\}\_\{\(s\_\{t\},a\_\{t:t\+H\},r\_\{t\}^\{H\},s\_\{t\+H\}\)\\sim\\mathcal\{D\}\}\\left\[\\left\(Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+H\}\)\-r\_\{t\}^\{H\}\-\\gamma^\{H\}Q\_\{\\bar\{\\phi\}\}\(s\_\{t\+H\},a\_\{t\+H:t\+2H\}\)\\right\)^\{2\}\\right\],\(2\)whereat\+H:t\+2H∼π\(⋅\|st\+H\)a\_\{t\+H:t\+2H\}\\sim\\pi\(\\cdot\|s\_\{t\+H\}\)\. Crucially, unlike the multi\-step return where the single\-action criticQ\(st,at\)Q\(s\_\{t\},a\_\{t\}\)is backed up with rewards generated by potentially off\-policy actions, the chunked criticQ\(st,at:t\+H\)Q\(s\_\{t\},a\_\{t:t\+H\}\)conditions on the exact action sequence used to obtain the multi\-step rewardsrtHr\_\{t\}^\{H\}, eliminating the off\-policy bias\[QC\]\.
Flow policy\.Flow matching\[FM,ReFlow,InterFlow\]is a generative modeling technique that trains a velocity field to transform a noise distribution into a target data distribution\. Given a target distributionp\(x\)∈Δ\(ℝd\)p\(x\)\\in\\Delta\(\\mathbb\{R\}^\{d\}\), flow matching fits a time\-dependent velocity fieldvθ\(u,x\):\[0,1\]×ℝd→ℝdv\_\{\\theta\}\(u,x\):\[0,1\]\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}whose corresponding flowψθ\(u,x\)\\psi\_\{\\theta\}\(u,x\), defined by the ODEdduψθ\(u,x\)=vθ\(u,ψθ\(u,x\)\)\\frac\{d\}\{du\}\\psi\_\{\\theta\}\(u,x\)=v\_\{\\theta\}\(u,\\psi\_\{\\theta\}\(u,x\)\), transforms𝒩\(0,Id\)\\mathcal\{N\}\(0,I\_\{d\}\)atu=0u=0intop\(x\)p\(x\)atu=1u=1, by minimizing:
ℒ\(θ\)=𝔼x0∼𝒩\(0,Id\),x1∼p\(x\),u∼Unif\(\[0,1\]\),xu=\(1−u\)x0\+ux1\[‖vθ\(u,xu\)−\(x1−x0\)‖22\],\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\_\{0\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\),\\,x\_\{1\}\\sim p\(x\),\\,u\\sim\\mathrm\{Unif\}\(\[0,1\]\),\\,x\_\{u\}=\(1\-u\)x\_\{0\}\+ux\_\{1\}\\end\{subarray\}\}\\left\[\\\|v\_\{\\theta\}\(u,x\_\{u\}\)\-\(x\_\{1\}\-x\_\{0\}\)\\\|\_\{2\}^\{2\}\\right\],\(3\)wherexu:=\(1−u\)x0\+ux1x\_\{u\}:=\(1\-u\)x\_\{0\}\+ux\_\{1\}is the linear interpolation betweenx0x\_\{0\}andx1x\_\{1\}\. To use flow matching for policy learning, we train a state\-conditioned velocity fieldvθ\(u,s,az\):\[0,1\]×𝒮×ℝd→ℝdv\_\{\\theta\}\(u,s,a\_\{z\}\):\[0,1\]\\times\\mathcal\{S\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}with the behavior cloning objective:
ℒ\(θ\)=𝔼s,a∼𝒟,z∼𝒩\(0,Id\),u∼Unif\(\[0,1\]\),az=\(1−u\)z\+ua\[‖vθ\(u,s,az\)−\(a−z\)‖22\]\.\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}s,a\\sim\\mathcal\{D\},\\,z\\sim\\mathcal\{N\}\(0,I\_\{d\}\),\\,u\\sim\\text\{Unif\}\(\[0,1\]\),\\,a\_\{z\}=\(1\-u\)z\+ua\\end\{subarray\}\}\\left\[\\\|v\_\{\\theta\}\(u,s,a\_\{z\}\)\-\(a\-z\)\\\|\_\{2\}^\{2\}\\right\]\.\(4\)We denote the ODE solution atu=1u=1starting from noisezzasπθ\(s,z\):=ψθ\(1,s,z\)\\pi\_\{\\theta\}\(s,z\):=\\psi\_\{\\theta\}\(1,s,z\), which maps a noise vectorz∼𝒩\(0,Id\)z\\sim\\mathcal\{N\}\(0,I\_\{d\}\)to an actiona=πθ\(s,z\)a=\\pi\_\{\\theta\}\(s,z\)\. Note thatπθ\(s,z\)\\pi\_\{\\theta\}\(s,z\)is a deterministic function ofzz, but the stochasticity ofzzinduces a conditional behavior distributionπθ\(⋅\|s\)\\pi\_\{\\theta\}\(\\cdot\|s\)\. Compared to Gaussian policies, flow policies can model complex, multi\-modal action distributions, making them particularly suited for offline RL where datasets often contain diverse behavior patterns\[FQL,MAC\]\.
## 4Method
ACSAC is an action\-chunking actor\-critic that adaptively selects the chunk size at each replanning state, rather than fixing one across all states as in prior methods\. Throughout this section,HHis themaximum chunk size, and forh∈\[H\]:=\{1,…,H\}h\\in\[H\]:=\\\{1,\\ldots,H\\\}the length\-hhprefixof a chunkat:t\+H=\(at,…,at\+H−1\)a\_\{t:t\+H\}=\(a\_\{t\},\\ldots,a\_\{t\+H\-1\}\)is its firsthhactionsat:t\+h=\(at,…,at\+h−1\)a\_\{t:t\+h\}=\(a\_\{t\},\\ldots,a\_\{t\+h\-1\}\)\. At each replanning statests\_\{t\}, ACSAC samplesNNlength\-HHcandidate chunks\{at:t\+H\(n\)\}n∈\[N\]\\\{a^\{\(n\)\}\_\{t:t\+H\}\\\}\_\{n\\in\[N\]\}from a flow BC policyπθ\\pi\_\{\\theta\}, evaluates allNHNHprefix\-conditioned valuesQϕ\(st,at:t\+h\(n\)\)Q\_\{\\phi\}\(s\_\{t\},a^\{\(n\)\}\_\{t:t\+h\}\)with a causal Transformer critic, and executes the prefixat:t\+h⋆\(n⋆\)a^\{\(n^\{\\star\}\)\}\_\{t:t\+h^\{\\star\}\}that achieves the joint argmax \(Figure[2](https://arxiv.org/html/2605.11009#S4.F2)\); the same extraction rule provides the bootstrap action prefix in the multi\-step TD loss used to trainQϕQ\_\{\\phi\}\. We describe the critic architecture in Section[4\.1](https://arxiv.org/html/2605.11009#S4.SS1), the multi\-step TD objective in Section[4\.2](https://arxiv.org/html/2605.11009#S4.SS2), and the flow BC policy with joint argmax extraction in Section[4\.3](https://arxiv.org/html/2605.11009#S4.SS3)\.
Figure 2:Adaptive policy extraction in ACSAC\.At replanning statests\_\{t\}, ACSAC samplesNNlength\-HHchunks\{at:t\+H\(n\)\}n∈\[N\]\\\{a^\{\(n\)\}\_\{t:t\+H\}\\\}\_\{n\\in\[N\]\}from the flow BC policyπθ\(st,z\)\\pi\_\{\\theta\}\(s\_\{t\},z\)withz\(n\)∼𝒩\(0,IHd\)z^\{\(n\)\}\\sim\\mathcal\{N\}\(0,I\_\{Hd\}\), evaluates all prefix\-conditioned valuesQϕ\(st,at:t\+h\(n\)\)Q\_\{\\phi\}\(s\_\{t\},a^\{\(n\)\}\_\{t:t\+h\}\)for\(n,h\)∈\[N\]×\[H\]\(n,h\)\\in\[N\]\\times\[H\], and executesat:t\+h⋆\(n⋆\)a^\{\(n^\{\\star\}\)\}\_\{t:t\+h^\{\\star\}\}where\(n⋆,h⋆\)=argmaxn∈\[N\],h∈\[H\]Qϕ\(st,at:t\+h\(n\)\)\(n^\{\\star\},h^\{\\star\}\)=\\arg\\max\_\{n\\in\[N\],\\,h\\in\[H\]\}Q\_\{\\phi\}\(s\_\{t\},a^\{\(n\)\}\_\{t:t\+h\}\)\. The same extraction rule is used for bootstrap action sampling and deployment\.### 4\.1Causal Transformer Critic Architecture
ACSAC’s joint argmax requires the critic to evaluate action chunks of different lengthsat:t\+ha\_\{t:t\+h\}\(h∈\[H\]h\\in\[H\]\) from a single statests\_\{t\}and to produce values that are comparable acrosshh\. An MLP critic on action chunks consumes fixed\-length inputs, so reusing it on shorter sub\-prefixes incurs causal leakage\[T\-SAC\]\. A causal Transformer instead ingests\(st,at:t\+H\)\(s\_\{t\},a\_\{t:t\+H\}\)as a token sequence and outputs theHHprefix\-conditioned values\{Qϕ\(st,at:t\+h\)\}h∈\[H\]\\\{Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)\\\}\_\{h\\in\[H\]\}jointly, with a causal attention mask that restricts positioniito attend only to positionsj≤ij\\leq i\.
This design has three consequences\. First,Qϕ\(st,at:t\+h\)Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)depends only on the prefixat:t\+ha\_\{t:t\+h\}and not on future actionsat\+h:t\+Ha\_\{t\+h:t\+H\}, so each value is a valid estimate for executing exactlyhhactions and the MLP\-style causal leakage is eliminated\[T\-SAC\]\. Second, the shared backbone produces sequence\-aware value estimates that capture the temporal structure within the chunk and enable fine\-grained credit assignment across positions\[TOP\-ERL,T\-SAC\]\. Third, because allHHvalues share a backbone and are jointly trained against per\-horizon targets \(Section[4\.2](https://arxiv.org/html/2605.11009#S4.SS2)\), they live on a common return scale, making the joint argmax inπ⋆\\pi\_\{\\star\}meaningful across different prefix lengths\. We give a formal argument for prefix consistency and cross\-horizon comparability in Appendix[G](https://arxiv.org/html/2605.11009#A7)\.
### 4\.2Multi\-Step TD Objective
We train the critic with a multi\-step TD loss at every prefix length\. For eachh∈\[H\]h\\in\[H\], define thehh\-step return target
Gh\(st,at:t\+h\):=∑τ=0h−1γτrt\+τ\+γhQϕ¯\(st\+h,π⋆\(st\+h\)\),G\_\{h\}\(s\_\{t\},a\_\{t:t\+h\}\):=\\sum\_\{\\tau=0\}^\{h\-1\}\\gamma^\{\\tau\}r\_\{t\+\\tau\}\+\\gamma^\{h\}\\,Q\_\{\\bar\{\\phi\}\}\\\!\\big\(s\_\{t\+h\},\\,\\pi\_\{\\star\}\(s\_\{t\+h\}\)\\big\),\(5\)which sumshhon\-policy rewards from the dataset and a bootstrap value from the target criticQϕ¯Q\_\{\\bar\{\\phi\}\}at the next statest\+hs\_\{t\+h\}\. The critic loss averages the squared error againstGhG\_\{h\}acrossh∈\[H\]h\\in\[H\]in expectation over chunked transitions from the offline dataset or replay buffer:
ℒ\(ϕ\)=𝔼\(st:t\+H\+1,at:t\+H,rt:t\+H\)∼𝒟\[1H∑h=1H\(Qϕ\(st,at:t\+h\)−Gh\(st,at:t\+h\)\)2\]\.\\mathcal\{L\}\(\\phi\)=\\mathbb\{E\}\_\{\(s\_\{t:t\+H\+1\},\\,a\_\{t:t\+H\},\\,r\_\{t:t\+H\}\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\frac\{1\}\{H\}\\sum\_\{h=1\}^\{H\}\\big\(Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)\-G\_\{h\}\(s\_\{t\},a\_\{t:t\+h\}\)\\big\)^\{2\}\\right\]\.\(6\)EachGhG\_\{h\}propagates the bootstrap value atst\+hs\_\{t\+h\}backhhsteps toQϕ\(st,at:t\+h\)Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)in a single critic update\[QC\]\. Averaging gradients across prefix lengths further reduces gradient variance while preserving sparse reward signals, as formalized in Appendix[G\.5](https://arxiv.org/html/2605.11009#A7.SS5)\[T\-SAC,TOP\-ERL\]\.
For the bootstrap valueQϕ¯\(st\+h,π⋆\(st\+h\)\)Q\_\{\\bar\{\\phi\}\}\(s\_\{t\+h\},\\pi\_\{\\star\}\(s\_\{t\+h\}\)\), ACSAC samplesNNcandidate chunks fromπθ\\pi\_\{\\theta\}at the next statest\+hs\_\{t\+h\}\. The selected prefixπ⋆\(st\+h\)\\pi\_\{\\star\}\(s\_\{t\+h\}\)maximizesQϕ¯\(st\+h,at\+h:t\+h\+h′\(n\)\)Q\_\{\\bar\{\\phi\}\}\(s\_\{t\+h\},a^\{\(n\)\}\_\{t\+h:t\+h\+h^\{\\prime\}\}\)over\(n,h′\)∈\[N\]×\[H\]\(n,h^\{\\prime\}\)\\in\[N\]\\times\[H\]\. This generalizes EMaQ’s expected\-max Q operator\[EMaQ\], replacing themax\\maxoverNNcandidate actions with a jointmax\\maxoverNHNHcandidate prefixes\. Appendix[G](https://arxiv.org/html/2605.11009#A7)shows that the resulting Bellman backup is aγ\\gamma\-contraction whose unique fixed point is the action\-value function ofπ⋆\\pi\_\{\\star\}\.
### 4\.3Adaptive Policy Extraction
We use a flow BC policyπθ\(s,z\)\\pi\_\{\\theta\}\(s,z\)to generate candidate action chunks of lengthHHfrom the offline dataset𝒟\\mathcal\{D\}\.πθ\\pi\_\{\\theta\}is parameterized by a state\-conditioned velocity fieldvθ\(u,st,az\):\[0,1\]×𝒮×ℝHd→ℝHdv\_\{\\theta\}\(u,s\_\{t\},a\_\{z\}\):\[0,1\]\\times\\mathcal\{S\}\\times\\mathbb\{R\}^\{Hd\}\\to\\mathbb\{R\}^\{Hd\}trained with the flow\-matching loss
ℒ\(θ\)=𝔼z∼𝒩\(0,IHd\),\(st,at:t\+H\)∼𝒟,u∼Unif\(\[0,1\]\),az=\(1−u\)z\+uat:t\+H\[‖vθ\(u,st,az\)−\(at:t\+H−z\)‖22\]\.\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}z\\sim\\mathcal\{N\}\(0,I\_\{Hd\}\),\\,\(s\_\{t\},a\_\{t:t\+H\}\)\\sim\\mathcal\{D\},\\\\ u\\sim\\text\{Unif\}\(\[0,1\]\),\\,a\_\{z\}=\(1\-u\)z\+ua\_\{t:t\+H\}\\end\{subarray\}\}\\left\[\\\|v\_\{\\theta\}\(u,s\_\{t\},a\_\{z\}\)\-\(a\_\{t:t\+H\}\-z\)\\\|\_\{2\}^\{2\}\\right\]\.\(7\)Fitting full\-length action chunks letsπθ\\pi\_\{\\theta\}capture non\-Markovian behavior patterns in the offline data, supporting temporally coherent exploration in long\-horizon tasks\[QC\]\.
Givenπθ\\pi\_\{\\theta\}and the trained criticQϕQ\_\{\\phi\}, we extract ACSAC’s policyπ⋆\\pi\_\{\\star\}via rejection sampling, which implicitly enforces a behavior constraint with a closed\-form bound on the KL divergence fromπθ\\pi\_\{\\theta\}\[QC\]and is generally robust to hyperparameters, unlike alternatives that require tuning a behavior regularization coefficient\[MAC\]\. Prior action\-chunking methods\[QC,DQC,MAC\]fix the chunk sizehhand apply rejection sampling: withn⋆=argmaxn∈\[N\]Qϕ\(st,at:t\+h\(n\)\)n^\{\\star\}=\\arg\\max\_\{n\\in\[N\]\}Q\_\{\\phi\}\(s\_\{t\},a^\{\(n\)\}\_\{t:t\+h\}\), the policy outputsπ\(st\)=at:t\+h\(n⋆\)\\pi\(s\_\{t\}\)=a^\{\(n^\{\\star\}\)\}\_\{t:t\+h\}\. ACSAC extends this single\-axis rejection sampling to a joint search overNHNHcandidates indexed by\(n,h\)\(n,h\):
\(n⋆,h⋆\)=argmaxn∈\[N\],h∈\[H\]Qϕ\(st,at:t\+h\(n\)\),π⋆\(st\)=at:t\+h⋆\(n⋆\),\(n^\{\\star\},h^\{\\star\}\)=\\arg\\max\_\{n\\in\[N\],\\,h\\in\[H\]\}Q\_\{\\phi\}\\\!\\left\(s\_\{t\},a^\{\(n\)\}\_\{t:t\+h\}\\right\),\\qquad\\pi\_\{\\star\}\(s\_\{t\}\)=a^\{\(n^\{\\star\}\)\}\_\{t:t\+h^\{\\star\}\},\(8\)where\{at:t\+H\(n\)\}n∈\[N\]\\\{a^\{\(n\)\}\_\{t:t\+H\}\\\}\_\{n\\in\[N\]\}are obtained by drawingNNnoise vectorsz\(n\)∼𝒩\(0,IHd\)z^\{\(n\)\}\\sim\\mathcal\{N\}\(0,I\_\{Hd\}\)and mapping them through the flow BC policy:at:t\+H\(n\)=πθ\(st,z\(n\)\)a^\{\(n\)\}\_\{t:t\+H\}=\\pi\_\{\\theta\}\(s\_\{t\},z^\{\(n\)\}\)\. The Transformer critic provides Q\-values for allNHNHpairs, and\(n⋆,h⋆\)\(n^\{\\star\},h^\{\\star\}\)is selected by a single argmax\.
Intuitively, the selected execution horizonh⋆h^\{\\star\}reflects the current phase of the policy rollout at statests\_\{t\}\. When the Q\-values of longer prefixes decline relative to shorter ones, the agent should replan after fewer steps to maintain reactivity\. Conversely, when Q\-values stay high or increase with longer prefixes, the state admits a coherent long\-horizon plan and the agent benefits from a longer prefix to maintain temporal consistency\. The selected execution horizonh⋆∈\[H\]h^\{\\star\}\\in\[H\]thus varies per state, with shorterh⋆h^\{\\star\}at sensitive states and longerh⋆h^\{\\star\}at states admitting coherent long\-horizon plans\. Importantly, the joint argmax across prefix lengths relies on ACSAC’s prefix\-conditioned Q\-values lying on a common return scale\. We prove this cross\-horizon comparability in Appendix[G\.3](https://arxiv.org/html/2605.11009#A7.SS3)and verify it empirically in Section[5\.3](https://arxiv.org/html/2605.11009#S5.SS3)\.
We provide pseudocode for the full offline pretraining and online fine\-tuning procedures in Algorithm[1](https://arxiv.org/html/2605.11009#alg1)\. Further implementation details are in Appendix[B](https://arxiv.org/html/2605.11009#A2)\.
Algorithm 1Adaptive Chunk Size Actor\-Critic \(ACSAC\)Dataset𝒟\\mathcal\{D\}, maximum chunk sizeHH, rejection sampling sizeNN, flow BC policyπθ\\pi\_\{\\theta\}, causal Transformer criticQϕQ\_\{\\phi\}
*// Offline training loop*
whilenot convergeddo
Sample chunked batch\{\(st:t\+H\+1,at:t\+H,rt:t\+H\)\}∼𝒟\\\{\(s\_\{t:t\+H\+1\},a\_\{t:t\+H\},r\_\{t:t\+H\}\)\\\}\\sim\\mathcal\{D\}
Update flow BC policyπθ\\pi\_\{\\theta\}with the flow\-matching loss in Equation[7](https://arxiv.org/html/2605.11009#S4.E7)\.
Update causal Transformer criticQϕQ\_\{\\phi\}with the multi\-step TD loss in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)\.
*// Adaptive policy extraction from flow BC policyπθ\\pi\_\{\\theta\}with rejection sampling*
functionπ⋆\\pi\_\{\\star\}\(sts\_\{t\}\)
z\(n\)∼𝒩\(0,IHd\),n∈\[N\]z^\{\(n\)\}\\sim\\mathcal\{N\}\(0,I\_\{Hd\}\),\\ n\\in\[N\]
at:t\+H\(n\)=πθ\(st,z\(n\)\),n∈\[N\]a^\{\(n\)\}\_\{t:t\+H\}=\\pi\_\{\\theta\}\(s\_\{t\},z^\{\(n\)\}\),\\ n\\in\[N\]
\(n⋆,h⋆\)←argmaxn∈\[N\],h∈\[H\]Qϕ\(st,at:t\+h\(n\)\)\(n^\{\\star\},h^\{\\star\}\)\\leftarrow\\arg\\max\_\{n\\in\[N\],\\ h\\in\[H\]\}Q\_\{\\phi\}\(s\_\{t\},a^\{\(n\)\}\_\{t:t\+h\}\)
returnat:t\+h⋆\(n⋆\)a^\{\(n^\{\\star\}\)\}\_\{t:t\+h^\{\\star\}\}
*// Online fine\-tuning with adaptive replanning*
Initialize𝒟\\mathcal\{D\}with offline data\.
forevery environment stepttdo
ifthe previous selected chunk has been fully executedthen
at:t\+h⋆⋆←π⋆\(st\)a^\{\\star\}\_\{t:t\+h^\{\\star\}\}\\leftarrow\\pi\_\{\\star\}\(s\_\{t\}\)
Act withat⋆a^\{\\star\}\_\{t\}and receivest\+1,rts\_\{t\+1\},r\_\{t\}\.
𝒟←𝒟∪\{\(st,at⋆,st\+1,rt\)\}\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\\{\(s\_\{t\},a^\{\\star\}\_\{t\},s\_\{t\+1\},r\_\{t\}\)\\\}
Updateπθ\\pi\_\{\\theta\}via the flow\-matching loss in Equation[7](https://arxiv.org/html/2605.11009#S4.E7)using𝒟\\mathcal\{D\}\.
UpdateQϕQ\_\{\\phi\}via the multi\-step TD loss in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)using𝒟\\mathcal\{D\}\.
## 5Experiments
We evaluate ACSAC on long\-horizon, sparse\-reward robotic manipulation tasks from OGBench\[OGBench\]\. Our experiments are designed to answer three questions\.Q1\.Does ACSAC improve offline\-to\-online RL performance over prior single\-step, multi\-step, and fixed chunk size methods?Q2\.Does ACSAC’s adaptive chunk size selection follow task phases, and are its prefix\-conditioned Q\-values calibrated and cross\-horizon comparable?Q3\.How do the design choices of ACSAC, including the maximum chunk size, the rejection sampling size, and adaptive chunk size selection itself, affect performance?
### 5\.1Experimental Setup
Environments and datasets\.We evaluate ACSAC on OGBench manipulation tasks, which contain challenging long\-horizon, sparse\-reward robotic manipulation problems\. We consider five domains with varying difficulties:scene\-sparse,puzzle\-3x3\-sparse,cube\-double,cube\-triple, andcube\-quadruple\. Each domain contains five single\-task variants, giving2525tasks in total\. We follow theQCdataset protocol for its five offline\-to\-online domains, including the100100M\-transition dataset forcube\-quadruple\[QC\]\. These domains are well suited for evaluating adaptive action chunking because they require both long\-range value propagation and reactive manipulation at precise task phases\. Additional domain metadata is provided in Appendix[E\.1](https://arxiv.org/html/2605.11009#A5.SS1)\.
Baselines\.We compare ACSAC against single\-step, multi\-step, and fixed chunk size methods\. The single\-step baselines areIQL\[IQL\],ReBRAC\[ReBRAC\],FQL\[FQL\], andBFN\[QC\]\. The multi\-step baselines areFQL\-n\[QC\]andBFN\-n\[QC\], which use multi\-step returns without learning both a chunked critic and a chunked policy\. Finally, we compare against fixed chunk size methods, includingQC\[QC\],QC\-FQL\[QC\], andDEAS\[DEAS\]\. Implementation details for all baselines are summarized in Appendix[B](https://arxiv.org/html/2605.11009#A2)\.
Table 1:Summary table for OGBench offline\-to\-online RL results\.For each cell, we report the offline performance after11M of training steps and then the online performance after11M of additional online steps\. The best method\(s\) for each column is highlighted in bold and color\. ACSAC consistently outperforms all prior single\-step, multi\-step, and fixed chunk size action\-chunking baselines at the end of both the offline training and the online training\. See the full per\-task results in Table[6](https://arxiv.org/html/2605.11009#A6.T6)\(complete table\) and Figure[6](https://arxiv.org/html/2605.11009#A6.F6)\(individual training curves\)\. All values are means over44seeds with95%95\\%confidence intervals\.
### 5\.2Main Results
We report the main offline\-to\-online RL results in Table[1](https://arxiv.org/html/2605.11009#S5.T1)\. ACSAC achieves the best overall performance across the five\-domain suite, outperforming single\-step, multi\-step, and fixed chunk size action\-chunking baselines in both the offline phase and the offline\-to\-online phase\. The improvement is most visible on the long\-horizoncube\-tripleandscene\-sparsedomains, where adaptive chunk size selection is beneficial when an episode alternates between coarse transport and precise manipulation\.
Fixed chunk size methods such asQC,QC\-FQL, andDEASsubstantially improve over single\-step and multi\-step flow baselines on the long\-horizon cube domains, which alone demonstrates the importance of action chunking\. A fixed chunk size, however, imposes a single trade\-off between reactivity and temporal consistency throughout the entire episode\. ACSAC adaptively selects the execution horizon at each replanning state, retaining the fast value backups of long chunks while avoiding unnecessary open\-loop execution in states that require precise feedback\.
The main exception iscube\-quadruple, where fixed chunk size methods remain highly competitive online\. A plausible reason is that the default maximum chunk size is still short relative to the temporal structure of a four\-object manipulation task, so adaptivity alone cannot cover a full behavior segment\. Complete per\-task results and training curves are in Appendix[F\.2](https://arxiv.org/html/2605.11009#A6.SS2)\.
### 5\.3Qualitative and Quantitative Analyses

Figure 3:Distribution of chunk size decisions from ACSAC\.Mean executed chunk size at each observation timestep on a representativecube\-doublepick\-and\-place task, averaged over5050episodes of the online checkpoint\.Distribution of chunk size decisions\.We visualize in Figure[3](https://arxiv.org/html/2605.11009#S5.F3)the mean executed chunk size at each observation timestep on a representativecube\-doublepick\-and\-place task, averaged over5050episodes of the online checkpoint\. The curve closely follows the semantic phases of the task: ACSAC commits to large chunks early \(t<5t<5, mean≈4\\approx 4\) for the coarse approach toward the cube, intermediate chunks during grasp and transport \(5≤t≤255\\leq t\\leq 25, mean≈3\\approx 3\), and progressively smaller chunks toward precise placement, reaching mean≈1\\approx 1aroundt=33t=33as the cube is aligned at the target\. This adaptive pattern mirrors the qualitative finding ofAACfor vision\-language\-action models: larger chunks enable fast, coarse movements while smaller chunks with high\-frequency replanning ensure precise control during the critical grasping and placement phases\. UnlikeAAC, where chunk size is driven by predicted action entropy, in ACSAC it emerges directly from the learned prefix\-conditioned Q\-values\.

Figure 4:Prefix\-Q calibration\.Binned predicted Q\-valueQ^\\hat\{Q\}versus realized Monte\-Carlo returnG^\\hat\{G\}over5050rollouts of the online checkpoint, for the deployed adaptive policy and five fixed\-hhcontrols\.Prefix\-Q calibration and cross\-horizon comparability\.We test whether ACSAC’s prefix\-conditioned Q\-valuesQ^\\hat\{Q\}are calibrated against realized returns and comparable across horizons\. We collect5050rollouts of the deployed adaptive policy and of five fixed\-hhcontrols, each selecting only amongNNcandidates of lengthhh\. Along all rollouts we recordQ^\\hat\{Q\}and discounted Monte\-Carlo returnsG^\\hat\{G\}, then plot the meanQ^\\hat\{Q\}in equal\-frequency bins ofG^\\hat\{G\}\. Deviation from the diagonalQ^=G^\\hat\{Q\}=\\hat\{G\}quantifies over\- or under\-estimation\. As shown in Figure[4](https://arxiv.org/html/2605.11009#S5.F4), all six curves track the diagonal closely and largely coincide along the return range\. This indicates that ACSAC’s prefix Q\-values are individually calibrated and mutually comparable acrosshh\. The jointargmax\\arg\\maxover theN×HN\{\\times\}Hcandidates is therefore a principled deployment rule\.
### 5\.4Ablation Study
We ablate the main design choices of ACSAC on thecube\-doubleandcube\-tripledomains\.



Figure 5:Ablation studies\.*Top:*maximum chunk sizeHHsweep\.*Middle:*rejection sampling sizeNNsweep\.*Bottom:*same\-architecture controls QT\-BFN and QT\-QC replacing the MLP critics in BFN and QC with ACSAC’s causal Transformer critic\. Curves aggregate five tasks per domain\. The first11M steps are offline and the next11M steps are online\.In Figure[5](https://arxiv.org/html/2605.11009#S5.F5)\(top\), we sweep the maximum chunk sizeH∈\{1,3,5,7\}H\\in\\\{1,3,5,7\\\}around the defaultH=5H\{=\}5\.H=1H\{=\}1fails on both domains, confirming that single\-step execution cannot provide the value propagation needed for these sparse\-reward tasks\.H=3H\{=\}3andH=5H\{=\}5perform comparably on both domains, withH=3H\{=\}3slightly ahead oncube\-triple\. Increasing toH=7H\{=\}7hurtscube\-triple, consistent with the observation that overly long chunks make the behavior policy and critic harder to learn\[QC,DQC,MAC\]\. We useH=5H\{=\}5in all other experiments to allow the adaptive policy a broader range of execution horizons\.
In Figure[5](https://arxiv.org/html/2605.11009#S5.F5)\(middle\), we vary the rejection sampling sizeN∈\{2,4,6,8\}N\\in\\\{2,4,6,8\\\}around the defaultN=4N\{=\}4\.N=2N\{=\}2weakens value\-based action selection, especially online oncube\-triple, while increasingNNto66or88saturates with no consistent gain\. We useN=4N\{=\}4as the best overall trade\-off\.
Finally, in Figure[5](https://arxiv.org/html/2605.11009#S5.F5)\(bottom\), we isolate the contribution of the causal Transformer critic with two same\-architecture controls\. QT\-QC and QT\-BFN replace the MLP critics of QC and BFN with ACSAC’s causal Transformer critic, keeping every other component unchanged fromQC\. Both controls improve over the original baselines but do not close the gap to ACSAC, most clearly oncube\-triple\. This indicates that ACSAC’s advantage is not merely a Transformer\-critic effect\. The multi\-horizon TD objective in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)and the jointargmax\\arg\\maxover\(n,h\)\(n,h\)in Equation[8](https://arxiv.org/html/2605.11009#S4.E8)are essential\.
## 6Conclusion
We propose ACSAC, which enables adaptive chunk selection through cross\-horizon calibrated prefix value learning\. By learning prefix\-conditioned Q\-values on a shared discounted\-return scale, ACSAC allows executable prefixes with different horizons to be compared consistently, thereby supporting adaptive replanning behavior without manual chunk\-size tuning\. A causal Transformer critic trained with a multi\-step TD objective produces these prefix\-conditioned values\. A flow BC policy paired with rejection sampling over the joint\(n,h\)\(n,h\)axes yields the executed chunk\. On long\-horizon, sparse\-reward OGBench manipulation tasks, ACSAC outperforms single\-step, multi\-step, and fixed chunk size baselines in both offline and offline\-to\-online settings\. A promising future direction is to integrate state\-dependent chunk sizes with vision\-language\-action models\. Recent work shows that action\-sequence critics scale to large VLAs in both simulation and real\-world experiments\[DEAS,CO\-RFT\], bringing RL closer to real\-world applications\.
## References
## Appendix ALimitations
We highlight three limitations of ACSAC and corresponding directions for future work\. First, largerHHaggravates the multi\-modality of behavior chunks\[DEAS\]and largerNNscales the per\-step cost, so jointly scaling\(H,N\)\(H,N\)together with the Transformer critic capacity is a natural next step\. Second, our evaluation focuses on long\-horizon manipulation from OGBench and excludes navigation domains such asantmazeandhumanoidmaze, where action\-chunked methods are known to be less effective due to highly reactive control and fine\-grained trajectory stitching\[CGQ,MAC,DQC\]\. Our theoretical analysis also assumes deterministic transitions \(Appendix[G](https://arxiv.org/html/2605.11009#A7)\)\[DQC\], leaving stochastic environments as a promising extension\. Third, we do not evaluate ACSAC on real\-world robotic deployment or on large vision\-language\-action backbones, but recent work shows that chunked value learning scales to such settings\[DEAS,CO\-RFT\], presenting a natural extension\.
## Appendix BImplementation Details
#### Online fine\-tuning\.
For offline\-to\-online RL, we simply add online transitions to the dataset, without distinguishing them from the offline transitions\. We continue to train each method with the same objective as in offline training, followingFQLandQC\.
#### Flow matching\.
FollowingQC, each action\-chunking baseline trains a flow BC policy on the dataset, parameterized by a state\-conditioned velocity fieldvθv\_\{\\theta\}trained with the flow\-matching loss in Equation[7](https://arxiv.org/html/2605.11009#S4.E7)\. At inference, an action chunk is generated by Euler integration ofvθv\_\{\\theta\}overFFflow steps froma0=z∼𝒩\(0,IHd\)a^\{0\}=z\\sim\\mathcal\{N\}\(0,I\_\{Hd\}\)\. Single\-action methods such asFQLandBFNtrain the velocity field on individual actions instead of action chunks\.
#### Causal Transformer\.
ACSAC ingests\(st,at,at\+1,…,at\+H−1\)\(s\_\{t\},a\_\{t\},a\_\{t\+1\},\\ldots,a\_\{t\+H\-1\}\)as a token sequence and outputsHHheads,\[Qϕ\(st,at\),Qϕ\(st,at:t\+2\),…,Qϕ\(st,at:t\+H\)\]\[Q\_\{\\phi\}\(s\_\{t\},a\_\{t\}\),Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+2\}\),\\ldots,Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+H\}\)\], in one forward pass\. A causal attention mask ensures that thehh\-th head depends only onat:t\+ha\_\{t:t\+h\}\. Pre\-LayerNorm is applied before every attention and feed\-forward sublayer\.
#### Value learning\.
Following standard practice, all methods train two Q\-networks for stability\. ACSAC takes the minimum of the two Q\-values\[TD3\+BC\]for both the bootstrap target and the policy extraction, while each baseline retains its original aggregation rule\. ACSAC’s bootstrap target uses the current online critic with the gradient stopped\. SEEM\[SEEM\]shows that LayerNorm bounds the critic’s NTK, allowing stable training even when online and target networks share parameters\.
## Appendix CBaselines
#### FQL\.
FQL\[FQL\]is a behavior regularization\-based offline / offline\-to\-online method that distills a one\-step noise\-conditioned policy from a flow BC policy with a single\-step MLP critic\.
#### FQL\-n\.
FQL\-n\[QC\]is a multi\-step return variant ofFQLthat replaces the single\-step TD target with a multi\-step return at the chunked horizon\.
#### QC\-FQL and QC\.
QC\-FQL\[QC\]extendsFQLto the chunked action space\.QC\[QC\]replacesQC\-FQL’s distilled one\-step policy with rejection sampling over the flow BC policy\. We reproduce both with the official implementation released byQC\.
#### BFN and BFN\-n\.
BFN\[QC\]runsQC’s rejection sampling extraction in the original single\-action space\.BFN\-n\[QC\]additionally uses a multi\-step TD target\.
#### DEAS\.
DEAS\[DEAS\]extends action\-chunked offline RL with detached value learning, distributional RL with fixed support, and dual discount factors for stable training\. We reproduceDEASwith the official implementation released byDEAS, following its cube\-task hyperparameters andCGQ’s per\-task entries forscene\-sparseandpuzzle\-3x3\-sparse\.
#### QT\-QC and QT\-BFN\.
Same\-architecture controls that replaceQCandBFN’s MLP critics with ACSAC’s causal Transformer critic, keeping every other component unchanged fromQC\.
## Appendix DHyperparameters
### D\.1Shared hyperparameters
We report in Table[2](https://arxiv.org/html/2605.11009#A4.T2)the hyperparameters shared across all methods in our experiments\.
Table 2:Shared hyperparameters across all methods\.
### D\.2Task\-specific hyperparameters
We report in Table[3](https://arxiv.org/html/2605.11009#A4.T3)the per\-task hyperparameter choices of each method\.
Table 3:Task\-specific hyperparameters\.The layout followsCGQ’s all\-method task\-specific hyperparameter table\. For ACSAC,HHis the maximum chunk size andNNis the rejection sampling size\. The Transformer critic usesnlayer=2n\_\{\\mathrm\{layer\}\}\{=\}2attention layers,nhead=8n\_\{\\mathrm\{head\}\}\{=\}8heads, and per\-head dimensiondhead=16d\_\{\\mathrm\{head\}\}\{=\}16across all tasks\. ForQC\-FQL,α\\alphais the behavior regularization coefficient andhhis the chunk size\. ForQC,hhis the chunk size andNNis the rejection sampling size\. ForDEAS,ddanduudenote data\-centric and universal support, respectively\.
## Appendix EExperiment Details
### E\.1Environments, Tasks, and Datasets
OGBench\[OGBench\]\.We evaluate ACSAC on five long\-horizon OGBench manipulation domains:scene\-sparse,puzzle\-3x3\-sparse,cube\-double,cube\-triple, andcube\-quadruple\. FollowingQC,scene\-sparseandpuzzle\-3x3\-sparsesparsify the rewards of OGBench’sscene\-playandpuzzle\-3x3\-playto\{−1,0\}\\\{\-1,0\\\}, where−1\-1is given when the task is incomplete and0when it is completed\. Each domain contains five single\-task variants, giving2525OGBench tasks in total\. These domains are particularly suitable for studying action chunking because successful behavior requires both long\-range value propagation and temporally coherent action sequences\. Dataset size, episode length, and action dimension for each domain are listed in Table[4](https://arxiv.org/html/2605.11009#A5.T4)\.
scene\-sparse\.This domain contains a drawer, a window, a cube, and two button locks that control whether the drawer and the window can be opened\. Tasks require multi\-stage manipulation, such as unlocking the scene, moving the drawer or window, placing the cube, and relocking the scene\. The long sequence of necessary sub\-behaviors makes this domain sensitive to slow value propagation and incoherent short\-horizon exploration\.
puzzle\-3x3\-sparse\.This domain contains a3×33\\times 3grid of buttons\. Pressing a button flips its own state and the states of adjacent buttons\. The task is to reach a target color configuration\. Because local actions can have coupled downstream effects, the domain tests whether a policy can execute coherent short plans while still replanning when the current prefix becomes unfavorable\.
cube\-double/triple/quadruple\.These domains require a robot arm to move two, three, or four cubes to target locations\. The reward depends on the number of cubes that remain incorrectly placed, and an episode terminates when all cubes are correctly placed\. Thecube\-tripleandcube\-quadrupledomains are especially challenging in the offline\-to\-online setting because solving them often requires efficient online exploration after offline pretraining\.
Table 4:Domain metadata\.Dataset size is measured in transitions\. All OGBench manipulation domains use a five\-dimensional action space corresponding to end\-effector translation, yaw, and gripper opening\.
### E\.2Evaluation Protocol
#### Offline\-to\-online evaluation\.
The main offline\-to\-online RL results in Table[1](https://arxiv.org/html/2605.11009#S5.T1)and the per\-task results in Table[6](https://arxiv.org/html/2605.11009#A6.T6)use this protocol\. FollowingFQL,QC, agents are pretrained for11M offline gradient steps and fine\-tuned for11M online environment steps, with success rates reported at11M and22M steps\. All results are averaged over44random seeds, with95%95\\%confidence intervals computed via50005000\-sample stratified bootstrap resampling\[QC\]\.
#### Offline evaluation\.
The offline RL results in Table[7](https://arxiv.org/html/2605.11009#A6.T7)use this protocol\. FollowingOGBench,FQL, we train each method for11M offline gradient steps, evaluate every100100K steps over5050episodes, and report the average success rate across the last three evaluation epochs \(800800K,900900K,11M\)\.
#### Results from prior works\.
For Table[1](https://arxiv.org/html/2605.11009#S5.T1), theIQL,ReBRAC,FQL,BFN,FQL\-n,BFN\-n,QC, andQC\-FQLentries are taken fromQC’s released plot data111[https://github\.com/ColinQiyangLi/qc/tree/main/plot\_data](https://github.com/ColinQiyangLi/qc/tree/main/plot_data)\. For Table[7](https://arxiv.org/html/2605.11009#A6.T7), theFQL,FQL\-n,QC\-FQL,DEAS,DQC, andCGQentries are taken fromCGQ\. Otherwise, we implement the baselines in our codebase and evaluate them under our protocol\.
## Appendix FAdditional Experimental Results
### F\.1Computational Costs
We run all experiments on NVIDIA RTX 3090 GPUs\. Table[5](https://arxiv.org/html/2605.11009#A6.T5)reports parameter counts and per\-step runtimes oncube\-triple\-task1\.
#### Parameter count\.
DEASis counted under its defaultcube\-triplereproduction configuration\[DEAS\]\. ACSAC has the smallest parameter count because its shallow Transformer critic is shared across all prefix lengths rather than instantiating a separate critic per chunk size\.
#### Per\-step runtime\.
QC\-FQLhas comparable runtime toFQLandBFNfor both offline and online training\.QCis slower offline because it samples3232actions per training example, whileBFNsamples44\. ACSAC sharesQC’s expected\-max sampling backbone and adds only a shallow \(nlayer=2n\_\{\\mathrm\{layer\}\}\{=\}2,nembd=128n\_\{\\mathrm\{embd\}\}\{=\}128\) Transformer, so its offline runtime is on par withQC\.DEASskips rejection sampling and is therefore substantially faster offline\. For online training,QCand ACSAC are only3030–50%50\\%more expensive thanFQL,BFN, andQC\-FQL\.
Table 5:Parameter count and per\-step runtime oncube\-triple\-task1\.Offline measures one agent training step\. Online measures one agent training step plus one environment step\. Parameter counts are measured oncube\-triple\-task1\(4646\-dim observation,55\-dim action\)\.
### F\.2Full Offline\-to\-Online RL Results for OGBench
Table[6](https://arxiv.org/html/2605.11009#A6.T6)reports per\-task offline\-to\-online RL results aligned with the domain\-level summary in Table[1](https://arxiv.org/html/2605.11009#S5.T1)\. The ACSAC andDEAScolumns are reproduced under our unified protocol\. Figure[6](https://arxiv.org/html/2605.11009#A6.F6)provides summary plots by domain \(matching Table[1](https://arxiv.org/html/2605.11009#S5.T1)\) followed by per\-task curves \(matching Table[6](https://arxiv.org/html/2605.11009#A6.T6)\)\.
Table 6:Complete OGBench offline\-to\-online RL results by task\.For each cell, we report the offline performance after11M of training steps and then the online performance after11M of additional online steps\. The best method\(s\) for each column is highlighted in bold and color\. All values are means over44seeds with95%95\\%confidence intervals\.

Figure 6:Complete OGBench offline\-to\-online RL results by task\.FollowingQC’s appendix convention, the figure first shows summary plots by domain \(corresponding to Table[1](https://arxiv.org/html/2605.11009#S5.T1)\) and then per\-task curves \(corresponding to Table[6](https://arxiv.org/html/2605.11009#A6.T6)\)\. The first11M steps correspond to offline training and the next11M steps correspond to online fine\-tuning\. Results are averaged over44seeds and plotted with95%95\\%confidence intervals\.
### F\.3Offline RL Results for OGBench
Table[7](https://arxiv.org/html/2605.11009#A6.T7)reports an offline\-only comparison on OGBench manipulation, complementary to the offline\-to\-online results in Table[1](https://arxiv.org/html/2605.11009#S5.T1)\. For tasks already evaluated in prior work, we use the reported numbers directly \(entries without±\\pm\)\. TheFQL,FQL\-n,QC\-FQL,DEAS,DQC, andCGQentries are taken fromCGQ\. The evaluation protocol is described in Appendix[E\.2](https://arxiv.org/html/2605.11009#A5.SS2)\.
Table 7:Offline RL results on OGBench manipulation benchmarks\.FollowingCGQ, we highlight results within95%95\\%of the best performance in bold\.
## Appendix GTheoretical Foundations
We analyze ACSAC as behavior\-constrained variable\-duration Bellman learning\. A length\-HHchunk drawn from the flow BC policy is treated as a proposal path\. At each replanning state, ACSAC may execute any prefix of this path and replan afterwards\. We establish four results\. First, prefix truncation is deterministic post\-processing, so it cannot increase the action\-level mismatch between the flow BC policy and the behavior chunk distribution at the corresponding prefix length\. Second, prefix\-conditioned Q\-values of different lengths are well\-defined under prefix consistency, and they share the unit of total discounted return at the unrestricted variable\-length Bellman optimum\. Third, the critic loss in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)approximates aγ\\gamma\-contractive Bellman backup whose unique fixed point is the action\-value function of the deployed policyπ⋆\\pi\_\{\\star\}\. Fourth, averaging per\-horizon squared losses at the gradient level reduces update variance under the multi\-step targets, while preserving the sparse reward signal in each per\-horizon target\.
Throughout this appendix we work in the deterministic setting and at the level of exact expectations\. A short final paragraph in Section[G\.5](https://arxiv.org/html/2605.11009#A7.SS5)sketches what changes in stochastic environments and under finite\-sample approximation, and points to DQC and EMaQ for the heavier analysis\.
### G\.1Setup and Notation
###### Assumption G\.1\(Deterministic MDP for analysis\)\.
For the proofs we consider the deterministic version of the MDP in Section[3](https://arxiv.org/html/2605.11009#S3), withst\+1=f\(st,at\)s\_\{t\+1\}=f\(s\_\{t\},a\_\{t\}\)and bounded rewards\|r\(s,a\)\|≤Rmax\|r\(s,a\)\|\\leq R\_\{\\max\},γ∈\(0,1\)\\gamma\\in\(0,1\)\. We use the sup\-norm‖Q‖∞:=sups,at:t\+h\|Q\(s,at:t\+h\)\|\\\|Q\\\|\_\{\\infty\}:=\\sup\_\{s,a\_\{t:t\+h\}\}\|Q\(s,a\_\{t:t\+h\}\)\|on boundedQQ\-functions, and write the proofs for finite or discretized state\-action spaces\. The same contraction arguments extend to bounded continuous spaces when the displayed maxima exist; otherwise, maxima can be replaced by suprema\. The OGBench tasks evaluated in Section[5](https://arxiv.org/html/2605.11009#S5)are deterministic\.
#### Variable\-length action chunks\.
We use the body’s chunk notationat:t\+h:=\(at,at\+1,…,at\+h−1\)∈𝒜ha\_\{t:t\+h\}:=\(a\_\{t\},a\_\{t\+1\},\\ldots,a\_\{t\+h\-1\}\)\\in\\mathcal\{A\}^\{h\}throughout this appendix, withh∈\[H\]:=\{1,…,H\}h\\in\[H\]:=\\\{1,\\ldots,H\\\}and\[N\]:=\{1,…,N\}\[N\]:=\\\{1,\\ldots,N\\\}\. We write
𝒜≤H:=⋃h=1H𝒜h\\mathcal\{A\}^\{\\leq H\}\\;:=\\;\\bigcup\_\{h=1\}^\{H\}\\mathcal\{A\}^\{h\}\(9\)for the set of executable prefixes, with the convention that an element carries its own lengthhh\. Under Assumption[G\.1](https://arxiv.org/html/2605.11009#A7.Thmtheorem1), executingat:t\+ha\_\{t:t\+h\}fromsts\_\{t\}produces the unique trajectoryst\+1=f\(st,at\),…,st\+h=f\(st\+h−1,at\+h−1\)s\_\{t\+1\}=f\(s\_\{t\},a\_\{t\}\),\\ldots,s\_\{t\+h\}=f\(s\_\{t\+h\-1\},a\_\{t\+h\-1\}\), with open\-loop chunk return
rt\(h\):=∑τ=0h−1γτr\(st\+τ,at\+τ\),r\_\{t\}^\{\(h\)\}\\;:=\\;\\sum\_\{\\tau=0\}^\{h\-1\}\\gamma^\{\\tau\}\\,r\(s\_\{t\+\\tau\},a\_\{t\+\\tau\}\),\(10\)in agreement with the prefix reward used in the chunked TD loss of Equation[2](https://arxiv.org/html/2605.11009#S3.E2)\.
#### Behavior and proposal distributions\.
LetπβH\(⋅∣st\)∈Δ\(𝒜H\)\\pi\_\{\\beta\}^\{H\}\(\\cdot\\mid s\_\{t\}\)\\in\\Delta\(\\mathcal\{A\}^\{H\}\)denote the true behavior chunk distribution under the data\-collection policy of𝒟\\mathcal\{D\}, and letπθH\(⋅∣st\)∈Δ\(𝒜H\)\\pi\_\{\\theta\}^\{H\}\(\\cdot\\mid s\_\{t\}\)\\in\\Delta\(\\mathcal\{A\}^\{H\}\)denote the full\-length law of the flow BC policy of Section[4\.3](https://arxiv.org/html/2605.11009#S4.SS3)\. Forh∈\[H\]h\\in\[H\],πθh\(⋅∣st\)\\pi\_\{\\theta\}^\{h\}\(\\cdot\\mid s\_\{t\}\)denotes the distribution of the firsthhactions when a full chunk is sampled fromπθH\(⋅∣st\)\\pi\_\{\\theta\}^\{H\}\(\\cdot\\mid s\_\{t\}\),
πθh\(at:t\+h∣st\):=∫𝒜H−hπθH\(at:t\+H∣st\)𝑑at\+h:t\+H,\\pi\_\{\\theta\}^\{h\}\(a\_\{t:t\+h\}\\mid s\_\{t\}\)\\;:=\\;\\int\_\{\\mathcal\{A\}^\{H\-h\}\}\\pi\_\{\\theta\}^\{H\}\(a\_\{t:t\+H\}\\mid s\_\{t\}\)\\,da\_\{t\+h:t\+H\},\(11\)andπβh\\pi\_\{\\beta\}^\{h\}is defined analogously fromπβH\\pi\_\{\\beta\}^\{H\}\. The integral becomes a sum in the discrete case\.
#### Equal\-value candidates\.
When several candidates attain the same maximum value, we choose one using a fixed state\-independent rule, such as the first candidate in the stored order\. The conclusions below do not depend on how ties are resolved; the rule only makes the policy extraction in Definition[G\.14](https://arxiv.org/html/2605.11009#A7.Thmtheorem14)below single\-valued\.
### G\.2Prefix Truncation Does Not Increase Behavior Mismatch
ACSAC’s bootstrap target queries the critic at an adaptively\-chosen prefix of a behavior\-generated chunk\. A natural concern is whether shorter prefixes drift further from the behavior support than the full chunk does\. We rule this out at the level of total variation distance, using only the elementary fact that marginalization is non\-expansive in TV\.
###### Lemma G\.2\(Marginalization is TV non\-expansive\)\.
For any two distributionsμ,ν\\mu,\\nuon𝒜H\\mathcal\{A\}^\{H\}and anyh∈\[H\]h\\in\[H\], letμh,νh\\mu\_\{h\},\\nu\_\{h\}denote their length\-hhprefix marginals as in Equation[11](https://arxiv.org/html/2605.11009#A7.E11)\. Then
DTV\(μh,νh\)≤DTV\(μ,ν\)\.D\_\{\\mathrm\{TV\}\}\(\\mu\_\{h\},\\nu\_\{h\}\)\\;\\leq\\;D\_\{\\mathrm\{TV\}\}\(\\mu,\\nu\)\.\(12\)
###### Proof\.
Using the conventionDTV\(p,q\)=12∑x\|p\(x\)−q\(x\)\|D\_\{\\mathrm\{TV\}\}\(p,q\)=\\tfrac\{1\}\{2\}\\sum\_\{x\}\|p\(x\)\-q\(x\)\|and the triangle inequality,∑at:t\+h\|∑at\+h:t\+H\(μ−ν\)\(at:t\+H\)\|≤∑at:t\+H\|μ\(at:t\+H\)−ν\(at:t\+H\)\|\\sum\_\{a\_\{t:t\+h\}\}\\bigl\|\\sum\_\{a\_\{t\+h:t\+H\}\}\(\\mu\-\\nu\)\(a\_\{t:t\+H\}\)\\bigr\|\\leq\\sum\_\{a\_\{t:t\+H\}\}\|\\mu\(a\_\{t:t\+H\}\)\-\\nu\(a\_\{t:t\+H\}\)\|\. Dividing by22gives Equation[12](https://arxiv.org/html/2605.11009#A7.E12)\. The continuous case is analogous, with sums replaced by integrals\. ∎
###### Theorem G\.3\(Prefix truncation does not increase behavior mismatch\)\.
Letδθ\(st\):=DTV\(πθH\(⋅∣st\),πβH\(⋅∣st\)\)\\delta\_\{\\theta\}\(s\_\{t\}\):=D\_\{\\mathrm\{TV\}\}\(\\pi\_\{\\theta\}^\{H\}\(\\cdot\\mid s\_\{t\}\),\\pi\_\{\\beta\}^\{H\}\(\\cdot\\mid s\_\{t\}\)\)be the full\-chunk behavior mismatch of the flow BC policy at statests\_\{t\}\. Then for everyh∈\[H\]h\\in\[H\],
DTV\(πθh\(⋅∣st\),πβh\(⋅∣st\)\)≤δθ\(st\)\.D\_\{\\mathrm\{TV\}\}\\bigl\(\\pi\_\{\\theta\}^\{h\}\(\\cdot\\mid s\_\{t\}\),\\;\\pi\_\{\\beta\}^\{h\}\(\\cdot\\mid s\_\{t\}\)\\bigr\)\\;\\leq\\;\\delta\_\{\\theta\}\(s\_\{t\}\)\.\(13\)
###### Proof\.
Apply Lemma[G\.2](https://arxiv.org/html/2605.11009#A7.Thmtheorem2)withμ=πθH\(⋅∣st\)\\mu=\\pi\_\{\\theta\}^\{H\}\(\\cdot\\mid s\_\{t\}\)andν=πβH\(⋅∣st\)\\nu=\\pi\_\{\\beta\}^\{H\}\(\\cdot\\mid s\_\{t\}\)\. ∎
### G\.3Prefix Values Are Well\-Defined and Cross\-Horizon Comparable
For ACSAC’s joint argmax over\(n,h\)\(n,h\)to be meaningful, two conditions are needed\. The network output at thehh\-th position must represent a well\-defined Q\-value of the length\-hhprefix, and the resulting Q\-values across prefix lengths must lie on a single comparable scale\.
###### Definition G\.6\(Prefix consistency\)\.
A networkQ^:𝒮×𝒜H→ℝH\\hat\{Q\}:\\mathcal\{S\}\\times\\mathcal\{A\}^\{H\}\\to\\mathbb\{R\}^\{H\}is*prefix\-consistent*if itshh\-th output takes the same value on any two length\-HHchunks that share the same firsthhactions\. Equivalently,Q^\(h\)\(s,at:t\+H\)\\hat\{Q\}^\{\(h\)\}\(s,a\_\{t:t\+H\}\)depends only on\(s,at:t\+h\)\(s,a\_\{t:t\+h\}\)and not on the suffixat\+h:t\+Ha\_\{t\+h:t\+H\}\.
###### Proposition G\.7\(Causal\-masked Transformer is sufficient\)\.
A decoder\-only Transformer with causal self\-attention and one output head per position satisfies prefix consistency\.
###### Proof\.
Index the input tokens as\(s,at,…,at\+H−1\)\(s,a\_\{t\},\\ldots,a\_\{t\+H\-1\}\)with actionat\+h−1a\_\{t\+h\-1\}at positionhh\. The causal mask restricts the hidden representation at positionhhto attend only to positions0,…,h0,\\ldots,h, and induction over layers preserves this restriction\. Thehh\-th output head reads only this hidden representation, so its value depends only on\(s,at:t\+h\)\(s,a\_\{t:t\+h\}\)\. ∎
#### Variable\-horizon Bellman semantics\.
For any continuation policyπ\\piand any statests\_\{t\}, the variable\-horizon prefix value is defined as
Qπ\(st,at:t\+h\):=rt\(h\)\+γhVπ\(st\+h\),Q^\{\\pi\}\(s\_\{t\},a\_\{t:t\+h\}\)\\;:=\\;r\_\{t\}^\{\(h\)\}\\;\+\\;\\gamma^\{h\}\\,V^\{\\pi\}\(s\_\{t\+h\}\),\(14\)whereVπV^\{\\pi\}is the standard state\-value function ofπ\\piin the original one\-step MDP\. This is the discounted return of committing to the prefixat:t\+ha\_\{t:t\+h\}open\-loop and then followingπ\\pifromst\+hs\_\{t\+h\}\. All prefix lengths therefore share the same quantity type, namely the total discounted return fromsts\_\{t\}, with prefix consistency ensuring that each output of a prefix\-consistent network estimates this quantity for a unique length\-hhprefix\.
###### Theorem G\.9\(Prefix Q\-values are cross\-horizon comparable\)\.
For any continuation policyπ\\pi,QπQ^\{\\pi\}in Equation[14](https://arxiv.org/html/2605.11009#A7.E14)is the action\-value function of the variable\-horizon Bellman backup
\(ℬπQ\)\(st,at:t\+h\):=rt\(h\)\+γhVπ\(st\+h\),\(\\mathcal\{B\}^\{\\pi\}Q\)\(s\_\{t\},a\_\{t:t\+h\}\)\\;:=\\;r\_\{t\}^\{\(h\)\}\+\\gamma^\{h\}\\,V^\{\\pi\}\(s\_\{t\+h\}\),\(15\)which is constant inQQand therefore trivially hasQπQ^\{\\pi\}as its unique fixed point\. For a full chunkat:t\+Ha\_\{t:t\+H\}and anyh1≤h2∈\[H\]h\_\{1\}\\leq h\_\{2\}\\in\[H\], writing the additional reward over the segment\[h1,h2\)\[h\_\{1\},h\_\{2\}\)as
rt\+h1\(h2−h1\):=∑j=0h2−h1−1γjr\(st\+h1\+j,at\+h1\+j\),r\_\{t\+h\_\{1\}\}^\{\(h\_\{2\}\-h\_\{1\}\)\}\\;:=\\;\\sum\_\{j=0\}^\{h\_\{2\}\-h\_\{1\}\-1\}\\gamma^\{j\}r\(s\_\{t\+h\_\{1\}\+j\},a\_\{t\+h\_\{1\}\+j\}\),\(16\)we obtain the difference identity
Qπ\(st,at:t\+h2\)−Qπ\(st,at:t\+h1\)\\displaystyle Q^\{\\pi\}\(s\_\{t\},a\_\{t:t\+h\_\{2\}\}\)\-Q^\{\\pi\}\(s\_\{t\},a\_\{t:t\+h\_\{1\}\}\)\(17\)=γh1\(rt\+h1\(h2−h1\)\+γh2−h1Vπ\(st\+h2\)−Vπ\(st\+h1\)\)\.\\displaystyle\\quad=\\;\\gamma^\{h\_\{1\}\}\\\!\\Bigl\(r\_\{t\+h\_\{1\}\}^\{\(h\_\{2\}\-h\_\{1\}\)\}\+\\gamma^\{h\_\{2\}\-h\_\{1\}\}\\,V^\{\\pi\}\(s\_\{t\+h\_\{2\}\}\)\-V^\{\\pi\}\(s\_\{t\+h\_\{1\}\}\)\\Bigr\)\.
###### Proof\.
Equation[14](https://arxiv.org/html/2605.11009#A7.E14)is the definition; substitute the two casesh=h1,h2h=h\_\{1\},h\_\{2\}and factorγh1\\gamma^\{h\_\{1\}\}to obtain Equation[17](https://arxiv.org/html/2605.11009#A7.E17)after splittingrt\(h2\)−rt\(h1\)=γh1rt\+h1\(h2−h1\)r\_\{t\}^\{\(h\_\{2\}\)\}\-r\_\{t\}^\{\(h\_\{1\}\)\}=\\gamma^\{h\_\{1\}\}r\_\{t\+h\_\{1\}\}^\{\(h\_\{2\}\-h\_\{1\}\)\}\. ∎
### G\.4Expected\-Prefix\-Max Bellman Backup
We now show that the per\-horizon target inside the critic loss in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)is an unbiased Monte Carlo sample of an expected\-prefix\-max Bellman backupℬθN,H\\mathcal\{B\}\_\{\\theta\}^\{N,H\}that is aγ\\gamma\-contraction in sup\-norm\. Its unique fixed point is the action\-value function of the deployed adaptive\-prefix policyπ⋆\\pi\_\{\\star\}\. The structure mirrors EMaQ\[EMaQ\], which samplesNNbehavior actions and backs up their max; ACSAC samplesNNbehavior chunks and backs up the max over allNHNHexecutable prefixes\.
###### Definition G\.11\(ACSAC expected\-prefix\-max Bellman backup\)\.
For boundedQ:𝒮×𝒜≤H→ℝQ:\\mathcal\{S\}\\times\\mathcal\{A\}^\{\\leq H\}\\to\\mathbb\{R\}, the*ACSAC Bellman backup*is
\(ℬθN,HQ\)\(st,at:t\+h\):=rt\(h\)\\displaystyle\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\)\(s\_\{t\},a\_\{t:t\+h\}\)\\;=\\;r\_\{t\}^\{\(h\)\}\(19\)\+γh𝔼a~t\+h:t\+h\+H\(1:N\)∼πθH\(⋅∣st\+h\)\[maxn∈\[N\],k∈\[H\]Q\(st\+h,a~t\+h:t\+h\+k\(n\)\)\]\.\\displaystyle\\quad\+\\;\\gamma^\{h\}\\,\\mathbb\{E\}\_\{\\tilde\{a\}^\{\(1:N\)\}\_\{t\+h:t\+h\+H\}\\sim\\pi\_\{\\theta\}^\{H\}\(\\cdot\\mid s\_\{t\+h\}\)\}\\\!\\left\[\\,\\max\_\{n\\in\[N\],\\,k\\in\[H\]\}Q\\\!\\left\(s\_\{t\+h\},\\,\\tilde\{a\}^\{\(n\)\}\_\{t\+h:t\+h\+k\}\\right\)\\right\]\.The tilde distinguishes theNNi\.i\.d\. proposal chunks at the bootstrap statest\+hs\_\{t\+h\}from the chunkat:t\+Ha\_\{t:t\+H\}on the regression side, and the maximum ranges over theNHNHcandidate prefixes formed from these proposals\.
###### Lemma G\.12\(Max non\-expansiveness over a common index set\)\.
For any finite index setℐ\\mathcal\{I\}and any real\-valuedg1,g2:ℐ→ℝg\_\{1\},g\_\{2\}:\\mathcal\{I\}\\to\\mathbb\{R\},
\|maxi∈ℐg1\(i\)−maxi∈ℐg2\(i\)\|≤maxi∈ℐ\|g1\(i\)−g2\(i\)\|\.\\bigl\|\\max\_\{i\\in\\mathcal\{I\}\}g\_\{1\}\(i\)\-\\max\_\{i\\in\\mathcal\{I\}\}g\_\{2\}\(i\)\\bigr\|\\;\\leq\\;\\max\_\{i\\in\\mathcal\{I\}\}\|g\_\{1\}\(i\)\-g\_\{2\}\(i\)\|\.\(20\)
###### Proof\.
Leti1∈argmaxig1\(i\)i\_\{1\}\\in\\arg\\max\_\{i\}g\_\{1\}\(i\)andi2∈argmaxig2\(i\)i\_\{2\}\\in\\arg\\max\_\{i\}g\_\{2\}\(i\)\. Thenmaxig1\(i\)−maxig2\(i\)≤g1\(i1\)−g2\(i1\)≤\|g1\(i1\)−g2\(i1\)\|≤maxi\|g1\(i\)−g2\(i\)\|\\max\_\{i\}g\_\{1\}\(i\)\-\\max\_\{i\}g\_\{2\}\(i\)\\leq g\_\{1\}\(i\_\{1\}\)\-g\_\{2\}\(i\_\{1\}\)\\leq\|g\_\{1\}\(i\_\{1\}\)\-g\_\{2\}\(i\_\{1\}\)\|\\leq\\max\_\{i\}\|g\_\{1\}\(i\)\-g\_\{2\}\(i\)\|, and the symmetric direction yields the absolute value\. This is a deterministic, set\-theoretic inequality, so it holds even when the indexed valuesg1\(i\),g2\(i\)g\_\{1\}\(i\),g\_\{2\}\(i\)are correlated acrossii, which is what allows EMaQ’s contraction argument to transport to ACSAC’sNHNHcorrelated prefix candidates\[EMaQ\]\. ∎
###### Theorem G\.13\(Contraction and unique fixed point\)\.
The operatorℬθN,H\\mathcal\{B\}\_\{\\theta\}^\{N,H\}in Equation[19](https://arxiv.org/html/2605.11009#A7.E19)satisfies
‖ℬθN,HQ1−ℬθN,HQ2‖∞≤γ‖Q1−Q2‖∞\\\|\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\_\{1\}\-\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\_\{2\}\\\|\_\{\\infty\}\\;\\leq\\;\\gamma\\,\\\|Q\_\{1\}\-Q\_\{2\}\\\|\_\{\\infty\}\(21\)for all boundedQ1,Q2:𝒮×𝒜≤H→ℝQ\_\{1\},Q\_\{2\}:\\mathcal\{S\}\\times\\mathcal\{A\}^\{\\leq H\}\\to\\mathbb\{R\}, soℬθN,H\\mathcal\{B\}\_\{\\theta\}^\{N,H\}has a unique fixed point in the space of boundedQQ\-functions\.
###### Proof\.
Fix\(st,at:t\+h\)\(s\_\{t\},a\_\{t:t\+h\}\)and lets′=st\+hs^\{\\prime\}=s\_\{t\+h\}\. The sharedrt\(h\)r\_\{t\}^\{\(h\)\}cancels, so\|\(ℬθN,HQ1\)\(st,at:t\+h\)−\(ℬθN,HQ2\)\(st,at:t\+h\)\|=γh\|𝔼\[maxn,kQ1\(s′,⋅\)\]−𝔼\[maxn,kQ2\(s′,⋅\)\]\|\|\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\_\{1\}\)\(s\_\{t\},a\_\{t:t\+h\}\)\-\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\_\{2\}\)\(s\_\{t\},a\_\{t:t\+h\}\)\|=\\gamma^\{h\}\\bigl\|\\mathbb\{E\}\[\\max\_\{n,k\}Q\_\{1\}\(s^\{\\prime\},\\cdot\)\]\-\\mathbb\{E\}\[\\max\_\{n,k\}Q\_\{2\}\(s^\{\\prime\},\\cdot\)\]\\bigr\|\. By Jensen’s inequality and Lemma[G\.12](https://arxiv.org/html/2605.11009#A7.Thmtheorem12)applied pointwise to each realized proposal sample,\|𝔼\[maxQ1\]−𝔼\[maxQ2\]\|≤𝔼\[maxn,k\|Q1\(s′,⋅\)−Q2\(s′,⋅\)\|\]≤‖Q1−Q2‖∞\\bigl\|\\mathbb\{E\}\[\\max Q\_\{1\}\]\-\\mathbb\{E\}\[\\max Q\_\{2\}\]\\bigr\|\\leq\\mathbb\{E\}\\bigl\[\\max\_\{n,k\}\|Q\_\{1\}\(s^\{\\prime\},\\cdot\)\-Q\_\{2\}\(s^\{\\prime\},\\cdot\)\|\\bigr\]\\leq\\\|Q\_\{1\}\-Q\_\{2\}\\\|\_\{\\infty\}\. Hence\|\(ℬθN,HQ1\)−\(ℬθN,HQ2\)\|≤γh‖Q1−Q2‖∞≤γ‖Q1−Q2‖∞\|\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\_\{1\}\)\-\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\_\{2\}\)\|\\leq\\gamma^\{h\}\\\|Q\_\{1\}\-Q\_\{2\}\\\|\_\{\\infty\}\\leq\\gamma\\,\\\|Q\_\{1\}\-Q\_\{2\}\\\|\_\{\\infty\}\. A sup\-norm contraction has a unique fixed point, which we denoteQθN,HQ\_\{\\theta\}^\{N,H\}\. ∎
###### Definition G\.14\(Critic\-induced extraction policy\)\.
For any boundedQQ, the extraction policyπ⋆Q\\pi\_\{\\star\}^\{Q\}acts at statests\_\{t\}as follows\. Drawat:t\+H\(1\),…,at:t\+H\(N\)∼i\.i\.d\.πθH\(⋅∣st\)a^\{\(1\)\}\_\{t:t\+H\},\\ldots,a^\{\(N\)\}\_\{t:t\+H\}\\stackrel\{\{\\scriptstyle\\mathrm\{i\.i\.d\.\}\}\}\{\{\\sim\}\}\\pi\_\{\\theta\}^\{H\}\(\\cdot\\mid s\_\{t\}\), set
\(n⋆,h⋆\)=argmaxn∈\[N\],h∈\[H\]Q\(st,at:t\+h\(n\)\),\(n^\{\\star\},h^\{\\star\}\)\\;=\\;\\arg\\max\_\{n\\in\[N\],\\,h\\in\[H\]\}Q\\\!\\left\(s\_\{t\},\\,a^\{\(n\)\}\_\{t:t\+h\}\\right\),\(22\)and execute the prefixat:t\+h⋆\(n⋆\)a^\{\(n^\{\\star\}\)\}\_\{t:t\+h^\{\\star\}\}, treating it as one temporally extended action\. We writeπ⋆:=π⋆QθN,H\\pi\_\{\\star\}:=\\pi\_\{\\star\}^\{Q\_\{\\theta\}^\{N,H\}\}for the idealized adaptive\-prefix policy induced by the operator fixed point; the deployed implementation in Section[4\.3](https://arxiv.org/html/2605.11009#S4.SS3)applies the same extraction rule to the learned criticQϕQ\_\{\\phi\}\.
###### Theorem G\.15\(Fixed\-point identity\)\.
QθN,HQ\_\{\\theta\}^\{N,H\}is the action\-value function ofπ⋆\\pi\_\{\\star\}when each selected prefix is treated as one temporally extended action; in particularQθN,H=Qπ⋆Q\_\{\\theta\}^\{N,H\}=Q^\{\\pi\_\{\\star\}\}\.
###### Proof\.
Equation[22](https://arxiv.org/html/2605.11009#A7.E22)gives, for any draw of theNNproposals,QθN,H\(st,at:t\+h⋆\(n⋆\)\)=maxn∈\[N\],k∈\[H\]QθN,H\(st,at:t\+k\(n\)\)Q\_\{\\theta\}^\{N,H\}\(s\_\{t\},a^\{\(n^\{\\star\}\)\}\_\{t:t\+h^\{\\star\}\}\)=\\max\_\{n\\in\[N\],\\,k\\in\[H\]\}Q\_\{\\theta\}^\{N,H\}\(s\_\{t\},a^\{\(n\)\}\_\{t:t\+k\}\)\. Taking the expectation over the proposal draws, the inner expected max in Equation[19](https://arxiv.org/html/2605.11009#A7.E19)equals𝔼π⋆\[QθN,H\(st,⋅\)\]\\mathbb\{E\}\_\{\\pi\_\{\\star\}\}\[Q\_\{\\theta\}^\{N,H\}\(s\_\{t\},\\cdot\)\], so the fixed\-point equationQθN,H=ℬθN,HQθN,HQ\_\{\\theta\}^\{N,H\}=\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\_\{\\theta\}^\{N,H\}reads
QθN,H\(st,at:t\+h\)=rt\(h\)\+γh𝔼a′∼π⋆\(⋅∣st\+h\)\[QθN,H\(st\+h,a′\)\]\.Q\_\{\\theta\}^\{N,H\}\(s\_\{t\},a\_\{t:t\+h\}\)\\;=\\;r\_\{t\}^\{\(h\)\}\+\\gamma^\{h\}\\,\\mathbb\{E\}\_\{a^\{\\prime\}\\sim\\pi\_\{\\star\}\(\\cdot\\mid s\_\{t\+h\}\)\}\\bigl\[Q\_\{\\theta\}^\{N,H\}\(s\_\{t\+h\},a^\{\\prime\}\)\\bigr\]\.This is the variable\-horizon Bellman equation forπ⋆\\pi\_\{\\star\}and uniquely identifies its action\-value function by the same contraction argument as Theorem[G\.13](https://arxiv.org/html/2605.11009#A7.Thmtheorem13), with the deterministic selectionπ⋆\\pi\_\{\\star\}in place of the joint max\. We may therefore write the fixed point asQπ⋆Q^\{\\pi\_\{\\star\}\}in the rest of the appendix\. ∎
###### Proposition G\.16\(Critic targets are Monte Carlo Bellman samples\)\.
Fix any boundedQ:𝒮×𝒜≤H→ℝQ:\\mathcal\{S\}\\times\\mathcal\{A\}^\{\\leq H\}\\to\\mathbb\{R\}and define the per\-horizon target
G^h\(Q\):=rt\(h\)\+γhmaxn∈\[N\],k∈\[H\]Q\(st\+h,a~t\+h:t\+h\+k\(n\)\),\\hat\{G\}\_\{h\}\(Q\)\\;:=\\;r\_\{t\}^\{\(h\)\}\\;\+\\;\\gamma^\{h\}\\max\_\{n\\in\[N\],\\,k\\in\[H\]\}Q\\\!\\left\(s\_\{t\+h\},\\,\\tilde\{a\}^\{\(n\)\}\_\{t\+h:t\+h\+k\}\\right\),\(23\)witha~t\+h:t\+h\+H\(1\),…,a~t\+h:t\+h\+H\(N\)∼i\.i\.d\.πθH\(⋅∣st\+h\)\\tilde\{a\}^\{\(1\)\}\_\{t\+h:t\+h\+H\},\\ldots,\\tilde\{a\}^\{\(N\)\}\_\{t\+h:t\+h\+H\}\\stackrel\{\{\\scriptstyle\\mathrm\{i\.i\.d\.\}\}\}\{\{\\sim\}\}\\pi\_\{\\theta\}^\{H\}\(\\cdot\\mid s\_\{t\+h\}\)\.G^h\(Q\)\\hat\{G\}\_\{h\}\(Q\)is the explicit\-max sample form of the body’s bootstrap targetGhG\_\{h\}in Equation[5](https://arxiv.org/html/2605.11009#S4.E5), withπ⋆\\pi\_\{\\star\}replaced by the jointargmax\\arg\\maxand the bootstrap critic treated as a free argument; the body’sGh\(st,at:t\+h\)G\_\{h\}\(s\_\{t\},a\_\{t:t\+h\}\)corresponds toG^h\(Qϕ¯\)\\hat\{G\}\_\{h\}\(Q\_\{\\bar\{\\phi\}\}\)at this conditioning point\. Under Assumption[G\.1](https://arxiv.org/html/2605.11009#A7.Thmtheorem1),
𝔼\[G^h\(Q\)\|st,at:t\+h\]=\(ℬθN,HQ\)\(st,at:t\+h\),\\mathbb\{E\}\\\!\\left\[\\hat\{G\}\_\{h\}\(Q\)\\,\\middle\|\\,s\_\{t\},a\_\{t:t\+h\}\\right\]\\;=\\;\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\)\(s\_\{t\},a\_\{t:t\+h\}\),\(24\)and the conditional squared\-error loss decomposes as
𝔼\[\(Qϕ\(st,at:t\+h\)−G^h\(Q\)\)2\|st,at:t\+h\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\(Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)\-\\hat\{G\}\_\{h\}\(Q\)\)^\{2\}\\,\\middle\|\\,s\_\{t\},a\_\{t:t\+h\}\\right\]\(25\)=\(Qϕ\(st,at:t\+h\)−\(ℬθN,HQ\)\(st,at:t\+h\)\)2\+Var\[G^h\(Q\)\|st,at:t\+h\]\.\\displaystyle\\quad=\\;\\bigl\(Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)\-\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\)\(s\_\{t\},a\_\{t:t\+h\}\)\\bigr\)^\{2\}\+\\mathrm\{Var\}\\\!\\left\[\\hat\{G\}\_\{h\}\(Q\)\\,\\middle\|\\,s\_\{t\},a\_\{t:t\+h\}\\right\]\.Hence the empirical critic loss in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)is a Monte Carlo regression toward the Bellman backupℬθN,HQ\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q, not an exact squared Bellman residual for any single sampled target\. In the tabular exact\-expectation regime, fitted iteration withℬθN,H\\mathcal\{B\}\_\{\\theta\}^\{N,H\}converges toQθN,HQ\_\{\\theta\}^\{N,H\}\.
###### Proof\.
Equation[24](https://arxiv.org/html/2605.11009#A7.E24)follows from Definition[G\.11](https://arxiv.org/html/2605.11009#A7.Thmtheorem11), since under Assumption[G\.1](https://arxiv.org/html/2605.11009#A7.Thmtheorem1)the rewards\{rt\+τ\}τ<h\\\{r\_\{t\+\\tau\}\\\}\_\{\\tau<h\}summed inrt\(h\)r\_\{t\}^\{\(h\)\}are deterministic given\(st,at:t\+h\)\(s\_\{t\},a\_\{t:t\+h\}\), and the only randomness inG^h\(Q\)\\hat\{G\}\_\{h\}\(Q\)comes from the proposal sampling that definesℬθN,HQ\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\. The decomposition in Equation[25](https://arxiv.org/html/2605.11009#A7.E25)is the standard bias\-variance identity for the squared error of a deterministic predictionQϕ\(st,at:t\+h\)Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)against a stochastic targetG^h\(Q\)\\hat\{G\}\_\{h\}\(Q\)with mean\(ℬθN,HQ\)\(st,at:t\+h\)\(\\mathcal\{B\}\_\{\\theta\}^\{N,H\}Q\)\(s\_\{t\},a\_\{t:t\+h\}\)\. ∎
### G\.5Per\-Horizon Loss Averaging Stabilizes Updates
The multi\-step targetsGhG\_\{h\}in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)accumulate rewards overhhsteps, and the variance of this cumulative sum tends to grow withhh\. ACSAC averages allHHper\-horizon squared losses at the gradient level, in line with T\-SAC’s gradient\-level averaging\[T\-SAC\]and TOP\-ERL’s multi\-horizon Transformer supervision\[TOP\-ERL\]\. A standard variance\-of\-mean argument explains this design choice\.
#### Cumulative\-reward variance\.
For the cumulative\-reward part ofGhG\_\{h\}, expansion gives
Var\[rt\(h\)\]=∑i=0h−1∑j=0h−1γi\+jCov\(rt\+i,rt\+j\),\\mathrm\{Var\}\\\!\\left\[r\_\{t\}^\{\(h\)\}\\right\]\\;=\\;\\sum\_\{i=0\}^\{h\-1\}\\sum\_\{j=0\}^\{h\-1\}\\gamma^\{i\+j\}\\,\\mathrm\{Cov\}\(r\_\{t\+i\},\\,r\_\{t\+j\}\),\(26\)which typically increases withhh, especially whenγ\\gammais close to11and per\-step rewards are positively correlated\.
#### Per\-horizon gradient\.
Letδh\(ϕ\):=Qϕ\(st,at:t\+h\)−Gh\\delta\_\{h\}\(\\phi\):=Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)\-G\_\{h\}denote the per\-horizon Bellman residual, and letgh:=δh\(ϕ\)∇ϕQϕ\(st,at:t\+h\)g\_\{h\}:=\\delta\_\{h\}\(\\phi\)\\,\\nabla\_\{\\phi\}Q\_\{\\phi\}\(s\_\{t\},a\_\{t:t\+h\}\)denote its contribution to∇ϕℒ\(ϕ\)\\nabla\_\{\\phi\}\\mathcal\{L\}\(\\phi\)\. The averaged loss1H∑hδh2\\frac\{1\}\{H\}\\sum\_\{h\}\\delta\_\{h\}^\{2\}in Equation[6](https://arxiv.org/html/2605.11009#S4.E6)produces the gradientg¯:=1H∑h=1Hgh\\bar\{g\}:=\\frac\{1\}\{H\}\\sum\_\{h=1\}^\{H\}g\_\{h\}, and letg~h:=gh−𝔼\[gh\]\\tilde\{g\}\_\{h\}:=g\_\{h\}\-\\mathbb\{E\}\[g\_\{h\}\]denote its centered version\.
###### Lemma G\.18\(Variance reduction by averaging\)\.
Suppose the centered per\-horizon gradients satisfy𝔼\[‖g~h‖2\]≤σ2\\mathbb\{E\}\[\\\|\\tilde\{g\}\_\{h\}\\\|^\{2\}\]\\leq\\sigma^\{2\}for everyhhand𝔼\[⟨g~h,g~h′⟩\]≤ρσ2\\mathbb\{E\}\[\\langle\\tilde\{g\}\_\{h\},\\tilde\{g\}\_\{h^\{\\prime\}\}\\rangle\]\\leq\\rho\\,\\sigma^\{2\}for someρ∈\[−1/\(H−1\),1\]\\rho\\in\[\-1/\(H\-1\),1\]and everyh≠h′h\\neq h^\{\\prime\}\. Then the centered averageg~¯:=1H∑hg~h\\bar\{\\tilde\{g\}\}:=\\frac\{1\}\{H\}\\sum\_\{h\}\\tilde\{g\}\_\{h\}satisfies
Var\(g¯\):=𝔼\[‖g~¯‖2\]≤σ2\[ρ\+1−ρH\]\.\\mathrm\{Var\}\(\\bar\{g\}\)\\;:=\\;\\mathbb\{E\}\\\!\\left\[\\big\\\|\\bar\{\\tilde\{g\}\}\\big\\\|^\{2\}\\right\]\\;\\leq\\;\\sigma^\{2\}\\\!\\left\[\\rho\+\\frac\{1\-\\rho\}\{H\}\\right\]\.\(27\)Wheneverρ<1\\rho<1, this is strictly less thanσ2\\sigma^\{2\}, the variance bound for any singleghg\_\{h\}\.
###### Proof\.
Expand‖g~¯‖2=H−2∑h,h′⟨g~h,g~h′⟩\\\|\\bar\{\\tilde\{g\}\}\\\|^\{2\}=H^\{\-2\}\\sum\_\{h,h^\{\\prime\}\}\\langle\\tilde\{g\}\_\{h\},\\tilde\{g\}\_\{h^\{\\prime\}\}\\rangleand take expectations\. The diagonal terms contributeH−2⋅Hσ2=σ2/HH^\{\-2\}\\cdot H\\sigma^\{2\}=\\sigma^\{2\}/H, and the off\-diagonal terms contribute at mostH−2⋅H\(H−1\)ρσ2H^\{\-2\}\\cdot H\(H\-1\)\\rho\\sigma^\{2\}, which sum to the right\-hand side of Equation[27](https://arxiv.org/html/2605.11009#A7.E27)\. ∎
#### Beyond the deterministic and exact\-expectation case\.
The four results above are statements about action\-level mismatch, prefix Q\-value semantics, and an operator\-level Bellman backup in a deterministic MDP\. Three further directions are listed here as informal pointers\. First,H=1H=1recovers EMaQ over single actions\[EMaQ\], and the suboptimality ofQθN,HQ\_\{\\theta\}^\{N,H\}relative to the unrestricted variable\-horizon optimum is governed by the standard EMaQ\-style proposal coverage analysis\. Second, with finite critic and proposal approximation, the prefix\-selection regret degrades by margin terms that scale linearly with the critic and proposal errors\. Third, in stochastic MDPs the open\-loop chunk return is no longer deterministic, and the resulting nominal\-versus\-actual gap is the open\-loop consistency issue studied by DQC\[DQC\]; we therefore state the main proofs in the deterministic setting and treat the stochastic extension as a separate issue\. None of these is required for the four results stated above\.Similar Articles
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
This paper introduces Adaptive Q-Chunking (AQC), a reinforcement learning method that dynamically selects action chunk sizes to balance reactive control and long-horizon planning. It achieves state-of-the-art results on OGBench and Robomimic, enhancing the performance of large-scale VLA models in robotics tasks.
Distributional Reinforcement Learning via the Cram\'er Distance
This paper introduces C-DSAC, a new distributional reinforcement learning algorithm that uses the Cramér distance to improve performance and stability in robotic benchmarks compared to standard SAC.
Asymmetric actor critic for image-based robot learning
OpenAI proposes an asymmetric actor-critic method for robot learning that leverages full state observability in simulators to train policies that operate on partial observations (RGBD images), enabling effective sim-to-real transfer without real-world training data.
Revisiting Adam for Streaming Reinforcement Learning
This paper revisits the Adam optimizer for streaming reinforcement learning, demonstrating that established methods like DQN and C51 perform well when properly tuned. The authors propose Adaptive Q(lambda), which combines eligibility traces with Adam's variance adaptation to surpass existing streaming RL methods on 55 Atari games.
Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts
This paper introduces CaRE, a novel continual learning framework using a bi-level routing mixture-of-experts mechanism to effectively handle class-incremental learning over sequences of 300+ tasks.