Revisiting Adam for Streaming Reinforcement Learning
Summary
This paper revisits the Adam optimizer for streaming reinforcement learning, demonstrating that established methods like DQN and C51 perform well when properly tuned. The authors propose Adaptive Q(lambda), which combines eligibility traces with Adam's variance adaptation to surpass existing streaming RL methods on 55 Atari games.
View Cached Full Text
Cached at: 05/11/26, 06:49 AM
# Revisiting Adam for Streaming Reinforcement Learning
Source: [https://arxiv.org/html/2605.06764](https://arxiv.org/html/2605.06764)
Revisiting Adam for Streaming Reinforcement Learning
Florin Gogianu, Adrian Catalin Lutu, Razvan Pascanu
Keywords:streaming RL, Adam, optimisation, eligibility traces, distributional RL
SummaryLearning from a sequence of interactions, as soon as observations are perceived and acted upon, without explicitly storing them, holds the promise of simpler, more efficient and adaptive algorithms\. For over a decade, however, deep reinforcement learning walked the contrary path, augmenting agents with replay buffers or parallel sampling routines, in an effort to tame learning instability\. Recently, this topic has been revisited byelsayed2024streamin, focusing on update computation through eligibility traces and modifications to the optimisation routine, resulting in the StreamQ algorithm\. In this work we take a step back, investigating the efficacy of established updates, such as those implemented by DQN and C51 within this online setting\. Not only do we find that they perform well, but through analysing how the optimisation algorithm generally, and Adam in particular, interacts with these updates, we contend that two properties are essential for robust performance: i\) the derivative of the objective is to be bounded and ii\) weight updates are variance\-adjusted\. Rigorous and exhaustive experimentation demonstrates that C51, which exhibits both characteristics, is competitive with StreamQ across a subset of 55 Atari games\. Using these insights, we derive a variance\-adjusted algorithm based on eligibility traces, termed Adaptive Q\(λ\)\(\\lambda\), which approaches double the human baseline on the same subset, surpassing existing methods by all performance metrics\.
Contribution\(s\)1\.We highlight the effectiveness ofAdamand established objective functions in the streaming RL setting\. Context:Recent literature suggests that the severe non\-stationarity of streaming environments requires specialized algorithms\. We contrast this by adaptingDQNandC51to the streaming setup, showing that when appropriately tuned, standard optimisation techniques can be surprisingly effective\.2\.We establish variance\-adjusted optimisation algorithms, together with carefully tunedε\\varepsilon\-values, to be important components in streaming RL agents\. Context:We provide mechanistic insights into the role of variance\-adjusted updates andAdam’s epsilon value\. We bring evidence that largeε\\varepsilonvalues can act as a filter for sparse and noisy gradient components, allowing for a more stable learning dynamics\.3\.We propose AdaptiveQ\(λ\)Q\(\\lambda\)\. Context:AQ\(λ\)AQ\(\\lambda\)combines eligibility traces with the variance adaptation mechanism ofAdamand bounded error signal, resulting in an algorithm which improves upon the performance of existing streaming RL methods\.4\.We perform a large empirical evaluation in the streaming RL setting, extensively benchmarkingStreamQalongsideDQNandC51across 55 Atari games and setting new performance expectations for these algorithms\. Context:To provide a rigorous demonstration of our algorithmic contributions, the 55 Atari games were selected and ordered based on the performance correlation framework established byaitchison2023atari5\.
###### Abstract
Learning from a sequence of interactions, as soon as observations are perceived and acted upon, without explicitly storing them, holds the promise of simpler, more efficient and adaptive algorithms\. For over a decade, however, deep reinforcement learning walked the contrary path, augmenting agents with replay buffers or parallel sampling routines, in an effort to tame learning instability\. Recently, this topic has been revisited byelsayed2024streamin, focusing on update computation through eligibility traces and modifications to the optimisation routine, resulting in the StreamQ algorithm\. In this work we take a step back, investigating the efficacy of established updates, such as those implemented by DQN and C51 within this online setting\. Not only do we find that they perform well, but through analysing how the optimisation algorithm generally, and Adam in particular, interacts with these updates, we contend that two properties are essential for robust performance: i\) the derivative of the objective is to be bounded and ii\) weight updates are variance\-adjusted\. Rigorous and exhaustive experimentation demonstrates that C51, which exhibits both characteristics, is competitive with StreamQ across a subset of 55 Atari games\. Using these insights, we derive a variance\-adjusted algorithm based on eligibility traces, termed Adaptive Q\(λ\)\(\\lambda\), which approaches double the human baseline on the same subset, surpassing existing methods by all performance metrics\.
## 1Introduction
Much of the performance gains indeep reinforcement learning\(DRL\) over the past decade can be traced back to scaling the amount of off\-policy data the optimisation routine can use during training\. Ever\-increasing experience replay buffers or massively parallel sampling of the environment, allow stochastic gradient descent algorithms originally developed for empirical risk minimisation to play for their strengths, at the cost of large memory, sample and algorithmic complexity\. It became rather the rule that we tackled environments that fitted in 4Kb ofROMwith agents consisting of 7GB replay buffers and millions of parameters in an effort to meet the stochastic gradient descent requirements for stability\(bowling2025talk\)\.


Figure 1:IQM and Mean human normalised score on 55Atarigames\. All agents learn solely from the current transition, without experience replay, or target network\.C51andAQ\(λ\)\(\\lambda\)demonstrate the usefulness of bounded objectives and variance\-adjusted optimisation\.In contrast, few voices in the community started to argue for developing computationally bounded agents, with a reduced complexity compared to that of the environment\(javed2024bigWorld;bowling2025rethinking;lewandowski2025bigWorld\)\. Thus, moving away from learning with batches sampled from pseudo\-stationary buffers and towards learning from each interaction at a time, in a streaming fashion, without storing data or leveraging parallel environments\.
Arguably, a major desideratum for this vision is the development of new RL objectives and optimisation algorithms that learn stably and efficiently purely from sequential data\. Inroads have been opened in byseijen2016trueOnlineTD;vanHasselt2014trueGTDlmbda;javed2024swiftTDand recent results byelsayed2024streamin;elelimy2025gradientsuggest a way forward even for historically difficult control benchmarks such as theArcade Learning Environment\(ALE\)\. These recent breakthroughs in the online setting have been attributed to the resurgence of eligibility traces\(sutton1988tdlmbda;sutton2018rlbook\)based methods\(vanHasselt2014trueGTDlmbda;mahmood2015wisTrace;white2016greedy;vanHasselt2021expected\)\.
Indeed, eligibility traces for long held the promise of solving the central problem of propagating present or future rewards to past states and actions by updating value functions at every transition using multistep returns\. Improved temporal credit assignment would in turn unlock lower sample complexity and help close the gap to batch\-RL111Following terminology inelsayed2024streamin, not to be confused with offline\-RL\.algorithms\.elsayed2024streaminandelelimy2025gradientboth highlight the importance of eligibility traces among the other algorithmic novelties they employ, such as sparse initialisation of weight, novel regularized objectives and optimisers, as well as liberal use of normalisation for observations, rewards and activations, in breaking the “streaming barrier” towards an efficient online learner\.
It has been remarked however that eligibility trace methods bear more than a passing similarity tostochastic gradient descent with momentum\(SGD\-M\)\(polyak1964sgdm\), even as recent aselsayed2024streamin\. Furthermore, optimisation algorithms that also implement running averages of the derivative of the objective such asAdaptive Moment Estimation\(Adam\)\(kingma2015adam\)and to a lesser extentRoot Mean Square Propagation\(RMSProp\)\(tieleman2012rmsprop\), have been critical to trainingreinforcement learning\(RL\) agents with neural\-network function approximation\. In this light, we aim to revisit whether:
Q1: Are objectives and optimisers developed for batch reinforcement learning competitive in the online setting?
Deploying a careful and extensive empirical protocol, we find that objectives, update rules and optimisation routines commonly used in batch\-RL, yield surprisingly strong results once adapted to the streaming setting\. These strong empirical findings should contribute on their own to shaping the discussion around the somewhat nascent field of streamingdeepreinforcement learning\. They hint towards the generality and robustness of TD\(0\) methods when coupled withAdam–style updates\. And should establish stronger baselines for the development of new streaming algorithms\.
Q2: What explains the performance of TD\(0\) methods andAdam–style updates?
Key for obtaining these results is the peculiar set of hyperparameters we settled on forAdam\. Our top configurations invariably required a coefficient of the running average of the gradient close to 1\.0 and a very large value for the numerical stability term\. In section[4](https://arxiv.org/html/2605.06764#S4)we test several hypotheses that could explain this choice of values\. Similarly, we notice a preference for algorithms that have a bounded derivative of the objective, such is the case ofCategorical DQN\(C51\) and someDeep Q\-Network\(DQN\) versions\.
Finally, we distil these observations and our analysis into an algorithm based on eligibility traces we term Adaptive Q\(λ\\lambda\) that improves on existing methods\. To conclude this section we summarise the contributions we bring:
- •The unreasonable effectiveness ofAdam\.We show evidence that objectives and optimisers developed in the batch\-RL setting are very strong performers in the streaming protocol once they are well tuned\.
- •Mechanistic insights\.We identify and discuss several important properties for streaming RL: the variance\-adjusted updates based on long gradient histories, the role of bounded derivatives of the objective, and the large effects ofε\\varepsilonon performance\.
- •Adaptive Q\(λ\)\(\\lambda\)\.We propose a new eligibility\-trace update rule that improves on existing methods\.
- •Extensive benchmarking\.We present a thorough empirical study, evaluatingDQNandC51which we adapt to the streaming setting, alongsideStreamQ, on 55 Atari games\.
## 2Background
This work is set in the classical episodic reinforcement learning setting and, for convenience follows the conventions defined byelsayed2024streamin,elelimy2025gradientand others: an*agent*interacting with an*environment*generates a time\-indexed sequenceS0,A0,R1,S1,A1,R2,…,STS\_\{0\},A\_\{0\},R\_\{1\},S\_\{1\},A\_\{1\},R\_\{2\},\.\.\.,S\_\{T\}, by following its state\-conditioned action generating behaviour, the policyAt∼π\(⋅\|St\)A\_\{t\}\\sim\\pi\\left\(\\cdot\|S\_\{t\}\\right\)\. Its objective is to maximise the sum of future discounted returnGt≐∑k=t\+1Tγk−t−1RkG\_\{t\}\\doteq\\sum^\{T\}\_\{k=t\+1\}\\gamma^\{k\-t\-1\}R\_\{k\}, withγ\\gammacontrolling how much weight to put on the most immediate rewards as opposed to the ones received later in the trajectory\. It does so by estimating the expected return of being in stateSt=sS\_\{t\}=sand following the policyπ\\pi, which we call the value functionvπ\(s\)≐𝔼π\[Gt\|St=s\]v\_\{\\pi\}\(s\)\\doteq\\,\{\\operatorname\{\\mathbb\{E\}\}\_\{\\pi\}\}\{\\left\[G\_\{t\}\|S\_\{t\}=s\\right\]\}\. When the estimator ofvπv\_\{\\pi\}uses function approximation, we denote itv\(s,𝒘\)v\(s,\\bm\{w\}\), with𝒘\\bm\{w\}a parameter vector\. Generally, we don’t only care about estimating the utility of a state, also known as the*value prediction*problem\. Moreover, we are interested in its counterpart, the*control*problem, with the corresponding action\-value functionqπ\(s,a\)≐𝔼π\[Gt\|St=s,At=a\]q\_\{\\pi\}\(s,a\)\\doteq\\,\{\\operatorname\{\\mathbb\{E\}\}\_\{\\pi\}\}\{\\left\[G\_\{t\}\|S\_\{t\}=s,A\_\{t\}=a\\right\]\}and its function approximation equivalentq\(s,a,𝒘\)q\(s,a,\\bm\{w\}\)\. Optimising the objective in the control problem is then a matter of finding the optimal policyπ⋆\\pi^\{\\star\}that maximises the state\-action value function:qπ\(s,a\)=maxπqπ\(s,a\)q\_\{\\pi\}\(s,a\)=\\max\_\{\\pi\}q\_\{\\pi\}\(s,a\)\.
#### Temporal difference learning\.
The straightforward approach to estimating the expected discounted sum of rewards is to wait till the end of the episode and compute it for every preceding state\. This estimate of the return makes for a learning target towards which we can adjust the current value function:𝒘t\+1≐η\(Gt−v\(St,𝒘t\)\)∇𝒘v\(St,𝒘t\)\\bm\{w\}\_\{t\+1\}\\doteq\\eta\\left\(G\_\{t\}\-v\\left\(S\_\{t\},\\bm\{w\}\_\{t\}\\right\)\\right\)\\nabla\_\{\\bm\{w\}\}v\(S\_\{t\},\\bm\{w\}\_\{t\}\)\. The downside of Monte\-Carlo \(MC\) methods is the policy remains unchanged until the end of the episode\. One way towards more frequent updates is to compute a learning target based on the estimator we are training, evaluated at the very next state, and the reward we received by taking an action at the current step:Rt\+1\+γv\(St\+1,𝒘\)R\_\{t\+1\}\+\\gamma v\(S\_\{t\+1\},\\bm\{w\}\)\. The resulting objective function to minimize then becomes the temporal\-difference \(TD\) error:δ≐Rt\+1\+γv\(St\+1,𝒘\)−v\(St,𝒘\)\\delta\\doteq R\_\{t\+1\}\+\\gamma v\(S\_\{t\+1\},\\bm\{w\}\)\-v\(S\_\{t\},\\bm\{w\}\), of which we will make heavy use in this work\. For completion, the TD error relevant for the control problem isδ≐Rt\+1\+γmaxaq\(St\+1,a,𝒘\)−q\(St,At,𝒘\)\\delta\\doteq R\_\{t\+1\}\+\\gamma\\max\_\{a\}q\(S\_\{t\+1\},a,\\bm\{w\}\)\-q\(S\_\{t\},A\_\{t\},\\bm\{w\}\)\. Because the target uses the prediction at the immediate next stateSt\+1S\_\{t\+1\}, we call it a TD\(0\) method\. But we can bootstrap the learning target farther into the future, giving way tonn\-step methods, where the return is estimated by:Gt:t\+n≐∑k=0n−1γkRt\+k\+1\+γnv\(St\+n,𝒘\)G\_\{t:t\+n\}\\doteq\\sum\_\{k=0\}^\{n\-1\}\\gamma^\{k\}R\_\{t\+k\+1\}\+\\gamma^\{n\}v\(S\_\{t\+n\},\\bm\{w\}\)\. Choosingnnallows for interpolating between the two main estimators so far: the one\-step look\-ahead in TD\(0\) and MC\.
#### Onλ\\lambda\-returns\.
The vast majority of batch\-RL algorithms are squarely set in the*forward*\-view we just described\. For a given state the agent “looks” forward in time at the future rewards received by the current policy and decides how to update the value estimate based on them\. This works out nicely for TD\(0\) and small values ofnnfornn\-step return methods, but quickly becomes tricky to implement efficiently for other multistep methods, especially in conjunction with resampling strategies\(daley2019reconciling\)\. One such alternative is theλ\\lambda\-return, an estimator of the return that further balances the bias\-variance tradeoff by computing an weighted average over allnn\-step returns along a trajectory,Gtλ≐\(1−λ\)∑n=1T−t−1λn−1Gt:t\+n\+λT−t−1GtG^\{\\lambda\}\_\{t\}\\doteq\(1\-\\lambda\)\\sum\_\{n=1\}^\{T\-t\-1\}\\lambda^\{n\-1\}G\_\{t:t\+n\}\+\\lambda^\{T\-t\-1\}G\_\{t\}\. Settingλ=0\\lambda=0recovers TD\(0\), while withλ=1\\lambda=1the estimator becomes the MC return\. This estimator is rarely encountered in pure value\-based methods with neural network approximation because it requires recomputing eachnn\-step return every time\.
#### Eligibility traces\.
λ\\lambda\-returns can however be efficiently implemented if we take on thebackwardbackward\-view where each update to the value function depends on the current TD\-error and some running statistic of past events\. The key insight is to have a*trace*vector𝒛t\\bm\{z\}\_\{t\}that mirrors and is the size of𝒘\\bm\{w\}\. While𝒘\\bm\{w\}stores the long\-term coefficients required for estimatingvπ\(s\)v\_\{\\pi\}\(s\), the role of𝒛\\bm\{z\}is to record whenever a component of𝒘\\bm\{w\}was sensitive in producing an estimate\. This record of the parameter “activity” is decayed towards 0 byλ\\lambdaand reset in terminal states\. If the TD\-error is non\-zero, then𝒘\\bm\{w\}will be updated according to the values of the error and the corresponding components of the eligibility vector𝒛\\bm\{z\}\. Specifically, if we denote the sensitivityof the value function with respect to the weights𝒈t≐∇𝒘v\(St,𝒘\)\\bm\{g\}\_\{t\}\\doteq\\nabla\_\{\\bm\{w\}\}v\(S\_\{t\},\\bm\{w\}\)and initialise𝒛0≐𝟎\\bm\{z\}\_\{0\}\\doteq\\bm\{0\}, then the update becomes:
𝒛t\\displaystyle\\bm\{z\}\_\{t\}≐γλ𝒛t−1\+𝒈t\\displaystyle\\doteq\\gamma\\lambda\\bm\{z\}\_\{t\-1\}\+\\bm\{g\}\_\{t\}\(1\)𝒘t\+1\\displaystyle\\bm\{w\}\_\{t\+1\}≐𝒘t\+ηδt𝒛t\\displaystyle\\doteq\\bm\{w\}\_\{t\}\+\\eta\\delta\_\{t\}\\bm\{z\}\_\{t\}\(2\)
With linear function approximation the eligibility trace update in Eq\. \([2](https://arxiv.org/html/2605.06764#S2.E2)\) can be equivalent to those of forward\-view algorithms implementingλ\\lambda\-returns\(sutton1988tdlmbda;seijen2014trueOnlineTDlmbda\)\.
#### Objectives\.
Generally, in the forward\-view, we set the TD\-error objective for minimisation by formulating it as a mean squared error \(MSE\) function,ℒ\(𝒘\)≐δ2\\mathcal\{L\}\(\\bm\{w\}\)\\doteq\\delta^\{2\}and optimising it with semi\-gradient descent\. Very often in practice MSE is replaced by SmoothL1Loss,Lκ\(δ\)=𝟙\|δ\|<κ\(0\.5δ2/κ\)\+𝟙\|δ\|≥κ\(\|δ\|−0\.5κ\)L\_\{\\kappa\}\(\\delta\)=\\operatorname\{\\mathbb\{1\}\}\_\{\|\\delta\|<\\kappa\}\\left\(0\.5\\delta^\{2\}/\\kappa\\right\)\+\\operatorname\{\\mathbb\{1\}\}\_\{\|\\delta\|\\geq\\kappa\}\\left\(\|\\delta\|\-0\.5\\kappa\\right\)\. Forκ=1\\kappa=1it smoothly transitions from the squared loss to the mean absolute error as the magnitude ofδ\\deltaincreases\.
#### Variance\-adjusted optimisation methods\.
Most consequential to achieving the first strong results on challenging control problems using neural networks was the use of a new class of optimisation algorithms that divides the derivative of the objective function or an estimate of its first moment, by an estimate of the second moment\. An early example was the use of RMSprop\(tieleman2012rmsprop\)for successfully trainingDQN\(mnih2015dqn\)\. To our knowledge,bellemare2017distributionalintroduces the use ofAdaminDQN\-style methods andhessel2017rainbow;ceron2021revisitingconfirm its advantages over RMSprop on a variety of agents\. In this work we refer as*variance\-adjusted methods*to algorithms that use the second moment of the gradient to scale the update\.Adamin particular approximates the first and second moments usingβ\\beta\-normalised exponential moving averages\. Ignoring bias\-correction and letting𝒈t≐∇𝒘δt2\\bm\{g\}\_\{t\}\\doteq\\nabla\_\{\\bm\{w\}\}\\delta\_\{t\}^\{2\}, the update is:
𝒎t\\displaystyle\\bm\{m\}\_\{t\}≐β0𝒎t−1\+\(1−β0\)𝒈t\\displaystyle\\doteq\\beta\_\{0\}\\bm\{m\}\_\{t\-1\}\+\(1\-\\beta\_\{0\}\)\\bm\{g\}\_\{t\}𝝆\\displaystyle\\bm\{\\rho\}≐𝒎t/\(𝒗t\+ε\)\\displaystyle\\doteq\\bm\{m\}\_\{t\}/\(\\sqrt\{\\bm\{v\}\_\{t\}\}\+\\varepsilon\)\(3\)𝒗t\\displaystyle\\bm\{v\}\_\{t\}≐β1𝒗t−1\+\(1−β1\)𝒈t2\\displaystyle\\doteq\\beta\_\{1\}\\bm\{v\}\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)\{\\bm\{g\}\_\{t\}\}^\{2\}𝒘t\\displaystyle\\bm\{w\}\_\{t\}≐𝒘t−1−η𝝆\\displaystyle\\doteq\\bm\{w\}\_\{t\-1\}\-\\eta\\bm\{\\rho\}\(4\)Comparing to Eq\. \([2](https://arxiv.org/html/2605.06764#S2.E2)\), we notice that both compute a running statistic of the gradient with respect to𝒘\\bm\{w\}\. However, whereas𝒛\\bm\{z\}accumulates the derivative ofv\(St,𝒘\)v\(S\_\{t\},\\bm\{w\}\)and is reset periodically,𝒎\\bm\{m\}is a long\-running average of the derivative of the objective and is generally not reset\. While both are running statistics, their roles and dynamics are different\. The eligibility trace is intended to accumulate the signal and tends to increase in magnitude, dampen oscillations and speed up learning, not unlikeSGD\-M\(polyak1964sgdm;rumelhart1986backprop\), which was already pointed out byelsayed2024streamin\.Adam’s EMA instead approximates the expected value of the gradients and tends to stay at the same scale as the signal\.
#### Distributional RL\.
Rather than estimating only the expected return, distributional reinforcement learning\(bellemare2017distributional\)models the full distribution of the returnGtG\_\{t\}, represented by the random variableZ\(s,a,𝒘\)Z\(s,a,\\bm\{w\}\)such thatq\(s,a,𝒘\)≐𝔼\[Z\(s,a,𝒘\)\]q\(s,a,\\bm\{w\}\)\\doteq\\,\{\\operatorname\{\\mathbb\{E\}\}\}\{\\left\[Z\(s,a,\\bm\{w\}\)\\right\]\}\. One way to approximate it is using a discrete set ofKKfixed atoms\{zi\}i=1K\\\{z\_\{i\}\\\}^\{K\}\_\{i=1\}spanning a specific range of possible returns, where the model estimates the probability masspi\(s,a,𝒘\)≐Pr\(Z\(s,a,𝒘\)=zi\)p\_\{i\}\(s,a,\\bm\{w\}\)\\doteq\\text\{Pr\}\\left\(Z\(s,a,\\bm\{w\}\)=z\_\{i\}\\right\)associated with each atom\. The learning objective then is to minimize the Kullback\-Leibler divergence between the current distribution and a target distribution formed by the sampleRt\+1\+γZ\(St\+1,A⋆,𝒘\)R\_\{t\+1\}\+\\gamma Z\(S\_\{t\+1\},A^\{\\star\},\\bm\{w\}\), whereA⋆≐argmaxaq\(St\+1,a,𝒘\)A^\{\\star\}\\doteq\\arg\\max\_\{a\}q\(S\_\{t\+1\},a,\\bm\{w\}\), yielding theC51algorithm\.
## 3Empirical setup
#### Adapting old algorithms to new setups\.
We closely follow the setup introduced byelsayed2024streamin, using the environment and normalisation wrappers provided by the authors, with no modification, for all the algorithms studied here\. Our implementation ofStreamQclosely follows the reference, and we were able to confirm that it performs on par with the original work inMinAtarand the eightAtarienvironments the authors originally evaluated it\.
We adaptDQNandC51to the online setting by removing the target network and the replay buffer\. Distributional algorithms such asC51\(bellemare2017distributional\)often have a much larger final layer, the size ofA×KA\\times K, whereKKis the number of bins or quantiles\. In order to compare estimators of the same capacity, in all our experiments we scale the layer before the output for non\-distributional methods such that the number of weights is similar for all algorithms\.
Table 1:Changes toDQNandC51for the online settingFor clarification, we note here thatelsayed2024streaminalso adaptDQNto the streaming setup by setting the replay buffer to a size of11, along with an update frequency of11, and call it DQN1\. We will refer to this agent several times in this work\. In comparison, ourDQNimplementation also removes the target network, uses the same architecture and initialisation asStreamQand the same environment wrappers\.
#### Benchmarks and hyperparameter tuning sets\.
For our empirical study we settle forMinAtar\(young2019minatar\)andALE\(bellemare2013ale\), two of the better established benchmarks in discrete control\. For theMinAtarexperiments we always train on all five games included in the benchmark\. Althoughceron2021revisitingestablishesMinAtaras a good predictor ofALEperformance, our experiments suggest significant differences in algorithm rankings in the online setting\. Previous efforts in the online setting\(elsayed2024streamin;elelimy2025gradient\)evaluated their proposals on a limited subset ofAtarigames\. While we were not able to train on the full 57 game set either, we take further steps to make sure our results are predictive for the entire benchmark\.aitchison2023atari5studies how predictive is the performance on eachAtarigame to the entireALEbenchmark, and it further identifies the most predictive subset of 5 games, calledAtari\-5\. We use this subset for all our hyper\-parameter searches, with one change: we replaceDoubleDunk, in which all agents flat\-line, with the most predictive game not in the original 5\-set,Amidar\. We call this subsetAtari\-5∗\. The 55 games in the main comparison are also selected based onaitchison2023atari5, in the order of their correlation to the aggregate performance\.
#### Evaluation\.
We perform 7 training runs with different initialisations for each game in the mainAtariexperiment and use 9 seeds for theMinAtarexperiments\. Hyperparameter tuning was performed with 3 to 5 runs\. When tuning onAtari\-5∗we only train for12\.5M12\.5\\text\{M\}steps making for25%25\\%of the usual budget\. Reporting the performance we followagarwal2021rliablein using stratified bootstrapping for confidence interval estimation, interquartile mean for aggregates, along with estimating the probability of improvement\.
## 4Are established objectives and optimisers effective in streaming RL?
We developed our initial observations and intuitions on theMinAtarenvironments\. A relatively sparse grid\-search on a subset of games withDQNandC51revealed two important findings that we kept confirming for the rest of the project\. First, that both algorithms gain in performance with increasingly large values ofβ0\\beta\_\{0\}, the coefficient that controls the “length” of the history of gradients inAdam’s exponential moving average of the gradient\. The default value, ofβ0=0\.9\\beta\_\{0\}=0\.9is used throughout batch\-RL, and defines a relatively small sliding window, which ensures the trace captures gradient information from the most recent several tens of steps\. We settled forβ0=β1=0\.999\\beta\_\{0\}=\\beta\_\{1\}=0\.999for all our experiments instead\. Second, we noticed a strong increase in performance with larger values ofε\\varepsilon\. We defer the discussion of this choice of hyperparameters to Sec\.[5](https://arxiv.org/html/2605.06764#S5)\.
Figure 2:Aggregated normalised score on fiveMinAtarenvironments\. Classical RL objectives usingAdamare strong performers\.Fig\.[2](https://arxiv.org/html/2605.06764#S4.F2)illustrates the final results on the fiveMinAtarenvironments\. Surprisingly,DQNandC51both outperformStreamQ\. Significant is the performance ofDQNwhen compared to DQN1 fromelsayed2024streaminwhich always underperformsStreamQ, given that they share the optimiser and the objective function\. We explain this difference because of our decision to level the field by allowing all methods to use the environment wrappers and the same estimator architecture, as well as to our specific choice of hyperparameters forAdam\.
Moving on to the 55Atarigames selected as described above, the conclusions change to some degree and Figs\.[1](https://arxiv.org/html/2605.06764#S1.F1)and[8](https://arxiv.org/html/2605.06764#A1.F8)paint a more nuanced view\. Whereas DQN1 was failing catastrophically in the eightAtarigames selected by the authors, ours showcases a decent performance, although not to the level ofStreamQ\. However,C51manages to raise to the level ofStreamQand is given a50%50\\%chance of improving over it by the Mann\-Whitney U\-statistic\(mann1947test;agarwal2021rliable\)\.
*Given these results, we must answer the original question in the affirmative\. The classical objective functions developed for batch\-RL, when coupled with a properly tunedAdam, demonstrate a reasonable performance in the two benchmarks we evaluated on\.*
## 5What drives the performance of established objectives withAdam?
Undoubtedly, one of the main reasons we are able to train these classic agents in the streaming setup in the first place, are the normalisation techniques and the weight initialisations introduced byelsayed2024streamin\. The separate impact of reward, observation and activation normalisation and the sparse weight initialisation scheme, have been already thoroughly ablated in prior work, and we refer the reader to it\.
We turn, in what follows, on what we believe are some of the factors that make these classical algorithms perform well in the streaming setup when compared toStreamQ, which, in addition, employs eligibility traces and a novel optimisation algorithm that avoids updates that could overshoot the TD target, Overshooting\-bounded Gradient Descent or ObGD\.
Figure 3:On aggregate,DQNperforms better with largerβ0\\beta\_\{0\}values and bounded objectives such as SmoothL1Loss\.∙\\bulletdenotes the IQM score overη\\etavalues given a value ofε\\varepsilon\. Each bar is the IQM of 5 games×\\times3 seeds×\\timeslast 10 evaluations\.#### Bounded objectives\.
The choice of objective function profoundly impacts optimisation stability in reinforcement learning, especially under the severe non\-stationarity of the streaming setup\. MSE is affected by unbounded gradients with respect to predictions\. Consequently, when confronted with large Temporal Difference \(TD\) errors, the linear derivative of MSE translates these errors into massive, destabilizing gradient steps\. Recent results bypalenicek2025xqcdemonstrate that bounding gradient norms is critical for maintaining a stable effective learning rate under non\-stationary targets and bootstrapping\. Crucially, employing an objective function with strictly bounded gradients allows for the effective gradient update to be theoretically upper bounded\. As shown bypalenicek2025xqc, the Cross Entropy loss implemented byC51is representative for this type of objective with bounded derivative, and we believe much of the performance it demonstrates inAtari\(Fig\.[1](https://arxiv.org/html/2605.06764#S1.F1)\) andMinAtar\(Fig\.[2](https://arxiv.org/html/2605.06764#S4.F2)\) is because of this property\. An increase in performance is also shown in Fig\.[3](https://arxiv.org/html/2605.06764#S5.F3), when changing the objective function from MSE to SmoothL1Loss\.
#### The longer the history of the gradients, the better\.
We explain in Sec\.[2](https://arxiv.org/html/2605.06764#S2)that bothAdamand eligibility traces include mechanisms for storing gradient statistics\. Although different in intent and dynamics,Adam’s𝒎\\bm\{m\}, like the eligibility vector𝒛\\bm\{z\}accumulates gradient information with a window size determined byβ0\\beta\_\{0\}\. It maybe came as no surprise that in our initial experiments withMinAtarwe noticed a strong preference towards largeβ0\\beta\_\{0\}values, beyond the usualβ0=0\.9\\beta\_\{0\}=0\.9found in the literature\. In Fig\.[3](https://arxiv.org/html/2605.06764#S5.F3), on theAtari\-5∗subset, we notice a similar trend, where the robust median \(IQM\) over learning rates is higher forβ0=0\.999\\beta\_\{0\}=0\.999\. Similarly, in Fig\.[4](https://arxiv.org/html/2605.06764#S5.F4), on the sameAtari\-5∗, we plot on the left the two\(ε,η\)\(\\varepsilon,\\eta\)combinations that performed better forβ0=0\.9\\beta\_\{0\}=0\.9and observe that they perform worse than the ones withβ0=0\.999\\beta\_\{0\}=0\.999\.
#### Adam’sε\\varepsilonas a step\-size scaling factor\.
The update in Eq\. \([4](https://arxiv.org/html/2605.06764#S2.E4)\) would suggest that the increased performance asε\\varepsilongrows is an artefact of the interplay between the numerical stability term and step\-sizeη\\eta\. Indeed, in theη/\(𝒗\+ε\)\\eta/\(\\sqrt\{\\bm\{v\}\}\+\\varepsilon\)relation, asε\\varepsilonincreases, the effective step size gets smaller, assuming𝒗\\sqrt\{\\bm\{v\}\}constant\. Note also that the updates scale higher with decreasingε\\varepsilonas the value of𝒗\\sqrt\{\\bm\{v\}\}decreases\. From this we could argue that increasingε\\varepsilonthe size of the update decreases with the possible effect of a more stable \(and slow\) optimisation process\. However, a second intuition is that this scaling effect should be compensated by picking a different step sizeη\\eta\.
Figure 4:Performance increases with higher values ofε\\varepsilonand there is no obvious scaling of the step sizeη\\etathat can compensate for it\.∙\\bulletdenotes the IQM score overη\\etavalues given a value ofε\\varepsilon\. Each bar is the IQM of 5 games×\\times3 seeds×\\timeslast 10 evaluations\.In order to test this hypothesis, we set the following experiment\. We trainC51agents on theAtari\-5∗subset for12\.512\.5M steps, amounting to25%25\\%of the standard training run\. For five values ofε∈\{1e−5,…,1e−1\}\\varepsilon\\in\\\{$110\-5$,\.\.\.,$110\-1$\\\}we do a grid search overη\\etavalues\. The bounded derivative of theC51objective allows us to ignore possible stabilisation effects induced byε\\varepsilonin the presence of outlier target values and further isolate the relative interplay betweenη\\etaandε\\varepsilonif it exists\. Indeed, for this hypothesis to hold fully, we would expect a step\-sizeη\\etathat performs equally well for eachε\\varepsilonvalue\. We observe in Fig\.[4](https://arxiv.org/html/2605.06764#S5.F4)that indeed there is an optimal step size for eachε\\varepsilonvalue\. However, the performance increases withε\\varepsiloninstead of being constant at this optimal step size, therefore invalidating our initial hypothesis\. An even clearer picture of the requirement to carefully tuneε\\varepsilonis observed in Fig\.[7](https://arxiv.org/html/2605.06764#A1.F7), the result of a grid\-search onMinAtarusingQuantile Regression\(QR\), another distributional algorithm, that implements Huber quantile loss, also an objective with a bounded derivative\. Details are available in App\.[A\.2](https://arxiv.org/html/2605.06764#A1.SS2)\. We must conclude that the performance excess cannot be fully explained away by a proportional scaling of the step size and an additional mechanism must be at play\.
#### Adam’sε\\varepsilonis an SNR filter\.
RLhas the particularity of having to deal with a long tail of features that are rarely encountered by the agent: entities that appear only sparsely or after bottleneck states, objects and tokens that are relevant only to certain parts of the game\. It is justified to assume these features are correlated with sparsely or noisily updated components of the weight vector\. Updating both the sparse and the dense or the noisy and the stable directions of the parameter space with the same scale is likely to produce estimation errors, cause instabilities, and result in the policy diverging\. Ideally, an optimiser should therefore be able to adjust the step\-size of each component such that rare or noisy gradients result in conservative updates\.


Figure 5:Large values forAdam’sε\\varepsilonallow for stable learning on problems which are noisy or sparse\. The y\-axis component receives a stable gradient signal in both cases, whereas on the x\-axis the gradient can either change sign randomly \(left\) or be zero95%95\\%of the time \(right\)\.We hypothesiseAdammanifests a seldom\-discussed implicit mechanism for scaling the updates differently across gradient components with variable noise or sparsity characteristics\. Consider the denominator of the update,𝒗\+ε\\sqrt\{\\bm\{v\}\}\+\\varepsilonwithvt≈𝒈2vt\\approx\\bm\{g\}^\{2\}and ignore the bias correction\. For frequently updated or stable features, the elements𝒗i,i∈𝒟dense/stable\\sqrt\{\\bm\{v\}\_\{i\}\},i\\in\\mathcal\{D\}\_\{\\text\{dense/stable\}\}dominate the denominator andAdamscales down the effective step size for the directions associated with these features\. In contrast, for components of𝒗\\bm\{v\}associated with sparse or noisy features, will result inε\\varepsilondominating the scaling term\. Larger values ofε\\varepsilonwill be effective in this regime for bounding the maximum step size\.
We propose two small toy examples to illustrate this behaviour in Fig\.[5](https://arxiv.org/html/2605.06764#S5.F5)\. Both optimisation problems depicted have just two parameters\. The component on theyy\-axis always receives a constant gradient of0\.1×wt−10\.1\\times w\_\{t\-1\}in both problems\. In the left panel, the derivative with respect to the second weightwnoisyw^\{\\text\{noisy\}\}has a similar form, but changes sign randomly at every step \(on thexx\-axis\)\. In the right panel we formulate a sparse variation of this problem where the component on thexx\-axis is set to zero95%95\\%of the time\. Therefore, in both problems the non\-constant component of𝒗\\bm\{v\}will tend to be small, either because of noise, or because of sparse updates\.
Indeed, for small values ofε\\varepsilon,Adamis overly sensitive to the sparse or noisy component, amplifying the updates whenever a component of𝒗\\bm\{v\}is small, along the corresponding dimension, resulting in unstable optimisation\. In contrast, large values ofε\\varepsilonthat dominate the non\-constant components of𝒗\\bm\{v\}allows for a filtering effect on the sparse or noisy dimension, resulting in stable and early convergence\.
## 6Adaptive Q\(λ\)\(\\lambda\)
Previous sections highlight the importance of i\) having a long memory of the sensitivity of the value function with respect to the decisions it produced as in eligibility traces orAdam’s EMA, ii\) having anε\\varepsilon\-modulated mechanism that adjusts the update based on the variance of this sensitivity and iii\) that error signals should be bounded, like is the case ofC51and SmoothL1Loss\.
We combine these three ideas into Adaptive Q\(λ\)\(\\lambda\)\. TakingQ\(λ\)Q\(\\lambda\)\(sutton1988tdlmbda\)as starting point, the algorithm uses a momentum\-based eligibility trace𝒛t\\bm\{z\}\_\{t\}, alongside an exponential moving average of the squared gradients of the state\-action value function,𝒗t\\bm\{v\}\_\{t\}\. We use this estimate of the second order moment of the gradient of the state\-action value function to scale the eligibility trace before producing an update, similar to whatAdamis doing\. By normalizing the eligibility trace vector𝒛t\\bm\{z\}\_\{t\}with this variance estimate, we ensure that the magnitude of the updates remains stable across different parameters, even in highly non\-stationary streaming environments\.
We noticed however that this update was having stability issues\. Many of our training runs resulted in agents crashing because of large updates, despite the extensive normalisation techniques used in our setup\. To further improve the convergence of our method, we simply clampδ\\deltato\(−1,1\)\(\-1,1\), resulting in a stable algorithm across hyperparameters\. Notice how this is exactly the derivative of the SmoothL1Loss\.
We reset the trace whenever the agent takes exploratory actions and when the episode ends, similar toStreamQand others in the literature\. Based on empirical experimentation, we opted not to reset𝒗\\bm\{v\}, as doing so yielded slightly worse results\. Alg\.[1](https://arxiv.org/html/2605.06764#alg1)compares and summarises the updates of Q\(λ\)\(\\lambda\), Adaptive Q\(λ\)\(\\lambda\)andDQNusingAdam\.
Algorithm 1Simplified comparison of update rulesFor all updates we require:δ←r\+γmaxaq\(s′,a,𝒘\)−q\(s,a,𝒘\)\\delta\\leftarrow r\+\\gamma\\max\_\{a\}q\(s^\{\\prime\},a,\\bm\{w\}\)\-q\(s,a,\\bm\{w\}\)
1:
⊳\\trianglerightQ\(λ\)Q\(\\lambda\)update
2:
𝒈←∇𝒘q\(s,a,𝒘\)\\bm\{g\}\\leftarrow\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\nabla\_\{\\bm\{w\}\}q\(s,a,\\bm\{w\}\)\}
3:
𝒛←γλ𝒛\+𝒈\\bm\{z\}\\leftarrow\\gamma\\lambda\\bm\{z\}\+\\bm\{g\}
4:
𝒘←𝒘\+ηδ𝒛\\bm\{w\}\\leftarrow\\bm\{w\}\+\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\eta\\delta\\bm\{z\}\}
5:
6:
7:
8:ifep\. doneornon\-greedythen
9:
𝒛←𝟎\\bm\{z\}\\leftarrow\\bm\{0\}
10:endif
1:
⊳\\trianglerightAQ\(λ\)AQ\(\\lambda\)update
2:
𝒈←∇𝒘q\(s,a,𝒘\)\\bm\{g\}\\leftarrow\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\nabla\_\{\\bm\{w\}\}q\(s,a,\\bm\{w\}\)\}
3:
𝒛←γλ𝒛\+𝒈\\bm\{z\}\\leftarrow\\gamma\\lambda\\bm\{z\}\+\\bm\{g\}
4:
𝒗←γλ𝒗\+\(1−γλ\)𝒈2\\bm\{v\}\\leftarrow\\gamma\\lambda\\bm\{v\}\+\(1\-\\gamma\\lambda\)\\bm\{g\}^\{2\}
5:
𝝆←𝒛/\(𝒗\+ε\)\\bm\{\\rho\}\\leftarrow\\bm\{z\}/\(\\sqrt\{\\bm\{v\}\}\+\\varepsilon\)
6:
δ^←clip\(δ,−1,1\)\\hat\{\\delta\}\\leftarrow\\text\{clip\}\\left\(\\delta,\-1,1\\right\)
7:
𝒘←𝒘\+ηδ^𝝆\\bm\{w\}\\leftarrow\\bm\{w\}\+\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\eta\\hat\{\\delta\}\\bm\{\\rho\}\}
8:
9:ifep\. doneornon\-greedythen
10:
𝒛←𝟎\\bm\{z\}\\leftarrow\\bm\{0\}
11:endif
1:
⊳\\trianglerightQQ\-learning update \(Adam\)
2:
𝒈←∇𝒘δ2\\bm\{g\}\\leftarrow\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\nabla\_\{\\bm\{w\}\}\\delta^\{2\}\}
3:
𝒎←β0𝒎\+\(1−β0\)𝒈\\bm\{m\}\\leftarrow\\beta\_\{0\}\\bm\{m\}\+\(1\-\\beta\_\{0\}\)\\bm\{g\}
4:
𝒗←β1𝒗\+\(1−β1\)𝒈2\\bm\{v\}\\leftarrow\\beta\_\{1\}\\bm\{v\}\+\(1\-\\beta\_\{1\}\)\\bm\{g\}^\{2\}
5:
𝝆←𝒎/\(𝒗\+ε\)\\bm\{\\rho\}\\leftarrow\\bm\{m\}/\(\\sqrt\{\\bm\{v\}\}\+\\varepsilon\)
6:
𝒘←𝒘−η𝝆\\bm\{w\}\\leftarrow\\bm\{w\}\-\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\eta\\bm\{\\rho\}\}
AQ\(λ\)\(\\lambda\)delivers strong results\. InMinAtar\(Fig\.[2](https://arxiv.org/html/2605.06764#S4.F2)\) it largely surpassesStreamQ, both by the average normalised score and the robust median \(IQM\)\. The IQM onAtari\(Fig\.[1](https://arxiv.org/html/2605.06764#S1.F1)\) is largely comparable between the top three algorithms, withAQ\(λ\)\(\\lambda\)having a small edge onStreamQ\. The average normalised score ofAQ\(λ\)\(\\lambda\)however approaches three times the human baseline, in comparison to just two times the human baseline forC51andStreamQ\. Furthermore, the Mann\-Whitney test indicates over0\.650\.65probability of improvement overStreamQ\.
In conclusion, we believe these results validate our initial hypotheses on the importance of variance adjusted updates and bounded objectives\.
### 6\.1Related work
AQ\(λ\)\(\\lambda\)draws heavily from the works ofjaved2024swiftTD;elsayed2024streamin;elelimy2025gradient, that sparked a renewed interest in the pure online RL setting\. It shares many of the components withStreamQ\. Being aλ\\lambda\-return method implemented with eligibility traces, it traces its origin in the works ofsutton1988tdlmbdaandmahmood2015wisTrace\.
In reasoning about the role of constraining the size of the updates, we were further inspired byjaved2024swiftTDand, specifically on the role of bounded objectives in batch\-RL, bypalenicek2025xqcandfarebrother2024stop\.
Among the recent advances in second\-order optimisation for large language model training there’s also the observation that adopting the variance\-adjusting mechanism ofAdamcan still yield improvements\(frans2025whatMatters\), which also influenced our design decisions\.
## 7Conclusion
In this work, we revisit the streaming reinforcement learning protocol to investigate whether established batch\-RL techniques could be competitive\. Through extensive benchmarking acrossMinAtarand 55Atarigames, we answered this in the affirmative: when appropriately adapted, classic algorithms likeDQNandC51prove to be surprisingly effective\.
We then isolate some of the properties that enable this performance\. We found that the severe non\-stationarity of the streaming setup requires objectives with bounded derivatives to prevent large TD errors from destabilizing learning\. Furthermore, we highlighted the necessity of variance\-adjusted updates featuring a long gradient history\. Finally, we took steps toward understanding the role ofAdam’sε\\varepsilonparameter, demonstrating that rather than acting as a simple step\-size scalar, it functions as a Signal\-to\-Noise Ratio \(SNR\) filter that facilitates smooth convergence when processing sparse or noisy features\.
Building upon these insights, we introduced AdaptiveQ\(λ\)Q\(\\lambda\), a small intervention on Q\(λ\\lambda\) that synergizes variance adaptation with bounded updates to set a new high\-performance baseline in the streaming regime\.
## Appendix AAppendix
### A\.1Hyperparameters
Table[2](https://arxiv.org/html/2605.06764#A1.T2)outlines the complete set of hyperparameter configurations used for our empirical evaluation across the ALE suite\. We detail both the globally shared settings, such as the exploration schedule and discount factor, as well as the specific optimizer and algorithmic parameters tuned for each individual method\.
Table 2:Hyperparameter configurations for the evaluated algorithms on the ALE suite\.
### A\.2AdditionalMinAtarresults
#### Quantile Regression DQN\.
We briefly experimented inMinAtarwith an alternate distributional algorithm\. Quantile Regression DQN \(QR\-DQN\)\(dabney2018qr\), models the distribution usingNNatoms with fixed, uniform probabilities1/N1/N, while learning their locationsθi\(s,a,𝒘\)\\theta\_\{i\}\(s,a,\\bm\{w\}\)instead\. The practical objective in QR\-DQN is then the minimisation of the quantile Huber loss\. Although our initialMinAtarwith Quantile Regression were very promising, C51 demonstrated superior performance upon transitioning to theAtari\-5∗subset\. We do however present some of these results here\.
Figure 6:MinAtarresults separated by game\. The solid lines represent the mean evaluation return averaged over 9 independent runsFigure 7:Performance increases with higher values ofε\\varepsilonand there is no obvious scaling of the step sizeη\\etathat can compensate for it\.∙\\bulletdenotes the mean score given a value ofε\\varepsilon\.
### A\.3FullAtariresults
Figure 8:Atariresults separated by game\. The solid lines represent the mean evaluation return averaged over 7 independent runs
## ReferencesSimilar Articles
AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization
Proposes AdaSR, a framework enabling reasoning models to process streaming inputs adaptively, and HRPO, a hierarchical reinforcement learning method to optimize thinking allocation for accuracy-efficiency trade-offs.
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
Introduces EVOM, an agentic meta-evolution framework using an LLM-based design agent to automatically discover high-performance actor-critic architectures for reinforcement learning, outperforming manual baselines and prior methods on continuous control tasks.
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
This paper proposes a quantile Bayesian risk-aware MDP framework for online RL that adaptively balances robustness and exploration over time, providing theoretical regret bounds and demonstrating strong empirical performance.
Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction
This paper proposes a quantum annealing enhanced Q-learning framework for remaining useful life prediction, using the D-Wave system to solve QUBO formulations for action selection. It outperforms classical and quantum baselines on NASA C-MAPSS and predictive maintenance datasets.
Reversal Q-Learning
This paper proposes Reversal Q-Learning (RQL), an offline reinforcement learning algorithm that trains a flow policy using an expanded Markov decision process framework and techniques to enable off-policy RL without backpropagation through time. It achieves state-of-the-art performance on challenging simulated robotic tasks.