Position: Deployed Reinforcement Learning should be Continual

arXiv cs.LG Papers

Summary

This position paper argues that deployed RL agents should never stop learning, as the train-then-fix paradigm inherently fails to address non-stationarity and distribution shift in real-world environments. The authors identify four sources of post-deployment non-stationarity and advocate for continual RL as the standard approach for deployed systems.

arXiv:2606.04029v1 Announce Type: new Abstract: Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:17 AM

# Position: Deployed Reinforcement Learning should be Continual
Source: [https://arxiv.org/html/2606.04029](https://arxiv.org/html/2606.04029)
###### Abstract

Reinforcement Learning \(RL\) has received increasing attention and adoption in real\-world use cases\. Most of these systems follow a train\-then\-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary\. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem\. We identify four sources of non\-stationarity after deployment that necessitate never\-ending learning, and highlight why the best deployed agents never stop adapting\. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train\-then\-fix paradigm\.

Reinforcement Learning, Continual Learning

## 1Introduction

Reinforcement Learning \(RL\) is learning from interaction\(Sutton and Barto,[2018](https://arxiv.org/html/2606.04029#bib.bib68)\)\. Yet after RL agents are deployed in the real world, they typically stop learning\. Policies are trained offline \(through simulations, self\-play, expert demonstrations or a combination of these\) and then frozen upon deployment while the world continues to change\. The environment’s complexity far exceeds what any finite training phase can capture, and re\-training becomes necessary\(Dulac\-Arnoldet al\.,[2019](https://arxiv.org/html/2606.04029#bib.bib113)\)\. We call this thetrain\-then\-fixparadigm\.

This train\-then\-fix paradigm pervades the history of RL\. TD\-Gammon excelled at backgammon through self\-play, and was frozen when competing\(Tesauro,[1995](https://arxiv.org/html/2606.04029#bib.bib24)\)\. AlphaGo defeated world champion Lee Sedol\(Silveret al\.,[2016](https://arxiv.org/html/2606.04029#bib.bib97)\), OpenAI Five defeated Dota 2 world champions\(Berneret al\.,[2019](https://arxiv.org/html/2606.04029#bib.bib98)\), and GT Sophy outperformed professional Gran Turismo drivers\(Wurmanet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib129)\)\. Deep RL has even controlled stratospheric balloons\(Bellemareet al\.,[2020](https://arxiv.org/html/2606.04029#bib.bib111)\)and tokamaks\(Degraveet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib110)\)\. In each case, policies were trained extensively offline and held fixed after deployment\. While being landmark achievements for the field, these examples constrained the deployment problem to fit the train\-then\-fix paradigm\. Environments were either stationary, easy to simulate accurately, or the deployment was brief or localized enough such that significant drifts from human knowledge and data during training did not emerge\. This set the stage for extensive training pre\-deployment in favor of learning continually from the observation stream after deployment\.

![Refer to caption](https://arxiv.org/html/2606.04029v1/x1.png)

Figure 1:The number of papers on arxiv\.org containing the wordscontinual reinforcement learningin their title or abstract each year\.Historically, we have been approximating continual learning problems as non\-continual ones\. Our methods would likely fail if deployed for extended periods, exposed to unforeseen distribution shifts, or asked to generalize beyond their training\. The train\-then\-fix paradigm does not remedy the continual learning problem; it only delays it\.

This is not a new observation\(Hamadanianet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib118); Khetarpalet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib42)\)\. The Big World Hypothesis\(Javed and Sutton,[2024](https://arxiv.org/html/2606.04029#bib.bib38)\)formalizes this insight that real\-world environments exceed any agent’s representational capacity\. In deployment, resource constraints compound this challenge\. Limited time, computation, and data may prevent agents from finding optimal policies even when they are representable in theory\.

The idea of blurring the line between training and deployment is not a new one either\(Ring,[1994](https://arxiv.org/html/2606.04029#bib.bib39); Thrun,[1998](https://arxiv.org/html/2606.04029#bib.bib125)\)\. It reflects a view of learning as adaptation rather than solving a fixed problem\(Barronet al\.,[2015](https://arxiv.org/html/2606.04029#bib.bib75); Abelet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib76)\)\.

Many deployed RL agents continue to receive evaluative feedback \(via a reward signal\) after deployment: recommendation systems observe user engagement, ride\-sharing platforms track completed rides, and coding assistants measure suggestion acceptance rates\. When such evaluative feedback is available, and the deployment environment exceeds the agent’s representational capacity or the available time or compute resources, there is little justification to not leverage this signal for continued adaptation\.

We call such settingsmeasurabledeployment: where the agent is constrained in its capacity and resources, but an evaluative reward signal remains available after deployment\.

In this paper, we argue thatmeasurable deployment is a continual reinforcement learningproblem, and therefore continual learningsolutionsshould be utilized in deployed models\. With theoretical foundations maturing, academic interest surging \([Figure1](https://arxiv.org/html/2606.04029#S1.F1)\), and successful industry deployments emerging, it is timely to advocate for this paradigm shift\.

## 2Background

### 2\.1Continual Reinforcement Learning

Abelet al\.\([2023](https://arxiv.org/html/2606.04029#bib.bib22)\)define the continual reinforcement learning \(CRL\) problem as: “An RL problem is an instance of CRL if the best agents never stop learning”\. This definition stands in contrast to the deployed RL examples in[Section1](https://arxiv.org/html/2606.04029#S1), where the agent searches for a policy, after which learning ceases and the fixed policy is deployed\.

The Problem with Traditional RL Foundations\.This norm of fixing the agent after a training phase comes in part from the mathematical formalization of the RL problem\. The traditional Markov Decision Process \(MDP;[Bellman](https://arxiv.org/html/2606.04029#bib.bib77),[1957](https://arxiv.org/html/2606.04029#bib.bib77);[Puterman](https://arxiv.org/html/2606.04029#bib.bib46),[2014](https://arxiv.org/html/2606.04029#bib.bib46)\) formalism of the agent\-environment interaction fails to capture the never\-ending nature of CRL\. An MDP is represented by the tuple⟨𝒮,𝒜,P,R⟩\\langle\\mathcal\{S\},\\mathcal\{A\},P,R\\rangle, where𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}are the state and action spaces\. On each discrete timesteptt, the agent selects an actionAt∈𝒜A\_\{t\}\\in\\mathcal\{A\}in stateSt∈𝒮S\_\{t\}\\in\\mathcal\{S\}, the environment transitions to a new stateSt\+1S\_\{t\+1\}, emitting a scalar rewardRt\+1R\_\{t\+1\}\. The MDP objective defines optimality with respect to transition and reward functions,P:𝒮×𝒜↦Δ​\(𝒮\)P:\\mathcal\{S\}\\times\\mathcal\{A\}\\mapsto\\Delta\(\\mathcal\{S\}\)andR:𝒮×𝒜↦ℝR:\\mathcal\{S\}\\times\\mathcal\{A\}\\mapsto\\mathbb\{R\}, for which an optimal policy,π⋆:𝒮↦Δ​\(𝒜\)\\pi^\{\\star\}:\\mathcal\{S\}\\mapsto\\Delta\(\\mathcal\{A\}\), exists as a fixed point of the Bellman optimality equation\(Puterman,[1990](https://arxiv.org/html/2606.04029#bib.bib45)\)\. This notion of converging toπ⋆\\pi^\{\\star\}implies a terminal point: once found, the policy is to be deployed indefinitely without further learning\. Such a framing treats learning as a means to an end rather than an ongoing process\(Abelet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib76)\)\. Additionally, the MDP assumptions of ergodicity, stationarity, and the ability to reset or revisit states rarely hold in deployment\.

The History Process Formalism\.To address these limitations, recent works propose the*history process*as an alternative mathematical foundation\(Bowlinget al\.,[2023](https://arxiv.org/html/2606.04029#bib.bib34); Abelet al\.,[2023](https://arxiv.org/html/2606.04029#bib.bib22)\)\. In this formalism, an environmenteeis a function from finite\-length histories and actions to a distribution over a finite observation space,e:ℋ×𝒜↦Δ​\(𝒪\)e:\\mathcal\{H\}\\times\\mathcal\{A\}\\mapsto\\Delta\(\\mathcal\{O\}\), where the observation space is denoted by𝒪\\mathcal\{O\}111These observations are not necessarily Markovian, allowing history processes to describe both MDPs and Partially Observable MDPs\(Monahan,[1982](https://arxiv.org/html/2606.04029#bib.bib78); Cassandraet al\.,[1994](https://arxiv.org/html/2606.04029#bib.bib47)\)\., action space denoted by𝒜\\mathcal\{A\}, and the history setℋ=⋃n=0∞\(𝒜×𝒪\)n\\mathcal\{H\}=\\bigcup\_\{n=0\}^\{\\infty\}\(\\mathcal\{A\}\\times\\mathcal\{O\}\)^\{n\}is the space of all possible finite sequences of action\-observation pairs\. The reward function is over such pairs,R:𝒪×𝒜↦ℝR:\\mathcal\{O\}\\times\\mathcal\{A\}\\mapsto\\mathbb\{R\}\. FollowingElelimyet al\.\([2025](https://arxiv.org/html/2606.04029#bib.bib33)\), we define a policyπ:𝒮↦Δ​\(𝒜\)\\pi:\\mathcal\{S\}\\mapsto\\Delta\(\\mathcal\{A\}\)as a mapping from the agent’s state representation to a distribution over actions\.St∈𝒮S\_\{t\}\\in\\mathcal\{S\}is now the agent’s compression of its history, not to be confused with state from the MDP formalism\. The set of all policies representable by the agent is denoted byΠ\\Pi\. The agent’s learning rule,σ:ℋ↦Δ​\(Π\)\\sigma:\\mathcal\{H\}\\mapsto\\Delta\(\\Pi\), maps histories to a distribution over the policy set\.

The history process makes minimal assumptions about the environment\. Critically, it does not assume that agents can reset the environment or revisit previous histories\. Once a historyht∈ℋh\_\{t\}\\in\\mathcal\{H\}has occurred, the agent cannot ever passhth\_\{t\}back into the environment again\. It can never exactly revisit the situations it was in before, and future interactions take the forme​\(ht⋅h,a\)e\(h\_\{t\}\\cdot h,a\), where⋅\\cdotdenotes concatenation\. Note that these formalisms include both episodic and continuing settings, as both can be continual learning problems\.

### 2\.2Continual Learning Problems vs\. Solutions

It is important to distinguish continual learningproblems\(problem settings where never\-ending adaptation is useful\) from continual learningsolutionsoralgorithms\(methods designed to address these problems\)\(Khetarpalet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib42)\)\. Continual learning algorithms typically address challenges with function approximators within the agent, such as catastrophic forgetting\(McCloskey and Cohen,[1989](https://arxiv.org/html/2606.04029#bib.bib41)\), maintaining plasticity\(Dohareet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib43)\), and balancing stability with adaptability\(Mermillodet al\.,[2013](https://arxiv.org/html/2606.04029#bib.bib44)\)\. However, the existence of these algorithmic challenges does not define what makes a problem a continual learning problem\. AsAbelet al\.\([2023](https://arxiv.org/html/2606.04029#bib.bib22)\)emphasize, a problem is an instance of CRL based on the environmental characteristics that necessitate never\-ending learning, not on the algorithmic difficulties of implementing such learning\. We examine the characteristics of real\-world deployment that necessitate never\-ending learning in[Section3](https://arxiv.org/html/2606.04029#S3)\.

### 2\.3Deployment

Throughout this paper, we use the termdeploymentto refer to integrating a trained policy into its intended operating environment where it actively makes decisions in the real\-world\. Deployment marks the transition from offline development to operational use, where the agent’s performance is truly tested and its value realized\(IBM,[2024](https://arxiv.org/html/2606.04029#bib.bib96)\)\.

## 3Measurable Deployment is a Continual Reinforcement Learning Problem

Notalldeployed systems require continual learning\.Schaefferet al\.\([2007](https://arxiv.org/html/2606.04029#bib.bib109)\)’s checkers engine has provably solved the game, and learning will not improve its win rate\. Fixed policies suffice when the environment lies within the agent’s representational capacity and available resources\.

Many deployed RL systems operate in the big world regime, where the environment’s complexity exceeds the agent’s capacity and resources\. In such settings, an optimal policy may be unrepresentable, or require more experience than the finite training phase\. When evaluative feedback remains available after deployment, it provides a means to narrow this gap: a reward signal that reveals how well the agent performs in the situations it encounters\. Not learning from this signal, and restricting agents to retraining based on human knowledge, leaves performance on the table as envisioned in[Figure2](https://arxiv.org/html/2606.04029#S3.F2)\. This is themeasurabledeployment setting, where the best agents overcome their representational or computational limitations by never stopping to learn from this evaluative signal\. Deploying such agents to a big world with an evaluative reward signal is a CRL problem\.

Given the incompatibilities of the MDP in[Section2\.1](https://arxiv.org/html/2606.04029#S2.SS1), we use the history process formalism to characterize why measurable deployment is a CRL problem\. The worlds that deployed agents find themselves in change through:

1. 1\.Action\-induced Non\-Stationarity: Each historyhth\_\{t\}instantiates a newe​\(ht⋅h,a\)e\(h\_\{t\}\\cdot h,a\)for future interactions\. The agent’s policy shapes the distribution of future histories it encounters\. For example, a recommendation system \(agent\) that repeatedly suggests certain content changes the user’s preferences \(environment\)\. The distribution over observations for future interactions differs from the current one precisely because of past actions\. Other examples include generative models adapting to updated safety/alignment policies, or markets responding to automated trading strategies222This closely relates to the literature on*performative prediction*, which studies how predictions influence their targets\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.04029#bib.bib35)\)\. This phenomenon is relevant to deployment and merits further study in the applied RL community\.\.
2. 2\.Changes in Environment Dynamics: The environment can also change due to factors outside the agent’s control \(seasonal variations, hardware aging, market shifts, regulatory changes\)\. These changes can be periodic \(multi\-timescale patterns like diurnal or seasonal cycles\) or stochastic \(unpredictable shifts in dynamics\)\.
3. 3\.Evolving Goals: Under the reward hypothesis, an agent’s “goals and purposes” are expressed through the reward function\(Sutton,[2004](https://arxiv.org/html/2606.04029#bib.bib127); Littman,[2017](https://arxiv.org/html/2606.04029#bib.bib128)\)\. This function can change over time, altering what constitutes desirable behavior even when the underlying environment dynamics remain stable\. In the history process formalism, the reward function induces the agent’s preference relation over histories\(Bowlinget al\.,[2023](https://arxiv.org/html/2606.04029#bib.bib34)\)\. In deployment, this preference can evolve as stakeholder priorities shift, regulations change, safety constraints are updated, or new capabilities are added to deployed systems\. This challenge intensifies in multi\-objective settings, where agents must balance multiple competing objectives, as both the set of objectives and their relative importance can evolve during deployment in ways unforeseen during the training phase\. Unlike environmental or action\-induced non\-stationarity, evolving goals are often imposed by human designers or stakeholders, making it particularly relevant for real\-world deployment where the goals and purposes we wish the agent to achieve are subject to change\.
4. 4\.Emergent Novelty: New action\-observation sequences can arise after deployment, lying outside the distribution of histories experienced in a finite training phase\. The Big World Hypothesis formalizes why this is inevitable: the world’s complexity is much richer than the representational capacity of any agent within it\. Recent work on computationally\-embedded agents shows how any agent embedded in its environment is implicitly constrained by its finite capacity, so there necessarily exist action\-observation sequences the agent cannot realize\(Lewandowskiet al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib36)\)\. A dangerous form of emergent novelty areblack swan events: rare, highly negative rewards that agents may misperceive as impossible despite their non\-zero probability\(Taleb,[2008](https://arxiv.org/html/2606.04029#bib.bib126); Leeet al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib37)\)\. Deployed agents inevitably encounter histories for which their current policy is inadequate, whether due to genuine novelty beyond its capacity or a misperception of rare events\. CRL views such novelty not just as a failure mode to be detected and flagged, but as an inherent property of deployment under the Big World Hypothesis\.

These four characteristics demonstrate why agents in measurable deployment should never stop learning\. We now examine production systems that have successfully deployed continual RL, analyzing which challenges each faces and how continual adaptation enables their success\.

### 3\.1Case Study I: Cursor Tab

Cursor Tab is a code completion system that predicts developers’ next actions across their codebase, handling over 400 million requests a day\(Jacksonet al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib48)\)\. The system employs online RL for their code suggestion policy,π𝜽\\pi\_\{\\boldsymbol\{\\theta\}\}, using a policy gradient method optimizing the single\-step rewardJ​\(𝜽\)=𝔼s∼P,a∼π𝜽​\[R​\(s,a\)\]J\(\\boldsymbol\{\\theta\}\)=\\mathbb\{E\}\_\{s\\sim P,a\\sim\\pi\_\{\\boldsymbol\{\\theta\}\}\}\[R\(s,a\)\]\. New models are rolled out frequently throughout the day to get “fresh” on\-policy rewards for∇𝜽J^\\widehat\{\\nabla\_\{\\boldsymbol\{\\theta\}\}J\}in the gradient ascent update\. This differs starkly from most large language model providers, who train on static datasets and release new models every few months\.

This deployment illustrates multiple CRL challenges\.Emergent noveltyis inherent: at 400 million daily requests across diverse codebases, the system continually encounters code patterns outside its training sets\.Changes in environment dynamicsemerge as user preferences and libraries evolve\.

Cursor’s use of policy gradient methods illustrates how continual learningsolutionsimpose their own constraints beyond those arising from theproblemitself\. The policy gradient theorem requires on\-policy samples\(Suttonet al\.,[2000](https://arxiv.org/html/2606.04029#bib.bib49)\)\. Once the policy parameters are updated, previous user interaction data becomes stale for computing accurate gradient estimates\. This is not an environmental characteristic necessitating continual learning, but rather the chosen solution’s characteristic \(akin to how catastrophic forgetting or plasticity loss are challenges with artificial neural networks, rather than characteristics of the problem\)\. Nevertheless, this solution\-level constraint creates operational requirements that demand tight deployment\-training loops, with Cursor Tab achieving 1\.5\-2 hour iteration cycles between deployment and data collection for subsequent updates\.

The algorithmic demands of this solution align well with the demands of the problem: the continuous deployment cycles required for on\-policy learning enable both adaptation to emergent novelty and shifting environmental dynamics\. This synergy demonstrates how CRL solutions are not just viable, but advantageous in deployment\. Cursor Tab yielded a 21% reduction in suggestions shown while achieving a 28% higher acceptance rate\(Jacksonet al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib48)\), exemplifying the benefit of deployed agents that never stop learning\.

### 3\.2Case Study II: Lyft

Lyft’s ridesharing platform processes hundreds of millions of rider\-driver matches per year, requiring real\-time decisions about which driver to assign to each incoming ride request\(Azagirreet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib50)\)\. New approaches incorporate increasingly realistic features, yet typically operate with a fixed policy during deployment, with learning or calibration occurring offline or via periodic retraining\(Karpet al\.,[1990](https://arxiv.org/html/2606.04029#bib.bib72); Bertsimaset al\.,[2019](https://arxiv.org/html/2606.04029#bib.bib73); Yanet al\.,[2020](https://arxiv.org/html/2606.04029#bib.bib74)\)\.

Lyft’s 2021 deployment was the first documented case of a ridesharing matching algorithm that used online RL to learn and update its policy after deployment using real\-time feedback from matching outcomes\. This deployment exemplifiedaction\-induced non\-stationarityas its primary challenge\. Each matching decision directly altered the future environment\. Assigning a driver to a rider relocated that driver, changed the spatial distribution of driver supply, and altered future matching possibilities\.

Environment dynamics change continuallyas demand patterns vary by time of day, season, one\-off events, and market conditions that evolve across cities\. Out\-of\-distributionnoveltycan emerge from events like a popular concert, a global sporting event, or a huge conference\. These persistent non\-stationarities cannot be fully anticipated offline\.

The choice of online RL as asolutionaddresses theseproblemcharacteristics directly\. By continually updating value estimates from real\-time matching outcomes, the system adapts to both the non\-stationarity it induces through its own decisions and the external dynamics of the ride\-sharing market\. The authors note that “trusting an algorithm that can update itself is difficult,” highlighting the operational and organizational barriers to safely deploying CRL systems\. Extensive switchback experiments were therefore used to validate safety and performance prior to a global rollout\.

Lyft’s results demonstrate the practical benefits of continual RL in deployment\. Their online RL matching system enabled drivers to complete millions of additional rides per year, generating over $30M in additional annual revenue while also improving rider pickup times and driver utilization\(Azagirreet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib50)\)\. The authors emphasize such across\-the\-board improvements are rare in large\-scale marketplace optimization, and attribute this to the system’s ability to adapt online rather than optimize for a fixed model\.

### 3\.3Case Study III: The Sim\-to\-Real Gap

The challenge of transferring policies trained in simulation to the real\-world \(referred to as thesim\-to\-realgap\) provides compelling evidence for continual learning after deployment\. This gap is a direct consequence of the conventional train\-then\-fix paradigm\. Despite simulation’s advantages in sample efficiency, cost, and safety for initial training in domains like robotics, policies trained purely in simulation often fail when transferred to physical robots\.

This gap arises from multiple sources of distribution shift between the environment in the simulator and the environment in which the policy is deployed\. Parameters such as friction, mass, and inertia are difficult to model precisely, and real robots experience wear\-and\-tear over time that changes their dynamics\. Visual discrepancies arise in systems that rely on computer vision: differences between rendered images in simulation and real camera observations cause policies relying on visual input to deploy poorly\. Temporal characteristics such as sensor sampling rates and latencies from sensing to actuation differ between simulation and reality\. The real\-world also introduces sensory noise, lighting variations, and unexpected perturbations absent from simulation, which increase the sim\-to\-real gap\.

Sim\-to\-real problems can manifest all four sources of non\-stationarity we identified:environmental dynamicsdiffer between simulation and reality;emergent noveltiesnever seen in simulation appear;environment dynamics changefrom hardware degradation; and evenevolving goalscan emerge as deployment reveals misalignments between simulated task specifications and real\-world objectives\.

Current sim\-to\-real methods acknowledge that fixed policies are insufficient\. Domain randomization, system identification, and domain adaptation all aim to create policies robust to the gap, but these approaches still assume the gap can be bridged during a training phase\(Zhaoet al\.,[2020](https://arxiv.org/html/2606.04029#bib.bib92)\)\. Even with sophisticated transfer techniques, policies still require real\-time data for reliable performance\(Akkayaet al\.,[2019](https://arxiv.org/html/2606.04029#bib.bib94); Horváthet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib93)\)\. Research in this area has led to more data\-efficient training algorithms \(real\-world grasping in 5,000 samples instead of 580,000\(Jameset al\.,[2019](https://arxiv.org/html/2606.04029#bib.bib95)\)\)\. While such 100\-fold improvements in sample efficiency are valuable for reducing training costs, it does not resolve the core underlying issue: agents cease to learn after deployment\. No amount of training data can allow for autonomy in the face of the big world’s observation stream\.

Table 1:The prevalence of CRL challenges across deployment case studies\. Primary: dominant driver of continual learning; Present: clearly manifests; Implicit: likely present but not in the foreground\.![Refer to caption](https://arxiv.org/html/2606.04029v1/x2.png)

![Refer to caption](https://arxiv.org/html/2606.04029v1/x3.png)

Figure 2:The train\-then\-fix paradigm \(Learning Rule A\) versus our vision for a continual learner \(Learning Rule B\) in measurable deployment\.Left:A fixed policy degrades until retraining is triggered \(orange dashed line marks acceptable performance, dictated by human expertise\)\.Right:Two learning rules compared over extended deployment: periodic retraining \(blue, sawtooth\) versus a continual learner that decides when and how to update its policy, without human intervention \(pink\)\.

## 4Continual Learners are the Proper Choice for Measurable Deployment

Under the history process formalism, an agent is a function mapping histories to action distributions, and itslearning ruledetermines how this mapping evolves\. Learning can be viewed as a search over a policy set: at each history, the agent either continues searching or terminates by settling on a policy for all future interactions333Note there is flexibility in what counts as “settling” on a policy\. One can define an agent that staysϵ\\epsilon\-close to someπ∈Π\\pi\\in\\Pias having stopped its search\. This is relevant for real\-world deployment, where there are acceptable ranges of competent performance\.\. An agent that terminates the search is a non\-continual learner\. An agent that continues searching indefinitely, adapting its policy with experience, is a continual learner\.

Abelet al\.\([2023](https://arxiv.org/html/2606.04029#bib.bib22)\)formalize this distinction through the notion of abest agent: one that maximizes a specified performance measure over the set of policies reachable by its learning rule\. A problem is an instance of CRL if and only if the best agent never terminates its search over the policy set\. Crucially, this definition depends on the choice of policy set and learning rule\. Changing either can switch a problem or a learner from continual to non\-continual, or vice\-versa\.

A Minimal Example\.Consider a parameterized policy,π𝜽\\pi\_\{\\boldsymbol\{\\theta\}\}, implemented as a single\-layer, fully\-connected, artificial neural network, with 32 units and a scalar input and output\. We can consider the space of possible parameter configurations𝜽∈ℝ64\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{64\}as our policy set\. The learning rule would be the optimization algorithm we choose for switching between these parameter configurations, such as Stochastic Gradient Descent \(SGD\) with an annealing step\-size\. If the step\-size is set to zero eventually, the learning rule settles on one specific𝜽final\\boldsymbol\{\\theta\}\_\{\\text\{final\}\}\. From this point, the agent stops searching and commits to using that policyπ𝜽final\\pi\_\{\\boldsymbol\{\\theta\}\_\{\\text\{final\}\}\}for all future actions, making it a non\-continual learner\. This is fine if aπ⋆\\pi^\{\\star\}truly does exist, and the policy set is rich enough to containπ⋆\\pi^\{\\star\}within a reachable amount of compute, but that is often not the case in real\-world deployment\. If instead we meta\-learn the step\-size for SGD with IDBD\(Sutton,[1992](https://arxiv.org/html/2606.04029#bib.bib122)\), the step\-size never converges to zero, and the agent never settles on a𝜽\\boldsymbol\{\\theta\}\. This agent qualifies as a continual learner\.

Measurable Deployment through the Lens of CRL Agents\.The question for any deployment is first determining whether the optimal policyπ⋆\\pi^\{\\star\}is practically reachable within the agent’s policy set under available resources\.

If yes, as is the case with the checkers engine, the agent can terminate its search upon findingπ⋆\\pi^\{\\star\}\. The problem is not an instance of CRL as the best learner can stop its search\.

If no \(as we argue is the case for most real\-world deployments\), then no practically reachable element of the policy set maximizes performance across all reachable histories, and the problem is a CRL problem\. In measurable deployment, the agent still receives an evaluative signal revealing how well it performs\. The best agents leverage this feedback through their learning rule to update their policy, improving performance over time and continuing to adapt as new histories are encountered\. This is precisely the CRL setting: the best agent cannot terminate its search, because no reachable policy is optimal\. The only path to better performance is to keep adapting, leveraging evaluative signal\.

Two Learning Rules Compared\.[Figure2](https://arxiv.org/html/2606.04029#S3.F2)illustrates two responses to measurable deployment\. The left plot shows a single train\-then\-fix cycle: an agent searches for a high\-performing element of the policy set and pauses the search upon finding one, settling on a fixed policy\. Due to the non\-stationarities identified in[Section3](https://arxiv.org/html/2606.04029#S3), performance degrades over time until a minimum performance threshold set by human experts triggers another round of search for a high\-performing policy\. The right plot extends this over a longer deployment horizon\. This leads to a sawtooth pattern from repeated retraining\. Learning rule A treats deployment as a series of non\-continual problems\. The Rusting Pendulum environment in Appendix[A](https://arxiv.org/html/2606.04029#A1)empirically demonstrates this degradation: a train\-then\-fix policy fails as joint friction accumulates, but a continual learner maintains performance\.

In contrast, the agent using learning rule B never stops its search\. Recognizing thatπ⋆\\pi^\{\\star\}is not in its policy set, the agent continually adapts, using the available evaluative signal to guide its ongoing search\.

Both A and B are learning rules \(maps from a history to a distribution over policies\), but only the latter embraces the CRL problem setting\. They differ in whether the agent designer accepts measurable deployment as inherently a CRL problem\.

We argue that designers facing measurable deployment should embrace the latter framing\.Abelet al\.\([2023](https://arxiv.org/html/2606.04029#bib.bib22)\)demonstrate this empirically in a controlled switching MDP where a continualQQ\-learner with a fixed step\-size consistently outperformed one that converges\.Suttonet al\.\([2007](https://arxiv.org/html/2606.04029#bib.bib84)\)showed that tracking outperforms convergence even in a stationary environment when observations are non\-Markovian, supporting the case for continual adaptation is broader than just non\-stationary settings\. By accepting the continual nature of the problem and designing agents that never terminate their search, we enable them to leverage the evaluative feedback they receive, rather than forgoing performance between retraining cycles\.

## 5Call to Action

Recognizing measurable deployment as a CRL problem is only a first step\. Translating this insight into practice requires coordinated effort from practitioners and researchers\. This section offers concrete recommendations for each group\.

### 5\.1Recommendations for Practitioners

If your deployed system receives evaluative feedback, and operates in an environment too complex to fully characterize offline, then you are in measurable deployment\. The question is not whether to adapt, but how\.

#### Recognize your deployment regime\.

Before investing in continual learning infrastructure, assess whether your setting warrants it\. Ask: \(1\) Does my system receive ongoing evaluative signals after deployment? \(2\) Does performance degrade without intervention? \(3\) Is periodic retraining costly, slow, or insufficient? If all three hold, you are giving up performance by fixing your policy after training\.

#### Build feedback loops, not just pipelines\.

Traditional MLOps treats deployment as the end of learning: data flows in, a model is trained, and the artifact is served until staleness triggers retraining, leading to the sawtooth pattern in[Figure2](https://arxiv.org/html/2606.04029#S3.F2)\. Continual deployment inverts this: the deployed model is the learning system, and production data is training data\. This requires infrastructure changes: logging interactions in formats amenable to online updates, maintaining low\-latency paths from feedback to model parameters, and treating model checkpoints as evolving rather than versioned artifacts\. Cursor Tab’s 1\.5–2 hour iteration cycles exemplify this tight coupling between deployment and learning\.

#### Validate continually, not just before deployment\.

Lyft’s switchback experiments \(alternating between policies in production to measure causal effects\) are a safety validation extending beyond pre\-deployment\. When policies update continually, so must evaluation\. Implement online monitoring for performance metrics, distributional drift, and safety constraints\. Maintain fallback policies that can be activated if the learner behaves unexpectedly\. The goal is not to eliminate risk, but to make adaptation safer than stagnation\.

#### Introduce non\-stationarity deliberately\.

A practical way to stress\-test your system’s adaptability is with controlled non\-stationarities in development: perturb rewards, skew observation, or simulate concept drift\. This reveals brittleness before deployment and calibrates how aggressively your system should adapt\. If your system cannot handle synthetic non\-stationarity, it will struggle with the real world\.

### 5\.2Recommendations for Researchers

A number of prior works have surveyed the challenges faced by continual learners and outlined promising directions for future research\(Parisiet al\.,[2019](https://arxiv.org/html/2606.04029#bib.bib58); Delangeet al\.,[2021](https://arxiv.org/html/2606.04029#bib.bib114); Khetarpalet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib42); Verwimpet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib117)\)\. Notably, theAlberta Planproposed bySuttonet al\.\([2022](https://arxiv.org/html/2606.04029#bib.bib107)\)presents a long\-term research plan aimed at developing agents capable of sustained, open\-ended learning\. In addition, several technical studies have examined the challenges associated with deploying continual learners in the real\-world environments\(Amodeiet al\.,[2016](https://arxiv.org/html/2606.04029#bib.bib115); Dulac\-Arnoldet al\.,[2019](https://arxiv.org/html/2606.04029#bib.bib113); Hamadanianet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib118)\)\. This section focuses on a subset of challenges and recommendations that have been highlighted more recently, with the goal of drawing attention to issues that may otherwise be overlooked and encouraging further investigation by the research community\.

#### Study the performance of conventional methods implemented in continual settings\.

A natural recommendation is to reassess the behavior of widely adopted conventional methods when deployed in continual learning problems\. Many algorithms and optimization techniques that perform well in stationary settings may exhibit undesirable behavior under non\-stationarity\.Degriset al\.\([2024](https://arxiv.org/html/2606.04029#bib.bib102)\)showed that two commonly used optimizers, RMSProp\(Hintonet al\.,[2012](https://arxiv.org/html/2606.04029#bib.bib104)\)and Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2606.04029#bib.bib105)\), were unsuitable for step\-size adaptation in a simple continual learning scenario\. Despite this, these optimizers are frequently employed in continual and online settings\(Hanet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib108)\)\. CRL research should treat hyperparameter tuning as part of online deployment, not a hidden offline phase\(Hakhverdyan,[2024](https://arxiv.org/html/2606.04029#bib.bib9); Pattersonet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib131)\), and evaluate methods that adapt hyperparameters during learning\. Notable candidates for this include Bayesian\(Parker\-Holderet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib132)\)and meta\-gradient methods\(Sutton,[1992](https://arxiv.org/html/2606.04029#bib.bib122); Degriset al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib102)\)\.

#### Rederive algorithm properties for history processes\.

The shift from the MDP formalism to the history process is not merely notational\. Standard convergence results for value\-based and policy gradient methods depend critically on the Markov property\. Without it, value functions defined on state representations are approximations of the true history\-conditioned values, and their theoretical guarantees dissolve\(Abelet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib76); Elelimyet al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib33)\)\. This is an under\-explored consequence of adopting more realistic CRL foundations\. Researchers should audit which properties of conventional algorithms survive the loss of Markovianity, and work to establish what weaker guarantees can be recovered \(convergence in a non\-stationary sense, regret bounds under history dependence, or stability conditions for online hyperparameter methods\)\.

#### Consider different evaluation metrics\.

Many CRL works evaluate agents using the expected average reward as the performance measure\. For this metric to be well\-defined and meaningful, the underlying environment should be at least weakly communicating, ensuring that long\-run averages are independent of the initial state\(Wan and Sutton,[2022](https://arxiv.org/html/2606.04029#bib.bib112)\)\. Many real\-world and production systems violate this assumption due to irreversible failures, shutdowns, or terminal user states that effectively partition the state space\. As a result, average\-reward\-based evaluation may obscure meaningful differences between agents\. We recommend that researchers consider more general performance measures that reduce implicit assumptions about environment structure and better reflect effective performance in continual deployment settings\. One such evaluation metric isdeviation regret\(Elelimyet al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib33)\), which assesses agents based on the situations they encounter rather than solely on asymptotic reward accumulation\.

#### Take resource constraints into account\.

Finally, CRL research should explicitly account for finite computational and memory resources\. Under the Big World Hypothesis \(Section[3](https://arxiv.org/html/2606.04029#S3)\), environmental complexity exceeds agent capacity, shifting the objective from convergence to tracking targets over time\(Suttonet al\.,[2007](https://arxiv.org/html/2606.04029#bib.bib84)\)\. Recent theoretical results formalize this intuition, showing that continual learning is necessary for resource\-constrained agents to sustain performance\(Kumaret al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib56)\)\. Practically, this motivates architectures that dynamically reallocate computation and memory, favoring efficient, approximate updates for real\-time adaptation\(Javedet al\.,[2023](https://arxiv.org/html/2606.04029#bib.bib99); Loet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib123); Vasanet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib124); Tamborski and Abel,[2025](https://arxiv.org/html/2606.04029#bib.bib100)\)\.

### 5\.3Benchmarks

Benchmarks have always played an important role in fostering collaboration and advances in machine learning\(Martínez\-Plumedet al\.,[2021](https://arxiv.org/html/2606.04029#bib.bib11)\)\. From CIFAR\(Krizhevsky,[2009](https://arxiv.org/html/2606.04029#bib.bib8)\)and ImageNet\(Krizhevskyet al\.,[2012](https://arxiv.org/html/2606.04029#bib.bib10)\), that challenged and accelerated computer vision, to mathematics benchmarks that have been critical in the progress of reasoning in large language models\(Fanget al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib3)\)\.

The lack of such benchmarks in CRL has been one of the main barriers in research in this field\(Khetarpalet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib42)\)\. The current CRL benchmarks add non\-stationarities to existing RL benchmarks\. This may be by changing the dynamics of the environment, switching between different tasks \(Switching Arcade Learning Environment;[Abbaset al\.](https://arxiv.org/html/2606.04029#bib.bib62),[2023](https://arxiv.org/html/2606.04029#bib.bib62)\), or changing the reward function over time\(Anand and Precup,[2023](https://arxiv.org/html/2606.04029#bib.bib7)\)\. There are also benchmarks inspired by robotics\(Wolczyket al\.,[2021](https://arxiv.org/html/2606.04029#bib.bib4)\), and games that introduce non\-stationarities gradually\(Mohamedet al\.,[2026](https://arxiv.org/html/2606.04029#bib.bib6)\)\.

We suggest that benchmarks must be developed based on defined characteristics of CRL problems to be useful for the community\. In particular, building on the proposed need for a big world simulator\(Kumaret al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib5)\)for continual learning, where there are no diminishing returns to increasing the capacity of the agent, as any agent with a finite capacity will need to learn and update forever in such an environment\. This is in line with deployment settings\.

## 6Alternative Views

To stimulate productive discussion in the community, we present several alternative views to our thesis that deployed RL should be continual\. While these concerns merit consideration, we argue they can be addressed through careful system design, deployment best practices, and by reorienting research toward continual learning rather than convergence to fixed artifacts\. Much of our community’s research remains on algorithms that solve problems and stop learning, which leads to deployments lacking continual adaptation\.

#### Current approaches are already adaptive\.

One can argue that deployed systems already adapt continually through modern MLOps pipelines\. Periodic retraining and fine\-tuning are legitimate learning rules \(under[Section2](https://arxiv.org/html/2606.04029#S2)’s definition\)\. Both curves in[Figure2](https://arxiv.org/html/2606.04029#S3.F2)are valid continual learning rules: they differ not in whether adaptation occurs, but in how it is triggered\. When retraining is integrated as an internal, autonomously triggered rule, it can constitute genuine continual adaptation in the CRL setting\.

However, retraining is often part of an external MLOps pipeline in the form of human intervention/expertise, rather than the agent’s internal operation\. This limits how and when the agent updates its policy to human knowledge, rather than letting the agent discover a good time and direction for an update\.

In contrast, when re\-training is integrated as an internal learning rule within the agent’s autonomous operation, the agent gets to decide how and when it updates its policy based on what it has seen\. In domains that are understudied or unexplored by humans \(exploring extra\-terrestrial environments\), or where optimal performance is unbounded \(stock trading\), such external updates may not adequately address the demands of measurable deployment\.

#### Policies trained offline are generalizable enough\.

Zero\-shot and few\-shot generalization, and sim\-to\-real methods may provide favorable initializations at deployment by distilling prior knowledge so competent performance is reached faster\(Kirket al\.,[2023](https://arxiv.org/html/2606.04029#bib.bib2); Becket al\.,[2023](https://arxiv.org/html/2606.04029#bib.bib136); Iannottaet al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib138)\)\. These are worthwhile complements to continual learning from interaction, but not substitutes\. They accelerate early adaptation but are bounded by the dynamics seen during a finite training phase\. As the deployment horizon extends and the agent encounters the non\-stationarities from[Section3](https://arxiv.org/html/2606.04029#S3), the best agents must continue adapting beyond what any prior can cover\. In\-context approaches\(Brownet al\.,[2020](https://arxiv.org/html/2606.04029#bib.bib139)\)can in principle be viewed as a continual learning rule by treating context as part of a recurrent hidden state\(Akyüreket al\.,[2023](https://arxiv.org/html/2606.04029#bib.bib140); Kanget al\.,[2025](https://arxiv.org/html/2606.04029#bib.bib137)\), but whether finite context windows support credit assignment and exploration under measurable deployment remains an open question\.

#### Not all deployments need continual learning\.

While not all deployments require continual learning, we argue that continual learning is a requirement formeasurabledeployment\. As discussed in Section[3](https://arxiv.org/html/2606.04029#S3), real\-world deployments are often in the Big World regime and are subject to the four non\-stationarities from a variety of causes \(ranging from gradual wear and tear to sudden black swan events\) that cannot be fully anticipated by designers or represented within the agent\. In such a setting, the agent should leverage reward as evaluative feedback to adapt continually\.

For example, an industrial robot arm repeating the same task seems stationary, but mechanical wear and joint degradation accumulate, causing accuracy to drift\. What appears stationary is revealed as non\-stationary over extended deployment\.

#### Reward signals may be sparse, delayed, or unobservable after deployment\.

Our argument hinges on measurable deployment, but what if the evaluative signal cannot be obtained reliably? A Roomba in a factory has extensive instrumentation to test and reward all scenarios; in a home, feedback is sporadic at best\. A sensor that degrades may report misleading observations, poisoning any learning signal\.

When rewards are delayed beyond the horizon of useful credit assignment, an agent may still be able to learn from the differences in values of the situations it encountered\. When reward is absent entirely, there is nothing to learn from\. Monitored\-MDPs formalize this challenge, and agents train reward models to still learn with partial feedback\(Parisiet al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib120); Mohammedalamen and Bowling,[2026](https://arxiv.org/html/2606.04029#bib.bib119)\)\. Such monitors are an open problem in history processes, which do not have the same convergence guarantees as a Markovian monitor, and we acknowledge this limitation\. Our position applies specifically to measurable deployment, where an evaluative signal exists and can be trusted\. The concerning observation is that many deployed systems have access to reliable feedback \(click\-through rates, revenue, suggestion acceptance rates\), but still default to fixed policies\. We urge the community to reconsider such cases\.

#### Safety and alignment concerns favor fixed policies over continual learners\.

One might argue that deploying fixed policies that have been extensively validated offline is inherently safer than allowing agents to continue learning after deployment\. A fixed policy’s behavior can be exhaustively tested, its failure modes catalogued, and its performance bounds characterized before it is deployed\. However, fixed policies are not a viably safer alternative\. As established in[Section3](https://arxiv.org/html/2606.04029#S3), the four sources of non\-stationarity ensure models cannot remain static\. Fixed policies either degrade unsafely or require human intervention, both carrying their own safety risks\. The wealth of safe RL literature reflects that adaptation is necessary for maintaining safety, not antithetical to it\(Moldovan and Abbeel,[2012](https://arxiv.org/html/2606.04029#bib.bib86); Alshiekhet al\.,[2018](https://arxiv.org/html/2606.04029#bib.bib87); Kumaret al\.,[2020](https://arxiv.org/html/2606.04029#bib.bib89); Thomaset al\.,[2021](https://arxiv.org/html/2606.04029#bib.bib88); Skalseet al\.,[2022](https://arxiv.org/html/2606.04029#bib.bib90); Moghimi and Ku,[2025](https://arxiv.org/html/2606.04029#bib.bib91)\)\. Recent work has even shown that agents can be trained to behave cautiously in the face of unseen observations\(Mohammedalamenet al\.,[2021](https://arxiv.org/html/2606.04029#bib.bib121)\)\.

Safe AI is a field still in its infancy and new safety constraints and legislature are frequently announced and updated\. Instead of re\-training new models from scratch to adhere to them, continual learning may allow a learner to adapt to these regulatory updates\.

## 7Conclusion

We have presented why deploying decision\-making agents that are incapable of optimality, but privileged with evaluative feedback in the form of a reward signal after deployment, is a continual reinforcement learning problem\. To largely ignore this signal, and restrict the agent to rely on human intervention or expertise to decide when and how the policy is updated, is to leave performance on the table indefinitely\. We hope this paper encourages practitioners to leverage the feedback their systems already receive, and researchers to prioritize continual learning as the problem setting for deployed RL\.

## Acknowledgements

The authors would like to thank Michael Bowling, A\. Rupam Mahmood, Khurram Javed, Gautham Vasan, Montaser Mohammedalamen, Diego Gomez, Abhishek Naik, Andrew Patterson, Shibhansh Dohare, and Matthew Schlegel for constructive discussions and feedback on an earlier draft of this work\. We thank the anonymous reviewers and area chair for their time, reviews, and suggestions that strengthened this paper\. Portions of this research is based on work supported by the Canadian AI Safety Institute Research Program at CIFAR, funded by the Government of Canada\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- Z\. Abbas, R\. Zhao, J\. Modayil, A\. White, and M\. C\. Machado \(2023\)Loss of Plasticity in Continual Deep Reinforcement Learning\.InConference on Lifelong Learning Agents,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p2.1)\.
- D\. Abel, A\. Barreto, B\. Van Roy, D\. Precup, H\. P\. van Hasselt, and S\. Singh \(2023\)A Definition of Continual Reinforcement Learning\.InNeural Information Processing Systems,Cited by:[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p3.10),[§2\.2](https://arxiv.org/html/2606.04029#S2.SS2.p1.1),[§4](https://arxiv.org/html/2606.04029#S4.p10.1),[§4](https://arxiv.org/html/2606.04029#S4.p2.1)\.
- D\. Abel, M\. K\. Ho, and A\. Harutyunyan \(2024\)Three Dogmas of Reinforcement Learning\.InReinforcement Learning Journal,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p5.1),[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p2.12),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px2.p1.1)\.
- I\. Akkaya, M\. Andrychowicz, M\. Chociej, M\. Litwin, B\. McGrew, A\. Petron, A\. Paino, M\. Plappert, G\. Powell, R\. Ribas,et al\.\(2019\)Solving Rubik’s Cube with a Robot Hand\.arXiv preprint 1910\.07113\.Cited by:[§3\.3](https://arxiv.org/html/2606.04029#S3.SS3.p4.1)\.
- E\. Akyürek, D\. Schuurmans, J\. Andreas, T\. Ma, and D\. Zhou \(2023\)What learning algorithm is in\-context learning? investigations with linear models\.InInternational Conference on Learning Representations,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px2.p1.1)\.
- M\. Alshiekh, R\. Bloem, R\. Ehlers, B\. Könighofer, S\. Niekum, and U\. Topcu \(2018\)Safe Reinforcement Learning via Shielding\.InAAAI Conference on Artificial Intelligence,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px5.p1.1)\.
- D\. Amodei, C\. Olah, J\. Steinhardt, P\. Christiano, J\. Schulman, and D\. Mané \(2016\)Concrete Problems in AI Safety\.arXiv preprint 1606\.06565\.Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1)\.
- N\. Anand and D\. Precup \(2023\)Prediction and Control in Continual Reinforcement Learning\.InNeural Information Processing Systems,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p2.1)\.
- X\. Azagirre, A\. Balwally, G\. Candeli, N\. Chamandy, B\. Han, A\. King, H\. Lee, M\. Loncaric, S\. Martin, V\. Narasiman,et al\.\(2024\)A Better Match for Drivers and Riders: Reinforcement Learning at Lyft\.InJournal on Applied Analytics,Cited by:[§3\.2](https://arxiv.org/html/2606.04029#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.04029#S3.SS2.p5.1)\.
- A\. B\. Barron, E\. A\. Hebets, T\. A\. Cleland, C\. L\. Fitzpatrick, M\. E\. Hauber, and J\. R\. Stevens \(2015\)Embracing Multiple Definitions of Learning\.InTrends in Neurosciences,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p5.1)\.
- J\. Beck, R\. Vuorio, E\. Zheran Liu, Z\. Xiong, L\. Zintgraf, C\. Finn, and S\. Whiteson \(2023\)A Survey of Meta\-Reinforcement Learning\.arXiv e\-prints\.Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px2.p1.1)\.
- M\. G\. Bellemare, S\. Candido, P\. S\. Castro, J\. Gong, M\. C\. Machado, S\. Moitra, S\. S\. Ponda, and Z\. Wang \(2020\)Autonomous Navigation of Stratospheric Balloons using Reinforcement Learning\.InNature,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p2.1)\.
- R\. Bellman \(1957\)A Markovian Decision Process\.InJournal of Mathematics and Mechanics,Cited by:[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p2.12)\.
- C\. Berner, G\. Brockman, B\. Chan, V\. Cheung, P\. Debiak, C\. Dennison, D\. Farhi, Q\. Fischer, S\. Hashme, C\. Hesse,et al\.\(2019\)Dota 2 with Large Scale Deep Reinforcement Learning\.arXiv preprint 1912\.06680\.Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p2.1)\.
- D\. Bertsimas, P\. Jaillet, and S\. Martin \(2019\)Online Vehicle Routing: The Edge of Optimization in Large\-Scale Applications\.InOperations Research,Cited by:[§3\.2](https://arxiv.org/html/2606.04029#S3.SS2.p1.1)\.
- M\. Bowling, J\. D\. Martin, D\. Abel, and W\. Dabney \(2023\)Settling the Reward Hypothesis\.InInternational Conference on Machine Learning,Cited by:[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p3.10),[item 3](https://arxiv.org/html/2606.04029#S3.I1.i3.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language Models are Few\-shot Learners\.InNeural Information Processing Systems,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px2.p1.1)\.
- A\. R\. Cassandra, L\. P\. Kaelbling, and M\. L\. Littman \(1994\)Acting Optimally in Partially Observable Stochastic Domains\.InAAAI Conference on Artifical Intelligence,Cited by:[footnote 1](https://arxiv.org/html/2606.04029#footnote1)\.
- J\. Degrave, F\. Felici, J\. Buchli, M\. Neunert, B\. Tracey, F\. Carpanese, T\. Ewalds, R\. Hafner, A\. Abdolmaleki, D\. de Las Casas,et al\.\(2022\)Magnetic Control of Tokamak Plasmas through Deep Reinforcement Learning\.InNature,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p2.1)\.
- T\. Degris, K\. Javed, A\. Sharifnassab, Y\. Liu, and R\. Sutton \(2024\)Step\-size Optimization for Continual Learning\.arXiv preprint 2401\.17401\.Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- M\. Delange, R\. Aljundi, M\. Masana, S\. Parisot, X\. Jia, A\. Leonardis, G\. Slabaugh, and T\. Tuytelaars \(2021\)A Continual Learning Survey: Defying forgetting in Classification Tasks\.InIEEE Transactions on Pattern Analysis and Machine Intelligence,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1)\.
- S\. Dohare, J\. F\. Hernandez\-Garcia, Q\. Lan, P\. Rahman, A\. R\. Mahmood, and R\. S\. Sutton \(2024\)Loss of Plasticity in Deep Continual Learning\.InNature,Cited by:[§2\.2](https://arxiv.org/html/2606.04029#S2.SS2.p1.1)\.
- G\. Dulac\-Arnold, D\. Mankowitz, and T\. Hester \(2019\)Challenges of Real\-World Reinforcement Learning\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p1.1),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1)\.
- E\. Elelimy, D\. Szepesvari, M\. White, and M\. Bowling \(2025\)Rethinking the Foundations for Continual Reinforcement Learning\.Reinforcement Learning Journal\.Cited by:[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p3.10),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px3.p1.1)\.
- M\. Fang, X\. Wan, F\. Lu, F\. Xing, and K\. Zou \(2025\)MathOdyssey: Benchmarking Mathematical Problem\-Solving Skills in Large Language Models Using Odyssey Math Data\.InNature Scientific Data,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p1.1)\.
- A\. Hakhverdyan \(2024\)Accounting for Hyperparameter Tuning in Online Reinforcement Learning\.Master’s Thesis,Department of Computing Science, University of Alberta\.Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- P\. Hamadanian, M\. Schwarzkopf, and S\. Sen \(2022\)How Reinforcement Learning systems fail and what to do about it\.InEuroSys Workshop on Machine Learning and Systems,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p4.1),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1)\.
- B\. Han, H\. Lee, and S\. Martin \(2022\)Real\-Time Rideshare Driver Supply Values Using Online Reinforcement Learning\.InACM Conference on Knowledge Discovery and Data Mining,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- G\. Hinton, N\. Srivastava, and K\. Swersky \(2012\)RMSProp: Divide the gradient by a running average of its recent magnitude\.Note:Coursera: Neural Networks for Machine Learning, Lecture 6eCited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- D\. Horváth, G\. Erdős, Z\. Istenes, T\. Horváth, and S\. Földi \(2022\)Object Detection using Sim2Real Domain Randomization for Robotic Applications\.IEEE Transactions on Robotics\.Cited by:[§3\.3](https://arxiv.org/html/2606.04029#S3.SS3.p4.1)\.
- M\. Iannotta, Y\. Yang, J\. A\. Stork, E\. Schaffernicht, and T\. Stoyanov \(2025\)Can Context Bridge the Reality Gap? Sim\-to\-Real Transfer of Context\-Aware Policies\.arXiv preprint 2511\.04249\.Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px2.p1.1)\.
- IBM \(2024\)What Is Model Deployment?\.Note:Accessed: 2026\-01\-11External Links:[Link](https://www.ibm.com/think/topics/model-deployment)Cited by:[§2\.3](https://arxiv.org/html/2606.04029#S2.SS3.p1.1)\.
- J\. Jackson, P\. Kravtsov, and S\. Jain \(2025\)Improving Cursor Tab with online RL\.Note:Cursor Blog \(Research\)Accessed: 2026\-01\-28External Links:[Link](https://cursor.com/blog/tab-rl)Cited by:[§3\.1](https://arxiv.org/html/2606.04029#S3.SS1.p1.3),[§3\.1](https://arxiv.org/html/2606.04029#S3.SS1.p4.1)\.
- S\. James, P\. Wohlhart, M\. Kalakrishnan, D\. Kalashnikov, A\. Irpan, J\. Ibarz, S\. Levine, R\. Hadsell, and K\. Bousmalis \(2019\)Sim\-to\-Real via Sim\-to\-Sim: Data\-Efficient Robotic Grasping via Randomized\-to\-Canonical Adaptation Networks\.InIEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§3\.3](https://arxiv.org/html/2606.04029#S3.SS3.p4.1)\.
- K\. Javed, H\. Shah, R\. S\. Sutton, and M\. White \(2023\)Scalable Real\-time Recurrent Learning using Columnar\-Constructive Networks\.InJournal of Machine Learning Research,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px4.p1.1)\.
- K\. Javed and R\. S\. Sutton \(2024\)The Big World Hypothesis and its ramifications for Artificial Intelligence\.InRLC Finding the Frame Workshop,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p4.1)\.
- L\. Kang, F\. Wang, S\. Liu, H\. Chou, C\. Lin, and N\. Ding \(2025\)In\-Context Learning can Perform Continual Learning Like Humans\.arXiv preprint 2509\.22764\.Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px2.p1.1)\.
- R\. M\. Karp, U\. V\. Vazirani, and V\. V\. Vazirani \(1990\)An Optimal Algorithm for Online Bipartite Matching\.InACM Symposium on Theory of Computing,Cited by:[§3\.2](https://arxiv.org/html/2606.04029#S3.SS2.p1.1)\.
- K\. Khetarpal, M\. Riemer, I\. Rish, and D\. Precup \(2022\)Towards Continual Reinforcement Learning: A Review and Perspectives\.InJournal of Artificial Intelligence Research,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.04029#S2.SS2.p1.1),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1),[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p2.1)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: A Method for Stochastic Optimization\.InInternational Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- R\. Kirk, A\. Zhang, E\. Grefenstette, and T\. Rocktäschel \(2023\)A survey of zero\-shot generalisation in deep reinforcement learning\.Journal of Artificial Intelligence Research\.Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Krizhevsky, I\. Sutskever, and G\. E\. Hinton \(2012\)ImageNet Classification with Deep Convolutional Neural Networks\.InNeural Information Processing Systems,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p1.1)\.
- A\. Krizhevsky \(2009\)Learning Multiple Layers of Features from Tiny Images\.Master’s Thesis,Department of Computer Science, University of Toronto\.Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p1.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative Q\-learning for Offline Reinforcement Learning\.InNeural Information Processing Systems,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px5.p1.1)\.
- S\. Kumar, H\. J\. Jeon, A\. Lewandowski, and B\. V\. Roy \(2024\)The Need for a Big World Simulator: A Scientific Challenge for Continual Learning\.InRLC Finding the Frame Workshop,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p3.1)\.
- S\. Kumar, H\. Marklund, A\. Rao, Y\. Zhu, H\. J\. Jeon, Y\. Liu, and B\. V\. Roy \(2025\)Continual Learning as Computationally Constrained Reinforcement Learning\.InFoundational Trends in Machine Learning Research,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px4.p1.1)\.
- P\. Langley \(2000\)Crafting Papers on Machine Learning\.InInternational Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2606.04029#A1.p6.1)\.
- H\. Lee, C\. Park, D\. Abel, and M\. Jin \(2025\)A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety\.InInternational Conference on Learning Representations,Cited by:[item 4](https://arxiv.org/html/2606.04029#S3.I1.i4.p1.1)\.
- A\. Lewandowski, A\. A\. Ramesh, E\. Meyer, D\. Schuurmans, and M\. C\. Machado \(2025\)The World Is Bigger: A Computationally\-Embedded Perspective on the Big World Hypothesis\.InNeural Information Processing Systems,Cited by:[item 4](https://arxiv.org/html/2606.04029#S3.I1.i4.p1.1)\.
- M\. L\. Littman \(2017\)The Reward Hypothesis\.Note:Video presentationAccessed: 2026\-01\-27External Links:[Link](https://tinyurl.com/4z52r3fe)Cited by:[item 3](https://arxiv.org/html/2606.04029#S3.I1.i3.p1.1)\.
- C\. Lo, K\. Roice, P\. M\. Panahi, S\. M\. Jordan, A\. White, G\. Mihucz, F\. Aminmansour, and M\. White \(2024\)Goal\-Space Planning with Subgoal Models\.Journal of Machine Learning Research\.Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px4.p1.1)\.
- F\. Martínez\-Plumed, P\. Barredo, S\. Heigeartaigh, and J\. Hernández\-Orallo \(2021\)Research Community Dynamics behind popular AI Benchmarks\.InNature Machine Intelligence,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p1.1)\.
- M\. McCloskey and N\. J\. Cohen \(1989\)Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem\.InPsychology of Learning and Motivation,Cited by:[§2\.2](https://arxiv.org/html/2606.04029#S2.SS2.p1.1)\.
- M\. Mermillod, A\. Bugaiska, and P\. Bonin \(2013\)The Stability\-Plasticity Dilemma: Investigating the Continuum from Catastrophic Forgetting to Age\-Limited Learning Effects\.InFrontiers in Psychology,Cited by:[§2\.2](https://arxiv.org/html/2606.04029#S2.SS2.p1.1)\.
- M\. Moghimi and H\. Ku \(2025\)Risk\-Sensitive Actor\-Critic with Static Spectral Risk Measures for Online and Offline Reinforcement Learning\.arXiv preprint 2507\.03900\.Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px5.p1.1)\.
- M\. A\. Mohamed, K\. Nekhomiazh, V\. Vyas, M\. M\. Jose, A\. Patterson, and M\. C\. Machado \(2026\)The Cell Must Go On: Agar\.io for Continual Reinforcement Learning\.InReinforcement Learning Journal,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p2.1)\.
- M\. Mohammedalamen and M\. Bowling \(2026\)Generalization in Monitored Markov Decision Processes\.InReinforcement Learning Journal,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px4.p2.1)\.
- M\. Mohammedalamen, D\. Morrill, A\. Sieusahai, Y\. Satsangi, and M\. Bowling \(2021\)Learning to be Cautious\.InTransactions on Machine Learning Research,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px5.p1.1)\.
- T\. M\. Moldovan and P\. Abbeel \(2012\)Safe Exploration in Markov Decision Processes\.InInternational Conference on Machine Learning,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px5.p1.1)\.
- G\. E\. Monahan \(1982\)State of the Art — A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms\.InManagement Science,Cited by:[footnote 1](https://arxiv.org/html/2606.04029#footnote1)\.
- G\. I\. Parisi, R\. Kemker, J\. L\. Part, C\. Kanan, and S\. Wermter \(2019\)Continual Lifelong Learning with Neural Networks: A Review\.InNeural Networks,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1)\.
- S\. Parisi, M\. Mohammedalamen, A\. Kazemipour, M\. E\. Taylor, and M\. Bowling \(2024\)Monitored Markov Decision Processes\.InInternational Conference on Autonomous Agents and Multiagent Systems,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px4.p2.1)\.
- J\. Parker\-Holder, R\. Rajan, X\. Song, A\. Biedenkapp, Y\. Miao, T\. Eimer, B\. Zhang, V\. Nguyen, R\. Calandra, A\. Faust,et al\.\(2022\)Automated Reinforcement Learning \(AutoRL\): A Survey and Open Problems\.InJournal of Artificial Intelligence Research,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- A\. Patterson, S\. Neumann, R\. Kumaraswamy, M\. White, and A\. White \(2024\)Cross\-environment Hyperparameter Tuning for Reinforcement Learning\.InReinforcement Learning Journal,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- J\. Perdomo, T\. Zrnic, C\. Mendler\-Dünner, and M\. Hardt \(2020\)Performative Prediction\.InInternational Conference on Machine Learning,Cited by:[footnote 2](https://arxiv.org/html/2606.04029#footnote2)\.
- M\. L\. Puterman \(1990\)Chapter 8: Markov Decision Processes\.InHandbooks in Operations Research and Management Science,Cited by:[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p2.12)\.
- M\. L\. Puterman \(2014\)Markov Decision Processes: Discrete Stochastic Dynamic Programming\.John Wiley & Sons\.Cited by:[§2\.1](https://arxiv.org/html/2606.04029#S2.SS1.p2.12)\.
- M\. B\. Ring \(1994\)Continual learning in reinforcement environments\.Ph\.D\. Thesis,University of Texas at Austin\.Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p5.1)\.
- G\. A\. Rummery and M\. Niranjan \(1994\)Online Q\-learning using Connectionist Systems\.University of Cambridge, Department of Engineering Cambridge, UK\.Cited by:[Appendix A](https://arxiv.org/html/2606.04029#A1.p5.3)\.
- J\. Schaeffer, N\. Burch, Y\. Bjornsson, A\. Kishimoto, M\. Muller, R\. Lake, P\. Lu, and S\. Sutphen \(2007\)Checkers is solved\.InScience,Cited by:[§3](https://arxiv.org/html/2606.04029#S3.p1.1)\.
- D\. Silver, A\. Huang, C\. J\. Maddison, A\. Guez, L\. Sifre, G\. Van Den Driessche, J\. Schrittwieser, I\. Antonoglou, V\. Panneershelvam, M\. Lanctot,et al\.\(2016\)Mastering the Game of Go with Deep Neural Networks and Tree Search\.InNature,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p2.1)\.
- J\. Skalse, N\. Howe, D\. Krasheninnikov, and D\. Krueger \(2022\)Defining and Characterizing Reward Gaming\.InNeural Information Processing Systems,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px5.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.MIT press\.Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p1.1),[footnote 4](https://arxiv.org/html/2606.04029#footnote4)\.
- R\. S\. Sutton, M\. Bowling, and P\. M\. Pilarski \(2022\)The Alberta Plan for AI Research\.arXiv preprint 2208\.11173\.Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1)\.
- R\. S\. Sutton, A\. Koop, and D\. Silver \(2007\)On the Role of Tracking in Stationary Environments\.InInternational Conference on Machine Learning,Cited by:[§4](https://arxiv.org/html/2606.04029#S4.p10.1),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px4.p1.1)\.
- R\. S\. Sutton, D\. McAllester, S\. Singh, and Y\. Mansour \(2000\)Policy Gradient Methods for Reinforcement Learning with Function Approximation\.InNeural Information Processing Systems,Cited by:[§3\.1](https://arxiv.org/html/2606.04029#S3.SS1.p3.1)\.
- R\. S\. Sutton \(1992\)Adapting Bias by Gradient Descent: An Incremental Version of Delta\-Bar\-Delta\.InAAAI Conference on Artificial Intelligence,Cited by:[§4](https://arxiv.org/html/2606.04029#S4.p3.7),[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px1.p1.1)\.
- R\. S\. Sutton \(2004\)The Reward Hypothesis\.Note:Accessed: 2026\-01\-27External Links:[Link](http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html)Cited by:[item 3](https://arxiv.org/html/2606.04029#S3.I1.i3.p1.1)\.
- N\. N\. Taleb \(2008\)The Black Swan: The Impact of the Highly Improbable\.Penguin Books\.Cited by:[item 4](https://arxiv.org/html/2606.04029#S3.I1.i4.p1.1)\.
- M\. Tamborski and D\. Abel \(2025\)Memory Allocation in Resource\-Constrained Reinforcement Learning\.InMulti\-Disciplinary Conference on Reinforcement Learning and Decision Making,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px4.p1.1)\.
- G\. Tesauro \(1995\)Temporal Difference Learning and TD\-Gammon\.InCommunications of the ACM,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p2.1)\.
- G\. Thomas, Y\. Luo, and T\. Ma \(2021\)Safe Reinforcement Learning by Imagining the Near Future\.InNeural Information Processing Systems,Cited by:[§6](https://arxiv.org/html/2606.04029#S6.SS0.SSS0.Px5.p1.1)\.
- S\. Thrun \(1998\)Chapter 8: Lifelong Learning Algorithms\.InLearning to Learn,Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p5.1)\.
- M\. Towers, A\. Kwiatkowski, J\. Terry, J\. U\. Balis, G\. De Cola, T\. Deleu, M\. Goulão, A\. Kallinteris, M\. Krimmel, A\. KG,et al\.\(2024\)Gymnasium: A Standard Interface for Reinforcement Learning Environments\.InarXiv preprint 2407\.17032,Cited by:[Appendix A](https://arxiv.org/html/2606.04029#A1.p2.9)\.
- G\. Vasan, M\. Elsayed, S\. A\. Azimi, J\. He, F\. Shahriar, C\. Bellinger, M\. White, and R\. Mahmood \(2024\)Deep Policy Gradient methods without Batch Updates, Target Networks, or Replay Buffers\.InNeural Information Processing Systems,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px4.p1.1)\.
- E\. Verwimp, R\. Aljundi, S\. Ben\-David, M\. Bethge, A\. Cossu, A\. Gepperth, T\. L\. Hayes, E\. Hüllermeier, C\. Kanan, D\. Kudithipudi, C\. H\. Lampert, M\. Mundt, R\. Pascanu, A\. Popescu, A\. S\. Tolias, J\. van de Weijer, B\. Liu, V\. Lomonaco, T\. Tuytelaars, and G\. M\. van de Ven \(2024\)Continual Learning: Applications and the Road Forward\.InTransactions on Machine Learning Research,Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.p1.1)\.
- Y\. Wan and R\. S\. Sutton \(2022\)On Convergence of Average\-Reward Off\-Policy Control algorithms in Weakly Communicating MDPs\.arXiv preprint 2209\.15141\.Cited by:[§5\.2](https://arxiv.org/html/2606.04029#S5.SS2.SSS0.Px3.p1.1)\.
- M\. Wolczyk, M\. Zajac, R\. Pascanu, L\. Kucinski, and P\. Milos \(2021\)Continual World: A Robotic Benchmark For Continual Reinforcement Learning\.InNeural Information Processing Systems,Cited by:[§5\.3](https://arxiv.org/html/2606.04029#S5.SS3.p2.1)\.
- P\. R\. Wurman, S\. Barrett, K\. Kawamoto, J\. MacGlashan, K\. Subramanian, T\. J\. Walsh, R\. Capobianco, A\. Devlic, F\. Eckert, F\. Fuchs,et al\.\(2022\)Outracing Champion Gran Turismo drivers with Deep Reinforcement Learning\.Nature\.Cited by:[§1](https://arxiv.org/html/2606.04029#S1.p2.1)\.
- C\. Yan, H\. Zhu, N\. Korolko, and D\. Woodard \(2020\)Dynamic Pricing and Matching in Ride\-Hailing Platforms\.InNaval Research Logistics,Cited by:[§3\.2](https://arxiv.org/html/2606.04029#S3.SS2.p1.1)\.
- W\. Zhao, J\. P\. Queralta, and T\. Westerlund \(2020\)Sim\-to\-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey\.InIEEE Symposium Series on Computational Intelligence,Cited by:[§3\.3](https://arxiv.org/html/2606.04029#S3.SS3.p4.1)\.

## Appendix ARusting Pendulum

We will use a toy problem with wear\-and\-tear to demonstrate the advantage of continual adaptation over train\-then\-fix\.

To do so, we introduce theRusting Pendulumenvironment, a variant of the original Pendulum classic control environment\. As in the original Pendulum, the agent’s goal is to raise a rigid, uniform rod that is hinged on one end, from the downward to the upright position by applying continuous\-valued torques as actions\. At each timesteptt, the agent observes thextx\_\{t\}andyty\_\{t\}coordinates of the rod along with its angular velocityθ˙t\\dot\{\\theta\}\_\{t\}\. The reward,rt\+1r\_\{t\+1\}, is negative at each timestep, and is closer to0the more upright and stable the pendulum is\. We introduce a non\-stationarity by modifying the environment’s transition dynamics\(Towerset al\.,[2024](https://arxiv.org/html/2606.04029#bib.bib134)\)to include a damping term,−bt​θt˙\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}\-b\_\{t\}\\dot\{\\theta\_\{t\}\}\. The damping coefficientbt\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}b\_\{t\}grows in a noisily quadratic manner over time, as shown in[Figure3](https://arxiv.org/html/2606.04029#A1.F3)\(Gaussian noiseσ=0\.02\\sigma=0\.02\)\. This mimics rust that accumulates at the pendulum joint, making actions from older policies lack the necessary torque to overcome the rust and raise the pendulum\.

![Refer to caption](https://arxiv.org/html/2606.04029v1/x4.png)Figure 3:Growth of the damping coefficient over the experiment\.θ˙t\+1\\displaystyle\\dot\{\\theta\}\_\{t\+1\}=θ˙t\+\(3​g2​l​yt\+3m​l2​At−bt​θt˙\)​Δ​t,\\displaystyle=\\dot\{\\theta\}\_\{t\}\+\\Bigg\(\\frac\{3g\}\{2l\}y\_\{t\}\+\\frac\{3\}\{ml^\{2\}\}A\_\{t\}\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}\-b\_\{t\}\\dot\{\\theta\_\{t\}\}\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@gray@stroke\{0\}\\pgfsys@color@gray@fill\{0\}\\Bigg\)\\Delta t,xt\+1\\displaystyle x\_\{t\+1\}=cos⁡\(θt\+θ˙t\+1​Δ​t\),\\displaystyle=\\cos\(\\theta\_\{t\}\+\\dot\{\\theta\}\_\{t\+1\}\\Delta t\),yt\+1\\displaystyle y\_\{t\+1\}=sin⁡\(θt\+θ˙t\+1​Δ​t\),\\displaystyle=\\sin\(\\theta\_\{t\}\+\\dot\{\\theta\}\_\{t\+1\}\\Delta t\),rt\+1\\displaystyle r\_\{t\+1\}=−\(θt2\+0\.1​θ˙t2\+0\.001​At2\),\\displaystyle=\-\\Big\(\\theta\_\{t\}^\{2\}\+0\.1\\,\\dot\{\\theta\}\_\{t\}^\{2\}\+0\.001\\,A\_\{t\}^\{2\}\\Big\),
whereAt∼πtA\_\{t\}\\sim\\pi\_\{t\}is the torque applied at timesteptt, andΔ​t\\Delta tthe timestep duration\.

[Figure4](https://arxiv.org/html/2606.04029#A1.F4)\(left\) shows two agents in the Pendulum environment\. The solid curve is a Sarsa\(λ\\lambda\) agent\(Rummery and Niranjan,[1994](https://arxiv.org/html/2606.04029#bib.bib135)\), continually updating its policy444The value function is linear in tile\-coded features\(Sutton and Barto,[2018](https://arxiv.org/html/2606.04029#bib.bib68)\), and the action space is discretized into 7 equal bins\. Both agents use a fixed step\-size ofα=0\.1\\alpha=0\.1and a trace decay parameter ofλ=0\.95\\lambda=0\.95\.\. The dashed curve is the train\-then\-fix paradigm, where checkpoints of the solid curve’s policy are taken every3030k steps and ran without updates\. Both agents run for a single,250250k step episode, with no resets\. The colour of the dashed curve maps to the time of checkpointing from the solid curve\. In Pendulum, the train\-then\-fix policy matches the continual learner’s performance as later checkpoints learn to recover from unfavorable states\. In Rusting Pendulum \([Figure4](https://arxiv.org/html/2606.04029#A1.F4), right\), the dynamics shift faster than the checkpoints account for\. A policy trained at low friction has not learned the larger torques to recover under rustier dynamics, and its performance degrades\. The same trend is observed across runs \([Figure5](https://arxiv.org/html/2606.04029#A1.F5)\)\. Spending compute tuning the learning rate may reduce this gap on a problem\-to\-problem basis\.

![Refer to caption](https://arxiv.org/html/2606.04029v1/x5.png)

![Refer to caption](https://arxiv.org/html/2606.04029v1/x6.png)

Figure 4:Continual learning vs train\-then\-fix on two pendula\.Left:In Pendulum, the train\-then\-fix policy \(dashed\) matches the continual learner after convergence, as the optimal policy does not change\.Right:In Rusting Pendulum, the train\-then\-fix policy degrades as the dynamics drift away from those seen during training, while the continual learner adapts and maintains near\-optimal performance\. Colour encodes time progression\. Each dashed vertical line corresponds to a checkpoint taken every3030k steps\.![Refer to caption](https://arxiv.org/html/2606.04029v1/x7.png)

![Refer to caption](https://arxiv.org/html/2606.04029v1/x8.png)

Figure 5:The experiment from[Figure4](https://arxiv.org/html/2606.04029#A1.F4)repeated across 30 independent runs\. Faint curves show individual runs, whereas the opaque curve is the mean performance\. Confidence intervals did not accurately capture the distribution of failure modes\.

Similar Articles

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

arXiv cs.LG

Proposes LILAC+, a framework for safe continual reinforcement learning under nonstationarity that uses three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Evaluations in simulated driving environments show reduced safety violations under distribution shift while maintaining competitive performance.

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

arXiv cs.LG

This paper presents a theoretical framework for deep reinforcement learning in continuous environments, modeling it as a continuous-time stochastic process using stochastic control theory. The authors characterize an actor-critic algorithm's dynamics in the infinite width limit of two-layer networks, deriving an equation for infinitesimal changes in state distribution under a vanishingly small learning rate.

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

arXiv cs.CL

This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.