Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
Summary
This paper introduces Neural Co-state Policies, establishing a formal link between recurrent reinforcement learning hidden states and the Pontryagin minimum principle to improve interpretability and robustness.
View Cached Full Text
Cached at: 05/08/26, 07:10 AM
# Structuring Hidden States in Recurrent Reinforcement Learning
Source: [https://arxiv.org/html/2605.05373](https://arxiv.org/html/2605.05373)
## Neural Co\-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
David Leeftink, Max Hinne, Marcel van Gerven Department of Machine Learning and Neural Computation Radboud University david\.leeftink@ru\.nl, \{max\.hinne, marcel\.vangerven\}@donders\.ru\.nl
###### Abstract
A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations\. While recurrent \(memory\-based\) policies learned via reinforcement learning address this by encoding history into latent state representations, their internal dynamics remain uninterpretable black boxes\. This paper establishes a formal link between these hidden states and the Pontryagin minimum principle \(PMP\) from optimal control\. We demonstrate that for standard recurrent architectures, latent representations map directly to PMP co\-states, which allows the readout layer to be interpreted as performing Hamiltonian minimization\. Because standard reward maximization does not naturally discover this alignment, we introduce a PMP\-derived co\-state loss to explicitly structure the internal dynamics\. Empirically, this approach matches or improves performance on partially observable DMControl tasks, and is robust against zero\-shot out\-of\-distribution sensor masking\. By framing recurrent networks as dynamic processes governed by the minimum principle, we provide a principled approach to designing robust continuous control policies\.
## 1Introduction
A fundamental challenge for intelligent agents is operating effectively under partial observability\. Real\-world physical tasks are inherently obscured and often characterized by noisy sensors, measurement delays, or missing data\. Consequently, biological and artificial systems in continuous control tasks rarely have full access to the true state of their environment\. Because a single instantaneous observation is typically insufficient to determine the underlying state of the system, an agent must learn to integrate a history of past events to infer what cannot be directly seen\(Kaelblinget al\.,[1998](https://arxiv.org/html/2605.05373#bib.bib47)\)\.
In deep reinforcement learning \(RL\), recurrent policies address partial observability by maintaining a hidden state that continuously accumulates information\. While optimizing these networks purely for reward yields strong performance, their internal dynamics remain an uninterpretable black box\(Wierstraet al\.,[2010](https://arxiv.org/html/2605.05373#bib.bib41); Hausknecht and Stone,[2015](https://arxiv.org/html/2605.05373#bib.bib48)\)\. Without explicit structural constraints, recurrent policies are prone to memorizing fragile heuristics rather than learning a grounded representation of the control task, leaving them vulnerable to breaking down entirely under unfamiliar conditions\.
To structure these internal dynamics, this paper establishes a formal link between recurrent policies and the Pontryagin minimum principle \(PMP\)\(Pontryagin,[1987](https://arxiv.org/html/2605.05373#bib.bib8)\)\. PMP states that optimal continuous control is governed by a Hamiltonian system, where co\-state variables co\-evolve with the environment to encode optimality conditions\. As shown in Fig\.[1](https://arxiv.org/html/2605.05373#S2.F1), we formalize this relationship by introducing neural co\-state policies \(NCP\); a framework that explicitly aligns neural memory updates with co\-state dynamics\. We demonstrate that standard architectures like continuous\-time recurrent neural networks \(CT\-RNNs\)\(Beer,[1995](https://arxiv.org/html/2605.05373#bib.bib32)\)and gated recurrent units \(GRUs\)\(Chunget al\.,[2014](https://arxiv.org/html/2605.05373#bib.bib42)\)inherently possess this exact mathematical structure\. By mapping their latent memory directly to optimal co\-states, the network’s final readout layer can be rigorously interpreted as performing Hamiltonian minimization\.
While recurrent policies have the capacity for optimal Hamiltonian dynamics, standard reward maximization does not naturally converge to this alignment\. To bridge this gap, we introduce an auxiliary co\-state loss derived from the Hamilton\-Jacobi\-Bellman \(HJB\) equation\(Bellman,[1966](https://arxiv.org/html/2605.05373#bib.bib50)\)\. Because HJB theory establishes the theoretical co\-state as the gradient of the value function\(Vinter,[1986](https://arxiv.org/html/2605.05373#bib.bib49)\), these targets can be dynamically extracted directly from the learned critic in standard actor\-critic architectures\. Supervising the actor with these targets explicitly structures the network’s hidden states to track the latent optimality conditions of the environment\.
We evaluate our approach on partially observable continuous control tasks from DeepMind Control Suite \(DMControl;Tassaet al\.\([2018](https://arxiv.org/html/2605.05373#bib.bib62)\)\)\. Empirically, applying the co\-state loss to CT\-RNN and GRU architectures matches or improves performance over recurrent policies trained with proximal policy optimization \(PPO\) baselines\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.05373#bib.bib36)\)\. We furthermore show that the learned, structured internal dynamics demonstrate robustness to increased out\-of\-distribution sensor dropout\.
Ultimately, this work bridges a critical gap between classical optimal control and deep RL by anchoring previously black\-box hidden states in the minimum principle’s mathematical structure\. By framing recurrent architectures as dynamic processes governed by this principle, we provide a theoretically aligned foundation to design robust continuous control policies\.
## 2Background: Recurrent Policy Optimization
Neural co\-statesh˙↔λ˙⋆\\dot\{h\}\\leftrightarrow\\dot\{\\lambda\}^\{\\star\}h˙=−Bθ\(y\)−Fθ\(y,u\)h\\dot\{h\}=\{\-\}B\_\{\\theta\}\(y\)\{\-\}F\_\{\\theta\}\(y,u\)hh1h\_\{1\}h2h\_\{2\}h3h\_\{3\}hih\_\{i\}hnh\_\{n\}…\\dotsx\(t\)\\quad x\(t\)\\quadState dynamicsx˙=f\(x\)\+g\(x\)uy=ϕ\(x\)\+v\\begin\{aligned\} \\dot\{x\}&=f\(x\)\+g\(x\)u\\\\ y&=\\phi\(x\)\+v\\end\{aligned\}y\(t\)y\(t\)u⋆\(t\)\\quad u^\{\\star\}\(t\)\\quadHamiltonianminimizationargminu∈𝒰ℋ\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\mathcal\{H\}h\(t\)h\(t\)Closed\-loop controlV\(y\)\\\>V\(y\)\\\>ℒco\-state\\mathcal\{L\}\_\{\\text\{co\-state\}\}𝒰\\mathcal\{U\}u1u\_\{1\}u2u\_\{2\}ℋ\>c\\mathcal\{H\}\>cℋ=c\\mathcal\{H\}=cℋ<c\\mathcal\{H\}<c−∇uℋ\-\\nabla\_\{u\}\\mathcal\{H\}u∗u^\{\*\}ℋ=ℒ\(x,u\)\+h⊤x˙\\mathcal\{H\}=\\mathcal\{L\}\(x,u\)\+h^\{\\top\}\\dot\{x\}
Figure 1:Recurrent Reinforcement Learning via Neural co\-state policies\.NCP s are regularized during training to mirror the optimality conditions implied by the minimum principle\. By structuring the hidden states as representations of the underlying optimal control co\-states, the read\-out layer acts as a control\-Hamiltonian minimizer\.### 2\.1Continuous control under partial observability
Many real\-world reinforcement learning \(RL\) domains, such as robotics and autonomous navigation, involve physical systems governed by continuous\-time dynamics where the agent only receives partial, noisy sensor readings\. We formalize this setting as a continuous dynamical system described by a differential equation and an observation equation:
x˙\(t\)\\displaystyle\\dot\{x\}\(t\)=f\(x\(t\)\)\+g\(x\(t\)\)u\(t\)\\displaystyle=f\(x\(t\)\)\+g\(x\(t\)\)u\(t\)\(2\.1\)y\(t\)\\displaystyle y\(t\)=ϕ\(x\(t\)\)\+v\(t\)\\displaystyle=\\phi\(x\(t\)\)\+v\(t\)\(2\.2\)wherex∈𝒳⊆ℝdxx\\in\\mathcal\{X\}\\subseteq\\mathds\{R\}^\{d\_\{x\}\}represents the true underlying system state,u\(t\)∈𝒰⊆ℝduu\(t\)\\in\\mathcal\{U\}\\subseteq\\mathds\{R\}^\{d\_\{u\}\}is the control input applied by the agent, andy\(t\)∈𝒪⊆ℝdyy\(t\)\\in\\mathcal\{O\}\\subseteq\\mathds\{R\}^\{d\_\{y\}\}is the noise\-corrupted observation with sensor noisev\(t\)v\(t\)\. We assume the true drift dynamicsffand control\-influence matrixggare continuous and differentiable with respect toxx\. In the standard RL paradigm, these underlying transition dynamics are unknown to the agent, but they can be evaluated by executing actions and collecting observation trajectories over multiple learning episodes\.
Finding the optimal control sequenceu\(t\)u\(t\)requires iteratively minimizing the cost objective:
J\(u\)=𝔼v\[Φ\(x\(tf\)\)\+∫t0tf\(q\(x\(τ\)\)\+c\(u\(τ\)\)\)𝑑τ\]\.\\displaystyle J\(\{u\}\)=\\mathds\{E\}\_\{v\}\\left\[\\Phi\(x\(t\_\{f\}\)\)\+\\int\_\{t\_\{0\}\}^\{t\_\{f\}\}\\Big\(q\(x\(\\tau\)\)\+c\(u\(\\tau\)\)\\Big\)d\\tau\\right\]\.\(2\.3\)Here,Φ\(x\)\\Phi\(x\)is the terminal cost,q\(x\)q\(x\)is the state cost, andc\(u\)c\(u\)is the control input cost\. We assumeqqandccare differentiable and can be evaluated empirically, though their underlying analytical forms are not known a priori\. Because the sequence of observations is stochastic, finding the optimal control requires minimizing this expected cost\. If the goal is to maximize the reward, instead−J\(u\)\-J\(u\)can be maximized by using the running reward function as integrand\.
### 2\.2Policies with memory
To minimize this cost functional, consider a policyπθ:𝒪→𝒰\\pi\_\{\\theta\}\\colon\\mathcal\{O\}\\to\\mathcal\{U\}, such thatuθ\(t\)=πθ\(y\(t\)\)u\_\{\\theta\}\(t\)=\\pi\_\{\\theta\}\(y\(t\)\)andθ∈Θ\\theta\\in\\Thetaare the policy parameters\. This is amemory\-lesscontroller that only considers the \(observation of\) the current state to decide what action to take\. The control inputs generated by this policy can be said to be optimal if they minimize the cost objective:
θ⋆≔argminθ∈ΘJ\(πθ\(x\(t\)\)\)s\.t\.x˙\(t\)=f\(x\(t\)\)\+g\(x\(t\)\)πθ\(y\(t\)\)fort∈\[t0,tf\]x\(t0\)=x0,\\begin\{split\}\\theta^\{\\star\}&\\coloneqq\\operatorname\*\{argmin\}\_\{\\theta\\in\\Theta\}J\(\\pi\_\{\\theta\}\(x\(t\)\)\)\\\\ \\text\{s\.t\.\}\\quad\\quad\\dot\{x\}\(t\)&=f\\left\(x\(t\)\\right\)\+g\\left\(x\(t\)\\right\)\\pi\_\{\\theta\}\(y\(t\)\)\\quad\\text\{ for \}t\\in\[t\_\{0\},t\_\{f\}\]\\kern 5\.0pt\\\\ x\(t\_\{0\}\)&=x\_\{0\}\\kern 5\.0pt,\\end\{split\}\(2\.4\)wherex0x\_\{0\}is the initial system state\.
Because the true statex\(t\)x\(t\)cannot be fully inferred from a single observationy\(t\)y\(t\), this memory\-less controller is fundamentally suboptimal\. This necessitates a policy with memory, which maintains an internal latent stateh\(t\)∈ℝdhh\(t\)\\in\\mathds\{R\}^\{d\_\{h\}\}that acts as a dynamical process coupled to the environment, wheredhd\_\{h\}is the dimensionality of the hidden state\. In continuous time, this takes the generic form:
h˙\(t\)\\displaystyle\\dot\{h\}\(t\)=ψθ\(y\(t\),h\(t\)\),\\displaystyle=\\psi\_\{\\theta\}\(y\(t\),h\(t\)\),\(2\.5\)u\(t\)\\displaystyle u\(t\)=πθ\(h\(t\)\),\\displaystyle=\\pi\_\{\\theta\}\(h\(t\)\),\(2\.6\)whereψθ\\psi\_\{\\theta\}determines the evolution of the latent state\. Common architectures for such dynamic policies CT\-RNN or GRU\(Wierstraet al\.,[2010](https://arxiv.org/html/2605.05373#bib.bib41)\)\. However, in standard deep RL practice, these methods treat the hidden stateh\(t\)h\(t\)as an unstructured and uninterpretable black\-box\.
## 3Recurrent Policies as Optimal Dynamic Processes
In continuous control, optimizing policy parameters is fundamentally governed by the underlying structure of the stochastic optimal control problem in Eq\. \([2\.3](https://arxiv.org/html/2605.05373#S2.E3)\)\. This structure is characterized by thenecessary conditions of optimality, traditionally approached via the Hamilton\-Jacobi\-Bellman \(HJB\) equation or PMP\(Pontryagin,[1987](https://arxiv.org/html/2605.05373#bib.bib8); Kirk,[2004](https://arxiv.org/html/2605.05373#bib.bib2); Bryson and Ho,[1975](https://arxiv.org/html/2605.05373#bib.bib23)\)\. While HJB characterizes optimality globally across the state space, PMP directly describes optimality conditions along the trajectory\. In what follows, we leverage PMP to demonstrate that the hidden state of a recurrent policy can be mathematically mapped onto the theoretical co\-states of an optimal Hamiltonian system\. While the stochastic optimal control problem is addressed via the stochastic minimum principle, we leverage the property that the internal dynamics of recurrent networks operate deterministically and ground our architectural mapping in the deterministic PMP\.
### 3\.1Pontryagin Minimum Principle and Hamiltonian Systems
To study optimality over time, we consider the control\-Hamiltonian, which evaluates the rate of change of the optimal value over time along the trajectory:
ℋ\(x,λ,u\)≔ℒ\(x,u\)\+λ⊤x˙=q\(x\)\+c\(u\)\+λ⊤\(f\(x\)\+g\(x\)u\)\.\\mathcal\{H\}\(x,\\lambda,u\)\\coloneqq\\mathcal\{L\}\(x,u\)\+\\lambda^\{\\top\}\\dot\{x\}=q\(x\)\+c\(u\)\+\\lambda^\{\\top\}\\bigl\(f\(x\)\+g\(x\)u\\bigr\)\.\(3\.1\)Here,λ∈ℝdx\\lambda\\in\\mathbb\{R\}^\{d\_\{x\}\}represents theco\-state\(or adjoint\) vector\. Acting as a continuous\-time Lagrange multiplier, the co\-state tracks the sensitivity of the value function to infinitesimal perturbations in the current state, and is connected to the value function viaλ\(t\)=∇V⋆\(x\)\\lambda\(t\)=\\nabla V^\{\\star\}\(x\)Vinter \([1986](https://arxiv.org/html/2605.05373#bib.bib49)\)\. By differentiating the Hamiltonian w\.r\.t\. the states and co\-states, we obtain a coupled, continuous\-time optimal boundary value problem\. PMP states that an optimal controlu⋆\(t\)u^\{\\star\}\(t\)must minimize the Hamiltonian at all timest∈\[t0,tf\]t\\in\[t\_\{0\},t\_\{f\}\]\.
###### Proposition 3\.1\(Optimal Hamiltonian System\)\.
Letu⋆\(t\)u^\{\\star\}\(t\)be an optimal solution for the control problem\. The optimal trajectoryx⋆\(t\)x^\{\\star\}\(t\)and its corresponding co\-stateλ⋆\(t\)\\lambda^\{\\star\}\(t\)must satisfy the following two\-point boundary value problem fort∈\[t0,tf\]t\\in\[t\_\{0\},t\_\{f\}\]:
x˙⋆=∇λℋ=f\(x⋆\)\+g\(x⋆\)u⋆λ˙⋆=−∇xℋ=−∇xq−\(∇xx˙⋆\)λ⋆u⋆=argminu∈𝒰ℋ\(x⋆,λ⋆,u\)s\.t\.x⋆\(t0\)=x0andλ⋆\(tf\)=∇xΦ\(x⋆\(tf\)\)\.\\begin\{split\}\\dot\{x\}^\{\\star\}&=\\nabla\_\{\\lambda\}\\mathcal\{H\}=f\(x^\{\\star\}\)\+g\(x^\{\\star\}\)u^\{\\star\}\\\\ \\dot\{\\lambda\}^\{\\star\}&=\-\\nabla\_\{x\}\\mathcal\{H\}=\-\\nabla\_\{x\}q\-\\left\(\\nabla\_\{x\}\\dot\{x\}^\{\\star\}\\right\)\\lambda^\{\\star\}\\\\ u^\{\\star\}&=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\mathcal\{H\}\(x^\{\\star\},\\lambda^\{\\star\},u\)\\\\ \\text\{s\.t\.\}\\quad&x^\{\\star\}\(t\_\{0\}\)=x\_\{0\}\\quad\\text\{and\}\\quad\\lambda^\{\\star\}\(t\_\{f\}\)=\\nabla\_\{x\}\\Phi\(x^\{\\star\}\(t\_\{f\}\)\)\\kern 5\.0pt\.\\end\{split\}\(3\.2\)where the notation\(∇yf\)i,j≔∂fj/∂yi\(\\nabla\_\{y\}f\)\_\{i,j\}\\coloneqq\\partial f\_\{j\}/\\partial y\_\{i\}denotes Jacobians, such that the gradient of the scalar Hamiltonian yields the column vector∇xℋ=\(∂ℋ∂x1,…,∂ℋ∂xdx\)⊤\\nabla\_\{x\}\\mathcal\{H\}=\(\\frac\{\\partial\\mathcal\{H\}\}\{\\partial x\_\{1\}\},\\ldots,\\frac\{\\partial\\mathcal\{H\}\}\{\\partial x\_\{d\_\{x\}\}\}\)^\{\\top\}\.
Proof:The derivation follows standard extremization of the cost functional using variational calculus\. A full derivation is provided in Appendix A\.
### 3\.2The Neural Co\-state Policy
The PMP framework thus describes optimality conditions through two coupled continuous\-time processes: the physical system statesxx, and the co\-statesλ\\lambda, which dictate the optimal control inputs\. These systems are inherently coupled: the co\-states generate actions that drive the physical environment, while the evolution of the environment continuously updates the co\-states via the Hamiltonian gradients\.
The central premise of this work is that this coupled structure mirrors the abstraction of recurrent reinforcement learning\. In partially observable RL, recurrent policies aim to control an environmentxxby generating actions from an internal, latent dynamical processhh\. However, standard practice treats this hidden state as an uninterpretable black box\. We propose the view that to achieve optimal control, the latent recurrent processh\(t\)h\(t\)should act as a high\-dimensional neural embedding of the theoretically optimal co\-stateλ⋆\(t\)\\lambda^\{\\star\}\(t\)\.
This analogy provides a mathematical blueprint for designing recurrent architectures\. To formalize this connection, we introduce the neural co\-state policy \(NCP\) class that mimics the theoretical optimal Hamiltonian system through learned neural representations\.
###### Definition 3\.2\(Neural Co\-state Policy\)\.
A neural co\-state policy \(NCP\) is a system defined by:
x˙=f\(x\)\+g\(x\)u,y=ϕ\(x\)\+v,h˙=−Bθ\(y\)−Fθ\(y,u\)hu=argminu∈𝒰\{c\(u\)\+u⊤Gθ\(y\)h\}s\.t\.x\(t0\)=x0,andh\(tf\)=∇yΦ\(y\(tf\)\),\\begin\{split\}\\dot\{x\}&=f\(x\)\+g\(x\)u,\\qquad y=\\phi\(x\)\+v,\\\\ \\dot\{h\}&=\-B\_\{\\theta\}\(y\)\-F\_\{\\theta\}\(y,u\)h\\\\ u&=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\\{c\(u\)\+u^\{\\top\}G\_\{\\theta\}\(y\)h\\\}\\\\ \\text\{s\.t\.\}\\quad&x\(t\_\{0\}\)=x\_\{0\},\\quad\\text\{and\}\\quad h\(t\_\{f\}\)=\\nabla\_\{y\}\\Phi\(y\(t\_\{f\}\)\),\\end\{split\}\(3\.3\)whereh∈ℝdhh\\in\\mathbb\{R\}^\{d\_\{h\}\}acts as the memory state of the network\. The policy class separates various learnable components to represent the unknown environmental dynamics:FθF\_\{\\theta\}models the state transition Jacobians∇xx˙\\nabla\_\{x\}\\dot\{x\},GθG\_\{\\theta\}models the control\-influence matrixg\(x\)g\(x\), andBθB\_\{\\theta\}models the state cost derivative∇xq\(x\)\\nabla\_\{x\}q\(x\)\. By defining the system strictly through its initial statex\(t0\)x\(t\_\{0\}\)and its terminal hidden stateh\(tf\)h\(t\_\{f\}\), the NCP forms a two\-point boundary value problem \(TPBVP\)\.
Assuming a standard continuous action space and a quadratic control penaltyc\(u\)=u⊤Ruc\(u\)=u^\{\\top\}Ru\(whereR≻0R\\succ 0\), evaluating∇uℋ=0\\nabla\_\{u\}\\mathcal\{H\}=0yields a closed\-form optimal control law for the NCP:u⋆=−12R−1Gθ\(y\)h⋆u^\{\\star\}=\-\\frac\{1\}\{2\}R^\{\-1\}G\_\{\\theta\}\(y\)h^\{\\star\}\.
###### Proposition 3\.3\(NCP Optimality Condition\)\.
Consider the optimal Hamiltonian system from Proposition[3\.1](https://arxiv.org/html/2605.05373#S3.Thmtheorem1)and the NCP system from Definition[3\.2](https://arxiv.org/html/2605.05373#S3.Thmtheorem2)\. Forθ⋆\\theta^\{\\star\}to yield optimal trajectories, it is a necessary condition for the NCP components to act as functional representations of the true Hamiltonian terms:
Bθ⋆\(y\)≈∇xq\(x\),Fθ⋆\(y,u\)≈∇xx˙,Gθ⋆\(y\)≈g\(x\)⊤\.\\displaystyle B\_\{\\theta^\{\\star\}\}\(y\)\\approx\\nabla\_\{x\}q\(x\),\\qquad F\_\{\\theta^\{\\star\}\}\(y,u\)\\approx\\nabla\_\{x\}\\dot\{x\},\\qquad G\_\{\\theta^\{\\star\}\}\(y\)\\approx g\(x\)^\{\\top\}\.Consequently, the optimal latent stateh\(t\)h\(t\)must converge to a structural embedding of the true theoretical co\-stateλ⋆\(t\)\\lambda^\{\\star\}\(t\)\.
This proposition establishes our theoretical goal: if the neural hidden stateh\(t\)h\(t\)can be explicitly regularized to functionally align with the true co\-stateλ⋆\(t\)\\lambda^\{\\star\}\(t\), the policy naturally recovers the theoretically optimal control law through its learned latent projections\. Figure[2](https://arxiv.org/html/2605.05373#S3.F2)makes this alignment explicit\.
Optimal Hamiltonian systemx˙⋆\\displaystyle\\dot\{x\}^\{\\star\}=f\(x⋆\)\+g\(x⋆\)u⋆,\\displaystyle=f\(x^\{\\star\}\)\+g\(x^\{\\star\}\)u^\{\\star\},λ˙⋆\\displaystyle\\dot\{\\lambda\}^\{\\star\}=−∇xq−\(∇xx˙⋆\)λ⋆,\\displaystyle=\-\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\nabla\_\{x\}q\}\-\{\\color\[rgb\]\{0,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,\.5,\.5\}\\left\(\\nabla\_\{x\}\\dot\{x\}^\{\\star\}\\right\)\}\\lambda^\{\\star\},u⋆\\displaystyle u^\{\\star\}=argminu∈𝒰\{c\(u\)\+u⊤g\(x⋆\)⊤λ⋆\},\\displaystyle=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\\{c\(u\)\+u^\{\\top\}\{\\color\[rgb\]\{0\.5,0\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\}\\pgfsys@color@cmyk@stroke\{0\}\{0\}\{1\}\{\.5\}\\pgfsys@color@cmyk@fill\{0\}\{0\}\{1\}\{\.5\}g\(x^\{\\star\}\)^\{\\top\}\}\\lambda^\{\\star\}\\\},s\.t\.x⋆\(t0\)=x0,\\displaystyle x^\{\\star\}\(t\_\{0\}\)=x\_\{0\},λ⋆\(tf\)=∇xΦ\(x⋆\(tf\)\)\.\\displaystyle\\lambda^\{\\star\}\(t\_\{f\}\)=\\nabla\_\{x\}\\Phi\(x^\{\\star\}\(t\_\{f\}\)\)\.Neural Co\-state Policyx˙⋆\\displaystyle\\dot\{x\}^\{\\star\}=f\(x⋆\)\+g\(x⋆\)u⋆,y⋆=ϕ\(x⋆\)\+v,\\displaystyle=f\(x^\{\\star\}\)\+g\(x^\{\\star\}\)u^\{\\star\},\\quad y^\{\\star\}=\\phi\(x^\{\\star\}\)\+v,h˙⋆\\displaystyle\\dot\{h\}^\{\\star\}=−Bθ\(y⋆\)−Fθ\(y⋆,u⋆\)h⋆,\\displaystyle=\-\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}B\_\{\\theta\}\(y^\{\\star\}\)\}\-\{\\color\[rgb\]\{0,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,\.5,\.5\}F\_\{\\theta\}\(y^\{\\star\},u^\{\\star\}\)\}h^\{\\star\},u⋆\\displaystyle u^\{\\star\}=argminu∈𝒰\{c\(u\)\+u⊤Gθ\(y⋆\)h⋆\},\\displaystyle=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\\{c\(u\)\+u^\{\\top\}\{\\color\[rgb\]\{0\.5,0\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\}\\pgfsys@color@cmyk@stroke\{0\}\{0\}\{1\}\{\.5\}\\pgfsys@color@cmyk@fill\{0\}\{0\}\{1\}\{\.5\}G\_\{\\theta\}\(y^\{\\star\}\)\}h^\{\\star\}\\\},s\.t\.x⋆\(t0\)=x0,\\displaystyle x^\{\\star\}\(t\_\{0\}\)=x\_\{0\},h⋆\(tf\)=∇yΦ\(y⋆\(tf\)\)\.\\displaystyle h^\{\\star\}\(t\_\{f\}\)=\\nabla\_\{y\}\\Phi\(y^\{\\star\}\(t\_\{f\}\)\)\.
Figure 2:Structural parallel between the optimal Hamiltonian system \(left\) and the NCP \(right\): an optimal NCP system will learn a representation such that the hidden stateh⋆h^\{\\star\}is a function approximator of the optimal co\-stateλ⋆\\lambda^\{\\star\}of the optimal Hamiltonian system\. This results in a policy that follows an optimal dynamic control law via a latent continuous\-time process\.
### 3\.3The Control\-Hamiltonian Structure of Recurrent Networks
The NCP framework does not necessarily demand a fundamentally new network architecture; rather, it provides a unifying mathematical lens through which to reinterpret existing models\. To demonstrate this, consider the optimal latent dynamics and corresponding actions defined by the NCP class:
\(NCP\)h˙⋆\\displaystyle\\text\{\(NCP\)\}\\quad\\dot\{h\}^\{\\star\}=−Bθ\(y\)−Fθ\(y,u⋆\)h⋆,andu⋆=−12R−1Gθ\(y\)h⋆\.\\displaystyle=\-B\_\{\\theta\}\(y\)\-F\_\{\\theta\}\(y,u^\{\\star\}\)h^\{\\star\},\\quad\\text\{and\}\\quad u^\{\\star\}=\-\\frac\{1\}\{2\}R^\{\-1\}G\_\{\\theta\}\(y\)h^\{\\star\}\.\(3\.4\)We can examine standard architectures, such as CT\-RNN and GRU, against these theoretical conditions\. By isolating the core affine transformations that drive their hidden state updates \(omitting time\-constants and leak rates for structural clarity\), we observe the following algebraic parallel:
\(CT\-RNN\)h˙\\displaystyle\\text\{\(CT\-RNN\)\}\\quad\\dot\{h\}∝ϕ\(Winy\+Whh\+b\),\\displaystyle\\propto\\phi\\bigl\(W\_\{\\text\{in\}\}y\+W\_\{h\}h\+b\\bigr\),andu\\displaystyle\\text\{and\}\\quad u=Wouth\\displaystyle=W\_\{\\text\{out\}\}h\(3\.5\)\(GRU\)h˙\\displaystyle\\text\{\(GRU\)\}\\quad\\dot\{h\}∝ϕ\(Winy\+Wh\(r⊙h\)\+b\),\\displaystyle\\propto\\phi\\bigl\(W\_\{\\text\{in\}\}y\+W\_\{h\}\(r\\odot h\)\+b\\bigr\),andu\\displaystyle\\text\{and\}\\quad u=Wouth\\displaystyle=W\_\{\\text\{out\}\}h\(3\.6\)whereϕ\(⋅\)\\phi\(\\cdot\)denotes the non\-linear activation function andrris the GRU reset gate\(Chunget al\.,[2014](https://arxiv.org/html/2605.05373#bib.bib42)\)\.
While these standard equations apply non\-linearities over the hidden state differential and include bias terms, their underlying affine structure inherently mirrors the coupled systems of PMP\. By treating the learned weights as state\-dependent matrix operators, we can extract a direct functional mapping to the NCP dynamics:
- •The state\-cost gradient \(BθB\_\{\\theta\}\):The input projection matrixWinyW\_\{\\text\{in\}\}yprocesses the current observation, acting as the structural equivalent to the marginal state cost−Bθ\(y\)\-B\_\{\\theta\}\(y\)\. Notably, if the environment’s true state costq\(x\)q\(x\)is quadratic, its gradient is strictly linear\. This makes standard RNNs well\-suited to capture optimal control in tasks with quadratic reward formulations\.
- •The dynamics Jacobian \(FθF\_\{\\theta\}\):The recurrent weight matricesWhW\_\{h\}serve the role of the dynamics Jacobian−Fθ\-F\_\{\\theta\}, capturing the state\-to\-state sensitivity of the system over time\.
- •The readout layer \(Hamiltonian minimization\):In the PMP framework, optimal control requires minimizing the Hamiltonian\. If we approximate the control\-influence mappingGθ\(y\)G\_\{\\theta\}\(y\)as a static matrixGθG\_\{\\theta\}, the Hamiltonian minimization term−12R−1Gθ\-\\frac\{1\}\{2\}R^\{\-1\}G\_\{\\theta\}collapses into a single constant matrix\. The linear readout layeru=Wouthu=W\_\{\\text\{out\}\}hcan then be interpreted as a closed\-form execution of Hamiltonian minimization\.
For the recurrent matrices to correctly integrate the theoretical co\-state, they must explicitly contain a representation of the local system dynamics \(capturing the state derivatives viaFθF\_\{\\theta\}and the control\-influence viaGθG\_\{\\theta\}\)\. This requirement directly reflects the core premise of thegood regulator theorem\(Conant and Ross Ashby,[1970](https://arxiv.org/html/2605.05373#bib.bib44)\)and theinternal model principle\(Francis and Wonham,[1976](https://arxiv.org/html/2605.05373#bib.bib45)\)in the context of ordinary differential equations \(ODEs\)\. This posits that any strictly optimal controller must inherently contain a model of the system it regulates\.
## 4Structuring Internal Dynamics in Recurrent Policies
In standard model\-free RL, however, policies are optimized purely via scalar reward maximization\. While the recurrent architectures discussed in the previous section possess the structural capacity to represent neural co\-states, pure policy optimization provides no guarantee that their hidden states will actually converge to the optimal deterministic PMP co\-state representations during learning\. To bridge this optimization gap, we introduce a neural co\-state loss that aligns the hidden states of the network with the theoretical co\-states\. By leveraging standard actor\-critic methods, a co\-state loss is derived that is applicable to any recurrent policy that belongs to the NCP class\.
### 4\.1The Actor\-Critic Optimization Framework
Standard recurrent RL operates via the actor\-critic paradigm, utilizing an actor \(πθ\\pi\_\{\\theta\}\) to selects actions and a critic \(VϕV\_\{\\phi\}\) to estimate returns\. Optimization commonly relies on PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.05373#bib.bib36)\)alongside generalized advantage estimation \(GAE\)\(Schulmanet al\.,[2015](https://arxiv.org/html/2605.05373#bib.bib46)\)\. PPO stabilizes updates by clipping the objective function to prevent destructively parameter shifts, while GAE reduces policy gradient variance by exponentially smoothing the critic’s temporal difference errors\.
To optimize the recurrent parameters over finite trajectories, gradients are computed recursively backwards through the unrolled network via back\-propagation through time \(BPTT\)\. The BPTT error propagation takes the standard form:
δt−1=δt∂ht∂ht−1\+∂ℒ∂ht−1,\\delta\_\{t\-1\}=\\delta\_\{t\}\\frac\{\\partial h\_\{t\}\}\{\\partial h\_\{t\-1\}\}\+\\frac\{\\partial\\mathcal\{L\}\}\{\\partial h\_\{t\-1\}\},\(4\.1\)whereδt\\delta\_\{t\}is the accumulated error gradient of the objective function with respect to the hidden state at timett\. Mathematically, this recursive gradient chaining is the discrete\-time computational equivalent of integrating the co\-state \(or adjoint\) equation\.
Algorithm 1Neural Co\-state Optimization for Recurrent Networks for PPOInitialize:actor parameters
θ=\{Eθ,GRUθ,Wout\}\\theta=\\\{E\_\{\\theta\},\\text\{GRU\}\_\{\\theta\},W\_\{\\text\{out\}\}\\\}, critic parameters
ϕ=\{Vϕ\}\\phi=\\\{V\_\{\\phi\}\\\}, batch size
MM, loss coefficients
c1,c2,c3c\_\{1\},c\_\{2\},c\_\{3\}\.
foriteration
=1,2,…=1,2,\\dotsdo
Initialize hidden state
h0h\_\{0\}
Run policy in environment for
TTtimesteps, collecting
\(yt,ht,ut,rt\)\(y\_\{t\},h\_\{t\},u\_\{t\},r\_\{t\}\):
Encode observation:
y~t=Eθ\(yt\)\\tilde\{y\}\_\{t\}=E\_\{\\theta\}\(y\_\{t\}\)
Integrate hidden state:
ht=GRUθ\(y~t,ht−1\)h\_\{t\}=\\text\{GRU\}\_\{\\theta\}\(\\tilde\{y\}\_\{t\},h\_\{t\-1\}\)
Apply implicit minimum principle:
μt=argminu∈𝒰ℋ=Woutht\\mu\_\{t\}=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\mathcal\{H\}=W\_\{\\text\{out\}\}h\_\{t\}
Sample action
ut∼𝒩\(μt,σ\)u\_\{t\}\\sim\\mathcal\{N\}\\left\(\\mu\_\{t\},\\sigma\\right\)
Compute advantage estimates
A^t\\hat\{A\}\_\{t\}using critic
Vϕ\(y~t,ht\)V\_\{\\phi\}\(\\tilde\{y\}\_\{t\},h\_\{t\}\)
Compute HJB co\-state targets
λ^t=stop\_gradient\(∇y~Vϕ\(y~t,ht\)\)\\hat\{\\lambda\}\_\{t\}=\\text\{stop\\\_gradient\}\\left\(\\nabla\_\{\\tilde\{y\}\}V\_\{\\phi\}\(\\tilde\{y\}\_\{t\},h\_\{t\}\)\\right\)
Optimize joint PPO objective with
KKepochs and minibatch size
MM:
ℒtotal\(θ,ϕ\)=ℒactor\+c1ℒcritic−c2ℒentropy\+c3ℒco\-state\\mathcal\{L\}\_\{\\text\{total\}\}\(\\theta,\\phi\)=\\mathcal\{L\}\_\{\\text\{actor\}\}\+c\_\{1\}\\mathcal\{L\}\_\{\\text\{critic\}\}\-c\_\{2\}\\mathcal\{L\}\_\{\\text\{entropy\}\}\+c\_\{3\}\\mathcal\{L\}\_\{\\text\{co\-state\}\}
where
ℒco\-state\(θ\)=𝔼t\[1−ht⋅λ^t‖ht‖2‖λ^t‖2\]\\mathcal\{L\}\_\{\\text\{co\-state\}\}\(\\theta\)=\\mathbb\{E\}\_\{t\}\\left\[1\-\\frac\{h\_\{t\}\\cdot\\hat\{\\lambda\}\_\{t\}\}\{\\\|h\_\{t\}\\\|\_\{2\}\\\|\\hat\{\\lambda\}\_\{t\}\\\|\_\{2\}\}\\right\]
endfor
### 4\.2Extracting Co\-state Targets from HJB
While standard actor\-critic algorithms excel at credit assignment, they optimize recurrent parameters purely for expected return maximization, leaving the differential structure of the hidden state unconstrained\. To align the hidden state with the theoretical PMP co\-state, we bridge PMP with dynamic programming via the Hamilton\-Jacobi\-Bellman \(HJB\) equation\. A foundational property linking these frameworks is that the optimal continuous\-time co\-state is strictly equivalent to the spatial gradient of the optimal value function:λ⋆\(t\)=∇V⋆\(x\(t\)\)\.\\lambda^\{\\star\}\(t\)=\\nabla V^\{\\star\}\(x\(t\)\)\.
In the actor\-critic paradigm, the learned critic networkVϕ\(yt\)V\_\{\\phi\}\(y\_\{t\}\)approximates this true value function\. Using automatic differentiation, we can directly compute its gradient with respect to the encoded environment observation to obtain a functional co\-state approximation:λ^t=∇y~Vϕ\(y~t\)\\hat\{\\lambda\}\_\{t\}=\\nabla\_\{\\tilde\{y\}\}V\_\{\\phi\}\(\\tilde\{y\}\_\{t\}\), wherey~=E\(yt\)\\tilde\{y\}=E\(y\_\{t\}\)is the encoded observation\.
Because theoretical co\-states can take on arbitrarily large magnitudes while recurrent hidden states are typically bounded by non\-linear activations, minimizing the unnormalized Euclidean distance can cause instability\. Therefore, we regularize the directional alignment using the cosine distance:
ℒco\-state\(θ\)=𝔼t\[1−ht⋅λ^t‖ht‖2‖λ^t‖2\]\\mathcal\{L\}\_\{\\text\{co\-state\}\}\(\\theta\)=\\mathbb\{E\}\_\{t\}\\left\[1\-\\frac\{h\_\{t\}\\cdot\\hat\{\\lambda\}\_\{t\}\}\{\\\|h\_\{t\}\\\|\_\{2\}\\\|\\hat\{\\lambda\}\_\{t\}\\\|\_\{2\}\}\\right\]\(4\.2\)Algorithm[1](https://arxiv.org/html/2605.05373#alg1)summarizes our approach to neural co\-state optimization\.
## 5Experiments and Results
We empirically evaluate the NCP framework across 8 continuous control tasks from the DeepMind Control Suite\(Tassaet al\.,[2018](https://arxiv.org/html/2605.05373#bib.bib62)\), using the PPO framework\. To explicitly test the memory capacity of the learned latent dynamics, we introduce partial observability via a stochastic sensor dropout mechanism\. At each timesteptt, the entire observation vectoryty\_\{t\}is completely masked through a scalar maskmt∼Bernoulli\(1−p\)m\_\{t\}\\sim\\text\{Bernoulli\}\(1\-p\), resulting in the corrupted observationy~t=mtyt\\tilde\{y\}\_\{t\}=m\_\{t\}y\_\{t\}\. This total sensor blackout requires the recurrent policy to integrate past interactions to handle sensor dropouts\. All models are trained with a fixed masking probability \(ptrain=0\.5p\_\{\\text\{train\}\}=0\.5\)\. Experimental details are described in Appendix[E](https://arxiv.org/html/2605.05373#A5), while the implementation code is open\-source available atgithub\.com/DavidLeeftink/neural\-costate\-policies\.
Through this design, our experiments address three core questions: \(1\)Performance:does the co\-state loss improve sample efficiency and expected returns? \(2\)Sensitivity:how sensitive is training stability to the co\-state loss coefficient? \(3\)Robustness:do NCP hidden states generalize better in zero\-shot learning with out\-of\-distribution \(OOD\) masking?
### 5\.1Continuous Control and Zero\-Shot Robustness on the DeepMind Control Suite
Figure 3:DMControl tasks under partial observability\.Learning curves for standard and NCP\-regularized recurrent architectures \(GRU and CT\-RNN\) evaluated under a sensor dropout regime\. During training, the entire observation vector is zero\-masked at each timestep with probabilityp=0\.5p=0\.5\. Solid lines denote the mean episode return across 10 independent random seeds, and shaded regions represent±1\\pm 1standard deviation\. For visual clarity, curves are smoothed using a simple moving average over 30 logged checkpoints \(equivalent to roughly10610^\{6\}environment steps\)\.Addressing \(1\)Performance, we benchmarked NC\-GRU and NC\-CTRNN against their unregularized counterparts \(Figure[3](https://arxiv.org/html/2605.05373#S5.F3)\)\. The impact of the PMP\-derived co\-state loss varies across task complexity\. On simpler, lower\-dimensional tasks \(e\.g\.,CartpoleSwingup,BallInCup\), all models successfully solve the environment\. However, the auxiliary co\-state regularization introduces a marginally slower initial convergence for the NCP variants compared to the unregularized baselines\. OnFingerTurnHard, all architectures struggle to achieve high returns, showing that dexterous manipulation under observation masking remains challenging\. For this task, the co\-state loss marginally weakens the performance compared to the standard GRU\. Lastly, on locomotion tasks \(WalkerStand,WalkerWalk, andWalkerRun,CheetahRun\), the NC\-GRU model yields substantial improvements in both final asymptotic return and overall stability\. Whereas the NC\-GRU is outperforming the standard GRU on most cases, the NC\-CTRNN achieves mostly similar performance to the original CTRNN policy, suggesting the co\-state loss term is not as effective for this architecture\.
Analysis of the results suggests that the co\-state priors perform well at regularizing rhythmic locomotion, however the performance degrades in contact\-heavy tasks likeFingerTurnHard\. This challenge possibly could be related to extracting the co\-state targets using the w\.r\.t\. the noisy partial derivative observation rather than the true state, which forms a challenge for continuous\-time co\-state tracking\.
### 5\.2The effect of the co\-state loss coefficient
To address \(2\)Sensitivity, we compare the standard GRU model against the neural co\-state GRU model with co\-state coefficients ofc3∈\{0\.01,0\.05,0\.1\}c\_\{3\}\\in\\\{0\.01,0\.05,0\.1\\\}on theWalkerStandtask, while we use the Brax\(Freemanet al\.,[2021](https://arxiv.org/html/2605.05373#bib.bib65)\)parameter configurations for the remaining training coefficients\. Figure[4](https://arxiv.org/html/2605.05373#S5.F4)shows that incorporating the co\-state loss during training substantially improves the expected returns\.
Unlike the standard GRU that plateaus early, all co\-state configurations improve performance, peaking optimally atc3=0\.05c\_\{3\}=0\.05before degrading atc3=0\.1c\_\{3\}=0\.1\. Larger coefficients accelerate the reduction of co\-state alignment loss, ensuring tighter adherence to the optimal control prior\. In contrast, the unregularized GRU fails to naturally align, maintaining a high loss near one throughout training despite improving expected returns\.
To address \(3\)Robustness, we evaluate model resilience in a zero\-shot setting by increasing observation masking from 50% to 75%\. While all models experience performance degradation under this out\-of\-distribution regime, the NC models do not demonstrate inherent robustness to the increased masking frequency itself\. However, the policies are able to largely retain the higher median returns inherited from its superior training\-time performance\. This suggests that while co\-state regularization significantly elevates the agent’s operating point, the resulting performance floor in difficult settings is a direct reflection of the gains achieved during the training distribution\.
Figure 4:Ablation of the co\-state penalty coefficient on the WalkerStand task\.\(Left\)Expected returns during training under 50% observation masking\.\(Middle left\)Co\-state cosine similarity loss, demonstrating the structural alignment of the latent space over time\.\(Middle right\)Final performance distributions evaluated within the training distribution\.\(Right\)Out\-of\-distribution zero\-shot robustness evaluated under 75% sensor masking\. Solid lines denote the mean episode return across 10 independent random seeds, and shaded regions represent±1\\pm 1standard deviation\. For visual clarity, curves are smoothed using a simple moving average over 30 logged checkpoints \(equivalent to roughly10610^\{6\}environment steps\)\. Boxplots aggregate 100 evaluations per seed across varying initial conditions, with white horizontal lines indicating the median\.
## 6Related work
Recurrent policies and hidden state representations\.Moving beyond explicit belief\-state tracking, modern deep RL encodes historical context as an unstructured latent state via RNNs\. To stabilize long\-horizon credit assignment, recent architectures introduce inductive biases, such as structured state space models\(Guet al\.,[2021](https://arxiv.org/html/2605.05373#bib.bib58)\)or continuous\-time oscillatory dynamics\(Rusch and Mishra,[2020](https://arxiv.org/html/2605.05373#bib.bib59)\)\. While these methods enforce a generaldynamicalprior, NCPs enforce anoptimal controlprior\. Although earlier work has considered the relationship between optimal control and neural optimization\(Liuet al\.,[2021](https://arxiv.org/html/2605.05373#bib.bib66); Bensoussanet al\.,[2023](https://arxiv.org/html/2605.05373#bib.bib67)\), NCPs are unique by bridging neural memory and classical control theory; regularizing the latent space to track the necessary optimality conditions\.
Physics\-informed latent representations and internal forward models\.This control\-theoretic regularization aligns with physics\-informed machine learning, where architectures like Hamiltonian neural networks\(Greydanuset al\.,[2019](https://arxiv.org/html/2605.05373#bib.bib54)\)and deep Lagrangian networks\(Lutteret al\.,[2019](https://arxiv.org/html/2605.05373#bib.bib55)\)embed physical conservation laws\. Whereas these works primarily modelpassiveenvironmental dynamics, NCPs explicitly structure the agent’sinternalbelief state to realize the internal model principle\(Francis and Wonham,[1976](https://arxiv.org/html/2605.05373#bib.bib45)\)\.
Pontryagin’s Minimum Principle in Deep RL\.While the Hamilton\-Jacobi\-Bellman \(HJB\) equation forms the theoretical backbone of value\-based RL, its trajectory\-centric counterpart, PMP, has only recently gained traction\. PMP in RL has been restricted strictly to model\-based paradigms: offline policy evaluation\(Jinet al\.,[2020](https://arxiv.org/html/2605.05373#bib.bib9)\), deterministic trajectory optimization\(Guet al\.,[2022](https://arxiv.org/html/2605.05373#bib.bib57); Eberhardet al\.,[2025](https://arxiv.org/html/2605.05373#bib.bib60)\)and planning under uncertainty\(Leeftinket al\.,[2025](https://arxiv.org/html/2605.05373#bib.bib33)\)\. NCPs diverge from this lineage by operating fundamentally model\-free, requiring only a targeted co\-state loss to ground closed\-loop recurrent policies in optimal control theory\.
## 7Discussion
In this work, we established a formal mathematical equivalence between the hidden states of recurrent reinforcement learning policies and the theoretical co\-states of the Pontryagin minimum principle\. By conceptualizing standard architectures like CT\-RNNs and GRUs as dynamic processes governed by Hamiltonian minimization, we introduced the neural co\-state policy \(NCP\) framework\. Our results demonstrate that extracting PMP co\-state targets from the HJB value gradient provides a tractable auxiliary loss that successfully grounds internal memory in optimal control theory, matching or improving performance on partially observable continuous control tasks\.
Limitations and future work\.While bridging a critical theoretical gap, NCPs derive the co\-state targets via the critic’s spatial gradient\. Because this relies on automatic differentiation, these targets can be noisy — a known vulnerability in deterministic policy gradients\(Lillicrapet al\.,[2016](https://arxiv.org/html/2605.05373#bib.bib64)\)\. A conceptual limitation is that standard recurrent architectures require bounded non\-linear activations for numerical stability, deviating from the unconstrained integration of theoretical PMP co\-states\.
This motivates three avenues for future research\. First, the NCP equivalence framework can be extended to identify and map broader classes of memory architectures, such as coupled oscillator or long\-short term memory \(LSTM;Hochreiter and Schmidhuber \([1997](https://arxiv.org/html/2605.05373#bib.bib63)\)\) models\. Second, theoretically principled readout layers can be designed to explicitly solve time\- and fuel\-optimal tasks that require discontinuous bang\-bang or bang\-off\-bang controls \(App\.[C](https://arxiv.org/html/2605.05373#A3)\)\. Finally, integrating probabilistic value network ensembles allows for increased data\-efficiency, providing optimal dynamic representations under epistemic uncertainty \(App\.[D](https://arxiv.org/html/2605.05373#A4)\)\.
Broader Impact\.Beyond algorithmic control, our framework shares deep conceptual connections with biological motor control\. The human brain operates as an exceptionally intricate recurrent policy, to resolve partial observability in continuous motor tasks\(Mastrogiuseppe and Ostojic,[2018](https://arxiv.org/html/2605.05373#bib.bib52); Tsay and Ivry,[2026](https://arxiv.org/html/2605.05373#bib.bib56)\)\. Viewing the brain through the lens of NCPs could provide a rigorous theoretical framework for understanding how biological networks encode dynamic optimality\. Furthermore, continuous control under partial observability is the defining challenge of physical robotics\(Schneideret al\.,[2022](https://arxiv.org/html/2605.05373#bib.bib61)\)\. Real\-world tasks such as bipedal locomotion, dextrous manipulation, and autonomous navigation are inherently plagued by noisy sensors, temporary occlusions, and hardware latency\. By grounding the memory state of the policy in the mathematics of optimal control, NCPs offer a principled path forward for robotic control\. We discuss broader societal impact in Appendix[A](https://arxiv.org/html/2605.05373#A1)\.
Ultimately, this work demonstrates that previously black\-box neural memory can be anchored in the rigorous mathematics of the minimum principle, providing a new theoretical perspective on recurrent computation in continuous control\.
## Acknowledgements
This publication is part of the project ROBUST: Trustworthy AI\-based Systems for Sustainable Growth with project number KICH3\.LTP\.20\.006, which is \(partly\) financed by the Dutch Research Council \(NWO\), ASMPT, and the Dutch Ministry of Economic Affairs and Climate Policy \(EZK\) under the program LTP KIC 2020\-2023\. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors\.
## References
- On the dynamics of small continuous\-time recurrent neural networks\.Adaptive Behavior3\(4\),pp\. 469–509\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p3.1)\.
- R\. Bellman \(1966\)Dynamic programming\.Science153\(3731\),pp\. 34–37\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p4.1)\.
- A\. Bensoussan, J\. Han, S\. C\. P\. Yam, and X\. Zhou \(2023\)Value\-gradient based formulation of optimal control problem and machine learning algorithm\.SIAM Journal on Numerical Analysis61\(2\),pp\. 973–994\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p1.1)\.
- J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, Y\. Katariya, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne, and Q\. Zhang \(2018\)JAX: composable transformations of Python\+NumPy programsExternal Links:[Link](http://github.com/jax-ml/jax)Cited by:[Appendix E](https://arxiv.org/html/2605.05373#A5.p1.1)\.
- A\. E\. Bryson and Y\. Ho \(1975\)Applied optimal control: optimization, estimation, and control\.Taylor & Francis\.Cited by:[§B\.2](https://arxiv.org/html/2605.05373#A2.SS2.p1.1),[§3](https://arxiv.org/html/2605.05373#S3.p1.1)\.
- J\. Chung, C\. Gulcehre, K\. Cho, and Y\. Bengio \(2014\)Empirical evaluation of gated recurrent neural networks on sequence modeling\.arXiv preprint arXiv:1412\.3555\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p3.1),[§3\.3](https://arxiv.org/html/2605.05373#S3.SS3.p1.2)\.
- R\. C\. Conant and W\. Ross Ashby \(1970\)Every good regulator of a system must be a model of that system\.International Journal of Systems Science1\(2\),pp\. 89–97\.Cited by:[§3\.3](https://arxiv.org/html/2605.05373#S3.SS3.p3.2)\.
- O\. Eberhard, C\. Vernade, and M\. Muehlebach \(2025\)A pontryagin perspective on reinforcement learning\.InProceedings of the Seventh Annual Learning for Dynamics & Control Conference,Proceedings of Machine Learning Research, Vol\.283,pp\. 233–244\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p3.1)\.
- B\. A\. Francis and W\. M\. Wonham \(1976\)The internal model principle of control theory\.Automatica12\(5\),pp\. 457–465\.Cited by:[§3\.3](https://arxiv.org/html/2605.05373#S3.SS3.p3.2),[§6](https://arxiv.org/html/2605.05373#S6.p2.1)\.
- C\. D\. Freeman, E\. Frey, A\. Raichuk, S\. Girgin, I\. Mordatch, and O\. Bachem \(2021\)Brax–a differentiable physics engine for large scale rigid body simulation\.arXiv preprint arXiv:2106\.13281\.Cited by:[§E\.4](https://arxiv.org/html/2605.05373#A5.SS4.p1.1),[Appendix E](https://arxiv.org/html/2605.05373#A5.p1.1),[§5\.2](https://arxiv.org/html/2605.05373#S5.SS2.p1.1)\.
- S\. Greydanus, M\. Dzamba, and J\. Yosinski \(2019\)Hamiltonian neural networks\.Advances in neural information processing systems32\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p2.1)\.
- A\. Gu, K\. Goel, and C\. Ré \(2021\)Efficiently modeling long sequences with structured state spaces\.arXiv preprint arXiv:2111\.00396\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p1.1)\.
- C\. Gu, H\. Xiong, and Y\. Chen \(2022\)Pontryagin optimal control via neural networks\.arXiv preprint arXiv:2212\.14566\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p3.1)\.
- M\. J\. Hausknecht and P\. Stone \(2015\)Deep recurrent Q\-learning for partially observable MDPs\.\.InAAAI fall symposia,Vol\.45,pp\. 141\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p2.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural computation9\(8\),pp\. 1735–1780\.Cited by:[§7](https://arxiv.org/html/2605.05373#S7.p3.1)\.
- S\. Huang, R\. F\. J\. Dossa, A\. Raffin, A\. Kanervisto, and W\. Wang \(2022\)The 37 implementation details of proximal policy optimization\.The ICLR Blog Track 2023\.Cited by:[§E\.2](https://arxiv.org/html/2605.05373#A5.SS2.p1.1)\.
- W\. Jin, Z\. Wang, Z\. Yang, and S\. Mou \(2020\)Pontryagin differentiable programming: an end\-to\-end learning and control framework\.Advances in Neural Information Processing Systems33,pp\. 7979–7992\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p3.1)\.
- L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra \(1998\)Planning and acting in partially observable stochastic domains\.Artificial intelligence101\(1\-2\),pp\. 99–134\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p1.1)\.
- D\. E\. Kirk \(2004\)Optimal control theory: an introduction\.Courier Corporation\.Cited by:[§B\.2](https://arxiv.org/html/2605.05373#A2.SS2.p1.1),[§B\.2](https://arxiv.org/html/2605.05373#A2.SS2.p4.10),[§B\.2](https://arxiv.org/html/2605.05373#A2.SS2.p4.11),[§3](https://arxiv.org/html/2605.05373#S3.p1.1)\.
- D\. Leeftink, C\. Yildiz, S\. Ridderbusch, M\. Hinne, and M\. Van Gerven \(2025\)Optimal control of probabilistic dynamics models via mean Hamiltonian minimization\.In2025 IEEE 64th Conference on Decision and Control \(CDC\),Vol\.,pp\. 4146–4153\.External Links:[Document](https://dx.doi.org/10.1109/CDC57313.2025.11312001)Cited by:[Appendix D](https://arxiv.org/html/2605.05373#A4.p3.1),[§6](https://arxiv.org/html/2605.05373#S6.p3.1)\.
- D\. Liberzon \(2011\)Calculus of variations and optimal control theory: a concise introduction\.Cited by:[§B\.2](https://arxiv.org/html/2605.05373#A2.SS2.p1.1)\.
- T\. P\. Lillicrap, J\. J\. Hunt, A\. Pritzel, N\. Heess, T\. Erez, Y\. Tassa, D\. Silver, and D\. Wierstra \(2016\)Continuous control with deep reinforcement learning\.In4th International Conference on Learning Representations, ICLR 2016,San Juan, Puerto Rico\.Cited by:[§7](https://arxiv.org/html/2605.05373#S7.p2.1)\.
- G\. Liu, T\. Chen, and E\. Theodorou \(2021\)Dynamic game theoretic neural optimizer\.InInternational Conference on Machine Learning,pp\. 6759–6769\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p1.1)\.
- M\. Lutter, C\. Ritter, and J\. Peters \(2019\)Deep lagrangian networks: using physics as model prior for deep learning\.arXiv preprint arXiv:1907\.04490\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p2.1)\.
- F\. Mastrogiuseppe and S\. Ostojic \(2018\)Linking connectivity, dynamics, and computations in low\-rank recurrent neural networks\.Neuron99\(3\),pp\. 609–623\.Cited by:[§7](https://arxiv.org/html/2605.05373#S7.p4.1)\.
- L\. S\. Pontryagin \(1987\)Mathematical theory of optimal processes\.1st edition,Routledge\.External Links:[Document](https://dx.doi.org/10.1201/9780203749319)Cited by:[§B\.1](https://arxiv.org/html/2605.05373#A2.SS1.p1.6),[§B\.2](https://arxiv.org/html/2605.05373#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.05373#S1.p3.1),[§3](https://arxiv.org/html/2605.05373#S3.p1.1)\.
- T\. K\. Rusch and S\. Mishra \(2020\)Coupled oscillatory recurrent neural network \(CORNN\): an accurate and \(gradient\) stable architecture for learning long time dependencies\.arXiv preprint arXiv:2010\.00951\.Cited by:[§6](https://arxiv.org/html/2605.05373#S6.p1.1)\.
- T\. Schneider, B\. Belousov, H\. Abdulsamad, and J\. Peters \(2022\)Active inference for robotic manipulation\.arXiv preprint arXiv:2206\.10313\.Cited by:[§7](https://arxiv.org/html/2605.05373#S7.p4.1)\.
- J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel \(2015\)High\-dimensional continuous control using generalized advantage estimation\.arXiv preprint arXiv:1506\.02438\.Cited by:[§4\.1](https://arxiv.org/html/2605.05373#S4.SS1.p1.2)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.05373#S4.SS1.p1.2)\.
- Y\. Tassa, Y\. Doron, A\. Muldal, T\. Erez, Y\. Li, D\. de Las Casas, D\. Budden, A\. Abdolmaleki, J\. Merel, A\. Lefrancq, T\. Lillicrap, and M\. Riedmiller \(2018\)DeepMind Control Suite\.arXiv e\-prints,pp\. arXiv:1801\.00690\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1801.00690),1801\.00690Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p5.1),[§5](https://arxiv.org/html/2605.05373#S5.p1.5)\.
- J\. S\. Tsay and R\. B\. Ivry \(2026\)Cerebellar contributions to action and cognition: prediction, timescale, and continuity\.Proceedings of the National Academy of Sciences123\(15\),pp\. e2524258123\.Cited by:[§7](https://arxiv.org/html/2605.05373#S7.p4.1)\.
- R\. Vinter \(1986\)Is the costate variable the state derivative of the value function?\.In1986 25th IEEE Conference on Decision and Control,pp\. 1988–1989\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p4.1),[§3\.1](https://arxiv.org/html/2605.05373#S3.SS1.p1.4)\.
- D\. Wierstra, A\. Förster, J\. Peters, and J\. Schmidhuber \(2010\)Recurrent policy gradients\.Logic Journal of IGPL18\(5\),pp\. 620–634\.Cited by:[§1](https://arxiv.org/html/2605.05373#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.05373#S2.SS2.p2.6)\.
## Appendix
## Appendix AImpact Statement
This work provides a foundation for reinforcement learning under partial observability\. The proposed theoretical link makes previously black\-box policies more interpretable and improves their performance\. Practically, controllers capable of sustaining stable behavior partial observations are vital for safety\-critical autonomous systems, such as robotics and medical devices\.
Naturally, advancements in continuous control carry dual\-use implications\. Algorithms capable to tolerate sensor failure are equally applicable to military robotics and autonomous weapons\. Although our current focus is strictly on theoretical algorithmic design and simulated benchmarks, translating these highly resilient RL agents to the real world will require careful, domain\-specific oversight to prevent misuse\.
## Appendix BOptimal Hamiltonian System
### B\.1Additional background on co\-states and the control\-Hamiltonian
Repeating the main text for completeness, the control\-Hamiltonian expresses how the cost\-to\-go changes over time with the dynamics:
ℋ\(x,λ,u\)≔ℒ\(x,u\)\+λ⊤x˙=q\(x\)\+c\(u\)\+λ⊤\(f\(x\)\+g\(x\)u\)\.\\mathcal\{H\}\(x,\\lambda,u\)\\coloneqq\\mathcal\{L\}\(x,u\)\+\\lambda^\{\\top\}\\dot\{x\}=q\(x\)\+c\(u\)\+\\lambda^\{\\top\}\\bigl\(f\(x\)\+g\(x\)u\\bigr\)\.\(B\.1\)Here,λ∈ℝdx\\lambda\\in\\mathds\{R\}^\{d\_\{x\}\}are referred to as co\-states, and encode the sensitivity of the value of a state with respect to the dynamics\. These follow from solving the constrained optimization problem in Eq\.[2\.4](https://arxiv.org/html/2605.05373#S2.E4)using the Lagrange multiplier functionλ\(t\)\\lambda\(t\)fort∈\[t0,tf\]t\\in\[t\_\{0\},t\_\{f\}\]\. By defining the differential function of the co\-states, we obtain a coupled ODE:
x˙\(t\)\\displaystyle\\dot\{x\}\(t\)=∇λℋ\(x,λ,u\)=f\(x\(t\)\)\+g\(x\(t\)\)u,\\displaystyle=\\nabla\_\{\\lambda\}\\mathcal\{H\}\(x,\\lambda,u\)=f\(x\(t\)\)\+g\(x\(t\)\)u,\(B\.2\)λ˙\(t\)\\displaystyle\\dot\{\\lambda\}\(t\)=−∇xℋ\(x,λ,u\)=−∇xq−\(∇xx˙\)λ\(t\)\\displaystyle=\-\\nabla\_\{x\}\\mathcal\{H\}\(x,\\lambda,u\)=\-\\nabla\_\{x\}q\-\\left\(\\nabla\_\{x\}\\dot\{x\}\\right\)\\lambda\(t\)\(B\.3\)fort∈\[t0,tf\]t\\in\[t\_\{0\},t\_\{f\}\], where the notation\(∇yf\)i,j≔∂fj/∂yi\(\\nabla\_\{y\}f\)\_\{i,j\}\\coloneqq\\partial f\_\{j\}/\\partial y\_\{i\}denotes Jacobians, in this case∇xℋ=\(∂ℋ∂x1,…,∂ℋ∂xdx\)⊤\\nabla\_\{x\}\\mathcal\{H\}=\(\\frac\{\\partial\\mathcal\{H\}\}\{\\partial x\_\{1\}\},\\ldots,\\frac\{\\partial\\mathcal\{H\}\}\{\\partial x\_\{d\_\{x\}\}\}\)^\{\\top\}\. The interaction dynamics between the states and co\-states form a coupled dynamical system, which we refer to as the Hamiltonian system\. Since the co\-states incorporate the cost function, this representation lends itself well to the inclusion of a notion of optimality\. The minimum principle byPontryagin \([1987](https://arxiv.org/html/2605.05373#bib.bib8)\), for historical reasons also referred to as the maximum principle, states:
u⋆=argminu∈𝒰ℋ\(x⋆,λ⋆,u\)u^\{\\star\}=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\mathcal\{H\}\(x^\{\\star\},\\lambda^\{\\star\},u\)\(B\.4\)This tells us that if the optimal controlu⋆\(t\)u^\{\\star\}\(t\)is applied, producing corresponding unique optimal pathsx⋆\(t\)x^\{\\star\}\(t\)andλ⋆\(t\)\\lambda^\{\\star\}\(t\), the Hamiltonian function is minimized at all time pointst∈\[t0,tf\]t\\in\[t\_\{0\},t\_\{f\}\]\. This is a necessary but not a sufficient condition of optimality, implying the following optimal system:
### B\.2Derivation of the optimal Hamiltonian system
In this section, we provide the derivation of the necessary optimality conditions for the deterministic continuous\-time control problem, drawing on standard results from the calculus of variations\(Bryson and Ho,[1975](https://arxiv.org/html/2605.05373#bib.bib23); Kirk,[2004](https://arxiv.org/html/2605.05373#bib.bib2)\)\. For brevity, we consider an unbounded control set𝒰=ℝdu\\mathcal\{U\}=\\mathbb\{R\}^\{d\_\{u\}\}which we approach in continuous\-time by extremizing the Gateaux derivative\. For bounded control sets, the proof is significantly more involved and involves theneedle\-variation, originally proposed byPontryagin \([1987](https://arxiv.org/html/2605.05373#bib.bib8)\)\. We refer the reader to Chapter 4 inLiberzon \([2011](https://arxiv.org/html/2605.05373#bib.bib43)\)for the needle\-variation proof\.
Letu⋆\(t\)u^\{\\star\}\(t\)be an optimal control that minimizes the cost functionalJ\(u\)J\(u\)over the set of admissible controls𝒰\\mathcal\{U\}\. Let the optimal trajectory bex⋆\(t\)x^\{\\star\}\(t\), with corresponding co\-statesλ⋆\(t\)\\lambda^\{\\star\}\(t\), and let the optimal Hamiltonian be evaluated asℋ⋆≔ℋ\(x⋆\(t\),λ⋆\(t\),u⋆\(t\)\)\\mathcal\{H\}^\{\\star\}\\coloneqq\\mathcal\{H\}\(x^\{\\star\}\(t\),\\lambda^\{\\star\}\(t\),u^\{\\star\}\(t\)\)\. A necessary condition for optimality is that the gradient of the Hamiltonian with respect to the control vanishes\.
Proof\.To enforce the dynamic equality constraintx˙=f\(x\)\+g\(x\)u\\dot\{x\}=f\(x\)\+g\(x\)u, we introduce the Lagrange multiplier functionλ\(t\)\\lambda\(t\)\(the co\-state\) that ensures this constraint for allt∈\[t0,tf\]t\\in\[t\_\{0\},t\_\{f\}\]\. Because the dynamic constraint equals zero along any valid trajectory, adding it to the integral does not change the value ofJ\(u\)J\(u\):
J\(u\)=Φ\(x\(tf\)\)\+∫t0tf\[q\(x\)\+c\(u\)\+λ\(t\)⊤\(f\(x\)\+g\(x\)u−x˙\)\]𝑑t\.J\(u\)=\\Phi\(x\(t\_\{f\}\)\)\+\\int\_\{t\_\{0\}\}^\{t\_\{f\}\}\\left\[q\(x\)\+c\(u\)\+\\lambda\(t\)^\{\\top\}\\Big\(f\(x\)\+g\(x\)u\-\\dot\{x\}\\Big\)\\right\]dt\.\(B\.5\)By substituting the definition of the control\-Hamiltonian,ℋ\(x,λ,u\)=q\(x\)\+c\(u\)\+λ⊤\(f\(x\)\+g\(x\)u\)\\mathcal\{H\}\(x,\\lambda,u\)=q\(x\)\+c\(u\)\+\\lambda^\{\\top\}\(f\(x\)\+g\(x\)u\), we can rewrite the objective as:
J\(u\)=Φ\(x\(tf\)\)\+∫t0tf\[ℋ\(x,λ,u\)−λ\(t\)⊤x˙\]𝑑t\.J\(u\)=\\Phi\(x\(t\_\{f\}\)\)\+\\int\_\{t\_\{0\}\}^\{t\_\{f\}\}\\left\[\\mathcal\{H\}\(x,\\lambda,u\)\-\\lambda\(t\)^\{\\top\}\\dot\{x\}\\right\]dt\.\(B\.6\)
We consider the Gâteaux derivative of this cost functional:
δJ\(u\)=limϵ→0J\(u\+ϵδu\)−J\(u\)ϵ,\\delta J\(u\)=\\lim\_\{\\epsilon\\to 0\}\\frac\{J\(u\+\\epsilon\\delta u\)\-J\(u\)\}\{\\epsilon\},\(B\.7\)whereδu\(t\)\\delta u\(t\)is an arbitrary control variation defined over\[t0,tf\]\[t\_\{0\},t\_\{f\}\]\. The first variation can be obtained by isolating the first\-order terms around the optimal solutionu⋆\(t\)u^\{\\star\}\(t\):
J\(u⋆\+ϵδu\)−J\(u⋆\)≈δJ\|u⋆\(δu\)ϵ,J\(u^\{\\star\}\+\\epsilon\\delta u\)\-J\(u^\{\\star\}\)\\approx\\delta J\|\_\{u^\{\\star\}\}\(\\delta u\)\\epsilon,\(B\.8\)where higher\-order terms are neglected asϵ→0\\epsilon\\to 0\. Since, by definition,u⋆\(t\)u^\{\\star\}\(t\)is a minimum forJJ, its first variation must be zero\. Note that a variation in the controlδu\(t\)\\delta u\(t\)naturally induces a corresponding variation in the state trajectory, denoted asδx\(t\)\\delta x\(t\)\. A well\-known result in optimal control theory\(Kirk,[2004](https://arxiv.org/html/2605.05373#bib.bib2), Eq\. 5\.1\.13\)states that to successfully absorb the variations in the state trajectoryδx\(t\)\\delta x\(t\), the co\-state multiplierλ\(t\)\\lambda\(t\)must be chosen to satisfy the differential equation:
λ˙⋆\(t\)=−∇xℋ\(x⋆,λ⋆,u⋆\),\\dot\{\\lambda\}^\{\\star\}\(t\)=\-\\nabla\_\{x\}\\mathcal\{H\}\(x^\{\\star\},\\lambda^\{\\star\},u^\{\\star\}\),\(B\.9\)with the terminal transversality conditionλ⋆\(tf\)=∇xΦ\(x⋆\(tf\)\)\\lambda^\{\\star\}\(t\_\{f\}\)=\\nabla\_\{x\}\\Phi\(x^\{\\star\}\(t\_\{f\}\)\)\. With the state variations eliminated by this specific choice of optimal co\-state dynamics, the remaining first variation of the cost with respect to the control is given by\(Kirk,[2004](https://arxiv.org/html/2605.05373#bib.bib2), Eq\. 5\.1\.14\):
δJ\|u⋆\(δu\)=∫t0tf\[∇uℋ⋆\]⊤δu\(t\)𝑑t\.\\delta J\|\_\{u^\{\\star\}\}\(\\delta u\)=\\int\_\{t\_\{0\}\}^\{t\_\{f\}\}\\left\[\\nabla\_\{u\}\\mathcal\{H\}^\{\\star\}\\right\]^\{\\top\}\\delta u\(t\)dt\.\(B\.10\)By the fundamental lemma of the calculus of variations, sinceδu\(t\)\\delta u\(t\)is an arbitrary variation, the integral can only equal zero if the integrand itself vanishes identically\. It follows that:
∇uℋ⋆=0,for allt∈\[t0,tf\]\.\\nabla\_\{u\}\\mathcal\{H\}^\{\\star\}=0,\\quad\\text\{for all \}t\\in\[t\_\{0\},t\_\{f\}\]\.\(B\.11\)
For unconstrained control sets, this stationary condition defines the optimal control law\. For bounded, closed control sets𝒰\\mathcal\{U\}, this local stationary condition is generalized by the PMP to the global minimization of the Hamiltonian across the action space:
u⋆\(t\)=argminu∈𝒰ℋ\(x⋆\(t\),λ⋆\(t\),u\)\.u^\{\\star\}\(t\)=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\mathcal\{H\}\(x^\{\\star\}\(t\),\\lambda^\{\\star\}\(t\),u\)\.\(B\.12\)This establishes the optimality condition for the control sequence\.
## Appendix CReadout Layer Design via Optimal Control Principles
1\. Quadratic Controlℒ\(x,u\)=q\(x\)\+u⊤Ru\\mathcal\{L\}\(x,u\)=q\(x\)\+u^\{\\top\}Ruu∗=R−1σu^\{\*\}=R^\{\-1\}\\sigmawhereσ=−12g\(x,t\)⊤λ\(t\)\\sigma=\-\\frac\{1\}\{2\}g\(x,t\)^\{\\top\}\\lambda\(t\)ttu∗\(t\)u^\{\*\}\(t\)umaxu\_\{\\max\}uminu\_\{\\min\}2\. Time/State \(Bang\-Bang\)ℒ\(x,u\)=q\(x\)or1\\mathcal\{L\}\(x,u\)=q\(x\)\\text\{ or \}1u∗=\{sign\(σ\)umaxif\|σ\|\>00ifσ=0\\begin\{aligned\} u^\{\*\}=\\begin\{cases\}\\text\{sign\}\(\\sigma\)u\_\{\\text\{max\}\}&\\text\{if \}\|\\sigma\|\>0\\\\ 0&\\text\{if \}\\sigma=0\\\\ \\end\{cases\}\\end\{aligned\}whereσ=−12g\(x,t\)⊤λ\(t\)\\sigma=\-\\frac\{1\}\{2\}g\(x,t\)^\{\\top\}\\lambda\(t\)ttu∗\(t\)u^\{\*\}\(t\)umaxu\_\{\\max\}uminu\_\{\\min\}3\. Fuel \(Bang\-Off\-Bang\)ℒ\(x,u\)=q\(x\)\+‖u‖\\mathcal\{L\}\(x,u\)=q\(x\)\+\\\|u\\\|u∗=\{sign\(σ\)umaxif\|σ\|\>10if\|σ\|<1\\begin\{aligned\} u^\{\*\}=\\begin\{cases\}\\text\{sign\}\(\\sigma\)u\_\{\\text\{max\}\}&\\text\{if \}\|\\sigma\|\>1\\\\ 0&\\text\{if \}\|\\sigma\|<1\\\\ \\end\{cases\}\\end\{aligned\}whereσ=−12g\(x,t\)⊤λ\(t\)\\sigma=\-\\frac\{1\}\{2\}g\(x,t\)^\{\\top\}\\lambda\(t\)ttu∗\(t\)u^\{\*\}\(t\)umaxu\_\{\\max\}uminu\_\{\\min\}Coast
Figure 5:Hamiltonian policies under varying cost functions:Energy\-optimal \(1\), time\-optimal \(2\), and fuel\-optimal \(3\)\. This can produce either smooth optimal control functions or discontinuous bang\-bang controls\. All of them leverage the same co\-states, and can thus can serve as an ’actor’ equivalent to the one that leverages the latent neural co\-state\.In standard deep RL architectures of the actor, a readout layer \(or policy head\) that maps the final hidden state to an action is typically chosen heuristically\. However, because the NCP class explicitly aims to align the hidden state with the theoretical Pontryagin co\-state \(h≈λ⋆h\\approx\\lambda^\{\\star\}\), the optimal readout layer is no longer arbitrary but instead has to minimize the control\-Hamiltonian function with respect to the specific running cost function of the environment\.
By defining the optimal switching function asσ\(t\)=−g\(x,t\)⊤λ\(t\)\\sigma\(t\)=\-g\(x,t\)^\{\\top\}\\lambda\(t\)and assuming symmetric actuator limits𝒰∈\[−umax,umax\]\\mathcal\{U\}\\in\[\-u\_\{\\max\},u\_\{\\max\}\]for notational clarity, we can structurally design the policy head to match different classes of optimal control problems \(illustrated in Fig\.[5](https://arxiv.org/html/2605.05373#A3.F5)\):
1\. Quadratic Control \(Smooth Action\):As explored in the main text, environments with a quadratic penalty on control effort,ℒ\(x,u\)=q\(x\)\+u⊤Ru\\mathcal\{L\}\(x,u\)=q\(x\)\+u^\{\\top\}Ruand control\-affine system dynamicsx˙=f\(x\)\+g\(x\)u\\dot\{x\}=f\(x\)\+g\(x\)u, yield a Hamiltonian that is convex with respect touu\. Setting the gradient to zero results in the optimal control law:u⋆=R−1σu^\{\\star\}=R^\{\-1\}\\sigma\. In our framework, this corresponds to a standard linear readout layer without bounding activation functions\.
2\. Time\-Optimal and State\-Only Costs \(Bang\-Bang Control\):For many robotics tasks \(e\.g\., minimum\-time navigation\), the cost consists only of state\-dependent terms, meaning control effort is not penalized:ℒ\(x,u\)=q\(x\)\\mathcal\{L\}\(x,u\)=q\(x\)\. Because the Hamiltonian is linear with respect touu, and physical actuators possess strict limits\[umin,umax\]\[u\_\{\\min\},u\_\{\\max\}\], the minimum principle dictates that the optimal solution lies exclusively on the boundaries\. This results inbang\-bangcontrol, where the agent switches instantaneously between maximum and minimum effort based on the sign of the switching function:
u⋆=\{sign\(σ\)umaxif\|σ\|\>00ifσ=0u^\{\\star\}=\\begin\{cases\}\\text\{sign\}\(\\sigma\)u\_\{\\max\}&\\text\{if \}\|\\sigma\|\>0\\\\ 0&\\text\{if \}\\sigma=0\\end\{cases\}In the NCP framework, a bang\-bang optimal controller can be enforced simply by applying asign\(⋅\)\\text\{sign\}\(\\cdot\)activation function to the policy head\.
3\. Fuel\-Optimal Costs \(Bang\-Off\-Bang Control\):If the environment penalizes the absolute magnitude of the control effort \(e\.g\., conserving fuel in aerospace applications\), anL1L\_\{1\}penalty is introduced:ℒ\(x,u\)=q\(x\)\+‖u‖\\mathcal\{L\}\(x,u\)=q\(x\)\+\\\|u\\\|\. The optimal control law develops a deadzone where it is optimal to coast \(apply no control effort\) when the sensitivity is low:
u⋆=\{sign\(σ\)umaxif\|σ\|\>10if\|σ\|<1\.u^\{\\star\}=\\begin\{cases\}\\text\{sign\}\(\\sigma\)u\_\{\\max\}&\\text\{if \}\|\\sigma\|\>1\\\\ 0&\\text\{if \}\|\\sigma\|<1\.\\end\{cases\}
While the empirical evaluations in this work focus on the continuous quadratic case, the interpretation of recurrent hidden states as Pontryagin co\-states allows one to design readout layers specific to the properties of the cost or reward function of the environment while leaving the underlying recurrent co\-state dynamics unchanged\.
## Appendix DEpistemic Uncertainty and Mean Hamiltonian Minimization
An agent attempting to make an intelligent trade\-off between exploration and exploitation must focus its exploration onepistemicuncertainty, associated with what the agent does not know yet\. Unlike aleatoric uncertainty, which represents inherent and irreducible randomness, epistemic uncertainty can be reduced by gathering more data\. In optimal control, epistemic uncertainty is often modeled as parametric uncertainty, obtained through Bayesian inference over the environment’s dynamics\. We can formulate this by considering a policy that aims to minimize the expected cost over a posterior distribution of plausible dynamic models:
π⋆\\displaystyle\\pi^\{\\star\}≔argminπ∈Π𝔼ζ∼p\(ζ∣𝒟\)\[J\(πθ\(xζ\(t\)\)\)\]\\displaystyle\\coloneqq\\operatorname\*\{argmin\}\_\{\\pi\\in\\Pi\}\\mathbb\{E\}\_\{\\zeta\\sim p\(\\zeta\\mid\\mathcal\{D\}\)\}\\left\[J\(\\pi\_\{\\theta\}\(x\_\{\\zeta\}\(t\)\)\)\\right\]\(D\.1\)s\.t\.x˙ζ\(t\)\\displaystyle\\text\{s\.t\.\}\\quad\\dot\{x\}\_\{\\zeta\}\(t\)=fζ\(xζ\(t\),u\(t\),t\),fort∈\[t0,tf\]\\displaystyle=f\_\{\\zeta\}\\bigl\(x\_\{\\zeta\}\(t\),u\(t\),t\\bigr\),\\quad\\text\{for \}t\\in\[t\_\{0\},t\_\{f\}\]\\kern 5\.0ptxζ\(t0\)\\displaystyle x\_\{\\zeta\}\(t\_\{0\}\)=x0\.\\displaystyle=x\_\{0\}\.whereJ\(θ\)J\(\\theta\)denotes the deterministic cost functional evaluated under dynamics parameterized byζ\\zeta, andp\(ζ∣𝒟\)p\(\\zeta\\mid\\mathcal\{D\}\)is the posterior distribution given past interaction data𝒟\\mathcal\{D\}\.
To solve this probabilistically robust optimization problem, we can draw a finite set ofmmsamplesζi∼p\(ζ∣𝒟\)\\zeta\_\{i\}\\sim p\(\\zeta\\mid\\mathcal\{D\}\)\. These samples form an ensemble of deterministic dynamical systems driven by a shared control inputu\(t\)u\(t\):
x˙ζi\(t\)\\displaystyle\\dot\{x\}\_\{\\zeta\_\{i\}\}\(t\)=fζi\(xζi\(t\),u\(t\),t\),\\displaystyle=f\_\{\\zeta\_\{i\}\}\\bigl\(x\_\{\\zeta\_\{i\}\}\(t\),u\(t\),t\\bigr\),\(D\.2\)λ˙ζi\(t\)\\displaystyle\\dot\{\\lambda\}\_\{\\zeta\_\{i\}\}\(t\)=−∇xℋ\(xζi\(t\),λζi\(t\),u\(t\),t\),\\displaystyle=\-\\nabla\_\{x\}\\mathcal\{H\}\\bigl\(x\_\{\\zeta\_\{i\}\}\(t\),\\lambda\_\{\\zeta\_\{i\}\}\(t\),u\(t\),t\\bigr\),\(D\.3\)s\.t\.λζi\(tf\)\\displaystyle\\text\{s\.t\.\}\\quad\\lambda\_\{\\zeta\_\{i\}\}\(t\_\{f\}\)=∇xΦ\(xζi\(tf\)\)\.\\displaystyle=\\nabla\_\{x\}\\Phi\\bigl\(x\_\{\\zeta\_\{i\}\}\(t\_\{f\}\)\\bigr\)\.\(D\.4\)
Ifu⋆\(t\)u^\{\\star\}\(t\)is an optimal control sequence for this ensemble, it must satisfy the Mean Hamiltonian Minimization principle\(Leeftinket al\.,[2025](https://arxiv.org/html/2605.05373#bib.bib33)\):
u⋆\(t\)=argminu∈𝒰𝔼ζ∼p\(ζ∣𝒟\)\[ℋ\(xζ\(t\),λζ\(t\),u,t\)\],u^\{\\star\}\(t\)=\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\mathbb\{E\}\_\{\\zeta\\sim p\(\\zeta\\mid\\mathcal\{D\}\)\}\\left\[\\mathcal\{H\}\\left\(x\_\{\\zeta\}\(t\),\\lambda\_\{\\zeta\}\(t\),u,t\\right\)\\right\],\(D\.5\)
Implications for Neural Co\-state Policies\.While the NCP framework introduced in this work involves optimizing a deterministic point estimate, Eq\. \([D\.5](https://arxiv.org/html/2605.05373#A4.E5)\) provides a theoretical foundation for aBayesianNCP\. Because the theoretical co\-state is linked to the value function gradient,λ⋆=∇xV⋆\\lambda^\{\\star\}=\\nabla\_\{x\}V^\{\\star\}, this uncertainty can be captured by an ensemble ofmmcritics\. By maintaining parallel hidden stateshζi\(t\)h\_\{\\zeta\_\{i\}\}\(t\)across these critics, each representing a unique co\-state hypothesis, the actor can take exploratory actions by minimizing the mean of the ensemble Hamiltonians:
u⋆\(t\)≈argminu∈𝒰1m∑i=1m\{c\(u\)\+u⊤Gθ\(y\)hζi\(t\)\}\.u^\{\\star\}\(t\)\\approx\\operatorname\*\{argmin\}\_\{u\\in\\mathcal\{U\}\}\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\left\\\{c\(u\)\+u^\{\\top\}G\_\{\\theta\}\(y\)h\_\{\\zeta\_\{i\}\}\(t\)\\right\\\}\.\(D\.6\)We leave the empirical validation of this Bayesian ensemble approach to future work\.
## Appendix EImplementation and training details
Here we provide an overview of the algorithmic components, architectural choices, and hyperparameters used in our implementation to ensure reproducibility\. Our systems and default hyperparameter configurations follow standard continuous control setups, drawing heavily from the Brax environments\(Freemanet al\.,[2021](https://arxiv.org/html/2605.05373#bib.bib65)\)\. The full implementation is written in JAX\(Bradburyet al\.,[2018](https://arxiv.org/html/2605.05373#bib.bib68)\), leveragingjax\.vmapfor vectorized environment rollouts andjax\.lax\.scanfor efficient BPTT\.
### E\.1Network Architecture
Both the NC\-GRU and NC\-CTRNN share a common actor\-critic blueprint\. To ensure stable learning, the actor and critic use a shared encoder\. They differ only in their recurrent core:
- •Observation Encoder:At each timestep, the raw observationyty\_\{t\}is dynamically normalized \(see Appendix[E\.2](https://arxiv.org/html/2605.05373#A5.SS2)\) and passed through a linear encoder followed by a hyperbolic tangent activation:xt=tanh\(Wencyt\+benc\)x\_\{t\}=\\tanh\(W\_\{\\text\{enc\}\}y\_\{t\}\+b\_\{\\text\{enc\}\}\)\.
- •Recurrent Cores: - –GRU:We use a standard GRU cell mapping the encoded observation and previous hidden state to the next state:ht=GRU\(xt,ht−1\)h\_\{t\}=\\text\{GRU\}\(x\_\{t\},h\_\{t\-1\}\)\. - –CT\-RNN:The continuous\-time variant employs leaky integration with a learnable time constantα=σ\(log\_alpha\)\\alpha=\\sigma\(\\text\{log\\\_alpha\}\), initialized to 0 \(defaulting to a 0\.5 leak rate\)\. The update is defined as:ht=\(1−α\)ht−1\+αtanh\(Winxt\+Wrecht−1\)h\_\{t\}=\(1\-\\alpha\)h\_\{t\-1\}\+\\alpha\\tanh\(W\_\{\\text\{in\}\}x\_\{t\}\+W\_\{\\text\{rec\}\}h\_\{t\-1\}\)\.
- •Actor\-Critic Heads:The actor outputs the mean of a Gaussian distribution\. The log standard deviation \(logσ\\log\\sigma\) is maintained as astate\-independentlearnable array initialized to zero\.
- •Code\-Level Initialization:All linear and recurrent weights use orthogonal initialization\. Hidden layers and the critic output use a scale of1\.01\.0, while the actor output layer uses a scale of0\.010\.01to ensure the initial action distribution is tightly centered around zero, preventing extreme actions early in training\. All biases are initialized to 0\.
### E\.2Algorithmic Components
Our optimization relies on PPO, integrating several standard mechanisms and code\-level optimizations for stabilized training described inHuanget al\.\([2022](https://arxiv.org/html/2605.05373#bib.bib69)\):
- •Observation and Reward Standardization:We maintain running statistics \(mean and variance\) for observations\. Raw observations are normalized dynamically and strictly clipped to\[−10,10\]\[\-10,10\]before being passed to the networks\. Additionally, raw rewards \(negative costs\) are scaled dynamically by dividing them by the standard deviation of the running discounted returns:rtscaled=rtraw/Var\(R\)\+ϵr^\{\\text\{scaled\}\}\_\{t\}=r^\{\\text\{raw\}\}\_\{t\}/\\sqrt\{\\text\{Var\}\(R\)\+\\epsilon\}, whereϵ=10−8\\epsilon=10^\{\-8\}, and are subsequently clipped to\[−10,10\]\[\-10,10\]\.
- •Advantage Estimation and Minibatch Normalization:Generalized Advantage Estimation \(GAE\) is computed backwards through time over the unrolled trajectories\. The resulting advantages are normalized \(subtracting the mean and dividing by the standard deviation\) at the minibatch level during the PPO update epoch, rather than globally over the entire rollout buffer\.
- •Sequential Minibatches:Our implementation constructs minibatches by fetching sequential trajectory segments to maintain the temporal dependencies required for BPTT\.
- •Action Clipping:During environmental interaction, actions sampled from the actor’s distributionat∼𝒩\(μt,σ2\)a\_\{t\}\\sim\\mathcal\{N\}\(\\mu\_\{t\},\\sigma^\{2\}\)are clipped to the valid environmental bounds\[umin,umax\]\[u\_\{\\min\},u\_\{\\max\}\]before being applied to the physics step\. The unclipped actions and corresponding log probabilities are stored for the PPO update\.
- •Clipped Surrogate Objective & Value Loss:The policy is updated using the standard clipped PPO objective with a clipping parameterϵclip\\epsilon\_\{\\text\{clip\}\}\. The value loss also employs clipping to constrain the value function update\.
### E\.3Co\-state loss implementation
Following Algorithm[1](https://arxiv.org/html/2605.05373#alg1), we compute the target co\-stateλtarget\\lambda\_\{\\text\{target\}\}is isolated from the computational graph via thestop\_gradientoperator\. The PMP loss is computed as the cosine distance between the recurrent hidden statehth\_\{t\}and the target co\-stateλtarget\\lambda\_\{\\text\{target\}\}, numerically stabilized byϵ=10−8\\epsilon=10^\{\-8\}\.
### E\.4Hyperparameters
The hyperparameter configurations for our systems are detailed in Table[1](https://arxiv.org/html/2605.05373#A5.T1)\. These parameters are based on the standard setup described inFreemanet al\.\([2021](https://arxiv.org/html/2605.05373#bib.bib65)\)to ensure fair evaluation, specifically adapted for BPTT sequence lengths and continuous control demands\. We explicitly utilize the Adam optimizer with an epsilon parameter of10−510^\{\-5\}, which significantly stabilizes updates compared to standard defaults\.
Table 1:PPO and Network HyperparametersCategoryHyperparameterValueArchitectureHidden Layer Size128Initialization Scale1\.0 \(General\), 0\.01 \(Actor Out\)Initialization TypeOrthogonalOptimizationOptimizerAdam \(via Optax\)Learning Rate2\.5×10−42\.5\\times 10^\{\-4\}\(Annealed linearly\)Epsilon \(Adam\)10−510^\{\-5\}Max Gradient Norm0\.5PPO DetailsRollout Batch Size32Timesteps60m \(100m for WalkerRun\)Minibatches4PPO Epochs4Discount Factor \(γ\\gamma\)0\.99GAE Parameter \(λ\\lambda\)0\.95Clipping Epsilon \(ϵclip\\epsilon\_\{\\text\{clip\}\}\)0\.2Value Loss Coef \(cvfc\_\{\\text\{vf\}\}\)0\.5Entropy Coef \(centc\_\{\\text\{ent\}\}\)0\.01Co\-state Coef \(ccostatec\_\{\\text\{costate\}\}\)0\.05 \(Ablated in experiments\)
### E\.5Hardware
Our experiments were conducted using NVIDIA RTX 2080 Ti and Quadro RTX 6000 GPUs, paired with Intel Xeon E5\-2650 and AMD EPYC 7302 CPUs\. Depending on the complexity of the environment, training a single seed for 60 million timesteps constituted between 1 \(Cartpole\) to 6 \(Walker\) hours for the given batch size used\.Similar Articles
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
SWITCH is a switchable latent reasoning framework that uses explicit boundary tokens to enable trainable and interpretable recurrent hidden-state reasoning via on-policy reinforcement learning, outperforming prior approaches.
Learning to Refine Hidden States for Reliable LLM Reasoning
Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
This paper identifies a 'concept bottleneck' in the CoCoNuT latent reasoning paradigm where hidden states are overwritten across passes, and proposes AGCLR, which adds a gated persistent memory stream to retain intermediate facts. Evaluations on GSM8K, HotpotQA, and ProsQA using GPT-2 show consistent improvements, especially on multi-hop tasks.
Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies
Introduces a neurosymbolic framework that injects LTLf constraints into transformer-based reinforcement learning policies via differentiable automaton representations and a logic-based loss, improving constraint satisfaction while maintaining competitive returns.
Stochastic Neural Networks for hierarchical reinforcement learning
OpenAI researchers propose a framework using stochastic neural networks for hierarchical reinforcement learning that pre-trains useful skills guided by a proxy reward, then leverages these skills for faster learning in downstream tasks with sparse rewards or long horizons.