Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations
Summary
This paper investigates memory-efficient meta-reinforcement learning architectures for adaptive safety-critical control in adversarial spacecraft proximity operations, finding that state space models like Mamba with PPO achieve superior task completion, safety, and fuel savings compared to LSTM and GRU.
View Cached Full Text
Cached at: 06/17/26, 05:38 AM
# Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations
Source: [https://arxiv.org/html/2606.17414](https://arxiv.org/html/2606.17414)
Alejandro Posadas\-NavaRichard LinaresAssociate Professor and Rockwell International Career Development Professor, AeroAstro, MIT, 77 Massachusetts Avenue, Cambridge, MA 02139\-4307\.Minduli WijayatungaAssistant Professor, Department of Aerospace Engineering, University of Illinois, Urbana\-Champaign, 104 S Wright St, Urbana, IL 61801\.
###### Abstract
Autonomous spacecraft rendezvous and proximity operations \(RPO\) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure\. Input\-constrained control barrier functions \(ICCBFs\) provide a control method for nonlinear systems with actuation constraints that construct a forward\-invariant safe set\. Previous work has shown that learning class\-𝒦\\mathcal\{K\}functions defining the ICCBF recursion via meta reinforcement learning \(meta\-RL\) yields a robust, non\-greedy approach to safety\-critical control in RPO\. This paper extends that framework further by investigating the performance of three recurrent network architectures \(Long Short Term Memory \(LSTM\), Gated Recurrent Unit \(GRU\), Selective State Space Model \(Mamba\)\) and two training algorithms \(Proximal Policy Optimization \(PPO\) and Soft Actor Critic \(SAC\)\) to identify the best setup for tuning ICCBF class\-𝒦\\mathcal\{K\}functions via meta\-RL\. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft\. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel\-savings compared to other architectures, across all cooperative and uncooperative scenarios tested\.
## 1Introduction
Autonomous rendezvous and proximity operations \(RPO\) increasingly require a spacecraft to maneuver in close proximity to objects whose properties are uncertain and, in some settings, actively uncooperative\. This requires the spacecraft controls to meet safety guarantees under bounded thrust and significant model uncertainty, conserve propellant, and execute within the limited computational budget available on flight hardware\.
Classical guidance and control \(G&C\) architectures that separate trajectory planning from safety enforcement provide only weak guarantees under these coupled constraints\. These approaches are rooted in optimal control and mathematical programming\[[6](https://arxiv.org/html/2606.17414#bib.bib24)\]\. Indirect methods derived from Pontryagin’s minimum principle yield fuel\-optimal solutions with high precision, though their convergence is notoriously sensitive to initialization and problem scaling\[[27](https://arxiv.org/html/2606.17414#bib.bib18)\]\. Direct methods based on convex and successive convex programming trade some optimality for reliability and speed\[[18](https://arxiv.org/html/2606.17414#bib.bib25),[19](https://arxiv.org/html/2606.17414#bib.bib26),[5](https://arxiv.org/html/2606.17414#bib.bib20)\], and have matured into computationally tractable pipelines for trajectory design and guidance in debris removal and servicing missions\[[25](https://arxiv.org/html/2606.17414#bib.bib19)\], including end\-to\-end close\-range rendezvous frameworks validated on hardware testbeds\[[28](https://arxiv.org/html/2606.17414#bib.bib23)\]\. Receding\-horizon schemes such as model predictive control \(MPC\) further embed constraint handling within online re\-optimization\[[24](https://arxiv.org/html/2606.17414#bib.bib27)\]\. However, the guarantees these methods provide are certified only with respect to the assumed model\. Their optimality and constraint satisfaction can degrade under unmodeled dynamics, navigation error, and parameter uncertainty, and robustness must be recovered indirectly through conservative margins, disturbance bounding, or repeated replanning\. Moreover, these formulations offer no principled mechanism to infer hidden physical parameters or anticipate non\-cooperative target behavior from the observation history\.
In contrast, reinforcement learning \(RL\) offers a complementary set of strengths\. By training closed\-loop policies over distributions of dynamics, disturbances, and sensing conditions, RL produces controllers that adapt online to conditions unseen at design time while requiring only inexpensive forward inference onboard\. It has consequently been applied across the spacecraft G&C spectrum, from six\-degree\-of\-freedom planetary landing\[[16](https://arxiv.org/html/2606.17414#bib.bib28)\]and robust interplanetary trajectory design\[[31](https://arxiv.org/html/2606.17414#bib.bib29)\]to autonomous guidance for proximity operations\[[12](https://arxiv.org/html/2606.17414#bib.bib30)\]\. In the RPO context specifically, RL\-based guidance has been shown to maintain target observability, safety margins, and low fuel consumption under angles\-only navigation in far\-range rendezvous\[[26](https://arxiv.org/html/2606.17414#bib.bib21)\]\. Meta\-RL extends this further: by equipping the policy with memory, the agent implicitly performs system identification within an episode, adapting to hidden parameters and time\-varying environments without explicit estimators\[[13](https://arxiv.org/html/2606.17414#bib.bib31),[14](https://arxiv.org/html/2606.17414#bib.bib32)\]\. However, RL by itself provides no formal guarantees\. Constraints are typically encoded as reward penalties, so satisfaction is achieved only in expectation and only on the training distribution; a single constraint violation during a close\-approach maneuver can be mission\-ending\. This shortcoming has spurred safe\-RL mechanisms such as shielding\[[2](https://arxiv.org/html/2606.17414#bib.bib33)\]and run\-time assurance for spacecraft docking and inspection\[[11](https://arxiv.org/html/2606.17414#bib.bib34),[22](https://arxiv.org/html/2606.17414#bib.bib35)\], and more broadly motivates hybrid architectures in which a learned policy is paired with a certificate\-bearing safety mechanism that retains hard guarantees regardless of what the policy outputs\[[30](https://arxiv.org/html/2606.17414#bib.bib22)\]\.
Control Barrier Functions \(CBFs\) offer such a mechanism, enforcing forward invariance of a prescribed safe set through a real\-time quadratic program \(QP\) that minimally alters a nominal command\[[3](https://arxiv.org/html/2606.17414#bib.bib11)\]\. Input\-Constrained CBFs \(ICCBFs\) extend this guarantee to systems with bounded actuation by recursively composing the barrier with a hierarchy of class\-𝒦\\mathcal\{K\}functions to construct an input\-admissible inner safe set\[[1](https://arxiv.org/html/2606.17414#bib.bib1)\]\. While well suited to address thrust limits and position\-only constraints in RPO, conventional ICCBFs use a fixed class\-𝒦\\mathcal\{K\}function hierarchy\. This renders the filter conservative and myopic, as it can shrink the feasible set unnecessarily, expend excess fuel, and behave sub\-optimally near constraint boundaries\.
Recent work has addressed this limitation by parameterizing and*learning*the class\-𝒦\\mathcal\{K\}hierarchy\. Alongside data\-driven approaches that represent barrier and certificate functions with neural networks\[[10](https://arxiv.org/html/2606.17414#bib.bib36)\], a first line of work demonstrated that RL can tune non\-greedy ICCBF parameterizations within a unified two\-stage framework, recovering fuel efficiency without sacrificing the safety certificate\[[30](https://arxiv.org/html/2606.17414#bib.bib22)\]\. This was subsequently generalized through meta\-reinforcement learning \(meta\-RL\), in which a recurrent policy is trained over distributions of hidden physical parameters and disturbances to shape the full inner safe set online\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\. That study established that a learned, memory\-based parameterization reduces conservatism and fuel consumption relative to fixed ICCBFs while preserving safety, with a recurrent Long Short\-Term Memory \(LSTM\) policy proving especially effective in the more complex, partially observed inspection task\. This paper extends that framework by systematically investigating the design space of the learned safety filter to identify the best setup for tuning the ICCBF class\-𝒦\\mathcal\{K\}functions via meta\-RL\. It investigates:
1. 1\.Sequence\-modeling architecture\.Three recurrent network architectures including the LSTM used in prior work, the lighter Gated Recurrent Unit \(GRU\)\[[8](https://arxiv.org/html/2606.17414#bib.bib8)\], and the selective state\-space model \(Mamba\)\[[9](https://arxiv.org/html/2606.17414#bib.bib7)\]with linear\-time inference are benchmarked against one another to determine which best balances safety, fuel efficiency, and task completion within onboard computational limits, demonstrating that this choice is a decisive rather than incidental design decision\.
2. 2\.Training algorithm\.On\-policy Proximal Policy Optimization \(PPO\)\[[20](https://arxiv.org/html/2606.17414#bib.bib9)\]is compared against the off\-policy, entropy\-regularized Soft Actor\-Critic \(SAC\)\[[17](https://arxiv.org/html/2606.17414#bib.bib10)\]under identical conditions to characterize the fuel–safety trade\-off that each induces\.
3. 3\.Adversarial robustness\.In addition to cooperative test cases, adversarial docking and adversarial inspection scenarios are introduced, in which the target spacecraft deliberately maneuvers to worsen the safety of the chaser or deny sensor coverage, and the robustness of each architecture–algorithm configuration is assessed\.
All combinations are validated through Monte Carlo studies on one\-dimensional cruise control, two\-dimensional docking, and three\-dimensional inspection under distributions of hidden parameters, state and thrust uncertainties\. For the docking and inspection, adversarial behavior cases are also investigated\.
## 2Theoretical Background
ICCBFs are a mathematical framework for constructing input\-admissible inner safe sets\. They account for actuation limits by restricting the safe state space to a smaller inner safe set\. This guarantees that for every state within this inner set, there exists a feasible control command that respects physical thrust constraints while keeping the system and its future states safe\. While traditional ICCBFs have fixed class\-𝒦\\mathcal\{K\}functions, the heirachy of class\-𝒦\\mathcal\{K\}functions can be learned using RL, as done in Ref\.Wijayatungaet al\.\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\.
The system dynamics are governed by the control\-affine system
𝐱˙=𝐟\(𝐱\)\+𝐠\(𝐱\)𝐮,\\mathbf\{\\dot\{x\}\}=\\mathbf\{f\}\(\\mathbf\{x\}\)\+\\mathbf\{g\}\(\\mathbf\{x\}\)\\mathbf\{u\},\(1\)
where𝐟\\mathbf\{f\}and𝐠\\mathbf\{g\}are sufficiently smooth,𝐱∈𝒳⊂ℝn\\mathbf\{x\}\\in\\mathcal\{X\}\\subset\\mathbb\{R\}^\{n\}, and𝐮∈𝒰⊂ℝm\\mathbf\{u\}\\in\\mathcal\{U\}\\subset\\mathbb\{R\}^\{m\}\. Here,𝐱\\mathbf\{x\}is the system state,𝐟\(𝐱\)\\mathbf\{f\(x\)\}the natural drift,𝐠\(𝐱\)\\mathbf\{g\(x\)\}the control effectiveness \(the degree to which the control influences the evolution of the state\), and𝐮\\mathbf\{u\}the control input\.
Because the goal of ICCBFs is to keep the system safe, a safety functionh\(𝐱\)h\(\\mathbf\{x\}\)is defined that maps the state to a safety score, and the effect of the environment dynamics and control inputs on that score is predicted over time\. Lie derivatives describe the rate of change of a function along the system trajectory, enabling prediction of the change in the safety score under both the natural drift \(when no control is applied\) and the applied control\. The Lie derivatives ofh\(𝐱\)h\(\\mathbf\{x\}\)with respect to𝐟\(𝐱\)\\mathbf\{f\}\(\\mathbf\{x\}\)and𝐠\(𝐱\)\\mathbf\{g\}\(\\mathbf\{x\}\)are
L𝐟h\(𝐱\)=∇h\(𝐱\)⋅𝐟\(𝐱\),L𝐠h\(𝐱\)=∇h\(𝐱\)⋅𝐠\(𝐱\)\.L\_\{\\mathbf\{f\}\}h\(\\mathbf\{x\}\)=\\nabla h\(\\mathbf\{x\}\)\\cdot\\mathbf\{f\(x\)\},\\qquad L\_\{\\mathbf\{g\}\}h\(\\mathbf\{x\}\)=\\nabla h\(\\mathbf\{x\}\)\\cdot\\mathbf\{g\(x\)\}\.\(2\)
Using Eq\. \([2](https://arxiv.org/html/2606.17414#S2.E2)\), the rate of change of the safety function is
h˙\(𝐱\)=L𝐟h\(𝐱\)\+L𝐠h\(𝐱\)𝐮\.\\dot\{h\}\(\\mathbf\{x\}\)=L\_\{\\mathbf\{f\}\}h\(\\mathbf\{x\}\)\+L\_\{\\mathbf\{g\}\}h\(\\mathbf\{x\}\)\\mathbf\{u\}\.\(3\)
### 2\.1Input\-Constrained CBFs
The safe set, represented by a continuously differentiable functionh\(𝐱\)h\(\\mathbf\{x\}\), does not inherently account for physical actuation limits\. Under classical Control Barrier Function \(CBF\) theory, forward invariance of this safe set is guaranteed if the control input𝐮\\mathbf\{u\}satisfies
L𝐟h\(𝐱\)\+L𝐠h\(𝐱\)𝐮≥−α\(h\(𝐱\)\),L\_\{\\mathbf\{f\}\}h\(\\mathbf\{x\}\)\+L\_\{\\mathbf\{g\}\}h\(\\mathbf\{x\}\)\\mathbf\{u\}\\geq\-\\alpha\(h\(\\mathbf\{x\}\)\),\(4\)
whereα\\alphais a class\-𝒦\\mathcal\{K\}function\. However, if a spacecraft lies on the boundary of the safe set while traveling with high tangential momentum, the control input required to satisfy this classical condition may exceed the physical thrust limits of the system\. The dynamics would then prevent the system from remaining safe at subsequent timesteps\. The ICCBF framework therefore strengthens standard CBF theory by constructing a stricter condition,bN\(𝐱\)b\_\{N\}\(\\mathbf\{x\}\), that prevents the system from approaching the boundary of the original safe set with excessive momentum\. Beginning from the foundational safety ruleb0\(𝐱\)=h\(𝐱\)b\_\{0\}\(\\mathbf\{x\}\)=h\(\\mathbf\{x\}\), input awareness is introduced by computing a recursive sequence of functions that evaluates the current condition against the system dynamics and the input constraints𝒰\\mathcal\{U\}, using a hierarchy of class\-𝒦\\mathcal\{K\}functions\{αi\}i=0N−1\\\{\\alpha\_\{i\}\\\}\_\{i=0\}^\{N\-1\}:
bi\+1\(𝐱\)=inf𝐮∈𝒰\[L𝐟bi\(𝐱\)\+L𝐠bi\(𝐱\)𝐮\+αi\(bi\(𝐱\)\)\]\.b\_\{i\+1\}\(\\mathbf\{x\}\)=\\inf\_\{\\mathbf\{u\}\\in\\mathcal\{U\}\}\\left\[L\_\{\\mathbf\{f\}\}b\_\{i\}\(\\mathbf\{x\}\)\+L\_\{\\mathbf\{g\}\}b\_\{i\}\(\\mathbf\{x\}\)\\mathbf\{u\}\+\\alpha\_\{i\}\(b\_\{i\}\(\\mathbf\{x\}\)\)\\right\]\.\(5\)
The recursion is repeated, generating progressively stricter functionsb1,b2,…,bNb\_\{1\},b\_\{2\},\\dots,b\_\{N\}, until it yields a subset that is intrinsically forward\-invariant under the bounded inputs\. This final, converged function defines the operational inner safe set, and forward invariance is enforced by any locally Lipschitz feedback satisfying
L𝐟bN\(𝐱\)\+L𝐠bN\(𝐱\)𝐮≥−αN\(bN\(𝐱\)\)\.L\_\{\\mathbf\{f\}\}b\_\{N\}\(\\mathbf\{x\}\)\+L\_\{\\mathbf\{g\}\}b\_\{N\}\(\\mathbf\{x\}\)\\mathbf\{u\}\\geq\-\\alpha\_\{N\}\(b\_\{N\}\(\\mathbf\{x\}\)\)\.\(6\)
It is this learned hierarchy\{αi\}i=0N\\\{\\alpha\_\{i\}\\\}\_\{i=0\}^\{N\}that the meta\-RL policy parameterizes\.
### 2\.2Time\-Sampled Execution
The system dynamics evolve in continuous time, as in Eq\. \([1](https://arxiv.org/html/2606.17414#S2.E1)\)\. Sensing and control updates, however, are typically executed at discrete instants on digital hardware\. Intuitively, this is analogous to a person navigating a maze who can open their eyes only briefly once every few seconds: during the intervals when their eyes are closed, no guarantee can be made that they will not collide with a wall\.
Lettk=kTt\_\{k\}=kTdenote the sampling instants, whereT\>0T\>0is the constant sampling time step\. Under a time\-sampled implementation with a zero\-order hold \(ZOH\), the control input is updated only attkt\_\{k\}and held constant between updates, i\.e,
𝐮\(t\)=𝐮k,t∈\[tk,tk\+1\),𝐮k∈𝒰\\mathbf\{u\}\(t\)=\\mathbf\{u\}\_\{k\},\\quad t\\in\[t\_\{k\},t\_\{k\+1\}\),\\quad\\mathbf\{u\}\_\{k\}\\in\\mathcal\{U\}\(7\)where𝐮k\\mathbf\{u\}\_\{k\}is computed from the sampled state𝐱k≜𝐱\(tk\)\\mathbf\{x\}\_\{k\}\\triangleq\\mathbf\{x\}\(t\_\{k\}\)\. The resulting trajectory segment on the interval\[tk,tk\+1\)\[t\_\{k\},t\_\{k\+1\}\)is the solution to the continuous dynamics with the initial condition𝐱\(tk\)=𝐱k\\mathbf\{x\}\(t\_\{k\}\)=\\mathbf\{x\}\_\{k\}and the constant input𝐮k\\mathbf\{u\}\_\{k\}\. Then, the state update can be written in the following integral form\.
𝐱k\+1=𝐱k\+∫0T\(𝐟\(𝐱\(tk\+τ\)\)\+𝐠\(𝐱\(tk\+τ\)\)𝐮k\)𝑑τ\\mathbf\{x\}\_\{k\+1\}=\\mathbf\{x\}\_\{k\}\+\\int\_\{0\}^\{T\}\\left\(\\mathbf\{f\}\(\\mathbf\{x\}\(t\_\{k\}\+\\tau\)\)\+\\mathbf\{g\}\(\\mathbf\{x\}\(t\_\{k\}\+\\tau\)\)\\mathbf\{u\}\_\{k\}\\right\)d\\tau\(8\)Note that enforcing a continuous\-time CBF exclusively at the sampling instants does not guarantee that the system will stay in the safe set during the inter\-sample period\(tk,tk\+1\)\(t\_\{k\},t\_\{k\+1\}\)unless inter\-sample effects are explicitly bounded or incorporated into the condition\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\. In this work, a control margin is utilized to accomplish this\.
### 2\.3Meta\-Reinforcement Learning
Reinforcement learning \(RL\) treats sequential decision\-making as an agent interacting with an environment and refining its behavior through trial and error\[[21](https://arxiv.org/html/2606.17414#bib.bib5)\]\. Meta\-RL generalizes this from learning a single policy for a single taskℳ\\mathcal\{M\}to learning an*adaptation mechanism*over a distribution of related tasksℳ∼p\(𝓜\)\\mathcal\{M\}\\sim p\(\\bm\{\\mathcal\{M\}\}\), where eachℳ\\mathcal\{M\}is drawn from a task family𝓜\\bm\{\\mathcal\{M\}\}\[[4](https://arxiv.org/html/2606.17414#bib.bib4)\]\. Tasks within the family differ in parameters such as mass, thrust limits, or disturbance levels, so the behavior that is optimal for a given instance depends on properties that are not fully revealed by an instantaneous observation\.
A policyπθ\\pi\_\{\\theta\}can take the form of either a feed\-forward mapping or a recurrent dynamical system\. In the feed\-forward case, a multilayer perceptron \(MLP\) maps the current observation directly to an action,𝒂k=πθ\(𝒐k\)\\bm\{a\}\_\{k\}=\\pi\_\{\\theta\}\(\\bm\{o\}\_\{k\}\), producing a memoryless decision rule\. In the recurrent case, an internal hidden state augments the mapping,𝒂k=πθ\(𝒐k,𝒔k\)\\bm\{a\}\_\{k\}=\\pi\_\{\\theta\}\(\\bm\{o\}\_\{k\},\\bm\{s\}\_\{k\}\)with𝒔k\+1=ϕθ\(𝒔k,𝒐k,𝒂k\)\\bm\{s\}\_\{k\+1\}=\\phi\_\{\\theta\}\(\\bm\{s\}\_\{k\},\\bm\{o\}\_\{k\},\\bm\{a\}\_\{k\}\), where𝒔k\\bm\{s\}\_\{k\}is the hidden state andϕθ\\phi\_\{\\theta\}is its update function, allowing the policy to accumulate information across timesteps\. This memory enables within\-episode inference of the hidden task parameters and the corresponding adaptation, which makes recurrent policies well suited to meta\-RL\[[15](https://arxiv.org/html/2606.17414#bib.bib3)\]\.
## 3Methodology
Given the system in Eq\. \([1](https://arxiv.org/html/2606.17414#S2.E1)\), the goal of this work is to extend the study of the meta\-RL ICCBF framework in Ref\.Wijayatungaet al\.\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\. This framework consists of three parts\.
1. 1\.The ICCBF recursion that creates a smaller subset of the safe set
2. 2\.A time\-sampled margin, which maintains forward invariance despite discrete control updates\.
3. 3\.An agent able to output dynamic, learned class\-𝒦\\mathcal\{K\}parameters,θi,k\\theta\_\{i,k\}, whereαi\(sk\)=θi,k\\alpha\_\{i\}\(s\_\{k\}\)=\\theta\_\{i,k\}\.
A single convex quadratic program \(QP\) computes the control at each time step\. This study considers agents that combine two RL algorithms \(PPO and SAC\) with three recurrent network architectures \(LSTM, GRU, and Mamba\), with performance measured through fuel savings and the safety score\.
### 3\.1Quadratic Program Formulation
At runtime, the safe set is enforced by solving a convex QP at each sampling instantkk\. The QP modifies the nominal control command to ensure constraint satisfaction while respecting the actuation limits𝐮k∈𝒰\\mathbf\{u\}\_\{k\}\\in\\mathcal\{U\}\. When the controller’s only active objective is to maintain safety, the baseline QP minimizes the total control effort‖𝐮k‖2\\\|\\mathbf\{u\}\_\{k\}\\\|^\{2\}subject to the time\-sampled ICCBF constraint\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\], which incorporates both the learned terminal class\-𝒦\\mathcal\{K\}gainθN,k\\theta\_\{N,k\}and the DA\-computed inter\-sample marginν^k\(T,𝐱k\)\\hat\{\\nu\}\_\{k\}\(T,\\mathbf\{x\}\_\{k\}\), defined below\.
Enforcing a continuous\-time ICCBF only at the sampling instants does not guarantee safety during the inter\-sample interval\(tk,tk\+1\)\(t\_\{k\},t\_\{k\+1\}\)under a zero\-order hold\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\. To recover strict forward invariance for allt≥0t\\geq 0, a control marginν^k\(T,𝐱k\)\\hat\{\\nu\}\_\{k\}\(T,\\mathbf\{x\}\_\{k\}\)is added to the discrete ICCBF inequality, bounding the worst\-case evolution of the barrier over one sample step\. Computing this margin exactly requires the local Lipschitz constants of the barrier terms and the maximum bounds of the dynamics, which in turn demand numerically intensive grid\-search maximization at every step\[[7](https://arxiv.org/html/2606.17414#bib.bib17)\]\. Following the formulation of Wijayatunga et al\.\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\], Differential Algebra \(DA\) is instead used to compute conservative interval enclosures of the required Lipschitz constants and dynamics bounds over a local hyper\-rectangle around𝐱k\\mathbf\{x\}\_\{k\}, yielding an efficiently computed upper boundν^k\(T,𝐱k\)\\hat\{\\nu\}\_\{k\}\(T,\\mathbf\{x\}\_\{k\}\)\. Replacing the exact margin with this bound allows the time\-sampled safety filter to be resolved with a single lightweight convex QP per control cycle\. The complete derivation of this margin is given in Ref\.Wijayatungaet al\.\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\.
For goal\-directed tasks, a relaxed CLF constraint is added to the QP to encode convergence to the target without compromising safety\. The resulting formulation is
𝐮k∗=argmin𝐮k∈ℝm,ϵk≥0\\displaystyle\\mathbf\{u\}\_\{k\}^\{\*\}=\\arg\\min\_\{\\mathbf\{u\}\_\{k\}\\in\\mathbb\{R\}^\{m\},\\,\\epsilon\_\{k\}\\geq 0\}12‖𝐮k‖2\+pϵk2\\displaystyle\\tfrac\{1\}\{2\}\\\|\\mathbf\{u\}\_\{k\}\\\|^\{2\}\+p\\,\\epsilon\_\{k\}^\{2\}\(9\)s\.t\.L𝐟bN\(𝐱k\)\+L𝐠bN\(𝐱k\)𝐮k\+θN,kbN\(𝐱k\)≥ν^k\(T,𝐱k\),\\displaystyle L\_\{\\mathbf\{f\}\}b\_\{N\}\(\\mathbf\{x\}\_\{k\}\)\+L\_\{\\mathbf\{g\}\}b\_\{N\}\(\\mathbf\{x\}\_\{k\}\)\\mathbf\{u\}\_\{k\}\+\\theta\_\{N,k\}\\,b\_\{N\}\(\\mathbf\{x\}\_\{k\}\)\\geq\\hat\{\\nu\}\_\{k\}\(T,\\mathbf\{x\}\_\{k\}\),L𝐟V\(𝐱k\)\+L𝐠V\(𝐱k\)𝐮k≤−cV,kV\(𝐱k\)\+ϵk,\\displaystyle L\_\{\\mathbf\{f\}\}V\(\\mathbf\{x\}\_\{k\}\)\+L\_\{\\mathbf\{g\}\}V\(\\mathbf\{x\}\_\{k\}\)\\mathbf\{u\}\_\{k\}\\leq\-c\_\{V,k\}V\(\\mathbf\{x\}\_\{k\}\)\+\\epsilon\_\{k\},𝐮k∈𝒰,\\displaystyle\\mathbf\{u\}\_\{k\}\\in\\mathcal\{U\},
wherep\>0p\>0weights the CLF slack variableϵk\\epsilon\_\{k\}\. Because the terminal ICCBF condition is affine in the control input, the optimization is a strictly convex QP, ensuring the computational efficiency required for high\-rate flight execution\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\.
### 3\.2Meta\-RL Formulation
The parameter\-tuning problem is posed as an episodic, discrete\-time control task\. At each sampling instanttkt\_\{k\}, the agent observes the system state and outputs the ICCBF class\-𝒦\\mathcal\{K\}parameters𝜽k\\bm\{\\theta\}\_\{k\}\(and, where a CLF is present, the gaincV,kc\_\{V,k\}\), which together shape the geometry of the inner safe set enforced by the QP\.
#### 3\.2\.1Network Architecture and Sequence Modeling
Because the optimal class\-𝒦\\mathcal\{K\}parameters depend on physical properties that are not observable from a single state, the policy must integrate the trajectory history to infer the hidden dynamics\. Both the actor and critic therefore use separate feature extractors followed by an independent recurrent sequence module\. To assess how the choice of temporal\-inference mechanism affects performance, independent agents are trained with three sequence\-modeling network architectures: an LSTM, a GRU, and a Mamba \(selective state\-space\) module\. For each scenario, one policy is trained per backbone with each of the two algorithms \(PPO and SAC\), giving six training runs per scenario\. The PPO and SAC update procedures are detailed in Appendix A\.
#### 3\.2\.2Domain Randomization and Uncertainty
To train a policy that generalizes across the task family, each episode randomizes the initial state, the hidden physical parameters, and the sensing and actuation noise\. These distributions are detailed below\.
#### 3\.2\.3Initial State and Hidden Parameter Distributions
Each training episode begins by drawing an initial condition together with a set of environmental parameters, such as the deputy mass, the maximum thrustumaxu\_\{\\max\}, and the keep\-in/keep\-out zone sizes, so that the meta\-RL agent is exposed to a distribution of task instances rather than a single nominal scenario, which improves robustness and limits overfitting\. The initial state𝐱0\\mathbf\{x\}\_\{0\}is drawn from a bounded admissible set𝒳0⊆𝒮\\mathcal\{X\}\_\{0\}\\subseteq\\mathcal\{S\}, and a vector of hidden parameters𝒑∈ℝnp\\bm\{p\}\\in\\mathbb\{R\}^\{n\_\{p\}\}is sampled from a hyper\-rectangle centered on the nominal values𝒑¯\\bm\{\\bar\{p\}\}\. Each component is sampled independently as
pi∼Uniform\(\[\(1−δi\)p¯i,\(1\+δi\)p¯i\]\),i=1,…,np,p\_\{i\}\\sim\\mathrm\{Uniform\}\\\!\\left\(\\bigl\[\(1\-\\delta\_\{i\}\)\\,\\bar\{p\}\_\{i\},\\ \(1\+\\delta\_\{i\}\)\\,\\bar\{p\}\_\{i\}\\bigr\]\\right\),\\quad i=1,\\dots,n\_\{p\},\(10\)
wherep¯i\\bar\{p\}\_\{i\}is the nominal value of theiith parameter andδi\>0\\delta\_\{i\}\>0sets its relative variation\. The per\-case values ofδi\\delta\_\{i\}are listed in Table[1](https://arxiv.org/html/2606.17414#S3.T1)\. Within each Monte Carlo dataset, the parameter samples are held fixed across all controllers, so that performance differences reflect only the ICCBF tuning mechanism\.
Table 1:Hidden parameter variations per episode \(nominal value with relative deviationδi\\delta\_\{i\}\)\. Adapted from Ref\.\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\.
#### 3\.2\.4State Errors
Sensing errors are modeled as additive zero\-mean Gaussian noise on the true state\. At each stepkk, the state presented to the agent is
𝒙kE=𝒙k\+ϵ𝒙k,ϵ𝒙k∼𝒩\(𝟎,𝝈x2𝑰\),\\bm\{x\}\_\{k\}^\{E\}=\\bm\{x\}\_\{k\}\+\\bm\{\\epsilon\}\_\{\\bm\{x\}\_\{k\}\},\\qquad\\bm\{\\epsilon\}\_\{\\bm\{x\}\_\{k\}\}\\sim\\mathcal\{N\}\\\!\\left\(\\bm\{0\},\\,\\bm\{\\sigma\}\_\{x\}^\{2\}\\bm\{I\}\\right\),\(11\)
where𝝈x=\[𝝈r,𝝈v\]⊤\\bm\{\\sigma\}\_\{x\}=\[\\bm\{\\sigma\}\_\{r\},\\bm\{\\sigma\}\_\{v\}\]^\{\\top\}collects the position\- and velocity\-error standard deviations listed in Table[2](https://arxiv.org/html/2606.17414#S3.T2)\.
#### 3\.2\.5Thrust Errors
Actuation errors are introduced by perturbing the magnitude and direction of the commanded control𝒖k⋆\\bm\{u\}\_\{k\}^\{\\star\}before propagation\. With𝒖k⋆=\[ux,uy,uz\]⊤\\bm\{u\}\_\{k\}^\{\\star\}=\[u\_\{x\},u\_\{y\},u\_\{z\}\]^\{\\top\}expressed in the control frame, the magnitudeuk=‖𝒖k⋆‖2u\_\{k\}=\\\|\\bm\{u\}\_\{k\}^\{\\star\}\\\|\_\{2\}and the out\-of\-plane and in\-plane angles\(β,γ\)\(\\beta,\\gamma\)are perturbed independently, and the executed command is reconstructed and saturated toumaxu\_\{\\max\}:
uk\\displaystyle u\_\{k\}=‖𝒖k⋆‖2,β=sin−1\(uzuk\),γ=tan−1\(uxuy\),\\displaystyle=\\\|\\bm\{u\}\_\{k\}^\{\\star\}\\\|\_\{2\},\\qquad\\beta=\\sin^\{\-1\}\\\!\\left\(\\frac\{u\_\{z\}\}\{u\_\{k\}\}\\right\),\\qquad\\gamma=\\tan^\{\-1\}\\\!\\left\(\\frac\{u\_\{x\}\}\{u\_\{y\}\}\\right\),\(12\)ukE\\displaystyle u\_\{k\}^\{E\}=uk\(1\+δu\),βE=β\+δβ,γE=γ\+δγ,\\displaystyle=u\_\{k\}\\left\(1\+\\delta\_\{u\}\\right\),\\qquad\\beta^\{E\}=\\beta\+\\delta\_\{\\beta\},\\qquad\\gamma^\{E\}=\\gamma\+\\delta\_\{\\gamma\},\(13\)𝒖k\\displaystyle\\bm\{u\}\_\{k\}=ukE\[cosβEsinγEcosβEcosγEsinβE\],𝒖k←umaxmax\(umax,‖𝒖k‖2\)𝒖k,\\displaystyle=u\_\{k\}^\{E\}\\begin\{bmatrix\}\\cos\\beta^\{E\}\\,\\sin\\gamma^\{E\}\\\\ \\cos\\beta^\{E\}\\,\\cos\\gamma^\{E\}\\\\ \\sin\\beta^\{E\}\\end\{bmatrix\},\\qquad\\bm\{u\}\_\{k\}\\leftarrow\\frac\{u\_\{\\max\}\}\{\\max\(u\_\{\\max\},\\\|\\bm\{u\}\_\{k\}\\\|\_\{2\}\)\}\\,\\bm\{u\}\_\{k\},\(14\)
whereδu∼𝒩\(0,σu2\)\\delta\_\{u\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{u\}^\{2\}\),δβ∼𝒩\(0,σβ2\)\\delta\_\{\\beta\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\beta\}^\{2\}\), andδγ∼𝒩\(0,σγ2\)\\delta\_\{\\gamma\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\gamma\}^\{2\}\)\. The corresponding standard deviations are given in Table[2](https://arxiv.org/html/2606.17414#S3.T2)\.
Table 2:Standard deviations of the thrust and state uncertainties applied\.
#### 3\.2\.6Observation
At each step, the environment supplies an observation𝑺k\\bm\{S\}\_\{k\}carrying enough information to estimate the value of a candidate action\. For numerical stability and consistent feature scaling, the observation is min–max normalized to the interval\[−1,1\]\[\-1,1\]:
𝑺k∗=2\(𝑺k−𝑺min𝑺max−𝑺min\)−1,\\bm\{S\}\_\{k\}^\{\*\}=2\\\!\\left\(\\frac\{\\bm\{S\}\_\{k\}\-\\bm\{S\}\_\{\\min\}\}\{\\bm\{S\}\_\{\\max\}\-\\bm\{S\}\_\{\\min\}\}\\right\)\-1,\(15\)
where𝑺min\\bm\{S\}\_\{\\min\}and𝑺max\\bm\{S\}\_\{\\max\}bound the components of𝑺\\bm\{S\}\. For the cruise\-control and docking cases, the observation is the noisy state,𝑺k=𝒙kE\\bm\{S\}\_\{k\}=\\bm\{x\}\_\{k\}^\{E\}\. For inspection it is augmented to𝑺k=\[𝒙kE,θS,ninsp,𝒅^\]⊤\\bm\{S\}\_\{k\}=\[\\bm\{x\}\_\{k\}^\{E\},\\ \\theta\_\{S\},\\ n\_\{\\text\{insp\}\},\\ \\hat\{\\bm\{d\}\}\]^\{\\top\}, whereθS\\theta\_\{S\}is the Sun angle,ninspn\_\{\\text\{insp\}\}the number of inspected points, and𝒅^\\hat\{\\bm\{d\}\}a unit vector toward the largest cluster of remaining uninspected points, obtained by K\-means clustering\[[23](https://arxiv.org/html/2606.17414#bib.bib6)\]\.
#### 3\.2\.7Reward Function
The agent’s objective is to minimize control effort \(fuel\) while strictly avoiding task failure and maintaining safety\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\. The per\-step reward penalizes the magnitude of the commanded nominal control‖𝐮k∗‖2\\\|\\mathbf\{u\}\_\{k\}^\{\*\}\\\|\_\{2\}, promoting fuel\-efficient trajectories, together with a failure flag and a safety\-violation term:
rk=−wu‖𝐮k∗‖2−wfailPfail−whmax\(0,−hk\+1\),r\_\{k\}=\-w\_\{u\}\\\|\\mathbf\{u\}\_\{k\}^\{\*\}\\\|\_\{2\}\-w\_\{\\text\{fail\}\}P\_\{\\text\{fail\}\}\-w\_\{h\}\\max\\\!\\left\(0,\\,\-h\_\{k\+1\}\\right\),\(16\)
wherewuw\_\{u\},wfailw\_\{\\text\{fail\}\}, andwhw\_\{h\}are positive weights,hk\+1h\_\{k\+1\}is the safety value at the next state, and the failure flag terminates the episode when the QP solver becomes infeasible or a safety constraint is violated,
Pfail=\{1,QP infeasible or solver failure,0,otherwise\.P\_\{\\text\{fail\}\}=\\begin\{cases\}1,&\\text\{QP infeasible or solver failure\},\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(17\)
For tasks that use a CLF to encode goal\-reaching, a terminal penalty is added at the episode horizontft\_\{f\}to penalize failure to converge to the target:
Rk=\{rk,tk<tf,rk−wVPV,tk≥tf,R\_\{k\}=\\begin\{cases\}r\_\{k\},&t\_\{k\}<t\_\{f\},\\\\ r\_\{k\}\-w\_\{V\}P\_\{V\},&t\_\{k\}\\geq t\_\{f\},\\end\{cases\}\(18\)
wherewV\>0w\_\{V\}\>0and the terminal penalty activates only if the minimum CLF value over the episode,minjV\(𝐱j\)\\min\_\{j\}V\(\\mathbf\{x\}\_\{j\}\), fails to fall below a convergence thresholdρV\\rho\_\{V\}:
PV=\{minjV\(𝐱j\),minjV\(𝐱j\)\>ρV,0,otherwise\.P\_\{V\}=\\begin\{cases\}\\min\_\{j\}V\(\\mathbf\{x\}\_\{j\}\),&\\min\_\{j\}V\(\\mathbf\{x\}\_\{j\}\)\>\\rho\_\{V\},\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(19\)
For the inspection scenario, which has no CLF, the per\-step reward in Eq\. \([16](https://arxiv.org/html/2606.17414#S3.E16)\) is augmented with a positive reward proportional to the number of newly inspected surface points, encouraging coverage alongside fuel economy\.
## 4Results
### 4\.1Test Cases
#### 4\.1\.1Cooperative Scenarios
The three cooperative test cases including a one\-dimensional cruise\-control problem, a two\-dimensional docking problem, and a three\-dimensional inspection problem are adopted directly from the meta\-RL ICCBF study of Wijayatunga et al\.\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\], which in turn draws the cruise\-control and docking problems from\[[1](https://arxiv.org/html/2606.17414#bib.bib1)\]and the inspection problem from\[[23](https://arxiv.org/html/2606.17414#bib.bib6)\]\. They are summarized below for completeness; the full dynamics, safety constraints, and CLFs are given in\[[29](https://arxiv.org/html/2606.17414#bib.bib16)\]\.
- •Cruise ControlA didactic, non\-RPO benchmark in which a deputy follows a lead vehicle moving at a constant speedv0v\_\{0\}\. With state𝐱=\[d,v\]⊤\\mathbf\{x\}=\[d,\\ v\]^\{\\top\}comprising the headway distanceddand the deputy velocityvv, the dynamics are \[d˙v˙\]=\[v0−v−F\(v\)m\]\+\[0g0\]u,𝒰=\{u:\|u\|≤umax\},\\begin\{bmatrix\}\\dot\{d\}\\\\ \\dot\{v\}\\end\{bmatrix\}=\\begin\{bmatrix\}v\_\{0\}\-v\\\\ \-\\tfrac\{F\(v\)\}\{m\}\\end\{bmatrix\}\+\\begin\{bmatrix\}0\\\\ g\_\{0\}\\end\{bmatrix\}u,\\qquad\\mathcal\{U\}=\\\{u:\|u\|\\leq u\_\{\\max\}\\\},\(20\) whereF\(v\)=0\.1\+5v\+0\.25v2F\(v\)=0\.1\+5v\+0\.25v^\{2\}is a resistive drag,mmis the deputy mass, andg0g\_\{0\}is a control\-scaling constant\. Collision avoidance defines the safe set through the CBF h\(𝐱\)=d−1\.8v≥0,h\(\\mathbf\{x\}\)=d\-1\.8\\,v\\geq 0,\(21\) and a CLF drives the deputy toward the speed limitvmaxv\_\{\\max\}: V\(𝐱\)=\(v−vmax\)2\.V\(\\mathbf\{x\}\)=\(v\-v\_\{\\max\}\)^\{2\}\.\(22\)
- •Docking with a Coorperative Target\.A planar rendezvous in which a deputy must reach the rotating docking port of a chief while remaining within a line\-of\-sight cone \(Figure[1](https://arxiv.org/html/2606.17414#S4.F1)\)\. The state is𝐱=\[px,py,vx,vy,ϕ\]⊤\\mathbf\{x\}=\[p\_\{x\},\\ p\_\{y\},\\ v\_\{x\},\\ v\_\{y\},\\ \\phi\]^\{\\top\}, where\(px,py\)\(p\_\{x\},p\_\{y\}\)and\(vx,vy\)\(v\_\{x\},v\_\{y\}\)are the chaser’s relative position and velocity andϕ\\phiis the docking\-port orientation, evolving asϕ˙=ω\\dot\{\\phi\}=\\omega\. The relative translational motion follows the Clohessy–Wiltshire \(CW\) dynamics with thrust acceleration as the control input, p˙x=vx,p˙y=vy,ϕ˙=ω,\\displaystyle\\dot\{p\}\_\{x\}=v\_\{x\},\\qquad\\dot\{p\}\_\{y\}=v\_\{y\},\\qquad\\dot\{\\phi\}=\\omega,\(23\)v˙x=3n2px\+2nvy\+uxm,v˙y=−2nvx\+uym,\\displaystyle\\dot\{v\}\_\{x\}=3n^\{2\}p\_\{x\}\+2nv\_\{y\}\+\\tfrac\{u\_\{x\}\}\{m\},\\qquad\\dot\{v\}\_\{y\}=\-2nv\_\{x\}\+\\tfrac\{u\_\{y\}\}\{m\}, wherennis the chief’s mean motion and‖𝐮‖≤umax\\\|\\mathbf\{u\}\\\|\\leq u\_\{\\max\}\. The line\-of\-sight requirement defines the CBF h\(𝐱\)=𝐫¯cp⋅𝐞^‖𝐫¯cp‖−cosγ≥0,h\(\\mathbf\{x\}\)=\\frac\{\\bar\{\\mathbf\{r\}\}\_\{cp\}\\cdot\\hat\{\\mathbf\{e\}\}\}\{\\\|\\bar\{\\mathbf\{r\}\}\_\{cp\}\\\|\}\-\\cos\\gamma\\geq 0,\(24\) with cone half\-angleγ\\gamma, cone axis𝐞^=\[cosϕ,sinϕ\]⊤\\hat\{\\mathbf\{e\}\}=\[\\cos\\phi,\\ \\sin\\phi\]^\{\\top\}, port offsetρ\\rho, and𝐫¯cp=\[px−ρcosϕ,py−ρsinϕ\]⊤\\bar\{\\mathbf\{r\}\}\_\{cp\}=\[p\_\{x\}\-\\rho\\cos\\phi,\\ p\_\{y\}\-\\rho\\sin\\phi\]^\{\\top\}the vector from the port to the chaser\. Convergence to the port is encoded by the CLF V\(𝐱\)=\(vx\+px−ρcosϕ10\)2\+\(vy\+py−ρsinϕ10\)2\.V\(\\mathbf\{x\}\)=\\left\(v\_\{x\}\+\\frac\{p\_\{x\}\-\\rho\\cos\\phi\}\{10\}\\right\)^\{2\}\+\\left\(v\_\{y\}\+\\frac\{p\_\{y\}\-\\rho\\sin\\phi\}\{10\}\\right\)^\{2\}\.\(25\) Figure 1:Spacecraft docking problem\.
- •Inspecting a Cooperative Target\.A three\-dimensional problem in which a deputy circumnavigates a spherical chief to inspect discretized points on its surface \(Figure[2](https://arxiv.org/html/2606.17414#S4.F2)\)\. With state𝐱=\[𝐫⊤,𝐯⊤\]⊤∈ℝ6\\mathbf\{x\}=\[\\mathbf\{r\}^\{\\top\},\\ \\mathbf\{v\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{6\}, where𝐫=\[x,y,z\]⊤\\mathbf\{r\}=\[x,\\ y,\\ z\]^\{\\top\}and𝐯\\mathbf\{v\}are the relative position and velocity in the chief\-centered LVLH frame, the motion follows the CW equations x¨=3n2x\+2ny˙\+uxm,y¨=−2nx˙\+uym,z¨=−n2z\+uzm,\\ddot\{x\}=3n^\{2\}x\+2n\\dot\{y\}\+\\tfrac\{u\_\{x\}\}\{m\},\\qquad\\ddot\{y\}=\-2n\\dot\{x\}\+\\tfrac\{u\_\{y\}\}\{m\},\\qquad\\ddot\{z\}=\-n^\{2\}z\+\\tfrac\{u\_\{z\}\}\{m\},\(26\) with‖𝐮‖∞≤umax\\\|\\mathbf\{u\}\\\|\_\{\\infty\}\\leq u\_\{\\max\}\. Three constraints define the safe set—a keep\-out zone \(KOZ\), a keep\-in zone \(KIZ\), and a sensor Sun\-exclusion condition: h1\(𝐱\)\\displaystyle h\_\{1\}\(\\mathbf\{x\}\)=‖𝐫‖−\(RC\+RD\)≥0,\\displaystyle=\\\|\\mathbf\{r\}\\\|\-\(R\_\{C\}\+R\_\{D\}\)\\geq 0,\(27\)h2\(𝐱\)\\displaystyle h\_\{2\}\(\\mathbf\{x\}\)=Rmax−‖𝐫‖≥0,\\displaystyle=R\_\{\\max\}\-\\\|\\mathbf\{r\}\\\|\\geq 0,\(28\)h3\(𝐱\)\\displaystyle h\_\{3\}\(\\mathbf\{x\}\)=θb−αFOV/2≥0,\\displaystyle=\\theta\_\{b\}\-\\alpha\_\{\\mathrm\{FOV\}\}/2\\geq 0,\(29\) whereRCR\_\{C\}andRDR\_\{D\}are the chief and deputy radii,RmaxR\_\{\\max\}is the KIZ radius,θb\\theta\_\{b\}is the sensor boresight–Sun angle, andαFOV\\alpha\_\{\\mathrm\{FOV\}\}is the sensor field of view\. Because surface coverage cannot be expressed as a CLF, a learned nominal control policy replaces the CLF term and the ICCBF acts as a supervisory safety filter\. Figure 2:Spacecraft inspection problem\.rsr\_\{s\}andrbr\_\{b\}denote the vector that points from the Sun to the target and the vector that points from the chaser to the target respectively\.αFOV\\alpha\_\{FOV\}denotes the field of view of the chaser, andθb\\theta\_\{b\}is the angle between the chaser’s sensor boresight and the Sun\.
#### 4\.1\.2Adversarial Scenarios
In addition to cooperative docking and inspection scenarios, in which the target remains passive, this section explores the same test cases under adversarial target behavior\.
- •Docking with an Adversarial Target\.The target acts as an adversary by choosingωk\+1=ϕ˙k\+1\\omega\_\{k\+1\}=\\dot\{\\phi\}\_\{k\+1\}to minimizeh\(𝒙\)h\(\\bm\{x\}\)\. After thekkth step, it evaluateshhunder a small perturbationΔω\\Delta\\omegato the current rotation rate\. h0\+\\displaystyle h\_\{0\}^\{\+\}=h0\(px,k\+1,py,k\+1,vx,k\+1,vy,k\+1,ϕk\+1\+\(ωk\+Δω\)T\)\\displaystyle=h\_\{0\}\\\!\\left\(p\_\{x,k\+1\},\\,p\_\{y,k\+1\},\\,v\_\{x,k\+1\},\\,v\_\{y,k\+1\},\\,\\phi\_\{k\+1\}\+\(\\omega\_\{k\}\+\\Delta\\omega\)T\\right\)\(30\)h0−\\displaystyle h\_\{0\}^\{\-\}=h0\(px,k\+1,py,k\+1,vx,k\+1,vy,k\+1,ϕk\+1\+\(ωk−Δω\)T\)\\displaystyle=h\_\{0\}\\\!\\left\(p\_\{x,k\+1\},\\,p\_\{y,k\+1\},\\,v\_\{x,k\+1\},\\,v\_\{y,k\+1\},\\,\\phi\_\{k\+1\}\+\(\\omega\_\{k\}\-\\Delta\\omega\)T\\right\)\(31\)whereTTis the simulation timestep\. The target then applies an update to its rotation rate in the direction that reduces safety as follows ωk\+1=\{ωk−Δωmaxifh0−<h0\+,ωk\+Δωmaxifh0−≥h0\+\.\\omega\_\{k\+1\}=\\begin\{cases\}\\omega\_\{k\}\-\\Delta\\omega\_\{\\max\}&\\text\{if \}h\_\{0\}^\{\-\}<h\_\{0\}^\{\+\},\\\\ \\omega\_\{k\}\+\\Delta\\omega\_\{\\max\}&\\text\{if \}h\_\{0\}^\{\-\}\\geq h\_\{0\}^\{\+\}\.\\end\{cases\}\(32\)Then, it clips to admissible bounds such that ωk\+1=clip\(ωk\+1,ωmin,ωmax\)\.\\omega\_\{k\+1\}=\\text\{clip\}\\\!\\left\(\\omega\_\{k\+1\},\\,\\omega\_\{\\min\},\\,\\omega\_\{\\max\}\\right\)\.\(33\)We use\[ωmin,ωmax\]=\[0,0\.7deg/s\]\[\\omega\_\{\\min\},\\omega\_\{\\max\}\]=\[0,\\ 0\.7\\ \\text\{deg/s\}\]andΔω=Δωmax=0\.02deg/s\\Delta\\omega=\\Delta\\omega\_\{\\max\}=0\.02\\ \\text\{deg/s\}\. The chaser’s ICCBF controller only receives a the stale estimate ofωk\\omega\_\{k\}that lags by one time\-step\.
- •Inspecting an Adversarial Target\.The chief applies a small impulsive maneuver at each step that increases the relative separation, modeled as a disturbance on the deputy’s discrete CW dynamics directed radially outward: 𝐱k\+1=Φ\(T\)𝐱k\+Γ\(T\)𝐮dep,k−Γ\(T\)𝐝k,𝒅k=−Δvk𝐫k‖𝐫k‖,\\mathbf\{x\}\_\{k\+1\}=\\Phi\(T\)\\mathbf\{x\}\_\{k\}\+\\Gamma\(T\)\\mathbf\{u\}\_\{\\text\{dep\},k\}\-\\Gamma\(T\)\\mathbf\{d\}\_\{k\},\\qquad\\bm\{d\}\_\{k\}=\-\\Delta v\_\{k\}\\,\\frac\{\\mathbf\{r\}\_\{k\}\}\{\\\|\\mathbf\{r\}\_\{k\}\\\|\},\(34\) whereΦ\(T\)\\Phi\(T\)andΓ\(T\)\\Gamma\(T\)are the CW state\-transition and input matrices,𝐮dep,k\\mathbf\{u\}\_\{\\text\{dep\},k\}is the deputy control, and𝐫k=\[px,py,pz\]⊤\\mathbf\{r\}\_\{k\}=\[p\_\{x\},p\_\{y\},p\_\{z\}\]^\{\\top\}is the deputy position relative to the chief, so that𝐝k\\mathbf\{d\}\_\{k\}pushes the deputy away from the target\. At episode reset, a total maneuver budgetΔvtotal∼𝒰\(0\.5,5\.0\)m/s\\Delta v\_\{\\text\{total\}\}\\sim\\mathcal\{U\}\(0\.5,\\ 5\.0\)~\\text\{m/s\}is sampled and spread evenly across theNN\-step horizon,Δvk=Δvtotal/N\\Delta v\_\{k\}=\\Delta v\_\{\\text\{total\}\}/N\.
### 4\.2Experimental Setup
Training was conducted over a maximum of100,000100\{,\}000episodes per run, distributed across 28 to 32 parallel environments, accumulating approximately 20 million timesteps\. The recurrent network architectures were trained with Backpropagation Through Time \(BPTT\)\. To mitigate recurrent\-state initialization error during updates, a burn\-in period \(20 to 100 steps, depending on the domain\) was used to warm up the hidden states before gradients were accumulated\. The primary hyperparameters are summarized below\.
- •PPO:learning rate≈3×10−4\\approx 3\\times 10^\{\-4\}\(constant or decaying by domain\), discount factorγ=0\.995\\gamma=0\.995, GAEλ=0\.95\\lambda=0\.95, clip parameterϵ=0\.2\\epsilon=0\.2, and entropy coefficientcent=0\.01c\_\{\\text\{ent\}\}=0\.01\.
- •SAC:learning rate≈3×10−4\\approx 3\\times 10^\{\-4\}, target smoothing coefficientτ=0\.005\\tau=0\.005, automatic target\-entropy tuning \(floor0\.050\.05\), and a replay buffer of50,00050\{,\}000to100,000100\{,\}000transitions\.
- •Networks:the MLP feature extractors used 2 to 3 hidden layers with 64 to 256 nodes per layer\. For a fair comparison, the hidden\-state dimension \(dmodeld\_\{\\text\{model\}\}\) of the LSTM, GRU, and Mamba2 modules was matched \(32 to 256 units, depending on domain complexity\)\.
### 4\.3Evaluation Methodology
To assess performance, safety, and efficiency, the evaluation phase was decoupled from training\. For each scenario, a fixed bank of 500 pre\-generated initial conditions and environment\-parameter randomizations \(e\.g\., mass and inertia\) was established\. Using a static bank guarantees that every policy is evaluated on the exact same sequence of test cases regardless of training algorithm or recurrent backbone, enabling a direct comparison\. During evaluation, the policies were executed deterministically: for PPO, the action\-distribution means were used; for SAC, the exploration noise was disabled\. This ensures that performance differences arise solely from the learned representations\. Each controller was assessed using the following three metrics\.
1. 1\.Safety:the fraction of episodes in which the state remained within the safe seth\(𝐱,t\)≥0h\(\\mathbf\{x\},t\)\\geq 0for the entire horizon \(or until the goal was reached\)\.
2. 2\.Fuel / control effort:the accumulated control effort over the trajectory, reported as totalΔv\\Delta vor the integral of the control magnitude∫‖𝐮\(t\)‖𝑑t\\int\\\|\\mathbf\{u\}\(t\)\\\|\\,dt\.
3. 3\.Task performance:domain\-specific success criteria, including time of flight \(TOF\) to the target state, docking\-angle error, and the final percentage of the target surface inspected\.
Figures[4](https://arxiv.org/html/2606.17414#S4.F4)and[4](https://arxiv.org/html/2606.17414#S4.F4)and Table[3](https://arxiv.org/html/2606.17414#S4.T3)summarize the results across the three cooperative scenarios and the two adversarial scenarios, respectively\. Detailed results follow\.
Figure 3:Performance of all setups across the three cooperative mission scenarios\. Triangle: Cruise Control, Circle: Docking, Square: Inspection\.
Figure 4:Performance of all setups across the two adversarial mission scenarios\. Circle: Docking, Square: Inspection\.
Table 3:Performance summary across the cooperative and adversarial scenarios\. Fuel is reported as mean±\\pmstd with quartiles\[Q1,Q2,Q3\]\[Q\_\{1\},Q\_\{2\},Q\_\{3\}\]\(N⋅\\cdots for Cruise Control, m/s otherwise\)\. “Safe” is the fraction of safe episodes; “Compl\.” is the docking\-success rate \(Docking\) or mean surface coverage \(Inspection\)\. Best value per column within each task in bold\.TaskCase𝝁±𝝈\\bm\{\\mu\\pm\\sigma\}\[𝑸𝟏,𝑸𝟐,𝑸𝟑\]\\bm\{\[Q\_\{1\},\\,Q\_\{2\},\\,Q\_\{3\}\]\}Safe \(%\)Compl\. \(%\)Cruise ControlLSTM\+PPO4\.02±0\.4404\.02\\pm 0\.440\[3\.99,4\.07,4\.14\]\[3\.99,\\,4\.07,\\,4\.14\]98\.8–LSTM\+SAC3\.53±0\.4973\.53\\pm 0\.497\[3\.29,3\.58,3\.84\]\[3\.29,\\,3\.58,\\,3\.84\]98\.9–GRU\+PPO3\.29±0\.5143\.29\\pm 0\.514\[3\.04,3\.38,3\.63\]\[3\.04,\\,3\.38,\\,3\.63\]98\.8–GRU\+SAC3\.98±0\.4163\.98\\pm 0\.416\[3\.93,4\.03,4\.12\]\[3\.93,\\,4\.03,\\,4\.12\]99\.0\\mathbf\{99\.0\}–Mamba2\+PPO2\.62±0\.355\\mathbf\{2\.62\\pm 0\.355\}\[2\.47,2\.60,2\.79\]\[\\mathbf\{2\.47\},\\,\\mathbf\{2\.60\},\\,\\mathbf\{2\.79\}\]98\.9–Mamba2\+SAC3\.90±0\.4493\.90\\pm 0\.449\[3\.81,3\.97,4\.09\]\[3\.81,\\,3\.97,\\,4\.09\]98\.8–Cooperative DockingLSTM\+PPO7\.64±2\.977\.64\\pm 2\.97\[5\.02,7\.15,10\.6\]\[5\.02,\\,7\.15,\\,10\.6\]98\.3\\mathbf\{98\.3\}0\.00LSTM\+SAC5\.23±1\.05\\mathbf\{5\.23\\pm 1\.05\}\[4\.53,5\.09,5\.97\]\[\\mathbf\{4\.53\},\\,\\mathbf\{5\.09\},\\,\\mathbf\{5\.97\}\]98\.20\.00GRU\+PPO6\.90±2\.476\.90\\pm 2\.47\[4\.90,6\.25,8\.81\]\[4\.90,\\,6\.25,\\,8\.81\]98\.3\\mathbf\{98\.3\}0\.00GRU\+SAC7\.39±1\.637\.39\\pm 1\.63\[6\.67,7\.08,7\.85\]\[6\.67,\\,7\.08,\\,7\.85\]4\.802\.20Mamba2\+PPO6\.26±1\.706\.26\\pm 1\.70\[5\.14,6\.02,7\.17\]\[5\.14,\\,6\.02,\\,7\.17\]97\.995\.0\\mathbf\{95\.0\}Mamba2\+SAC6\.06±1\.886\.06\\pm 1\.88\[4\.74,5\.21,7\.73\]\[4\.74,\\,5\.21,\\,7\.73\]98\.3\\mathbf\{98\.3\}48\.6Cooperative InspectionLSTM\+PPO8\.31±4\.308\.31\\pm 4\.30\[6\.02,7\.04,8\.80\]\[6\.02,\\,7\.04,\\,8\.80\]94\.497\.7LSTM\+SAC20\.1±11\.020\.1\\pm 11\.0\[12\.5,16\.6,22\.4\]\[12\.5,\\,16\.6,\\,22\.4\]6\.4076\.5GRU\+PPO9\.69±4\.249\.69\\pm 4\.24\[6\.50,7\.69,13\.9\]\[6\.50,\\,7\.69,\\,13\.9\]67\.898\.1GRU\+SAC20\.7±21\.920\.7\\pm 21\.9\[8\.48,9\.64,25\.2\]\[8\.48,\\,9\.64,\\,25\.2\]0\.0037\.4Mamba2\+PPO6\.59±1\.38\\mathbf\{6\.59\\pm 1\.38\}\[5\.71,6\.55,7\.58\]\[\\mathbf\{5\.71\},\\,\\mathbf\{6\.55\},\\,\\mathbf\{7\.58\}\]99\.4\\mathbf\{99\.4\}99\.6\\mathbf\{99\.6\}Mamba2\+SAC9\.20±6\.789\.20\\pm 6\.78\[7\.78,8\.40,9\.06\]\[7\.78,\\,8\.40,\\,9\.06\]0\.0027\.7Adversarial DockingLSTM\+PPO6\.77±2\.926\.77\\pm 2\.92\[4\.58,6\.27,8\.27\]\[4\.58,\\,6\.27,\\,8\.27\]95\.6\\mathbf\{95\.6\}95\.1GRU\+PPO7\.24±2\.687\.24\\pm 2\.68\[5\.24,7\.05,9\.74\]\[5\.24,\\,7\.05,\\,9\.74\]88\.374\.3Mamba2\+PPO5\.18±1\.87\\mathbf\{5\.18\\pm 1\.87\}\[3\.47,5\.39,6\.82\]\[\\mathbf\{3\.47\},\\,\\mathbf\{5\.39\},\\,\\mathbf\{6\.82\}\]95\.495\.2\\mathbf\{95\.2\}Adversarial InspectionLSTM\+PPO7\.79±2\.807\.79\\pm 2\.80\[6\.17,7\.09,8\.38\]\[6\.17,\\,7\.09,\\,8\.38\]95\.698\.4GRU\+PPO9\.69±4\.249\.69\\pm 4\.24\[6\.50,7\.69,13\.9\]\[6\.50,\\,7\.69,\\,13\.9\]67\.898\.1Mamba2\+PPO5\.29±1\.37\\mathbf\{5\.29\\pm 1\.37\}\[4\.32,5\.22,6\.07\]\[\\mathbf\{4\.32\},\\,\\mathbf\{5\.22\},\\,\\mathbf\{6\.07\}\]99\.0\\mathbf\{99\.0\}99\.4\\mathbf\{99\.4\}
### 4\.4Cruise Control
As shown in Table[3](https://arxiv.org/html/2606.17414#S4.T3), the Cruise Control task is well within the capabilities of all evaluated models\. Every combination achieves a safety rate of nearly99\.0%99\.0\\%\. However, Mamba2\+PPO distinguishes itself by exhibiting the lowest mean fuel usage\. Figure[5](https://arxiv.org/html/2606.17414#S4.F5)shows the PPO\-trained results, with phase\-space trajectories in\(d,v\)\(d,v\)colored by total thrust \(top\), the CBF valueh\(t\)h\(t\)against theh=0h=0safety boundary \(middle\), and the Lyapunov functionV\(t\)V\(t\)\(bottom\)\. This figure shows that the Mamba2\+PPO undercuts LSTM\+PPO and GRU\+PPO on fuel\. All architectures fail to fully satisfy the CLF withV\(t\)V\(t\)plateauing near100100–150150rather than converging to zero\.This behavior is a consequence of the reward structure prioritizing fuel over the CLF leading the agent to reduce its velocity during its approach\. In future work, higher weighting for the CLF satisfaction condition in the RL reward will be explored to avoid this\. Figure[6](https://arxiv.org/html/2606.17414#S4.F6)shows the analogous results for the three SAC\-trained network architectures, but the median fuel consumptions are considerably higher than PPO’s\.
Figure 5:PPO cruise control trajectoriesFigure 6:SAC cruise control trajectories
### 4\.5Docking with a Coorperative Target
As seen in Table[3](https://arxiv.org/html/2606.17414#S4.T3)LSTM and GRU models achieve the highest safety metrics but a0\.0%0\.0\\%dock rate, suggesting that the agents failed to learn how to reach the docking target within the given training time\. Conversely, Mamba2\+PPO achieves a95\.0%95\.0\\%successful dock rate while maintaining a97\.9%97\.9\\%safety margin\. The lowest median fuel consumption is observed for LSTM\+SAC, however at a docking success rate of 0%\. Instead, Mamba2 offers the second lowest median fuel consumption rate, with much higher rate of docking success\. Figure[7](https://arxiv.org/html/2606.17414#S4.F7)evaluates the spacecraft docking scenario under nominal conditions for three PPO\-trained architectures: LSTM\+PPO, GRU\+PPO, and Mamba2\+PPO\. The top row shows planarXYXYposition trajectories of the chaser spacecraft, overlaid with a rotating docking\-cone obstacle at its start and end orientations to illustrate the time\-varying geometry of the safety constraint\. The middle row plots the CBF valueh\(t\)h\(t\)over time, whereh<0h<0indicates a constraint violation, with theh=0h=0boundary clearly marked\. The bottom row shows the Lyapunov functionV\(t\)V\(t\)\. Analogous to the PPO docking group, Figure[8](https://arxiv.org/html/2606.17414#S4.F8)evaluates the standard spacecraft docking scenario for three SAC\-trained architectures\.
Figure 7:PPO docking trajectoriesFigure 8:SAC docking trajectories
### 4\.6Inspecting a Cooperative Target
The Inspection scenario pushes the agents further, requiring them to balance goal\-oriented navigation against fuel efficiency and strict safety constraints\. Figures[9](https://arxiv.org/html/2606.17414#S4.F9)and[10](https://arxiv.org/html/2606.17414#S4.F10)show the PPO\- and SAC\-trained network architectures, with one architecture per column and rows giving theXYXYtrajectories over the KIZ annulus, the KOZ barrierhKOZ\(t\)=‖𝐫b‖−\(RC\+RD\)h\_\{\\mathrm\{KOZ\}\}\(t\)=\\\|\\mathbf\{r\}\_\{b\}\\\|\-\(R\_\{C\}\+R\_\{D\}\), the KIZ barrierhKIZ\(t\)=Rmax−‖𝐫b‖h\_\{\\mathrm\{KIZ\}\}\(t\)=R\_\{\\max\}\-\\\|\\mathbf\{r\}\_\{b\}\\\|, the Sun\-exclusion barrierhSUN\(t\)h\_\{\\mathrm\{SUN\}\}\(t\), and the cumulative inspection coverage\. Table[3](https://arxiv.org/html/2606.17414#S4.T3)reveals a safety collapse for SAC: both GRU\+SAC and Mamba2\+SAC record a0\.0%0\.0\\%safety rate\. Among the PPO variants, GRU\+PPO reaches high coverage \(98\.1198\.11points\) but inefficiently, consuming far more fuel \(Thrustμ=1042\.96\\mu=1042\.96\) at only67\.8%67\.8\\%safety\. Mamba2\+PPO is again the best pairing, achieving near\-complete coverage \(99\.55%99\.55\\%median\) and the highest safety \(99\.4%99\.4\\%\) with a stable, efficient thrust profile \(σ=115\.44\\sigma=115\.44\)\.
Figure 9:PPO inspection trajectoriesFigure 10:SAC inspection trajectories
### 4\.7Docking with an Adversarial Target\.
To assess robustness under unpredictable, actively evasive conditions, the models were evaluated on adversarial variants of the docking and inspection scenarios; given PPO’s superior cooperative performance, only the PPO variants were carried forward\.
For adversarial docking, the GRU backbone degrades sharply, falling to an88\.3%88\.3\\%safety rate and a74\.3%74\.3\\%docking\-success rate\. LSTM\+PPO and Mamba2\+PPO remain robust, with comparable docking success \(∼95%\\sim 95\\%\) and safety \(∼95\.5%\\sim 95\.5\\%\)\. Mamba2\+PPO is the most fuel\-efficient of the three, reaching the target with lower meanΔv\\Delta v\(5\.185\.18vs\.6\.776\.77m/s\) and lower variance \(1\.871\.87vs\.2\.922\.92\), indicating a more efficient and stable policy learned from fewer interactions\. Figure[11](https://arxiv.org/html/2606.17414#S4.F11)shows the adversarial docking trajectories for the three PPO network architectures, with the layout mirroring the cooperative docking figure\.
Figure 11:PPO adversarial docking trajectories
### 4\.8Inspecting an Adversarial Target\.
Table[3](https://arxiv.org/html/2606.17414#S4.T3)demonstrates that all three models maintain excellent coverage against the adversarial target, inspecting roughly9898–99%99\\%of the target points, but they rely on markedly different control strategies to achieve this\. The RNN\-based models expend considerably more fuel countering the adversary’s evasive maneuvers \(7\.797\.79and9\.699\.69m/s meanΔv\\Delta vfor LSTM\+PPO and GRU\+PPO, respectively\), and GRU\+PPO does so while sacrificing safety \(67\.8%67\.8\\%safe episodes\)\. Mamba2\+PPO, by contrast, dominates on every metric\. It requires the least control effort \(5\.295\.29m/s\) with the narrowest distribution \(σ=1\.37\\sigma=1\.37m/s\), while simultaneously achieving the highest safety rate \(99\.0%99\.0\\%vs\.95\.6%95\.6\\%for LSTM\+PPO\) and the highest coverage \(99\.4%99\.4\\%\)\. This across\-the\-board advantage highlights the superior capacity of selective state space models for handling complex, multi\-objective control problems, even in hostile environments\. Figure[12](https://arxiv.org/html/2606.17414#S4.F12)presents results for the adversarial inspection scenario evaluated on the three PPO\-trained architectures\. The figure layout mirrors the standard inspection figure\.
Figure 12:PPO adversarial inspection trajectories
## 5Discussion
Three observations emerge from these results\. First, the performance gap between configurations scales with task difficulty\. In Cruise Control all pairings are separated only by fuel, but in docking only Mamba2 learns to complete the task, in inspection the SAC variants collapse, and under adversarial behavior Mamba2\+PPO consumes roughly3030–45%45\\%less fuel than the RNN\-based policies while matching or exceeding their safety and coverage\. This illustrates that architecture and algorithm choices must be validated according to the difficulty of the problem\. Second, Mamba2’s advantage is plausibly rooted in its linear, input\-gated hidden\-state recurrence, which, unlike the saturating nonlinear updates of the LSTM and GRU, mitigates the vanishing\-gradient problem over long horizons and allows the policy to infer hidden task parameters and adversarial behavior from the full observation history\. Third, the systematic underperformance of SAC reflects a mismatch between entropy\-regularized, off\-policy learning and the thin feasible sets of safety\-critical control: the maximum\-entropy objective injects stochasticity that repeatedly drives the agent across tight constraint boundaries, while replayed transitions generated under earlier class\-𝒦\\mathcal\{K\}parameterizations leave the value estimates stale with respect to the current safety filter\. On\-policy methods such as PPO that learn only from data generated by the current filter can avoid this problem\.
## CONCLUSION
This paper extended a previously developed meta\-RL framework for tuning the class\-𝒦\\mathcal\{K\}functions of ICCBFs by benchmarking three recurrent network architectures \(LSTM, GRU, and Mamba2\) and two training algorithms \(PPO and SAC\) across cruise control, docking, and inspection scenarios, including adversarial variants in which the target actively degrades the chaser’s safety or sensor coverage\. Monte Carlo evaluation under hidden\-parameter, state, and thrust uncertainty showed that the choice of sequence model and training algorithm is fairly decisive rather than incidental, and that the performance gap between configurations widens with task complexity\. On\-policy PPO consistently outperformed off\-policy SAC, whose entropy\-driven exploration and stale replayed transitions led to safety collapse in the docking and inspection tasks\. Among the network architectures, Mamba2 paired with PPO provided the best overall balance: it was the only configuration to achieve a high docking success rate \(95\.0%95\.0\\%\) while preserving safety \(97\.9%97\.9\\%\), attained near\-complete inspection coverage \(99\.6%99\.6\\%\) at the highest safety rate \(99\.4%99\.4\\%\), and retained this advantage under adversarial behavior, where it achieved the highest safety and coverage at3030–45%45\\%lower fuel consumption than the RNN\-based alternatives\. These results indicate that selective state\-space models are well suited to learned safety filters for proximity operations\. Future work will include adapting off\-policy methods to the non\-stationary safe set induced by the learned filter, and validating the framework on flight\-representative hardware\.
## 6Acknowledgment
Research was sponsored by the Department of the Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750\-19\-2\-1000\. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of the Air Force or the U\.S\. Government\. The U\.S\. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein\.
## Use of Artificial Intelligence
AI tools were used in the preparation of this manuscript to improve the clarity of the technical writing\. All scientific content, including the problem formulation, theoretical development, interpretation of results, and conclusions, is the sole intellectual contribution of the authors\. The authors reviewed and verified all AI\-generated text, and take full responsibility for the accuracy and integrity of the work\.
## Appendix: A
Algorithm 1Recurrent Meta\-RL via PPO \(On\-Policy\)Require:Taskℰ\\mathcal\{E\}; recurrent actorπθ\\pi\_\{\\theta\}, criticVϕV\_\{\\phi\}; rollout lengthTT, burn\-inKK, epochsEE, clipϵ\\epsilon, GAE\(γ,λ\)\(\\gamma,\\lambda\)\.
1:Initialise
θ,ϕ\\theta,\\,\\phi; context
𝒞←𝟎K×N\\mathcal\{C\}\\leftarrow\\mathbf\{0\}^\{K\\times N\}
2:foreach PPO updatedo
3:// — Phase 1: Rollout —
4:
𝒉a,𝒉c←𝟎\\bm\{h\}^\{a\},\\,\\bm\{h\}^\{c\}\\leftarrow\\mathbf\{0\}
5:for
t=1t=1to
TTdo
6:
at,logπt,𝒉a←πθ\(𝒔t,𝒉a\)a\_\{t\},\\,\\log\\pi\_\{t\},\\,\\bm\{h\}^\{a\}\\leftarrow\\pi\_\{\\theta\}\(\\bm\{s\}\_\{t\},\\,\\bm\{h\}^\{a\}\)
7:
vt,𝒉c←Vϕ\(𝒔t,𝒉c\)v\_\{t\},\\,\\bm\{h\}^\{c\}\\leftarrow V\_\{\\phi\}\(\\bm\{s\}\_\{t\},\\,\\bm\{h\}^\{c\}\)
8:
𝒔t\+1,Rt,dt←ℰ\.Step\(ICCBF\-QP\(at,𝒔t\)\)\\bm\{s\}\_\{t\+1\},\\,R\_\{t\},\\,d\_\{t\}\\leftarrow\\mathcal\{E\}\.\\textsc\{Step\}\\bigl\(\\textsc\{ICCBF\-QP\}\(a\_\{t\},\\bm\{s\}\_\{t\}\)\\bigr\)
9:if
dt\(i\)=trued\_\{t\}^\{\(i\)\}=\\textbf\{true\}then
𝒉\[:,i\]a,𝒉\[:,i\]c←𝟎\\bm\{h\}^\{a\}\_\{\[:,i\]\},\\,\\bm\{h\}^\{c\}\_\{\[:,i\]\}\\leftarrow\\mathbf\{0\}
10:endif
11:Store
\(𝒔t,at,Rt,dt,vt,logπt\)\(\\bm\{s\}\_\{t\},a\_\{t\},R\_\{t\},d\_\{t\},v\_\{t\},\\log\\pi\_\{t\}\)in
ℬ\\mathcal\{B\}
12:endfor
13:// — Phase 2: GAE —
14:
vT\+1←Vϕ\(𝒔T\+1,𝒉c\)v\_\{T\+1\}\\leftarrow V\_\{\\phi\}\(\\bm\{s\}\_\{T\+1\},\\bm\{h\}^\{c\}\);
A^T\+1←0\\hat\{A\}\_\{T\+1\}\\leftarrow 0
15:for
t=Tt=Tdownto
11do
16:
A^t←\(Rt\+γvt\+1−vt\)\+γλA^t\+1\\hat\{A\}\_\{t\}\\leftarrow\(R\_\{t\}\+\\gamma v\_\{t\+1\}\-v\_\{t\}\)\+\\gamma\\lambda\\,\\hat\{A\}\_\{t\+1\}
17:endfor
18:// — Phase 3: Update with Burn\-In —
19:
𝒞←ℬ\.obsT−K:T\\mathcal\{C\}\\leftarrow\\mathcal\{B\}\.\\textsc\{obs\}\_\{T\-K:T\}⊳\\trianglerightsave tail as burn\-in context for next update
20:forepoch
e=1e=1to
EEdo
21:foreach minibatch
ℳ∼ℬ\\mathcal\{M\}\\sim\\mathcal\{B\}do
22:Run
πθ,Vϕ\\pi\_\{\\theta\},V\_\{\\phi\}over
\[𝒞,ℳ\]\[\\mathcal\{C\},\\,\\mathcal\{M\}\]from
𝒉=𝟎\\bm\{h\}=\\mathbf\{0\}; discard first
KKoutputs
23:
A^←\(A^−μA^\)/σA^\\hat\{A\}\\leftarrow\(\\hat\{A\}\-\\mu\_\{\\hat\{A\}\}\)/\\sigma\_\{\\hat\{A\}\}
24:
r←exp\(logπt′−logπt\)r\\leftarrow\\exp\(\\log\\pi^\{\\prime\}\_\{t\}\-\\log\\pi\_\{t\}\)
25:
ℒ←−𝔼\[min\(rA^,clip\(r,1±ϵ\)A^\)\]\+cVℒV−cℋℋ\\mathcal\{L\}\\leftarrow\-\\mathbb\{E\}\\bigl\[\\min\(r\\hat\{A\},\\;\\operatorname\{clip\}\(r,1\{\\pm\}\\epsilon\)\\hat\{A\}\)\\bigr\]\+c\_\{V\}\\mathcal\{L\}\_\{V\}\-c\_\{\\mathcal\{H\}\}\\mathcal\{H\}
26:Update
θ,ϕ\\theta,\\phivia
∇ℒ\\nabla\\mathcal\{L\}; clip
‖∇‖2≤δg\\\|\\nabla\\\|\_\{2\}\\leq\\delta\_\{g\}
27:if
KL^\>δKL\\widehat\{\\mathrm\{KL\}\}\>\\delta\_\{\\mathrm\{KL\}\}thenbreak
28:endif
29:endfor
30:endfor
31:endfor
Algorithm 2Recurrent Meta\-RL via SAC \(Off\-Policy\)Require:Taskℰ\\mathcal\{E\}; recurrent actorπθ\\pi\_\{\\theta\}, twin criticsQϕ1,Qϕ2Q\_\{\\phi\_\{1\}\},Q\_\{\\phi\_\{2\}\}, target criticsQ¯ϕ1,Q¯ϕ2\\bar\{Q\}\_\{\\phi\_\{1\}\},\\bar\{Q\}\_\{\\phi\_\{2\}\}; chunk lengthLL, burn\-inKK, Polyakτ\\tau, entropy coefα\\alpha\.
1:Initialise
θ,ϕ1:2,ϕ¯1:2←ϕ1:2\\theta,\\,\\phi\_\{1:2\},\\,\\bar\{\\phi\}\_\{1:2\}\\leftarrow\\phi\_\{1:2\},
logα\\log\\alpha; replay buffer
𝒟\\mathcal\{D\}of capacity
MM
2:
𝒉←𝟎\\bm\{h\}\\leftarrow\\mathbf\{0\}; collect
NstartN\_\{\\mathrm\{start\}\}random transitions into
𝒟\\mathcal\{D\}
3:foreach environment stepdo
4:// — Collection —
5:
at,𝒉←πθ\(𝒔t,𝒉\)a\_\{t\},\\,\\bm\{h\}\\leftarrow\\pi\_\{\\theta\}\(\\bm\{s\}\_\{t\},\\bm\{h\}\)
6:
𝒔t\+1,Rt,dt←ℰ\.Step\(ICCBF\-QP\(at,𝒔t\)\)\\bm\{s\}\_\{t\+1\},R\_\{t\},d\_\{t\}\\leftarrow\\mathcal\{E\}\.\\textsc\{Step\}\\bigl\(\\textsc\{ICCBF\-QP\}\(a\_\{t\},\\bm\{s\}\_\{t\}\)\\bigr\)
7:if
dt=trued\_\{t\}=\\textbf\{true\}then
𝒉←𝟎\\bm\{h\}\\leftarrow\\mathbf\{0\}
8:endif
9:Store
\(𝒔t,at,Rt,dt\)\(\\bm\{s\}\_\{t\},a\_\{t\},R\_\{t\},d\_\{t\}\)in chunk collector; flush complete chunks of length
K\+LK\{\+\}Lto
𝒟\\mathcal\{D\}
10:ifstep
mod\\bmodtrain\_freq
=0=0and
\|𝒟\|≥\|\\mathcal\{D\}\|\\geqbatch\_sizethen
11:forgradient\_stepsiterationsdo
12:// — Burn\-In Split —
13:Sample batch of sequences
\(K\+L\)\(K\{\+\}L\)from
𝒟\\mathcal\{D\}
14:
𝒞←seq:K\\mathcal\{C\}\\leftarrow\\text\{seq\}\_\{:K\};
\(𝒔,a,R,𝒔′,d\)←seqK:\(\\bm\{s\},a,R,\\bm\{s\}^\{\\prime\},d\)\\leftarrow\\text\{seq\}\_\{K:\}⊳\\trianglerightfirstKKsteps warm up the RNN; loss on lastLL
15:// — Critic Update —
16:
a′,logπ′←πθ\(𝒔′,d,𝒞\+1\)a^\{\\prime\},\\log\\pi^\{\\prime\}\\leftarrow\\pi\_\{\\theta\}\(\\bm\{s\}^\{\\prime\},d,\\;\\mathcal\{C\}^\{\\,\+1\}\)⊳\\trianglerightcontext shifted by 1 for next\-obs
17:
y←R\+γ\(1−d\)\[miniQ¯ϕ¯i\(𝒔′,a′\)−αlogπ′\]y\\leftarrow R\+\\gamma\(1\-d\)\\bigl\[\\min\_\{i\}\\bar\{Q\}\_\{\\bar\{\\phi\}\_\{i\}\}\(\\bm\{s\}^\{\\prime\},a^\{\\prime\}\)\-\\alpha\\log\\pi^\{\\prime\}\\bigr\]
18:
ℒQ←∑i=12MSE\(Qϕi\(𝒔,a,𝒞\),y\)\\mathcal\{L\}\_\{Q\}\\leftarrow\\sum\_\{i=1\}^\{2\}\\mathrm\{MSE\}\\bigl\(Q\_\{\\phi\_\{i\}\}\(\\bm\{s\},a,\\;\\mathcal\{C\}\),\\;y\\bigr\)
19:Update
ϕ1:2\\phi\_\{1:2\}via
∇ϕℒQ\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{Q\}; clip
‖∇‖2≤δg\\\|\\nabla\\\|\_\{2\}\\leq\\delta\_\{g\}
20:// — Actor Update —
21:
a~,logπ←πθ\(𝒔,d,𝒞\)\\tilde\{a\},\\log\\pi\\leftarrow\\pi\_\{\\theta\}\(\\bm\{s\},d,\\;\\mathcal\{C\}\)
22:
ℒπ←𝔼\[αlogπ−miniQϕi\(𝒔,a~,𝒞\)\]\\mathcal\{L\}\_\{\\pi\}\\leftarrow\\mathbb\{E\}\\bigl\[\\alpha\\log\\pi\-\\min\_\{i\}Q\_\{\\phi\_\{i\}\}\(\\bm\{s\},\\tilde\{a\},\\;\\mathcal\{C\}\)\\bigr\]
23:Update
θ\\thetavia
∇θℒπ\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\pi\}; clip
‖∇‖2≤δg\\\|\\nabla\\\|\_\{2\}\\leq\\delta\_\{g\}
24:// — Entropy Coefficient Update —
25:
ℒα←𝔼\[−α\(logπ\+ℋ¯\)\]\\mathcal\{L\}\_\{\\alpha\}\\leftarrow\\mathbb\{E\}\\bigl\[\-\\alpha\(\\log\\pi\+\\bar\{\\mathcal\{H\}\}\)\\bigr\]; update
logα\\log\\alphavia
∇ℒα\\nabla\\mathcal\{L\}\_\{\\alpha\}; clamp
α≥αmin\\alpha\\geq\\alpha\_\{\\min\}
26:// — Polyak Update —
27:
ϕ¯i←τϕi\+\(1−τ\)ϕ¯i\\bar\{\\phi\}\_\{i\}\\leftarrow\\tau\\phi\_\{i\}\+\(1\{\-\}\\tau\)\\bar\{\\phi\}\_\{i\}⊳\\trianglerightsoft update target critics
28:endfor
29:endif
30:endfor
## References
- \[1\]D\. R\. Agrawal and D\. Panagou\(2021\-12\)Safe control synthesis via input constrained control barrier functions\.In2021 60th IEEE Conference on Decision and Control \(CDC\),pp\. 6113–6118\.External Links:[Link](http://dx.doi.org/10.1109/CDC45484.2021.9682938),[Document](https://dx.doi.org/10.1109/cdc45484.2021.9682938)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p4.2),[§4\.1\.1](https://arxiv.org/html/2606.17414#S4.SS1.SSS1.p1.1)\.
- \[2\]M\. Alshiekh, R\. Bloem, R\. Ehlers, B\. Könighofer, S\. Niekum, and U\. Topcu\(2018\)Safe reinforcement learning via shielding\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[3\]A\. D\. Ames, S\. Coogan, M\. Egerstedt, G\. Notomista, K\. Sreenath, and P\. Tabuada\(2019\)Control barrier functions: theory and applications\.External Links:1903\.11199,[Link](https://arxiv.org/abs/1903.11199)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p4.2)\.
- \[4\]J\. Beck, R\. Vuorio, E\.Z\. Liu, Z\. Xiong, L\. Zintgraf, C\. Finn, and S\. Whiteson\(2025\)A tutorial on meta\-reinforcement learning\.Foundations and Trends in Artificial Intelligence Series,Now Publishers\.External Links:ISBN 9781638285403,[Link](https://books.google.lk/books?id=halH0QEACAAJ)Cited by:[§2\.3](https://arxiv.org/html/2606.17414#S2.SS3.p1.4)\.
- \[5\]N\. Bernardini, M\. C\. Wijayatunga, N\. Baresi, and R\. Armellin\(2023\)State\-dependent trust region for successive convex optimization of spacecraft trajectories\.In33rd AAS/AIAA Space Flight Mechanics Meeting,Austin, TX\.Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[6\]J\. T\. Betts\(1998\)Survey of numerical methods for trajectory optimization\.Journal of Guidance, Control, and Dynamics21\(2\),pp\. 193–207\.External Links:[Document](https://dx.doi.org/10.2514/2.4231)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[7\]J\. Breeden, K\. Garg, and D\. Panagou\(2022\)Control barrier functions in sampled\-data systems\.IEEE Control Systems Letters6,pp\. 367–372\.External Links:ISSN 2475\-1456,[Link](http://dx.doi.org/10.1109/LCSYS.2021.3076127),[Document](https://dx.doi.org/10.1109/lcsys.2021.3076127)Cited by:[§3\.1](https://arxiv.org/html/2606.17414#S3.SS1.p2.5)\.
- \[8\]J\. Chung, C\. Gulcehre, K\. Cho, and Y\. Bengio\(2014\)Empirical evaluation of gated recurrent neural networks on sequence modeling\.External Links:1412\.3555,[Link](https://arxiv.org/abs/1412.3555)Cited by:[item 1](https://arxiv.org/html/2606.17414#S1.I1.i1.p1.1)\.
- \[9\]T\. Dao and A\. Gu\(2024\)Transformers are ssms: generalized models and efficient algorithms through structured state space duality\.External Links:2405\.21060,[Link](https://arxiv.org/abs/2405.21060)Cited by:[item 1](https://arxiv.org/html/2606.17414#S1.I1.i1.p1.1)\.
- \[10\]C\. Dawson, S\. Gao, and C\. Fan\(2023\)Safe control with learned certificates: a survey of neural lyapunov, barrier, and contraction methods for robotics and control\.IEEE Transactions on Robotics39\(3\),pp\. 1749–1767\.External Links:[Document](https://dx.doi.org/10.1109/TRO.2022.3232542)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p5.2)\.
- \[11\]K\. Dunlap, M\. Mote, K\. Delsing, and K\. L\. Hobbs\(2023\)Run time assured reinforcement learning for safe satellite docking\.Journal of Aerospace Information Systems20\(1\),pp\. 25–36\.External Links:[Document](https://dx.doi.org/10.2514/1.I011126)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[12\]L\. Federici, B\. Benedikter, and A\. Zavoli\(2021\)Deep learning techniques for autonomous spacecraft guidance during proximity operations\.Journal of Spacecraft and Rockets58\(6\),pp\. 1774–1785\.External Links:[Document](https://dx.doi.org/10.2514/1.A35076)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[13\]L\. Federici, A\. Scorsoglio, A\. Zavoli, and R\. Furfaro\(2022\)Meta\-reinforcement learning for adaptive spacecraft guidance during finite\-thrust rendezvous missions\.Acta Astronautica201,pp\. 129–141\.External Links:[Document](https://dx.doi.org/10.1016/j.actaastro.2022.08.047)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[14\]G\. Fereoli, H\. Schaub, and P\. Di Lizia\(2025\)Meta\-reinforcement learning for spacecraft proximity operations guidance and control in cislunar space\.Journal of Spacecraft and Rockets62\(3\),pp\. 706–718\.External Links:[Document](https://dx.doi.org/10.2514/1.A36100)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[15\]B\. Gaudet, R\. Linares, and R\. Furfaro\(2020\)Adaptive guidance and integrated navigation with reinforcement meta\-learning\.Acta Astronautica169,pp\. 180–190\.External Links:ISSN 0094\-5765,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.actaastro.2020.01.007)Cited by:[§2\.3](https://arxiv.org/html/2606.17414#S2.SS3.p2.6)\.
- \[16\]B\. Gaudet, R\. Linares, and R\. Furfaro\(2020\)Deep reinforcement learning for six degree\-of\-freedom planetary landing\.Advances in Space Research65\(7\),pp\. 1723–1741\.External Links:[Document](https://dx.doi.org/10.1016/j.asr.2019.12.030)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[17\]T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine\(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.External Links:1801\.01290,[Link](https://arxiv.org/abs/1801.01290)Cited by:[item 2](https://arxiv.org/html/2606.17414#S1.I1.i2.p1.1)\.
- \[18\]P\. Lu and X\. Liu\(2013\)Autonomous trajectory planning for rendezvous and proximity operations by conic optimization\.Journal of Guidance, Control, and Dynamics36\(2\),pp\. 375–389\.External Links:[Document](https://dx.doi.org/10.2514/1.58436)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[19\]D\. Malyuta, T\. P\. Reynolds, M\. Szmuk, T\. Lew, R\. Bonalli, M\. Pavone, and B\. Açıkmeşe\(2022\)Convex optimization for trajectory generation: a tutorial on generating dynamically feasible trajectories reliably and efficiently\.IEEE Control Systems Magazine42\(5\),pp\. 40–113\.External Links:[Document](https://dx.doi.org/10.1109/MCS.2022.3187542)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[20\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[item 2](https://arxiv.org/html/2606.17414#S1.I1.i2.p1.1)\.
- \[21\]R\. S\. Sutton and A\. G\. Barto\(2018\)Reinforcement learning: an introduction\.Second edition,The MIT Press\.External Links:[Link](http://incompleteideas.net/book/the-book-2nd.html)Cited by:[§2\.3](https://arxiv.org/html/2606.17414#S2.SS3.p1.4)\.
- \[22\]D\. van Wijk, K\. Dunlap, M\. Majji, and K\. L\. Hobbs\(2024\)Safe spacecraft inspection via deep reinforcement learning and discrete control barrier functions\.Journal of Aerospace Information Systems21\(12\),pp\. 996–1013\.Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[23\]D\. Van Wijk, K\. Dunlap, M\. Majji, and K\. Hobbs\(2024\)Safe spacecraft inspection via deep reinforcement learning and discrete control barrier functions\.Journal of Aerospace Information Systems21\(12\),pp\. 996–1013\.External Links:[Document](https://dx.doi.org/10.2514/1.I011391)Cited by:[§3\.2\.6](https://arxiv.org/html/2606.17414#S3.SS2.SSS6.p2.8),[§4\.1\.1](https://arxiv.org/html/2606.17414#S4.SS1.SSS1.p1.1)\.
- \[24\]A\. Weiss, M\. Baldwin, R\. S\. Erwin, and I\. Kolmanovsky\(2015\)Model predictive control for spacecraft rendezvous and docking: strategies for handling constraints and case studies\.IEEE Transactions on Control Systems Technology23\(4\),pp\. 1638–1647\.External Links:[Document](https://dx.doi.org/10.1109/TCST.2014.2379639)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[25\]M\. C\. Wijayatunga, R\. Armellin, H\. Holt, L\. Pirovano, and A\. A\. Lidtke\(2023\)Design and guidance of a multi\-active debris removal mission\.Astrodynamics7\(4\),pp\. 383–399\.External Links:[Document](https://dx.doi.org/10.1007/s42064-023-0159-3)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[26\]M\. C\. Wijayatunga, R\. Armellin, and H\. Holt\(2025\)Robust trajectory design and guidance for far\-range rendezvous using reinforcement learning with safety and observability considerations\.Aerospace Science and Technology159,pp\. 109996\.External Links:[Document](https://dx.doi.org/10.1016/j.ast.2025.109996)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.
- \[27\]M\. C\. Wijayatunga, R\. Armellin, and L\. Pirovano\(2023\)Exploiting scaling constants to facilitate the convergence of indirect trajectory optimization methods\.Journal of Guidance, Control, and Dynamics46\(5\),pp\. 958–969\.External Links:[Document](https://dx.doi.org/10.2514/1.G007091)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[28\]M\. C\. Wijayatunga, J\. Guinane, N\. D\. Wallace, and X\. Wu\(2026\)An autonomous, end\-to\-end, convex\-based framework for close\-range rendezvous trajectory design and guidance with hardware testbed validation\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.12421)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p2.1)\.
- \[29\]M\. C\. Wijayatunga, R\. Linares, and R\. Armellin\(2026\)Meta\-reinforcement learning for robust and non\-greedy control barrier functions in spacecraft proximity operations\.External Links:2602\.07335,[Link](https://arxiv.org/abs/2602.07335)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p5.2),[§2\.2](https://arxiv.org/html/2606.17414#S2.SS2.p2.9),[§2](https://arxiv.org/html/2606.17414#S2.p1.2),[§3\.1](https://arxiv.org/html/2606.17414#S3.SS1.p1.6),[§3\.1](https://arxiv.org/html/2606.17414#S3.SS1.p2.5),[§3\.1](https://arxiv.org/html/2606.17414#S3.SS1.p4.2),[§3\.2\.7](https://arxiv.org/html/2606.17414#S3.SS2.SSS7.p1.1),[Table 1](https://arxiv.org/html/2606.17414#S3.T1),[§3](https://arxiv.org/html/2606.17414#S3.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.17414#S4.SS1.SSS1.p1.1)\.
- \[30\]M\. Wijayatunga, N\. Wallace, S\. Sukkarieh, and R\. Armellin\(2026\)Learning safety\-guaranteed, non\-greedy control barrier functions using reinforcement learning\.External Links:2602\.00366Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1),[§1](https://arxiv.org/html/2606.17414#S1.p5.2)\.
- \[31\]A\. Zavoli and L\. Federici\(2021\)Reinforcement learning for robust trajectory design of interplanetary missions\.Journal of Guidance, Control, and Dynamics44\(8\),pp\. 1440–1453\.External Links:[Document](https://dx.doi.org/10.2514/1.G005794)Cited by:[§1](https://arxiv.org/html/2606.17414#S1.p3.1)\.Similar Articles
Temporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory
This paper proposes a meta-control architecture using temporal self-attention for adaptive control of Euler-Lagrange systems with unobservable memory states. It demonstrates improved tracking performance over baseline methods on a 2-DOF manipulator while identifying failure modes in long-memory regimes.
Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints
Proposes LILAC+, a framework for safe continual reinforcement learning under nonstationarity that uses three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Evaluations in simulated driving environments show reduced safety violations under distribution shift while maintaining competitive performance.
Some considerations on learning to explore via meta-reinforcement learning
OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.
Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping
This paper proposes H-Res, a method to adapt large transformer models by shaping the energy landscape of associative memories without modifying weights or adding prompts, preserving memory capacity and outperforming LoRA.
Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance
This paper presents a framework (CARE) that jointly learns control inputs and communication-efficient timing decisions under a pointwise Lyapunov safety shield, achieving higher inter-sample intervals than classical methods on inverted pendulum, cart-pole, and planar quadrotor systems.