Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry

arXiv cs.LG Papers

Summary

This paper explores deep reinforcement learning for attitude control of spacecraft during hypersonic re-entry. It demonstrates that state-of-the-art RL and hybrid controllers can outperform traditional PID controllers with gain scheduling, especially when dynamics randomization is used to improve robustness and generalization.

arXiv:2606.31291v1 Announce Type: new Abstract: Deep reinforcement learning has the potential to solve attitude control problems more adaptively, precisely, and robustly by handling nonlinear dynamics, uncertainties, and failure cases more effectively than traditional attitude control approaches. We explore reinforcement learning (RL) for attitude control in spacecraft re-entry. An industry-standard proportional-integral-derivative controller with gain scheduling serves as a strong baseline for model-free RL and hybrid controllers that combine these two approaches. We formalize the application in the RL framework to apply continuous, off-policy RL. State-of-the-art RL achieves comparable performance to traditional control approaches in this domain. However, its out-of-distribution generalization is not sufficient. Hence, we use dynamics randomization to introduce challenging task variations during training and enforce generalization in a predefined operational envelope. Finally, we assess the best obtained RL-based controllers with application-specific metrics to show superior performance in comparison to traditional controllers in the operational envelope, that is, hybrid controllers are able to track the angle of attack better and are more robust under variations of mass, inertia tensor, and flap actuator bandwidth.
Original Article
View Cached Full Text

Cached at: 07/01/26, 05:35 AM

# Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry
Source: [https://arxiv.org/html/2606.31291](https://arxiv.org/html/2606.31291)
Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re\-Entry

Alexander Fabisch, Melvin Laux, Mariela De Lucas Álvarez, Edoardo Caroselli, Julian Theis

Keywords:Attitude Control, Spacecraft Re\-Entry, Continuous Model\-Free Reinforcement Learning, Task Scheduling, Out\-of\-Distribution Generalization

SummaryDeep reinforcement learning has the potential to solve attitude control problems more adaptively, precisely, and robustly by handling nonlinear dynamics, uncertainties, and failure cases more effectively than traditional attitude control approaches\. We explore reinforcement learning \(RL\) for attitude control in spacecraft re\-entry\. An industry\-standard proportional\-integral\-derivative controller with gain scheduling serves as a strong baseline for model\-free RL and hybrid controllers that combine these two approaches\. We formalize the application in the RL framework to apply continuous, off\-policy RL\. State\-of\-the\-art RL achieves comparable performance to traditional control approaches in this domain\. However, its out\-of\-distribution generalization is not sufficient\. Hence, we use dynamics randomization to introduce challenging task variations during training and enforce generalization in a predefined operational envelope\. Finally, we assess the best obtained RL\-based controllers with application\-specific metrics to show superior performance in comparison to traditional controllers in the operational envelope, that is, hybrid controllers are able to track the angle of attack better and are more robust under variations of mass, inertia tensor, and flap actuator bandwidth\.

Contribution\(s\)1\.We apply state\-of\-the\-art deep reinforcement learning algorithms for continuous control to the problem of attitude control during hypersonic re\-entry of a spacecraft\. This is a challenging tracking problem that requires gain scheduling for changing atmospheric conditions\. We design a simple and effective reward function for attitude control and find that MR\.Q excels without task\-specific tuning\. Context:We apply existing reinforcement learning algorithms \(e\.g\., MR\.Q\(fujimoto2025\_mrq\)\)\. Prior work with older algorithms, less thorough evaluation, or more complicated reward functions in the application domain exists\(e\.g\.,elkins\_adaptive\_2020\)\.2\.Although the best learned controller performs better than the industry\-standard baseline controller under nominal conditions, we explore hybrid control architectures that combine the baseline controller with reinforcement learning and compare these against pure reinforcement learning to enhance out\-of\-distribution generalization\. Context:liu\_attitude\_2022propose a similar hybrid controller that integrates reinforcement learning in a traditional controller\. In comparison to this work, we use state\-of\-the\-art algorithms and explicitly investigate generalization and robustness of learned controllers\.3\.To improve robustness and generalization explicitly, we compare dynamics randomization and task scheduling approaches for reinforcement learning in this attitude control problem and determine the best training strategy\. We find that the resulting policies considerably improve robustness in comparison to the baseline controller\. Context:Uniform dynamics randomization has been used successfully for various applications in robotics\(Antonova2017;Peng2018;Tan2018;OpenAI2020\)\. Task scheduling approaches have been proposed in the context of deep reinforcement learning\(Cho2024\)and contextual policy search\(Fabisch2014\)\.

###### Abstract

Deep reinforcement learning has the potential to solve attitude control problems more adaptively, precisely, and robustly by handling nonlinear dynamics, uncertainties, and failure cases more effectively than traditional attitude control approaches\. We explore reinforcement learning \(RL\) for attitude control in spacecraft re\-entry\. An industry\-standard proportional\-integral\-derivative controller with gain scheduling serves as a strong baseline for model\-free RL and hybrid controllers that combine these two approaches\. We formalize the application in the RL framework to apply continuous, off\-policy RL\. State\-of\-the\-art RL achieves comparable performance to traditional control approaches in this domain\. However, its out\-of\-distribution generalization is not sufficient\. Hence, we use dynamics randomization to introduce challenging task variations during training and enforce generalization in a predefined operational envelope\. Finally, we assess the best obtained RL\-based controllers with application\-specific metrics to show superior performance in comparison to traditional controllers in the operational envelope, that is, hybrid controllers are able to track the angle of attack better and are more robust under variations of mass, inertia tensor, and flap actuator bandwidth\.

## 1Introduction

\\tikzset

every picture/\.style=line width=0\.75pt

\{tikzpicture\}\[x=0\.75pt,y=0\.75pt,yscale=\-1,xscale=1\]

\\draw\[fill=rgb, 255:red, 0; green, 0; blue, 0 ,fill opacity=0\.03 \] \(330\.71,350\.59\) \.\. controls \(294\.48,351\.41\) and \(263\.64,286\.27\) \.\. \(261\.81,205\.1\) \.\. controls \(259\.98,123\.94\) and \(287\.86,57\.48\) \.\. \(324\.09,56\.66\) \.\. controls \(360\.31,55\.84\) and \(391\.16,120\.98\) \.\. \(392\.99,202\.15\) \.\. controls \(394\.82,283\.32\) and \(366\.93,349\.78\) \.\. \(330\.71,350\.59\) – cycle ;\\draw\[color=rgb, 255:red, 65; green, 117; blue, 5 ,draw opacity=1 \]\[fill=rgb, 255:red, 21; green, 65; blue, 145 ,fill opacity=1 \] \(163\.15,234\.63\) – \(324\.67,210\.14\)\(163\.6,237\.6\) – \(325\.12,213\.11\) ;\\draw\[shift=\(154\.48,237\.46\), rotate = 351\.38\] \[fill=rgb, 255:red, 65; green, 117; blue, 5 ,fill opacity=1 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[color=rgb, 255:red, 65; green, 117; blue, 5 ,draw opacity=1 \]\[fill=rgb, 255:red, 0; green, 32; blue, 91 ,fill opacity=1 \] \(313\.33,70\.82\) – \(326\.39,211\.49\)\(310\.34,71\.1\) – \(323\.4,211\.76\) ;\\draw\[shift=\(311,62\), rotate = 84\.7\] \[fill=rgb, 255:red, 65; green, 117; blue, 5 ,fill opacity=1 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[color=rgb, 255:red, 65; green, 117; blue, 5 ,draw opacity=1 \]\[fill=rgb, 255:red, 0; green, 0; blue, 0 ,fill opacity=1 \] \(324\.73,331\.02\) – \(323\.39,211\.64\)\(327\.73,330\.98\) – \(326\.39,211\.61\) ;\\draw\[shift=\(326\.33,340\), rotate = 269\.36\] \[fill=rgb, 255:red, 65; green, 117; blue, 5 ,fill opacity=1 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[color=rgb, 255:red, 0; green, 0; blue, 0 ,draw opacity=1 \] \(416\.33,109\.57\) – \(324\.89,211\.62\) ;\\draw\[shift=\(418\.33,107\.33\), rotate = 131\.86\] \[fill=rgb, 255:red, 0; green, 0; blue, 0 ,fill opacity=1 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[color=rgb, 255:red, 0; green, 0; blue, 0 ,draw opacity=1 \] \(384\.34,332\.64\) – \(324\.89,211\.62\) ;\\draw\[shift=\(385\.67,335\.33\), rotate = 243\.84\] \[fill=rgb, 255:red, 0; green, 0; blue, 0 ,fill opacity=1 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[color=rgb, 255:red, 0; green, 0; blue, 0 ,draw opacity=1 \] \(504\.67,223\.8\) – \(324\.89,211\.62\) ;\\draw\[shift=\(507\.67,224\), rotate = 183\.87\] \[fill=rgb, 255:red, 0; green, 0; blue, 0 ,fill opacity=1 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[fill=rgb, 255:red, 0; green, 0; blue, 0 ,fill opacity=0\.03 \] \(191\.24,228\.08\) \.\. controls \(189\.45,213\.53\) and \(247\.83,194\.37\) \.\. \(321\.65,185\.28\) \.\. controls \(395\.46,176\.19\) and \(456\.76,180\.62\) \.\. \(458\.55,195\.17\) \.\. controls \(460\.34,209\.72\) and \(401\.95,228\.88\) \.\. \(328\.14,237\.97\) \.\. controls \(254\.32,247\.06\) and \(193\.03,242\.63\) \.\. \(191\.24,228\.08\) – cycle ;\\draw\[color=rgb, 255:red, 0; green, 41; blue, 91 ,draw opacity=1 \]\[fill=rgb, 255:red, 0; green, 32; blue, 91 ,fill opacity=1 \] \(489\.76,183\.11\) – \(404\.81,198\.81\) – \(325\.16,213\.1\)\(489\.21,180\.16\) – \(404\.27,195\.86\) – \(324\.63,210\.15\) ;\\draw\[shift=\(498\.33,180\), rotate = 169\.53\] \[fill=rgb, 255:red, 0; green, 41; blue, 91 ,fill opacity=1 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[dash pattern=on 0\.84pt off 2\.51pt\] \(493,134\.67\) – \(267,240\) ;\\draw\(385\.29,144\.85\) \.\. controls \(391\.07,161\.92\) and \(391\.68,161\.67\) \.\. \(392,180\.78\) ;\\draw\[shift=\(384\.33,142\), rotate = 71\.57\] \[fill=rgb, 255:red, 0; green, 0; blue, 0 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\(451,186\.44\) \.\. controls \(439\.38,182\.88\) and \(421\.28,180\.11\) \.\. \(406\.67,180\.78\) ;\\draw\[shift=\(453\.88,187\.39\), rotate = 199\.44\] \[fill=rgb, 255:red, 0; green, 0; blue, 0 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\[fill=rgb, 255:red, 255; green, 255; blue, 255 ,fill opacity=1 \] \(316\.95,211\.62\) \.\. controls \(316\.95,207\.24\) and \(320\.51,203\.68\) \.\. \(324\.89,203\.68\) \.\. controls \(329\.28,203\.68\) and \(332\.83,207\.24\) \.\. \(332\.83,211\.62\) \.\. controls \(332\.83,216\.01\) and \(329\.28,219\.57\) \.\. \(324\.89,219\.57\) \.\. controls \(320\.51,219\.57\) and \(316\.95,216\.01\) \.\. \(316\.95,211\.62\) – cycle ;\\draw\(316\.95,211\.62\) – \(332\.83,211\.62\) ;\\draw\(324\.89,203\.68\) – \(324\.89,219\.57\) ;\\draw\(370\.55,303\.39\) \.\. controls \(355\.99,324\.62\) and \(340\.88,325\.98\) \.\. \(330\.74,323\.54\) ;\\draw\[shift=\(327\.88,322\.72\), rotate = 18\.43\] \[fill=rgb, 255:red, 0; green, 0; blue, 0 \]\[line width=0\.08\] \[draw opacity=0\] \(8\.93,\-4\.29\) – \(0,0\) – \(8\.93,4\.29\) – cycle ;\\draw\(326\.5,206\.33\) node \[rotate=\-340\.16\]![Refer to caption](https://arxiv.org/html/2606.31291v1/output-onlinepngtools-2.png);\\draw\[fill=rgb, 255:red, 255; green, 255; blue, 255 ,fill opacity=1 \] \(262\.4,242\.9\) \.\. controls \(262\.4,241\.74\) and \(263\.34,240\.8\) \.\. \(264\.5,240\.8\) \.\. controls \(265\.66,240\.8\) and \(266\.6,241\.74\) \.\. \(266\.6,242\.9\) \.\. controls \(266\.6,244\.06\) and \(265\.66,245\) \.\. \(264\.5,245\) \.\. controls \(263\.34,245\) and \(262\.4,244\.06\) \.\. \(262\.4,242\.9\) – cycle ;\\draw\[fill=rgb, 255:red, 255; green, 255; blue, 255 ,fill opacity=1 \] \(389\.9,180\.78\) \.\. controls \(389\.9,179\.62\) and \(390\.84,178\.68\) \.\. \(392,178\.68\) \.\. controls \(393\.16,178\.68\) and \(394\.1,179\.62\) \.\. \(394\.1,180\.78\) \.\. controls \(394\.1,181\.94\) and \(393\.16,182\.88\) \.\. \(392,182\.88\) \.\. controls \(390\.84,182\.88\) and \(389\.9,181\.94\) \.\. \(389\.9,180\.78\) – cycle ;\\draw\[fill=rgb, 255:red, 255; green, 255; blue, 255 ,fill opacity=1 \] \(371\.5,311\.58\) \.\. controls \(371\.5,310\.42\) and \(372\.44,309\.48\) \.\. \(373\.6,309\.48\) \.\. controls \(374\.76,309\.48\) and \(375\.7,310\.42\) \.\. \(375\.7,311\.58\) \.\. controls \(375\.7,312\.74\) and \(374\.76,313\.68\) \.\. \(373\.6,313\.68\) \.\. controls \(372\.44,313\.68\) and \(371\.5,312\.74\) \.\. \(371\.5,311\.58\) – cycle ;\\draw\[fill=rgb, 255:red, 255; green, 255; blue, 255 ,fill opacity=1 \] \(384\.33,142\) \.\. controls \(384\.33,140\.84\) and \(385\.27,139\.9\) \.\. \(386\.43,139\.9\) \.\. controls \(387\.59,139\.9\) and \(388\.53,140\.84\) \.\. \(388\.53,142\) \.\. controls \(388\.53,143\.16\) and \(387\.59,144\.1\) \.\. \(386\.43,144\.1\) \.\. controls \(385\.27,144\.1\) and \(384\.33,143\.16\) \.\. \(384\.33,142\) – cycle ;\\draw\[fill=rgb, 255:red, 255; green, 255; blue, 255 ,fill opacity=1 \] \(419\.6,218\.1\) \.\. controls \(419\.6,216\.94\) and \(420\.54,216\) \.\. \(421\.7,216\) \.\. controls \(422\.86,216\) and \(423\.8,216\.94\) \.\. \(423\.8,218\.1\) \.\. controls \(423\.8,219\.26\) and \(422\.86,220\.2\) \.\. \(421\.7,220\.2\) \.\. controls \(420\.54,220\.2\) and \(419\.6,219\.26\) \.\. \(419\.6,218\.1\) – cycle ;

\\draw\(393\.62,91\.14\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-1\.42\] \[align=left\]xb\\displaystyle x\_\{b\};\\draw\(481\.84,225\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-1\.04\] \[align=left\]yb\\displaystyle y\_\{b\};\\draw\(384\.37,333\.64\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-359\.26\] \[align=left\]zb\\displaystyle z\_\{b\};\\draw\(496\.98,163\.6\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-359\.26\] \[align=left\]𝐕\\displaystyle\\mathbf\{V\};\\draw\(394\.95,146\.83\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-1\.42\] \[align=left\]α\\displaystyle\\alpha;\\draw\(454\.28,164\.83\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-1\.42\] \[align=left\]β\\displaystyle\\beta;\\draw\(295\.31,316\.27\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-359\.26\] \[align=left\]m​𝐠\\displaystyle m\\mathbf\{g\};\\draw\(345\.62,296\.83\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-1\.42\] \[align=left\]μ\\displaystyle\\mu;\\draw\(318\.98,60\.27\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-359\.26\] \[align=left\]𝐋\\displaystyle\\mathbf\{L\};\\draw\(160\.98,207\.6\) node \[anchor=north west\]\[inner sep=0\.75pt\] \[rotate=\-359\.26\] \[align=left\]𝐃\\displaystyle\\mathbf\{D\};

Figure 1:Sketch of the hypersonic re\-entry vehicle\.Deep reinforcement learning \(RL\) for continuous control\(e\.g\.,lillicrap\_continuous\_2019\)has the potential to handle nonlinear dynamics and failure cases more easily and to be more robust against and adaptable to unforeseen conditions than other attitude control approaches\(see, e\.g\.,elkins\_autonomous\_2020;bernini\_reinforcement\_2024;liu\_attitude\_2022\)\. However, the introduction of RL into a safety\-critical attitude control system requires a fundamental shift in verification and validation \(V&V\) for space\-grade software\. V&V for guidance, navigation and control traditionally relies on deterministic and formally verifiable control laws, e\.g\., proportional–integral–derivative \(PID\) controllers\. For these, the system’s behavior is auditable through formal analysis and exhaustive simulation of well\-defined corner cases\. Core challenges with deep RL are the out\-of\-distribution generalization and opacity of the learned policy\. It is a black box, making it infeasible to formally prove its stability and performance across the entire state space and parametric envelope\. While explainable artificial intelligence advances transparency in the decision\-making process of deep RL\(Li2025;Goel2025;Luss2023\), its integration into safety\-critical control applications is not fully established\. We propose the use of hybrid control designs\(e\.g\., residual RL,Johannink2019\)to mitigate black\-box concerns\.

We specifically discuss attitude control for a hypersonic re\-entry vehicle, as illustrated in Figure[1](https://arxiv.org/html/2606.31291#S1.F1)\. Such spacecraft return personnel and cargo from space and, hence, are a key technology for future human spaceflight, planetary exploration, and for establishing an extraterrestrial resource extraction economy, e\.g\., lunar mining\. We attempt on the one hand to replace classical control engineering by RL, reducing the expert’s effort, and on the other hand to combine PID control with RL\-based control to form a hybrid controller that more effectively handles challenging changes to the environment dynamics and failure cases\. We compare three different types of controllers\. \(1\) Baseline controller: The previous solution is a PID controller with gain scheduling\. \(2\) Only RL: RL without any additional controller, but with the model of the spacecraft to generate kinematically feasible reference values\. \(3\) Hybrid controller: RL modifies the baseline controller\. The hybrid controller is our primary candidate for a flight\-ready system\. This architecture is inherently safer because the PID controller acts as a verified baseline\. The RL agent’s authority can be explicitly bounded by a supervisor logic that monitors the system’s state and the RL agent’s output, disengaging the RL agent when the vehicle approaches the edge of its validated safe domain\. We aim to provide statistical evidence that the controller is reliable and safe within a strictly defined operational envelope\.

Our contributions are the following\. \(1\) We apply state\-of\-the\-art deep reinforcement learning algorithms for continuous control to the problem of attitude control during hypersonic re\-entry of a spacecraft\. This is a challenging tracking problem that requires gain scheduling for changing atmospheric conditions\. We design a simple and effective reward function for attitude control and find that MR\.Q excels without task\-specific tuning\. \(2\) Although the best learned controller performs better than the industry\-standard baseline controller under nominal conditions, we explore hybrid control architectures that combine the baseline controller with reinforcement learning and compare these against pure reinforcement learning to enhance out\-of\-distribution generalization\. \(3\) To improve robustness and generalization explicitly, we compare dynamics randomization and task scheduling approaches for reinforcement learning in this attitude control problem and determine the best training strategy\. We find that the resulting policies considerably improve robustness in comparison to the baseline controller\.

## 2Background: Spacecraft Re\-Entry and Reinforcement Learning

### 2\.1Spacecraft Re\-Entry Problem

The present paper considers flight path control for a lifting\-body vehicle with low lift\-to\-drag ratio during hypersonic re\-entry \(see Figure[1](https://arxiv.org/html/2606.31291#S1.F1)\)\. We focus on the endo\-atmospheric part of the flight trajectory and disregard the preceding deorbit maneuver\. Specifically, we focus on the descent from the upper mesosphere down to the troposphere \(altitude93km/93\\text\{\\,\}\\mathrm\{km\}\\text\{/\}to10km/10\\text\{\\,\}\\mathrm\{km\}\\text\{/\}\)\. The vehicle speed at the start is7378m/s7378\\text\{\\,\}\\mathrm\{m\}\\text\{/\}\\mathrm\{s\}and decreases to approximately150m/s150\\text\{\\,\}\\mathrm\{m\}\\text\{/\}\\mathrm\{s\}at the end, when a parachute can be deployed for landing\. Atmospheric flight creates aerodynamic forces and moments that act on the vehicle\. The main objective is to control the aerodynamic forces such that they result in a desired translational motion of the re\-entry vehicle\. In addition, the magnitude of the forces must remain limited to not endanger structural integrity of the vehicle\. These objectives are addressed by tracking a specific angle of attack and bank angle\. The angle of attackα\\alphadescribes the pitch attitude of the vehicle with respect to its direction of flight, while the bank angleμ\\mudescribes the roll attitude of the vehicle with respect to this direction\. Furthermore, the nose of the vehicle needs to be aligned with its direction of flight, i\.e\., the sideslip angleβ\\betamust remain small\. Figure[1](https://arxiv.org/html/2606.31291#S1.F1)illustrates these angles\. Attitude control must minimize the errors between the commanded and actual angles over the course of the trajectory\. The vehicle in this study has two flaps at its tail that deflect to produce aerodynamic moments about the body\-fixed axesxbx\_\{b\}andyby\_\{b\}, and thrusters arranged in a configuration to exert a moment about the axiszbz\_\{b\}\. The actions of attitude controllers therefore manipulate deflection commands to the aerodynamic control surfaces \(δe,cmd\\delta\_\{\\text\{e,cmd\}\}for symmetric andδa,cmd\\delta\_\{\\text\{a,cmd\}\}for antisymmetric deflection\) and a thruster command𝝉=\[τx,τy,τz\]T\\boldsymbol\{\\tau\}=\\left\[\\tau\_\{x\},\\tau\_\{y\},\\tau\_\{z\}\\right\]^\{T\}\. The observations encompass command values for angle of attack \(αcmd\\alpha\_\{\\text\{cmd\}\}\) and bank angle \(μcmd\\mu\_\{\\text\{cmd\}\}\) as well as estimates of the actual anglesα,β,μ\\alpha,\\beta,\\mu, and body\-fixed angular rates from the navigation filter\.

### 2\.2Simulation Model

In order to train and evaluate the controllers, we rely on a simulation that encompasses the vehicle dynamics, actuator and sensor lags, atmosphere, as well as guidance, navigation, and control algorithms that serve as a baseline for this study\. The models are briefly described next with additional details given in Appendix[D](https://arxiv.org/html/2606.31291#A4)\. The step size of the simulator is1/140​s1/140\\,\\mathrm\{s\}and it accounts for multi\-rate subsystems and computational delay\. For example, control algorithms are executed at14Hz/14\\text\{\\,\}\\mathrm\{Hz\}\\text\{/\}\.

A six\-degree\-of\-freedom rigid\-body flight dynamics model for a rotating, spherical earth is the core of the simulation\(see, e\.g\.,stevens2003aircraft;stengel2022flight\)\. In addition to the notation from Figure[1](https://arxiv.org/html/2606.31291#S1.F1), the aerodynamic side forceYY\(perpendicular to aerodynamic liftLLand dragDD\), and the vehicle massmmare needed in the evolution of the flight path, defined in terms of velocityVV, flight path angleγ\\gamma\(vertical direction relative to the local horizontal\), and course angleχ\\chi\(horizontal direction relative to local north\) as

\[V˙γ˙χ˙\]=1m​\[−D−m​g​sin⁡γ\(L​cos⁡μ−Y​sin⁡μ−m​g​cos⁡γ\)/V\(L​sin⁡μ\+Y​cos⁡μ\)/V\]\.\\begin\{bmatrix\}\\dot\{V\}\\\\ \\dot\{\\gamma\}\\\\ \\dot\{\\chi\}\\end\{bmatrix\}=\\frac\{1\}\{m\}\\begin\{bmatrix\}\-D\-mg\\sin\\gamma\\\\ \\left\(L\\cos\\mu\-Y\\sin\\mu\-mg\\cos\\gamma\\right\)/V\\\\ \\left\(L\\sin\\mu\+Y\\cos\\mu\\right\)/V\\end\{bmatrix\}\.\(1\)The change of the aerodynamic anglesα\\alpha,β\\beta, andμ\\muis

\[μ˙α˙β˙\]=\[\(ωx​cos⁡α\+ωz​sin⁡α\)/cos⁡β−χ˙​sin⁡γ\+tan⁡β​\(γ˙​cos⁡μ\+χ˙​cos⁡γ​sin⁡μ\)ωy−tan⁡β​\(ωx​cos⁡α\+ωz​sin⁡α\)−\(γ˙​cos⁡μ\+χ˙​cos⁡γ​sin⁡μ\)/\(cos⁡β\)ωx​sin⁡α−ωz​cos⁡α\+χ˙​cos⁡γ​cos⁡μ−γ˙​sin⁡μ\],\\begin\{bmatrix\}\\dot\{\\mu\}\\\\ \\dot\{\\alpha\}\\\\ \\dot\{\\beta\}\\end\{bmatrix\}=\\begin\{bmatrix\}\(\\omega\_\{x\}\\cos\\alpha\+\\omega\_\{z\}\\sin\\alpha\)/\\cos\\beta\-\\dot\{\\chi\}\\sin\\gamma\+\\tan\\beta\(\\dot\{\\gamma\}\\cos\\mu\+\\dot\{\\chi\}\\cos\\gamma\\sin\\mu\)\\\\ \\omega\_\{y\}\-\\tan\\beta\(\\omega\_\{x\}\\cos\\alpha\+\\omega\_\{z\}\\sin\\alpha\)\-\(\\dot\{\\gamma\}\\cos\\mu\+\\dot\{\\chi\}\\cos\\gamma\\sin\\mu\)/\(\\cos\\beta\)\\\\ \\omega\_\{x\}\\sin\\alpha\-\\omega\_\{z\}\\cos\\alpha\+\\dot\{\\chi\}\\cos\\gamma\\cos\\mu\-\\dot\{\\gamma\}\\sin\\mu\\end\{bmatrix\},where we denote angular rates about the body\-fixed axes by𝝎=\[ωx,ωy,ωz\]T\\boldsymbol\{\\omega\}=\[\\omega\_\{x\},\\omega\_\{y\},\\omega\_\{z\}\]^\{T\}\. The rotational dynamics are described by𝝎˙=𝑰−1​\(𝑴−𝝎×𝑰​𝝎\)\\boldsymbol\{\\dot\{\\omega\}\}=\\boldsymbol\{I\}^\{\-1\}\\left\(\\boldsymbol\{M\}\-\\boldsymbol\{\\omega\}\\times\\boldsymbol\{I\}\\,\\boldsymbol\{\\omega\}\\right\), where𝑴∈ℝ3\\boldsymbol\{M\}\\in\\mathbb\{R\}^\{3\}denotes the moments resulting from aerodynamics and thruster usage and𝑰\\boldsymbol\{I\}is the inertia tensor with respect to the center of mass\. The aerodynamic forces and momentsD,L,Y,𝑴D,L,Y,\\boldsymbol\{M\}are complicated to model accurately and are, in general, nonlinear functions of the vehicle state and environmental conditions\. We employ an industry\-grade high\-fidelity aerodynamics model that represents these dependencies in terms of tabulated coefficients that depend on the flight conditions\. This is a standard approach in flight dynamics modeling\(see, e\.g\.,schmidt2011modern;stengel2022flight\)\. For example, the lift is modeled asL=q¯​S​CL​\(M​a,α,β,𝝎,𝜹\)\.L=\\bar\{q\}SC\_\{L\}\(M\\\!a,\\alpha,\\beta,\\boldsymbol\{\\omega\},\\boldsymbol\{\\delta\}\)\.Here,q¯=12​ρ​Va2\\bar\{q\}=\\frac\{1\}\{2\}\\rho V\_\{a\}^\{2\}is the dynamic pressure, resulting from the velocity of the vehicle relative to airVaV\_\{a\}\(i\.e\.,VVaccounted for wind\) with the local air densityρ\\rho, andSSis the effective surface area of the vehicle\. The coefficientCLC\_\{L\}depends on the Mach numberM​aM\\\!a\(the ratio of flight speed and local speed of sound\) and the flap deflections𝜹=\[δe,δa\]T\\boldsymbol\{\\delta\}=\[\\delta\_\{\\text\{e\}\},\\delta\_\{\\text\{a\}\}\]^\{T\}in addition to the state variablesα\\alpha,β\\beta, and𝝎\\boldsymbol\{\\omega\}\. The simulation encompassesM​a∈\[0\.5,26\.8\]M\\\!a\\in\[0\.5,26\.8\]andq¯∈\[50,5825\]​Pa\\bar\{q\}\\in\[50,5825\]\\,\\mathrm\{Pa\}, leading to considerable variations in the aerodynamic influence\. The international standard atmosphere model is used to determine environmental conditions such as the density of air and speed of sound\.

Trajectory generation is a complex, mission\-specific endeavor that requires precalculation and predictor\-corrector algorithms\(lu\_predictor\_2008;vernis\_accurate\_2011;speng\_robust\_2011\)\. We consider a generic atmospheric re\-entry mission with a simplified flight path guidance loop\. One objective for the controller is to keep the sideslip angle small,β≈0\\beta\\approx 0, which impliesY≈0Y\\approx 0\. For this case, Equation \([1](https://arxiv.org/html/2606.31291#S2.E1)\) simplifies and provides insight into the principal mechanics behind guidance for re\-entry\. The gravitational force \(m​gmg\) accelerates the vehicle during descent \(γ<0\\gamma<0\), while the aerodynamic dragDDdecelerates the vehicle\. The change inγ\\gammadepends on the termL​cos⁡μL\\cos\\mu, which represents the upward component of the aerodynamic liftLLthat opposes the gravitational force \(m​g​cos⁡γmg\\cos\\gamma\)\. Hence, the vertical direction of flight can be controlled by changing the magnitude of the bank angleμ\\mu\. However, the equation forχ\\chireveals that this will also change the horizontal direction due to the sideward lift componentL​sin⁡μL\\sin\\mu\. We employ a simple guidance strategy: \(1\) The angle of attack determines the magnitude of lift and drag\. The command valueαcmd\\alpha\_\{\\text\{cmd\}\}issued by the guidance is a setpoint that ensures structural integrity of the vehicle, while resulting in sufficient deceleration along the trajectory\. \(2\) The magnitude of the bank angle commandμcmd\\mu\_\{\\text\{cmd\}\}is calculated by a feedback loop to track a predefined flight path angleγ\\gamma, i\.e\., to offset the gravity componentm​g​cos⁡γmg\\cos\\gammain Equation \([1](https://arxiv.org/html/2606.31291#S2.E1)\)\. \(3\) Bank angle reversal is triggered, i\.e\., the sign of the command valueμcmd\\mu\_\{\\text\{cmd\}\}is switched, whenever a specified heading angle deviation from the reference is exceeded\. This strategy leads to sufficiently rich variations in the reference commands for the present study\.

### 2\.3Baseline Controller

We focus on the atmospheric control mode, which is more complex compared to exo\-atmospheric attitude control\. To provide a challenging baseline for benchmarking, a state\-of\-the\-art control algorithm is included in the simulation\. This baseline consist of separate control laws for the longitudinal motion \(α→αcmd\\alpha\\to\\alpha\_\{\\text\{cmd\}\}tracking\) and for the lateral\-directional motion \(μ→μcmd\\mu\\to\\mu\_\{\\text\{cmd\}\}tracking andβ→0\\beta\\to 0regulation\)\. Longitudinal control uses a gain\-scheduled proportional\-integral\-derivative \(PID\) law in combination with an inverse modelffof the flight dynamics in feedforward:

δe​\(t\)\\displaystyle\\delta\_\{\\text\{e\}\}\(t\)=\\displaystyle=f​\(M​a​\(t\),q¯​\(t\),αcmd​\(t\)\)\+kp​\(M​a​\(t\),q¯​\(t\)\)⋅\(αcmd​\(t\)−α​\(t\)\)\\displaystyle f\(M\\\!a\(t\),\\bar\{q\}\(t\),\\alpha\_\{\\text\{cmd\}\}\(t\)\)\+k\_\{p\}\(M\\\!a\(t\),\\bar\{q\}\(t\)\)\\cdot\(\\alpha\_\{\\text\{cmd\}\}\(t\)\-\\alpha\(t\)\)\+ki​\(M​a​\(t\),q¯​\(t\)\)⋅∫0tαcmd​\(s\)−α​\(s\)​d​s−kd​\(M​a​\(t\),q¯​\(t\)\)⋅α˙​\(t\)\\displaystyle\+\\,k\_\{i\}\(M\\\!a\(t\),\\bar\{q\}\(t\)\)\\cdot\\int^\{t\}\_\{0\}\\alpha\_\{\\text\{cmd\}\}\(s\)\-\\alpha\(s\)ds\-k\_\{d\}\(M\\\!a\(t\),\\bar\{q\}\(t\)\)\\cdot\\dot\{\\alpha\}\(t\)This is a deterministic, time\-varying control policy with continuous changes of the controller gains based on the current estimates of the Mach numberM​a​\(t\)M\\\!a\(t\)and dynamic pressureq¯​\(t\)\\bar\{q\}\(t\)\. Similarly, lateral\-directional control laws are gain\-scheduled PID for bank angle tracking and proportional\-derivative \(PD\) control for sideslip angle regulation\. The lateral directional control laws involve crossfeeds and hence they are multivariable\. The controller aims at yielding as little as possible error between the commanded and the estimated values of the angle of attack, bank angle, and sideslip angle\. The gains for the control laws were selected based on the well\-established pole\-placement technique that seeks to match a closed\-loop response with desired characteristics specified by a low\-order transfer function reference model\(see, e\.g\.,enns1994dynamic;stevens2003aircraft\)\. Designs were performed for 21 linear snapshot models obtained along the considered nominal flight trajectory\. The control laws are implemented in terms of state feedback gain schedules with linear interpolation in between design grid points\. The scheduling variables are calculated based on estimated Mach number and dynamic pressure\(similar toganet\_ARD\_2008\)\.

### 2\.4Reinforcement Learning for Continuous Control and Generalization of Policies

Deep Deterministic Policy Gradients\(DDPG,lillicrap\_continuous\_2019\)enabled using neural networks for continuous model\-free RL\. Since then, many off\-policy RL algorithms for continuous control were influenced by Soft Actor Critic\(SAC,haarnoja\_soft\_2018\)and DDPG’s extension TD3\(fujimoto18a\_td3\)\. TD3 adds clipped double Q\-learning, target policy smoothing, and delayed policy updates to DDPG\. SAC is similar to TD3, but uses a stochastic policy in the maximum entropy RL framework\. TD3 was extended to TD7\(Fujimoto2023\_td7\)and MR\.Q\(fujimoto2025\_mrq\)\. TD7 adds a state\-action encoder, loss\-adjusted prioritized experience replay\(LAP,Fujimoto2020\_lap\), and checkpointing\. Similarly, MR\.Q uses model\-based representation learning and LAP\. SAC was extended to DroQ\(hiraoka2022\_droq\), CrossQ\(Bhatt2024\_crossq\), BRO\(Nauman2024\_bro\), and SimbaV2\(lee2025\_simbav2\)\. Across various benchmarks for continuous control, model\-free RL\(Nauman2024\_bro;fujimoto2025\_mrq;lee2025\_simbav2\)currently seems to outperform model\-based RL, i\.e\., DreamerV3\(Hafner2025\_dreamerv3\)and TD\-MPC2\(hansen2024tdmpc\)\.

To obtain robust policies and enable sim\-to\-real transfer, dynamics randomization has been used in robotics\(Antonova2017;Peng2018;Tan2018;OpenAI2020\)\. In its simplest form, dynamics randomization varies parameters of the dynamics in the simulation randomly during training without informing the RL agent\. The agent has to learn a policy that is robust under these unknown variations while only relying on observations\.

We have a smoothly varying parametrization of the dynamic properties over which we want to generalize, e\.g\., actuation characteristics\. Hence, we adopt the formalism of Contextual Markov Decision Processes\(CMDPs,Hallak2015Contextual\), in which the variation of tasks is described as a tuple\(𝒞,𝒮,𝒜,ℳ​\(c\)\),\\left\(\\mathcal\{C\},\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{M\}\(c\)\\right\),with context space𝒞\\mathcal\{C\}, state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, and a functionℳ​\(c\)=\(𝒮,𝒜,Pc​\(s′\|s,a\),Rc​\(r\|s,a,s′\),μc​\(s0\)\)\\mathcal\{M\}\(c\)=\\left\(\\mathcal\{S\},\\mathcal\{A\},P\_\{c\}\(s^\{\\prime\}\|s,a\),R\_\{c\}\(r\|s,a,s^\{\\prime\}\),\\mu\_\{c\}\(s\_\{0\}\)\\right\)mapping contextsc∈𝒞c\\in\\mathcal\{C\}to individual Markov Decision Processes \(MDPs\) with context\-dependent state transition probabilitiesPcP\_\{c\}, reward functionRcR\_\{c\}, and initial state distributionμc\\mu\_\{c\}\. In our case, the agent only has access to observations that depend on context and previous statesot=ϕ​\(c,st,st−1,…\)o\_\{t\}=\\phi\(c,s\_\{t\},s\_\{t\-1\},\\ldots\), and it is supposed to select actionsa∈𝒜a\\in\\mathcal\{A\}\. Our goal is to find a policyπ\\pithat maximizes the objective function1\|𝒞\|​∑c∈𝒞J​\(ℳ​\(c\),π\),\\frac\{1\}\{\|\\mathcal\{C\}\|\}\\sum\_\{c\\in\\mathcal\{C\}\}J\(\\mathcal\{M\}\(c\),\\pi\),whereJ​\(ℳ,π\)J\(\\mathcal\{M\},\\pi\)is the expected return of policyπ\\piin taskℳ\\mathcal\{M\}\. CMDPs allow us to define distinct sets of contexts for training and testing to evaluate zero\-shot generalization\(Kirk2023\)\.

Going one step further, we want to select task variations intelligently during training to improve sample efficiency and final performance\. Task scheduling approaches have been proposed for multi\-task RL\(Sharma2018AMT;Cho2024\)and contextual RL\(Fabisch2014\)\.Fabisch2014frame the task selection problem as a non\-stationary multi\-armed bandit problem and solve it via discounted upper\-confidence bound\(D\-UCB,Kocsis2006;Garivier2011\)\. In this framework, they compare task selection strategies based on the episodic return: easiest tasks first \(best reward\), hardest task first, and focus on tasks with largest improvement in returns \(e\.g\., monotonic progress\)\. We call this active multi\-task training \(AMT\)\. Scheduled multi\-task training\(SMT,Cho2024\)focuses on solving hard tasks first\.

##### Action space \(see also Appendix[A](https://arxiv.org/html/2606.31291#A1)\)

The vehicle uses its flaps to control pitch and roll motion, and thrusters for yaw motion\. We compare several architectures for attitude control\. The simplest learned controller \(Only RL\) will directly output yaw thrust and changes of the flap angles\. The controller described in Section[2\.3](https://arxiv.org/html/2606.31291#S2.SS3)serves as a baseline for evaluation, but also as a building block in two hybrid controllers\. In additive hybrid control, the output of the RL policy network is added to the output of the baseline controller\(also known as residual RL,Johannink2019\)\. A more advanced hybrid control approach uses the policy for gain scheduling in the baseline controller, i\.e\., we compute twelve continuous gain factorsfgainf\_\{\\text\{gain\}\}to scale gains in the baseline control laws\. We represent these multipliers as values of decibel in accordance with classical control notion, i\.e\., we transform each policy output byfgain​\(x\)=10x/20f\_\{\\text\{gain\}\}\(x\)=10^\{x/20\}withx∈\[−6,6\]x\\in\\left\[\-6,6\\right\], such that a policy output ofx=0x=0results in a factor of 1\. The maximum isfgain​\(6\)≈2f\_\{\\text\{gain\}\}\(6\)\\approx 2, and the minimum isfgain​\(−6\)≈0\.5f\_\{\\text\{gain\}\}\(\-6\)\\approx 0\.5\.

##### Observation space \(see also Appendix[B](https://arxiv.org/html/2606.31291#A2)\)

The observable state of the vehicle includes the aerodynamic angles \(α,β,μ\\alpha,\\beta,\\mu\), and angular velocities𝝎\\boldsymbol\{\\omega\}\(see Section[2](https://arxiv.org/html/2606.31291#S2)\)\. In addition, we add the commanded aerodynamic angles \(αcmd,t,βcmd,t,μcmd,t\\alpha\_\{\\text\{cmd\},t\},\\beta\_\{\\text\{cmd\},t\},\\mu\_\{\\text\{cmd\},t\}\), the previously commanded flap deflections \(δe,cmd,t−1,δa,cmd,t−1\\delta\_\{e,\\text\{cmd\},t\-1\},\\delta\_\{a,\\text\{cmd\},t\-1\}\), and, for hybrid control, also the current baseline control command to the observation space\. To mimic the structure of a conventional PID error\-feedback controller, we add for each aerodynamic angle the current error, its derivative, and its integral to the observation space\. This improves the ability of a feedforward policy network in this partially observable MDP\. The dynamics of the environment are influenced by air density and speed of the vehicle, which determine the dynamic pressure\. We measure the altitude, velocity, and dynamic pressure and provide it as observations to the controller\. Observations are normalized to the range\[−1,1\]\\left\[\-1,1\\right\]\.

##### Reward function \(see also Appendix[C](https://arxiv.org/html/2606.31291#A3)\)

Our main objective is to minimize the attitude cost

c​\(t\)=\(αcmd​\(t\)−α​\(t\)αrange\)2\+\(βcmd​\(t\)−β​\(t\)βrange\)2\+\(μcmd​\(t\)−μ​\(t\)μrange\)2,c\(t\)=\\left\(\\frac\{\\alpha\_\{\\mathrm\{cmd\}\}\(t\)\-\\alpha\(t\)\}\{\\alpha\_\{\\text\{range\}\}\}\\right\)^\{2\}\+\\left\(\\frac\{\\beta\_\{\\mathrm\{cmd\}\}\(t\)\-\\beta\(t\)\}\{\\beta\_\{\\text\{range\}\}\}\\right\)^\{2\}\+\\left\(\\frac\{\\mu\_\{\\mathrm\{cmd\}\}\(t\)\-\\mu\(t\)\}\{\\mu\_\{\\text\{range\}\}\}\\right\)^\{2\},which we define based on commands and estimates of the aerodynamic angles \(in radian\)\. The individual ranges allow us to define a desired range for each error\. This approach is known as*Bryson’s rule*in classical linear quadratic optimal control\. We focus the attitude cost on the angle of attackα\\alphawithαrange=2∘,βrange=10∘,μrange=10∘\\alpha\_\{\\text\{range\}\}=2^\{\\circ\},\\beta\_\{\\text\{range\}\}=10^\{\\circ\},\\mu\_\{\\text\{range\}\}=10^\{\\circ\}\.

However, taking the negative attitude cost as a reward has the drawback that the reward is always negative, which makes early termination of an episode by running into failure states appealing\. We solve this problem by defining a strictly positive reward

attitude reward​\(t\)=max⁡\(0\.001,exp⁡\(−wc⋅c​\(t\)\)\)∈\[0\.001,1\]\\text\{attitude reward\}\\,\(t\)=\\max\(0\.001,\\exp\(\-w\_\{c\}\\cdot c\(t\)\)\)\\in\\left\[0\.001,1\\right\]with weightwc=2w\_\{c\}=2\. Since the attitude reward is at least 0\.001, it discourages the agent from terminating early and shifts focus to the main objective\.

We combine the attitude reward with a control cost to form the reward function in each stept′t^\{\\prime\}asrt′=attitude reward​\(t\)−control cost​\(t\),r\_\{t^\{\\prime\}\}=\\text\{attitude reward\}\\,\(t\)\-\\text\{control cost\}\\,\(t\),where control cost is the sum of thruster and flap costs

control cost​\(t\)=wτz​\(τz​\(t\)τz,max\)2\+wδe​\(Δ​δe,cmd​\(t\)Δ​δe,max\)2\+wδa​\(Δ​δa,cmd​\(t\)Δ​δa,max\)2,\\text\{control cost\}\\,\(t\)=w\_\{\\tau\_\{z\}\}\\left\(\\frac\{\\tau\_\{z\}\(t\)\}\{\\tau\_\{z,\\max\}\}\\right\)^\{2\}\+w\_\{\\delta\_\{e\}\}\\left\(\\frac\{\\Delta\\delta\_\{e,\\text\{cmd\}\}\(t\)\}\{\\Delta\\delta\_\{e,\\max\}\}\\right\)^\{2\}\+w\_\{\\delta\_\{a\}\}\\left\(\\frac\{\\Delta\\delta\_\{a,\\text\{cmd\}\}\(t\)\}\{\\Delta\\delta\_\{a,\\max\}\}\\right\)^\{2\},with weightswτz=1w\_\{\\tau\_\{z\}\}=1andwδe=wδa=0\.05w\_\{\\delta\_\{e\}\}=w\_\{\\delta\_\{a\}\}=0\.05\. The difference between successive flap commandsΔ​δ∘,cmd​\(t\)=δ∘,cmd​\(t\)−δ∘,cmd​\(t−1\)\\Delta\\delta\_\{\\circ,\\text\{cmd\}\}\(t\)=\\delta\_\{\\circ,\\text\{cmd\}\}\(t\)\-\\delta\_\{\\circ,\\text\{cmd\}\}\(t\-1\)has a maximum ofΔδ∘,max=Δt⋅δ˙∘,max=114⋅\\Delta\\delta\_\{\\circ,\\max\}=\\Delta t\\cdot\\dot\{\\delta\}\_\{\\circ,\\max\}=\\frac\{1\}\{14\}\\,\\cdot15°/15\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}≈\\approx1\.07°/1\.07\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}\. The yaw torqueτz​\(t\)\\tau\_\{z\}\(t\)generated by the thrusters is limited byτz,max=\\tau\_\{z,\\max\}=300N/300\\text\{\\,\}\\mathrm\{N\}\\text\{/\}\.

##### Context space \(see also Appendix[D](https://arxiv.org/html/2606.31291#A4)\)

For dynamics randomization, we parametrize the simulation with the mass of the vehiclem0∈\[1312,1968\]m\_\{0\}\\in\\left\[1312,1968\\right\]kg/\\mathrm\{kg\}\\text\{/\}, fractions of the principal moments of inertia𝒇∈\[0\.9,1\.1\]3\\boldsymbol\{f\}\\in\\left\[0\.9,1\.1\\right\]^\{3\}, misalignment of the principal axis by means of rotation𝝎∈\{ω=θ​ω^\|ω^∈S2,θ∈\[−10∘,10∘\]\}⊂ℝ3\\boldsymbol\{\\omega\}\\in\\\{\\omega=\\theta\\hat\{\\omega\}\|\\hat\{\\omega\}\\in S^\{2\},\\theta\\in\\left\[\-10^\{\\circ\},10^\{\\circ\}\\right\]\\\}\\subset\\mathbb\{R\}^\{3\}, and flap actuator bandwidthω0∈\[12,30\]\\omega\_\{0\}\\in\\left\[12,30\\right\]rad/s\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}\. Hence, the parameter space is𝒞⊂ℝ8\\mathcal\{C\}\\subset\\mathbb\{R\}^\{8\}\. Nominal conditions correspond tocnominal=\(1640,1,1,1,0,0,0,30\)Tc\_\{\\text\{nominal\}\}=\\left\(1640,1,1,1,0,0,0,30\\right\)^\{T\}\.

## 4Experiments

In our experiments, we want to answer the following research questions:

1. 1\.Can we apply deep RL algorithms to spacecraft attitude control? \(Section[4\.1\.1](https://arxiv.org/html/2606.31291#S4.SS1.SSS1)\)
2. 2\.What is the best achievable performance with pure RL? \(Section[4\.1\.1](https://arxiv.org/html/2606.31291#S4.SS1.SSS1)\)
3. 3\.Which control architecture works best? \(Section[4\.1\.2](https://arxiv.org/html/2606.31291#S4.SS1.SSS2)\)
4. 4\.How well do controllers generalize to out\-of\-distribution conditions? \(Section[4\.1\.3](https://arxiv.org/html/2606.31291#S4.SS1.SSS3)\)
5. 5\.Can we enforce generalization with domain randomization or task scheduling? \(Section[4\.2\.1](https://arxiv.org/html/2606.31291#S4.SS2.SSS1)\)
6. 6\.Which is the most robust controller? \(Section[4\.2\.2](https://arxiv.org/html/2606.31291#S4.SS2.SSS2)\)

##### Evaluation of RL algorithms:

FollowingAgarwal2021\_statistical\_precipe, we analyze the performance of RL algorithms with sample\-efficiency curves using the interquartile mean \(IQM\) and bootstrapped 95% confidence intervals of the cumulative episode reward, i\.e\., undiscounted return\. For direct comparison of policies, we compute the probability of improvement\.

##### Application\-specific performance metrics:

Key metrics for attitude control are the absolute errors of the angle of attack\|eα​\(t\)\|=\|αcmd​\(t\)−α​\(t\)\|\|e\_\{\\alpha\}\(t\)\|=\|\\alpha\_\{\\text\{cmd\}\}\(t\)\-\\alpha\(t\)\|, sideslip angle\|eβ​\(t\)\|=\|βcmd​\(t\)−β​\(t\)\|\|e\_\{\\beta\}\(t\)\|=\|\\beta\_\{\\text\{cmd\}\}\(t\)\-\\beta\(t\)\|, and bank angle\|eμ​\(t\)\|=\|μcmd​\(t\)−μ​\(t\)\|\|e\_\{\\mu\}\(t\)\|=\|\\mu\_\{\\text\{cmd\}\}\(t\)\-\\mu\(t\)\|\. We prioritize the angle of attack to maintain structural and thermal integrity of the vehicle and create sufficient drag to decelerate\. We want to avoid large errors, while small errors over a longer period of time are allowed\. Hence, we analyze the error distribution and report percentiles\. As secondary metrics, we report the control costs in all three action components\. For multiple test contexts, we furthermore report the success rate, which is the percentage of contexts, in which the controller was able to follow the trajectory completely\. Episodes terminate when any angle is outside of its safe domain, which is defined based on the aerodynamic angles:α∈\[0∘,60∘\]\\alpha\\in\\left\[0^\{\\circ\},60^\{\\circ\}\\right\],β∈\[−20∘,20∘\]\\beta\\in\\left\[\-20^\{\\circ\},20^\{\\circ\}\\right\], andμ∈\[−90∘,90∘\]\\mu\\in\\left\[\-90^\{\\circ\},90^\{\\circ\}\\right\]\.

### 4\.1Training Under Nominal Conditions

We start episodes with constant initial attitude, initial mass, inertia, and flap actuator bandwidth, but randomize parameters of the trajectory to prevent overfitting the trajectory \(see Appendix[D](https://arxiv.org/html/2606.31291#A4)for details\)\. We analyze which control architecture and RL algorithm performs best and what is the maximum achievable performance\. Since we do not vary dynamics during training, we can test out\-of\-distribution generalization of the learned controllers\.

#### 4\.1\.1RL Algorithm Selection

![Refer to caption](https://arxiv.org/html/2606.31291v1/x1.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x2.png)

Figure 2:MR\.Q surpasses the baseline controller\. Learning curves with interquartile means \(IQM\) and 95% bootstrap confidence intervals for 10 training seeds\(Agarwal2021\_statistical\_precipe\)\.We initially selected SAC and TD3 to explore the feasibility RL\-based control\. We tried CrossQ as an extension of SAC as well as TD7 and MR\.Q as extensions of TD3\. In this study, we focus on TD3\-based algorithms as they showed consistent behavior in preliminary experiments without task\-specific tuning\. Hence, we evaluate only SAC, TD3, TD7, MR\.Q\. Hyperparameters are listed in Appendix[E](https://arxiv.org/html/2606.31291#A5)\.

![Refer to caption](https://arxiv.org/html/2606.31291v1/x3.png)\(a\)Probability of improvement\(Agarwal2021\_statistical\_precipe\)\.
![Refer to caption](https://arxiv.org/html/2606.31291v1/x4.png)\(b\)Error distribution \(kernel density estimation\) for angle of attack\. Density is plotted on a logarithmic scale to highlight the tails\. The peak of the baseline controller around0is more pronounced on a linear scale\.

Figure 3:Comparison under nominal conditions\.With the learning curves in Figure[2](https://arxiv.org/html/2606.31291#S4.F2), we evaluate the performance and sample efficiency of RL algorithms and control architectures during training\. We evaluate combinations of learning algorithm and control architecture with 10 random seeds for training and 10 evaluation episodes without exploration noise\. For comparison, we plot the performance of no control and the baseline controller averaged over 10 evaluation episodes\. Individual learning curves per training run are in Appendix[M](https://arxiv.org/html/2606.31291#A13)\.

MR\.Q excels without task\-specific tuning\. The main reasons for this are the multi\-step return, which reduces bias in the target value for the critic loss, and the model\-based representation \(detailed analysis in Appendix[G](https://arxiv.org/html/2606.31291#A7)\)\. Most algorithms perform better in the hybrid control architecture than their non\-hybrid counterpart, with the exception of MR\.Q\. For the non\-obvious cases, we compare the best obtained policies per run with the probability of improvement and a 95% confidence interval in Figure[3\(a\)](https://arxiv.org/html/2606.31291#S4.F3.sf1)\. We attribute that MR\.Q outperforms the baseline controller to the fact that MR\.Q learns to continuously adapt to environment conditions, which is more efficient than the gain scheduling of the baseline\. In particular, the error distribution for our primary control objective, the angle of attack, is narrower for the learned controller than for the baseline \(see Figure[3\(b\)](https://arxiv.org/html/2606.31291#S4.F3.sf2), see also Appendix[J](https://arxiv.org/html/2606.31291#A10)for obtained trajectories and control commands\)\. However, the baseline has a higher concentration of errors close to 0\. Hence, MR\.Q works without task\-specific tuning and is able to achieve a higher return than the baseline controller by tracking the angle of attack more accurately\.

#### 4\.1\.2Comparison of Control Architectures

![Refer to caption](https://arxiv.org/html/2606.31291v1/x5.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x6.png)

Figure 4:Comparisons of control architectures\.With MR\.Q, we compare control architectures \(see Figure[4](https://arxiv.org/html/2606.31291#S4.F4)\)\. Our simplest RL\-based controller maps from observations to flap changes and thruster commands\. We compare it to additive and a gain\-scheduling hybrid control\. In both hybrid architectures, the baseline controller’s command is in the observation space\.

For maximum performance pure RL or additive hybrid control seem to be the best options under nominal evaluation conditions\. Although the best possible performance of the gain\-scheduling hybrid controller is lower than for the two other architectures, the bounded influence of the policy makes learning considerably more stable across runs and over time, and the performance is still better than the baseline\.

#### 4\.1\.3Out\-Of\-Distribution Generalization of Policies Trained Under Nominal Conditions

![Refer to caption](https://arxiv.org/html/2606.31291v1/x7.png)

Return \(Undiscounted\)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x8.png)

\(a\)Performance evaluated under variations of𝑰\\boldsymbol\{I\}\. On the left, we scale the principal moments of inertia along three axes \(from top to bottom: x\-, y\-, z\-axis\)\. On the right, we rotate𝑰\\boldsymbol\{I\}about the three basis vectors\.
![Refer to caption](https://arxiv.org/html/2606.31291v1/x9.png)\(b\)Performance with respect to changes of the initial attitude without training for it\. Performance is measured at the initial attitude during training\+ϵ∈\[−10,−5,−2,−1,0,1,2,5,10\]\+\\,\\epsilon\\in\\left\[\-10,\-5,\-2,\-1,0,1,2,5,10\\right\]degree per component\.
![Refer to caption](https://arxiv.org/html/2606.31291v1/x10.png)\(c\)Performance evaluated under variations of flap actuator bandwidth\.

Figure 5:Generalization of policies trained under nominal conditions\. The solid lines show median performance over training runs and shaded areas the\[5,95\]\\left\[5,95\\right\]\-percentile interval\.To analyze generalization of the standard RL training paradigm with random exploration in action space, we test robustness with respect to variations that were not present during training\. We compare the undiscounted return of an episode for the best obtained policies per run \(see Figure[5](https://arxiv.org/html/2606.31291#S4.F5)\)\.

Inertia tensor \(Figure[5\(a\)](https://arxiv.org/html/2606.31291#S4.F5.sf1)\):The inertia tensor𝑰\\boldsymbol\{I\}has six degrees of freedom\. To systematically vary it, we multiply the principal moments of inertiaIx​x,Iy​y,Iz​zI\_\{xx\},I\_\{yy\},I\_\{zz\}by factors from\[0\.7,1\.3\]\\left\[0\.7,1\.3\\right\]and apply rotations about the basis vectors with angles from\[−30∘,30∘\]\\left\[\-30^\{\\circ\},30^\{\\circ\}\\right\]\. Rotation of𝑰\\boldsymbol\{I\}about the z\-axis has the most impact and the baseline controller is more robust than the learned controllers to it\.

Initial attitude \(Figure[5\(b\)](https://arxiv.org/html/2606.31291#S4.F5.sf2)\):We vary the initial values of the aerodynamic angles\. Although policies trained on constant initial attitude outperform the baseline under nominal conditions, they do not generalize well\. Even for a small envelope around nominal conditions, we cannot guarantee stability without explicit testing, which demonstrates that action noise is not enough to learn policies that are robust against deviations from nominal conditions\. However, the additive hybrid controller’s performance deteriorates less drastically in most out\-of\-distribution cases than pure RL and the gain\-scheduling hybrid controller is almost as reliable as the baseline\.

Flap actuator bandwidth \(Figure[5\(c\)](https://arxiv.org/html/2606.31291#S4.F5.sf3)\):Similarly, we test generalization to variations of the flap actuator bandwidth\. Lower values make the control problem harder and simulate failures of the actuation\. Nominal conditions correspond to30rad/s30\\text\{\\,\}\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}\. Hence, all of the tested values are out of distribution\. The baseline controller’s performance drastically drops for actuator bandwidths below14rad/s14\\text\{\\,\}\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}\. Interestingly, completely learned controllers are consistently more robust to lower values\. We suspect that random exploration in action space during training makes the resulting policy robust to this scenario although the exact stable range cannot be guaranteed\. Both hybrid control architectures show more variation, but their median performance is similar to the baseline\.

The bounded influence of the gain\-scheduling hybrid controller makes its behavior more similar to the baseline with respect to variations of the initial conditions\. However, the stability of learned controllers cannot be guaranteed in the same range as for the baseline controllers when we do not explicitly enforce the operational envelope during training\. Although MR\.Q is more robust to actuation failures than the baseline controller, it is less stable with respect to variations of the inertia tensor and considerably less stable under variations of the initial attitude than the baseline controller\.

### 4\.2Training Under Dynamics Randomization

We randomize not just the trajectories, but also the initial attitude to ensure that the RL\-based controllers generalize to these variations\. Furthermore, we systematically vary the following dynamics parameters: initial mass, inertia tensor, and flap actuator bandwidth \(see Appendix[D](https://arxiv.org/html/2606.31291#A4)for details\)\. Each combination of dynamics parameters defines a context\. We compare task scheduling approaches to select contexts during training and evaluate generalization of the obtained controllers\.

To test generalization and robustness of controllers, we uniformly sample a test set𝒞text\\mathcal\{C\}\_\{\\text\{text\}\}of 100 contexts and measure the obtained return in each test context\. We select the best controllers per training procedure for this analysis based on the minimum return in a smaller test set of 30 contexts from regularly saved checkpoints per run \(checkpoint interval: 200k time steps\)\.

#### 4\.2\.1Comparison of Task Scheduling Methods for Dynamics Randomization

We sample\|𝒞train\|=50\|\\mathcal\{C\}\_\{\\text\{train\}\}\|=50context vectors that define individual tasks\. We compare SMT, which prefers hard tasks, and active task selection with the best\-reward \(AMT\-B, prefers easy tasks\) and monotonic progress \(AMT\-M, prefers tasks with high expected learning progress\) strategies\. Hyperparameters are listed in Appendix[F](https://arxiv.org/html/2606.31291#A6)\. We use continuous uniform random sampling of contexts \(dynamics randomization, DR\) and round robin \(RR\) as baselines for task scheduling\.

![Refer to caption](https://arxiv.org/html/2606.31291v1/x11.png)Figure 6:Learning curves of three training runs with DR and pure RL with MR\.Q\. Lines indicate median performance over 30 test contexts and shaded areas show the full range of returns\.Task scheduling methods do not perform better than dynamics randomization or round robin\. Since the results are inconclusive, we report them in Appendix[L](https://arxiv.org/html/2606.31291#A12)\. Among task selection strategies, focusing on hard tasks in the beginning \(SMT\) is the worst strategy\. Focusing on the easiest tasks first \(AMT\-B\) and focusing on tasks with the maximum expected learning progress \(AMT\-M\) work equally well and are comparable to the baselines DR and RR with no clear advantage\.

Training with task scheduling is difficult for two reasons: \(1\) training on the full range of flap actuator bandwidths creates some problems that are considerably harder to solve, and \(2\) as we use a replay buffer for each task, infrequently selected tasks might contain transitions of low performing policies that are still selected relatively often, since we first sample the task uniformly and then the samples within the task for the batch that is required in the network updates\. We mitigate the first issue by sampling flap actuator bandwidths from\[25,30\]\\left\[25,30\\right\]rad/s\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}, but still evaluate on the full range\. The second problem causes instabilities after about 4M steps, which we avoid by stopping after 5M steps\. A more detailed analysis is required to understand the problem and find solutions\.

#### 4\.2\.2Generalization and Robustness of Best Controllers

Table 1:Evaluation of best controllers with domain\-specific metrics\. Cell background color indicatesbest valuesorvalues better than the baseline\. Appendix[K](https://arxiv.org/html/2606.31291#A11)contains more details on error distributions and Appendix[L](https://arxiv.org/html/2606.31291#A12)contains all results with control costs\.TrainingSeed

Success

Median

90\-Perc\.

95\-Perc\.

98\-Perc\.

Median

90\-Perc\.

95\-Perc\.

98\-Perc\.

Median

90\-Perc\.

95\-Perc\.

98\-Perc\.

\|eα\|\|e\_\{\\alpha\}\|\[deg\]\|eβ\|\|e\_\{\\beta\}\|\[deg\]\|eμ\|\|e\_\{\\mu\}\|\[deg\]Baseline\-79%\\cellcolorgreen\!400\.050\.460\.672\.96\\cellcolorgreen\!400\.02\\cellcolorgreen\!400\.20\\cellcolorgreen\!400\.36\\cellcolorgreen\!400\.57\\cellcolorgreen\!400\.060\.721\.432\.35Add\. Hyb\.0\\cellcolorgreen\!1593%\\cellcolorgreen\!400\.05\\cellcolorgreen\!400\.23\\cellcolorgreen\!400\.34\\cellcolorgreen\!400\.590\.070\.250\.410\.650\.17\\cellcolorgreen\!400\.42\\cellcolorgreen\!400\.59\\cellcolorgreen\!400\.87Add\. Hyb\.1\\cellcolorgreen\!1593%0\.07\\cellcolorgreen\!150\.37\\cellcolorgreen\!150\.56\\cellcolorgreen\!150\.830\.070\.300\.530\.850\.15\\cellcolorgreen\!150\.60\\cellcolorgreen\!150\.84\\cellcolorgreen\!151\.11Add\. Hyb\.2\\cellcolorgreen\!1594%0\.07\\cellcolorgreen\!150\.33\\cellcolorgreen\!150\.47\\cellcolorgreen\!150\.710\.110\.390\.600\.840\.17\\cellcolorgreen\!150\.60\\cellcolorgreen\!150\.84\\cellcolorgreen\!151\.19Only RL0\\cellcolorgreen\!40100%0\.120\.47\\cellcolorgreen\!150\.63\\cellcolorgreen\!150\.860\.090\.450\.881\.870\.231\.141\.63\\cellcolorgreen\!152\.19Only RL1\\cellcolorgreen\!1599%0\.10\\cellcolorgreen\!150\.45\\cellcolorgreen\!150\.62\\cellcolorgreen\!150\.840\.100\.531\.062\.030\.341\.311\.822\.57Only RL2\\cellcolorgreen\!1598%0\.10\\cellcolorgreen\!150\.39\\cellcolorgreen\!150\.55\\cellcolorgreen\!150\.800\.090\.490\.951\.760\.25\\cellcolorgreen\!150\.69\\cellcolorgreen\!150\.98\\cellcolorgreen\!151\.49Since task scheduling approaches do not provide any benefit in comparison to continuous uniform dynamics randomization \(DR\), we analyze more closely the performance of the best obtained controllers with DR\. We compare DR for pure RL and additive hybrid control in Table[2](https://arxiv.org/html/2606.31291#S4.T1)with domain\-specific evaluation metrics\. We focus on the error distribution of the aerodynamic angles and success rate\. See Appendix[L](https://arxiv.org/html/2606.31291#A12)for a detailed analysis including the control costs\. Exemplary training curves are displayed in Figure[6](https://arxiv.org/html/2606.31291#S4.F6)\)\.

In comparison to the baseline controller, which fails particularly often at low flap actuator bandwidths and large rotations of the inertia tensor, both additive hybrid control and pure RL improve the success rate considerably\. The additive hybrid controller’s success rate seems to be limited by the baseline, although it obtains lower error 9x\-percentiles in each aerodynamic angle\. For the angle of attack, both architectures reduce the 90, 95, and 98 error percentiles in comparison to the baseline\. The additive hybrid controller is even better at tracking the bank angle than the baseline in these percentiles\. Improving these metrics comes at the cost of more expensive control, i\.e\., thruster usage and flap angle changes \(see Appendix[L](https://arxiv.org/html/2606.31291#A12)\)\.

## 5Related Work

RL has been applied to attitude control in various ways for vehicles inside\(zhen\_deep\_2020;bernini\_few\_2021;liu\_attitude\_2022;rosa\_deep\_2023;bernini\_reinforcement\_2024;bohn\_data\-efficient\_2024;AI4GNC\_City2024\)and outside of an atmosphere\(su\_deep\_2019;vedant\_reinforcement\_2019;elkins\_autonomous\_2020;elkins\_adaptive\_2020;elkins\_bridging\_2021;mahfouz\_reinforcement\_2022;liu\_neural\_2022;xiao\_fixed\-time\_2023;Barrenechea2023Hoppa\)\. Q\-learning\(liu\_neural\_2022\), PPO\(vedant\_reinforcement\_2019;elkins\_autonomous\_2020;zhen\_deep\_2020;mahfouz\_reinforcement\_2022\), DDPG\(su\_deep\_2019;rosa\_deep\_2023\), TD3\(elkins\_adaptive\_2020;elkins\_bridging\_2021;liu\_attitude\_2022\), and SAC\(bernini\_reinforcement\_2024;bohn\_data\-efficient\_2024\)were used in these applications\. Most studies replace traditional control approaches by RL\. However,Barrenechea2023Hoppapropose a residual RL architecture for online adaptation andliu\_attitude\_2022embed a policy for anti\-disturbance control in a hybrid controller for hypersonic re\-entry\. Furthermore,liu\_attitude\_2022extend TD3 to EVE\-TD3, which shares similarities with MR\.Q: \(1\) a multi\-step return \(over a horizon of 30 steps\) to improve the estimation of the target value for the critic, and \(2\) the policy network has a bottleneck \(6–18 nodes\) to force the policy to extract high\-level features\.

According toelkins\_adaptive\_2020, robustness of attitude control can be defined with respect to \(1\) tumble: nonzero initial angular velocity, \(2\) single, impulsive disturbance torque, \(3\) constant disturbance torque, and \(4\) different inertia tensors𝑰\\boldsymbol\{I\}\. RL can be robust against variations of𝑰\\boldsymbol\{I\}by adapting to it online\(e\.g\., in takeover maneuvers,liu\_neural\_2022\)or by handling variation of it as noise\(vedant\_reinforcement\_2019\)\. Robustness can also be defined in terms of handling disturbances caused by the applications of unknown external torques on the spacecraft\(xiao\_fixed\-time\_2023;elkins\_adaptive\_2020\)or actuator faults\(xiao\_fixed\-time\_2023\)\. Sources of disturbances might be \(1\) fault caused by bias \(e\.g\., less available torque\), \(2\) fault caused by failure to respond to control signals \(e\.g\., motor fault\), or \(3\) disturbance of environment: uncertainties in air resistance, gravity, solar pressure, etc\.\(su\_deep\_2019\)\.bernini\_reinforcement\_2024measure robustness with respect to wind gusts in quadcopter control, which is similar to a single, impulsive disturbance force\. Similarly,rosa\_deep\_2023verified a deep RL algorithm for guidance and control of a reusable launch vehicle in the landing phase and tested under various wind conditions\.

## 6Conclusion

The MR\.Q algorithm excels without task\-specific tuning in the challenging attitude control problem of a hypersonic re\-entry vehicle \(see Section[4\.1\.1](https://arxiv.org/html/2606.31291#S4.SS1.SSS1)\)\. Specifically, its model\-based representation and multi\-step returns are pivotal\. Pure MR\.Q can surpass the performance of the baseline controller and obtain a similar performance as hybrid controllers under nominal conditions\. However, training under nominal conditions does not yield sufficient generalization, although hybrid control architectures often work better under out\-of\-distribution settings than pure RL \(see Section[4\.1\.3](https://arxiv.org/html/2606.31291#S4.SS1.SSS3)\)\. Hence, we enforce generalization with dynamics randomization and find that continuous uniform sampling of dynamic parameters is sufficient, i\.e\., task scheduling approaches do not perform better \(see Section[4\.2\.1](https://arxiv.org/html/2606.31291#S4.SS2.SSS1)\)\. Surprisingly, MR\.Q is able to generalize over each condition that we varied during training\. With dynamics randomization, obtained controllers are considerably more robust than the baseline with respect to uncertainties in dynamics parameters \(see Section[4\.2\.2](https://arxiv.org/html/2606.31291#S4.SS2.SSS2)\)\.

The additive hybrid controller is the most promising controller\. It generalizes better than pure MR\.Q in the out\-of\-distribution experiments with training under nominal conditions \(see Section[4\.1\.3](https://arxiv.org/html/2606.31291#S4.SS1.SSS3)\)\. It improves the robustness in comparison to the baseline by reducing the probability of larger errors of the angle of attack at the cost of using the thrusters and changing the flap angles more \(see Section[4\.2\.2](https://arxiv.org/html/2606.31291#S4.SS2.SSS2)\)\. The lower bound of the additive hybrid controller’s performance is the baseline controller’s performance if we deactivate the learned component in out\-of\-distribution settings\.

To ensure stability of hybrid controllers, out\-of\-distribution conditions must be detected to disable the learned component and fall back to the formally verified baseline\. This is straightforward to implement through outlier detection with data from the replay buffer\. To enable real\-time execution of policies in deployment, we plan to run them on Field Programmable Gate Arrays \(FPGAs\)\.

Being able to adapt to unforeseen conditions online would be a major advantage of reinforcement learning for attitude control\(see, e\.g\.,Barrenechea2023Hoppa\)\. The gain\-scheduling controller, which exhibits less variance across training steps and runs \(see Section[4\.1\.2](https://arxiv.org/html/2606.31291#S4.SS1.SSS2)\), is our preferred candidate for online adaptation in a real scenario\. However, the analysis is out of scope for this study\. More experiments are needed to evaluate stability and performance of the online adaptation process\.

#### Acknowledgments

This work was funded by the European Aerospace Agency under the GSTP programme, activity GT1I\-602SA “Artificial Intelligence techniques for spacecraft attitude control and estimation” \(project acronym: AI4AOCS\), contract number 4000145154/24/NL/MGu, lead by Airbus Defence and Space GmbH\. We thank Octavio Arriaga for feedback on our implementation of MR\.Q, Shubham Vyas for feedback on the manuscript, and Arthur de Freitas Precht for technical support for the interface to the simulation software\. This work was partially supported by the German Federal Ministry of Research, Technology and Space \(BMFTR\) under the Robotics Institute Germany \(RIG\)\.

## References

Supplementary Materials

*The following content was not necessarily subject to peer review\.*

## Appendix AAction Space

ActionDescriptionUnitLimitsControl Commands \(Only in Pure RL or Additive Hybrid Control Mode\)Δ​δe,cmd​\(t\)\\Delta\\delta\_\{e,\\text\{cmd\}\}\(t\)Change \(δe,cmd​\(t\)−δe,cmd​\(t−1\)\\delta\_\{e,\\text\{cmd\}\}\(t\)\-\\delta\_\{e,\\text\{cmd\}\}\(t\-1\)\) of symmetric deflection \(both flaps trailing edge down\)\.rad\[−1514∘,1514∘\]\\left\[\-\\frac\{15\}\{14\}^\{\\circ\},\\frac\{15\}\{14\}^\{\\circ\}\\right\]Δ​δa,cmd​\(t\)\\Delta\\delta\_\{a,\\text\{cmd\}\}\(t\)Change of antisymmetric deflection of both flaps\. Right flap is deflected trailing edge down, left flap is deflected trailing edge up\.rad\[−1514∘,1514∘\]\\left\[\-\\frac\{15\}\{14\}^\{\\circ\},\\frac\{15\}\{14\}^\{\\circ\}\\right\]τz​\(t\)\\tau\_\{z\}\(t\)The available four thrusters are fired in a group that produces a \(yawing\) torque about the body\-fixed vertical \(z\) axis\. The command is a torque demand that is converted to thruster opening times by low\-level functions\.N\[−300,300\]\\left\[\-300,300\\right\]PID Gains \(Only in Gain\-Scheduling Hybrid Control Mode\)fkp,αf\_\{k\_\{p,\\alpha\}\}Factor for proportional gain of angle of attack error\.decibel\[−6,6\]\\left\[\-6,6\\right\]fki,αf\_\{k\_\{i,\\alpha\}\}Factor for integral gain of angle of attack error\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkd,αf\_\{k\_\{d,\\alpha\}\}Factor for derivative gain of angle of attack error\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkp,β,βf\_\{k\_\{p,\\beta,\\beta\}\}Factor for proportional gain of yaw angle error to thruster command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkd,β,βf\_\{k\_\{d,\\beta,\\beta\}\}Factor for derivative gain of yaw angle error to thruster command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkp,β,μf\_\{k\_\{p,\\beta,\\mu\}\}Factor for proportional gain of bank angle error to thruster command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkd,β,μf\_\{k\_\{d,\\beta,\\mu\}\}Factor for derivative gain of bank angle error to thruster command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkp,μ,μf\_\{k\_\{p,\\mu,\\mu\}\}Factor for proportional gain of bank angle error to flap command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fki,μ,μf\_\{k\_\{i,\\mu,\\mu\}\}Factor for integral gain of bank angle error to flap command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkd,μ,μf\_\{k\_\{d,\\mu,\\mu\}\}Factor for derivative gain of bank angle error to flap command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkp,μ,βf\_\{k\_\{p,\\mu,\\beta\}\}Factor for proportional gain of yaw angle error to flap command\.decibel\[−6,6\]\\left\[\-6,6\\right\]fkd,μ,βf\_\{k\_\{d,\\mu,\\beta\}\}Factor for derivative gain of yaw angle error to flap command\.decibel\[−6,6\]\\left\[\-6,6\\right\]
## Appendix BObservation Space

ObservationDescriptionUnitLimitsEnvironment ConditionsAltitudeAltitude above the WGS84 geoid\.m\[0,150​k\]\\left\[0,150k\\right\]Mach numberM​aM\\\!aVelocity as ratio of absolute speed and local speed of soundunitless\[0,35\]\\left\[0,35\\right\]Dynamic pressureq¯\\bar\{q\}Dynamic pressure depends on air speedVaV\_\{a\}and air densityρ\\rhothrough0\.5​Va2​ρ0\.5V\_\{a\}^\{2\}\\rhoPa\[0,10​k\]\\left\[0,10k\\right\]Commanded Aerodynamic AnglesThe commands are generated using a model of the desired response characteristics \(second\-order transfer function\) in the setpoint generator\. Tracking the aerodynamic angles is the primary control objective\.αcmd​\(t\)\\alpha\_\{\\text\{cmd\}\}\(t\)\(angle of attack\)Requested angle between the direction of flight relative to air and the body\-fixed frame in the vertical plane\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]βcmd​\(t\)\\beta\_\{\\text\{cmd\}\}\(t\)\(sideslip angle\)Requested angle between the direction of flight relative to air and the body\-fixed frame in the horizontal plane\. The command can be assumed to be zero all of the time\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]μcmd​\(t\)\\mu\_\{\\text\{cmd\}\}\(t\)\(bank angle\)Requested angle between the direction of lift and the vertical plane around the direction of flight relative to air\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]Measured Angles and Angular Ratesα​\(t\)\\alpha\(t\)\(angle of attack\)Measured angle of attack\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]β​\(t\)\\beta\(t\)\(sideslip angle\)Measured angle of sideslip\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]μ​\(t\)\\mu\(t\)\(bank angle\)Measured bank angle\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]α​\(t−1\)\\alpha\(t\-1\)Last measured angle of attack\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]β​\(t−1\)\\beta\(t\-1\)Last measured angle of sideslip\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]μ​\(t−1\)\\mu\(t\-1\)Last measured bank angle\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]ϕ˙​\(t\)\\dot\{\\phi\}\(t\)\(roll rate\)Angular velocity about the forward \(x\) axis of the vehicle\.rad/s\[−10,10\]\\left\[\-10,10\\right\]θ˙​\(t\)\\dot\{\\theta\}\(t\)\(pitch rate\)Rate of change of the pitch angle\. Corresponds to angular velocity about the intermediate y’ axis in a standard Euler angle zyx\-rotation sequence\.rad/s\[−10,10\]\\left\[\-10,10\\right\]ψ˙​\(t\)\\dot\{\\psi\}\(t\)\(yaw rate\)Rate of change of the yaw angle\. Corresponds to angular velocity about the geodetic z axis in a standard Euler angle zyx\-rotation sequence\.rad/s\[−10,10\]\\left\[\-10,10\\right\]Last Commandsδe,cmd​\(t−1\)\\delta\_\{\\text\{e\},\\text\{cmd\}\}\(t\-1\)Flap command from last step\. See actions for details\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]δa,cmd​\(t−1\)\\delta\_\{\\text\{a\},\\text\{cmd\}\}\(t\-1\)Flap command from last step\. See actions for details\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]Baseline Control Commands \(Only in Hybrid Control Mode\)δe,basecmd​\(t\)\\delta\_\{\\text\{e\},\\text\{basecmd\}\}\(t\)Flap command of baseline controller\. See actions for details\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]δa,basecmd​\(t\)\\delta\_\{\\text\{a\},\\text\{basecmd\}\}\(t\)Flap command of baseline controller\. See actions for details\.rad\[−π2,π2\]\\left\[\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\\right\]τz,basecmd​\(t\)\\tau\_\{\\text\{z\},\\text\{basecmd\}\}\(t\)Thruster command of baseline controller\. See actions for details\.N\[−300,300\]\\left\[\-300,300\\right\]PID Error Componentseα​\(t\)e\_\{\\alpha\}\(t\)Proportional error for angle of attack\.rad\[−π,π\]\\left\[\-\\pi,\\pi\\right\]eβ​\(t\)e\_\{\\beta\}\(t\)Proportional error for sideslip angle\.rad\[−π,π\]\\left\[\-\\pi,\\pi\\right\]eμ​\(t\)e\_\{\\mu\}\(t\)Proportional error for bank angle\.rad\[−π,π\]\\left\[\-\\pi,\\pi\\right\]∫0teα​\(τ\)​𝑑τ\\int\_\{0\}^\{t\}e\_\{\\alpha\}\(\\tau\)d\\tauIntegral error for angle of attack\.rads/\\mathrm\{rad\}\\text\{\\,\}\\mathrm\{s\}\\text\{/\}∫0teβ​\(τ\)​𝑑τ\\int\_\{0\}^\{t\}e\_\{\\beta\}\(\\tau\)d\\tauIntegral error for sideslip angle\.rads/\\mathrm\{rad\}\\text\{\\,\}\\mathrm\{s\}\\text\{/\}∫0teμ​\(τ\)​𝑑τ\\int\_\{0\}^\{t\}e\_\{\\mu\}\(\\tau\)d\\tauIntegral error for bank angle\.rads/\\mathrm\{rad\}\\text\{\\,\}\\mathrm\{s\}\\text\{/\}d​eα​\(t\)d​t\\frac\{de\_\{\\alpha\}\(t\)\}\{dt\}Derivative error for angle of attack\.rad/s\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}d​eβ​\(t\)d​t\\frac\{de\_\{\\beta\}\(t\)\}\{dt\}Derivative error for sideslip angle\.rad/s\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}d​eμ​\(t\)d​t\\frac\{de\_\{\\mu\}\(t\)\}\{dt\}Derivative error for bank angle\.rad/s\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}
## Appendix CDesign of Reward Function

### C\.1Sensitivity Analysis of Reward Function

![Refer to caption](https://arxiv.org/html/2606.31291v1/x12.png)\(a\)Influence of flap cost weight on control cost and absolute error of aerodynamic angles\.
![Refer to caption](https://arxiv.org/html/2606.31291v1/x13.png)\(b\)Influence of thruster cost weight on control cost and absolute error of aerodynamic angles\. In the gray region, the policy was not able to follow the complete trajectory\.

Figure 7:Sensitivity analysis of for weights in the reward function\.For various configurations of the control cost weights \(flap cost weightswδe=wδaw\_\{\\delta\_\{e\}\}=w\_\{\\delta\_\{a\}\}and thruster cost weightwτzw\_\{\\tau\_\{z\}\}\), we report the interquartile mean of absolute errors and control costs of the obtained controllers over ten randomly sampled evaluation trajectories after 2\.4M training steps with one training run of MR\.Q in Figure[7](https://arxiv.org/html/2606.31291#A3.F7)\.

We see that increasing the flap cost weightswδe=wδaw\_\{\\delta\_\{e\}\}=w\_\{\\delta\_\{a\}\}has the desired effect of reducing the flap control cost without sacrificing tracking performance around a sweet spot\. Good policies with low control costs are obtained in the range of\[0\.03,0\.2\]\\left\[0\.03,0\.2\\right\]\. However, a weight of0\.20\.2considerably reduces sample efficiency in comparison to lower values, indicating that it makes the learning task harder\. Usually about250​k250kto600​k600ktraining steps are required to follow the trajectory completely during training\. Withwδe=wδa=0\.2w\_\{\\delta\_\{e\}\}=w\_\{\\delta\_\{a\}\}=0\.2, about1​M1Mtraining steps were required\. For a weight of0\.50\.5,2\.2​M2\.2Mtraining steps are required\. Although this analysis was done after the experiments under nominal conditions,wδe=wδa=0\.05w\_\{\\delta\_\{e\}\}=w\_\{\\delta\_\{a\}\}=0\.05seems to be a good choice in hindsight as it balances sample efficiency, control cost, and control performance\.

Similarly, increasing the thruster cost weight has the desired effect of reducing control cost\. However, the errors of the aerodynamic angles decrease at the same time\. The best possible weight is aboutwτz=1w\_\{\\tau\_\{z\}\}=1as increasing the weight further makes the learning problem so hard that it is not possible to follow the whole trajectory after2\.5​M2\.5Mtraining steps\.

### C\.2Monitoring Training

Since the maximum value ofrtr\_\{t\}is 1, the maximum value of the expected return approximated by the value function is11−γ\\frac\{1\}\{1\-\\gamma\}under the discounted infinite horizon model, e\.g\., forγ=0\.99\\gamma=0\.99the maximum return is 100 if we assume no control costs and that the attitude is tracked perfectly\. Hence, we are able to directly track the performance of the policy in comparison to the optimum during training in algorithms based on DDPG \(i\.e\., TD3, TD7, MR\.Q\), since the loss of the policy is the negative expected return approximated by the value function\. The best possible value would be \-100 under the condition of perfect approximation\.

### C\.3Considered Reward Functions

In non\-systematic, preliminary experiments we explored a wide range of reward functions and components of reward functions\. We tried the following:

- •Negative attitude cost: sum of squared differences between commanded and measured aerodynamic angles\.
- •Adding a penalty for thruster forces\.
- •Adding a penalty for flap angle changes\.
- •Adding a penalty for derivative of the aerodynamic angles\. We do not use this component anymore because it is often required to change the aerodynamic angles considerably to track the commanded trajectory\.
- •Adding an extra penalty for large deviations from commanded angles \(unsafe state penalty\)\.
- •Terminating an episode when the orientation is out of the safe range and adding a large penalty of \-1000 or \-8000 \(stop penalty\)\.
- •Adding a reward for each step that the episode is not aborted \(healthy reward\), which results in significantly longer simulations, however, sometimes at the cost of less accurate tracking of the orientation\.
- •Replacing angle cost by forward reward, which rewards moving closer to the commanded attitude and is inspired by the reward function of the MuJoCo environments from Gymnasium\.
- •Initially, we set the weight of the control cost components to 0\.001 and then increased it to 0\.01, 0\.1, 0\.5, and 1\. Higher weights lead to less switching between flap angles, less extensive use of thrusters, and generally less oscillation of commands\. The current individual weights are a good compromise between tracking performance and avoiding excessive overcompensation of errors\.
- •We tried to remove the healthy reward and use a terminal cost instead, that rewards low altitude of the vehicle when the episode terminates\. This did not lead to better results, even when evaluating policies obtained with the previous reward under the new reward\.
- •We tried to remove the healthy reward and use an attitude that is always positive and more positive when the attitude cost is lower\. We call this the attitude reward and it is included in the final version of the reward function\. The idea behind this is to substitute the healthy reward and the attitude cost component, since it is always better to survive longer and it is better to have a low attitude cost\. The approach was successful, so we do not use any attitude cost, terminal cost, or stop penalty anymore\.

## Appendix DSimulation Details

### D\.1Common Configuration

##### Initial conditions:

The vehicle starts at the outer atmosphere of Earth with a fixed altitude of about93km/93\\text\{\\,\}\\mathrm\{km\}\\text\{/\}and fixed velocity of approximately Mach 26\.75\. The dynamic pressure is low with approximately54Pa/54\\text\{\\,\}\\mathrm\{Pa\}\\text\{/\}in this state\.

##### Flight path:

To avoid overfitting to a specific flight path, we randomize the trajectory given by the guidance system in each episode by setting two parametersγ\\gammaandΔ​χmax\\Delta\\chi\_\{\\max\}\(see Table[5](https://arxiv.org/html/2606.31291#A4.T5)\)\. The flight path angleγ\\gammadetermines the desired speed of descent\. The maximum deviation from the flight courseΔ​χmax\\Delta\\chi\_\{\\max\}controls bank angle reversals\. These reversals are necessary because the bank angle of the vehicle influences not just the speed of descent, but also the direction of the flight\. In order to follow a straight path, the bank angle has be reversed whenΔ​χ\\Delta\\chiexceeds the given threshold\. This has the effect of modifying the environment dynamics and reward function from the perspective of the agent because it modifies the commanded aerodynamic angles that are part of the state space as well as the PID error components that are part of the observation space\.

Table 5:Configuration of simulation\.VariableNominalDistribution \(Nominal\)Distribution \(Dynamics Randomization\)Trajectory Parametersγ\\gamma−1\.0​°𝒰​\(−1\.1∘,−0\.9∘\)\\mathcal\{U\}\(\-1\.1^\{\\circ\},\-0\.9^\{\\circ\}\)𝒰​\(−1\.1∘,−0\.9∘\)\\mathcal\{U\}\(\-1\.1^\{\\circ\},\-0\.9^\{\\circ\}\)Δ​χmax\\Delta\\chi\_\{\\max\}3\.25​°𝒰​\(1\.5∘,5∘\)\\mathcal\{U\}\(1\.5^\{\\circ\},5^\{\\circ\}\)𝒰​\(1\.5∘,5∘\)\\mathcal\{U\}\(1\.5^\{\\circ\},5^\{\\circ\}\)Initial Attitudeα​\(0\)\\alpha\(0\)45\.024​°constant𝒰​\(α0,nominal−5∘,α0,nominal\+5∘\)\\mathcal\{U\}\(\\alpha\_\{0,\\text\{nominal\}\}\-5^\{\\circ\},\\alpha\_\{0,\\text\{nominal\}\}\+5^\{\\circ\}\)β​\(0\)\\beta\(0\)0\.046​°constant𝒰​\(β0,nominal−5∘,β0,nominal\+5∘\)\\mathcal\{U\}\(\\beta\_\{0,\\text\{nominal\}\}\-5^\{\\circ\},\\beta\_\{0,\\text\{nominal\}\}\+5^\{\\circ\}\)μ​\(0\)\\mu\(0\)61\.141​°constant𝒰​\(μ0,nominal−5∘,μ0,nominal\+5∘\)\\mathcal\{U\}\(\\mu\_\{0,\\text\{nominal\}\}\-5^\{\\circ\},\\mu\_\{0,\\text\{nominal\}\}\+5^\{\\circ\}\)Inertiam0m\_\{0\}1640kg/1640\\text\{\\,\}\\mathrm\{kg\}\\text\{/\}constantcontext,m0∈\[1312,1968\]m\_\{0\}\\in\\left\[1312,1968\\right\]kg/\\mathrm\{kg\}\\text\{/\}𝑰\\boldsymbol\{I\}diag​\(49222472358\)​kg⋅m2\\text\{diag\}\\left\(\\begin\{array\}\[\]\{c\}492\\\\ 2247\\\\ 2358\\\\ \\end\{array\}\\right\)\\text\{kg\}\\cdot\\text\{m\}^\{2\}constantcontext, see main textActuationω0\\omega\_\{0\}30rad/s30\\text\{\\,\}\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}constantcontext,ω0∈\[12,30\]\\omega\_\{0\}\\in\\left\[12,30\\right\]rad/s\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}

### D\.2Nominal Conditions

Under nominal conditions, the initial attitude, vehicle mass, the vehicle’s inertia tensor, and the actuation are constant as summarized in Table[5](https://arxiv.org/html/2606.31291#A4.T5)\.

### D\.3Dynamics Randomization

In comparison to nominal conditions, we additionally randomize the initial attitude for each episode according to Table[5](https://arxiv.org/html/2606.31291#A4.T5)and the context vector contains the vehicle’s mass, inertia tensor, and flap actuator bandwidth\.

To model uncertainty in the actuation of the flaps, we modify the flap actuator bandwidthω0\\omega\_\{0\}\. Lower values make controlling the vehicle harder\. The baseline controller becomes unstable for values under1414rad/s\\mathrm\{rad\}\\text\{/\}\\mathrm\{s\}\.

The initial mass is varied in a predefined range according to Table[5](https://arxiv.org/html/2606.31291#A4.T5)\. The vehicle’s mass defines its gravitational force and, hence, influences the evolution of the flight path\. The flight path then defines the commanded aerodynamic angles, as it is required to increase or decrease the lift to compensate for the change in mass\.

To generate physically plausible inertia tensors, we define perturbations of the nominal inertia tensor with the following procedure\. Since the nominal inertia tensor is diagonal, we apply a perturbation directly to the principal moments and then rotate the inertia tensor\. To modify the principal moment, we use a vector𝒂∈\[−0\.1,0\.1\]3,\\boldsymbol\{a\}\\in\\left\[\-0\.1,0\.1\\right\]^\{3\},and to modify the rotation, we use a rotation vector𝝎=θ​𝝎^∈ℝ3\\boldsymbol\{\\omega\}=\\theta\\hat\{\\boldsymbol\{\\omega\}\}\\in\\mathbb\{R\}^\{3\}with an angleθ∈\[−10∘,10∘\]\\theta\\in\\left\[\-10^\{\\circ\},10^\{\\circ\}\\right\]and axis𝝎^∈S2⊂ℝ3\\hat\{\\boldsymbol\{\\omega\}\}\\in S^\{2\}\\subset\\mathbb\{R\}^\{3\}\. With the exponential mapExp:ℝ3→S​O​\(3\)\\text\{Exp\}:\\mathbb\{R\}^\{3\}\\rightarrow SO\(3\)\(Sola2018\), we compute the rotation matrix from the rotation vector so that the nominal inertia is perturbed with

𝑰=Exp​\(𝝎\)⋅diag​\(𝟏3\+𝒂\)⋅𝑰nominal⋅Exp​\(𝝎\)T,\\boldsymbol\{I\}=\\text\{Exp\}\(\\boldsymbol\{\\omega\}\)\\cdot\\text\{diag\}\(\\boldsymbol\{1\}\_\{3\}\+\\boldsymbol\{a\}\)\\cdot\\boldsymbol\{I\}\_\{\\text\{nominal\}\}\\cdot\\text\{Exp\}\(\\boldsymbol\{\\omega\}\)^\{T\},wherediag​\(𝒙\)\\text\{diag\}\(\\boldsymbol\{x\}\)transforms a vector𝒙∈ℝ3\\boldsymbol\{x\}\\in\\mathbb\{R\}^\{3\}to a diagonal matrix\. We generate a random set of rotation axes by sampling from a 3D standard normal distribution𝒩​\(𝟎3,𝑰3×3\)\\mathcal\{N\}\(\\boldsymbol\{0\}\_\{3\},\\boldsymbol\{I\}\_\{3\\times 3\}\)and normalizing the vector\.

## Appendix EHyperparameters of RL Algorithms

HyperparameterMR\.QTD7TD3SACCommonDiscount factorγ\\gamma0\.990\.990\.990\.990\.990\.990\.990\.99Replay buffer capacity22M22M22M22MMini\-batch size256256256256256256256256Target update frequencyTtargetT\_\{\\text\{target\}\}250250250250\-\-Target update rateτ\\tau\-\-5⋅10−35\\cdot 10^\{\-3\}5⋅10−35\\cdot 10^\{\-3\}MR\.QDynamics loss weightλDynamics\\lambda\_\{\\text\{Dynamics\}\}11\-\-\-Reward loss weightλReward\\lambda\_\{\\text\{Reward\}\}0\.10\.1\-\-\-Terminal loss weightλTerminal\\lambda\_\{\\text\{Terminal\}\}0\.10\.1\-\-\-Activation loss weightλpre\-activ\\lambda\_\{\\text\{pre\-activ\}\}1​e−51\\text\{e\}\-5\-\-\-Encoder horizonHEncH\_\{\\text\{Enc\}\}55\-\-\-Multi\-step returns horizonHQH\_\{Q\}33\-\-\-TD3Target policy noiseσ\\sigma𝒩​\(0,0\.22\)\\mathcal\{N\}\(0,0\.2^\{2\}\)𝒩​\(0,0\.22\)\\mathcal\{N\}\(0,0\.2^\{2\}\)𝒩​\(0,0\.22\)\\mathcal\{N\}\(0,0\.2^\{2\}\)\-Target policy noise clippingcc\(−0\.3,0\.3\)\(\-0\.3,0\.3\)\(−0\.5,0\.5\)\(\-0\.5,0\.5\)\(−0\.5,0\.5\)\(\-0\.5,0\.5\)\-Initial random exploration steps1010k2525k2525k55kExploration noise𝒩​\(0,0\.22\)\\mathcal\{N\}\(0,0\.2^\{2\}\)𝒩​\(0,0\.12\)\\mathcal\{N\}\(0,0\.1^\{2\}\)𝒩​\(0,0\.22\)\\mathcal\{N\}\(0,0\.2^\{2\}\)\-Policy delay11222211LAPProbability smoothingα\\alpha0\.40\.40\.40\.4\-\-Minimum priority1111\-\-Value NetworkOptimizerAdamW111loshchilov2018decoupledAdam222Kingma2014\_adamAdamAdamLearning rate3​e−43\\text\{e\}\-43​e−43\\text\{e\}\-43​e−43\\text\{e\}\-41​e−31\\text\{e\}\-3Hidden dim512512256256256256256256Activation functionELU333clevert2015fastELUReLUReLUWeight initializationXavier uniform444Glorot2010understandingLeCun normal555Klambauer2017LeCun normalLeCun normalBias initialization0000Gradient clip norm2020\-\-\-Policy NetworkOptimizerAdamWAdamAdamAdamLearning rate3​e−43\\text\{e\}\-43​e−43\\text\{e\}\-43​e−43\\text\{e\}\-43​e−43\\text\{e\}\-4Hidden dim512512256256256256256256Activation functionReLU666Glorot2011\_reluReLUReLUSwish777Ramachandran2017Weight initializationXavier uniformLeCun normalLeCun normalLeCun normalBias initialization0000Encoder NetworkOptimizerAdamWAdam\-\-Learning rate1​e−41\\text\{e\}\-43​e−43\\text\{e\}\-4\-\-Weight decay1​e−41\\text\{e\}\-4\-\-\-𝐳s\\mathbf\{z\}\_\{s\}dim512512256256\-\-𝐳s​a\\mathbf\{z\}\_\{sa\}dim512512256256\-\-𝐳a\\mathbf\{z\}\_\{a\}dim256256256256\-\-Hidden dim512512256256\-\-Activation functionELUELU\-\-Encoder \(cont\.\)Weight initializationXavier uniformLeCun normal\-\-Bias initialization00\-\-Reward bins6565\-\-\-Reward range\[−10,10\]\[\-10,10\]\(effective:\[−22​k,22​k\]\[\-22\\text\{k\},22\\text\{k\}\]\)\-\-\-SACEntropy optimizer\-\-\-AdamEntropy learning rate\-\-\-1​e−31\\text\{e\}\-3While the MR\.Q algorithm is a theoretically sound approach\(fujimoto2025\_mrq\), we had to modify it slightly in comparison to its original implementation\. In the original implementation, the dynamics loss compares the latent state computed from unrolled dynamicszs​aT​Wp∈ℝz\_\{sa\}^\{T\}W\_\{p\}\\in\\mathbb\{R\}to the latent target state obtained by applying the state encoderfωf\_\{\\omega\}to a states′s^\{\\prime\}from the replay bufferfω​\(s′\)=zs′f\_\{\\omega\}\(s^\{\\prime\}\)=z\_\{s^\{\\prime\}\}, in which the state encoder applies layer normalization and ELU, hence, limits the output tozs′∈ℝ≥−1z\_\{s^\{\\prime\}\}\\in\\mathbb\{R\}\_\{\\geq\-1\}\. Initially, we had this as a bug in our implementation of MR\.Q, but we found that removing ELU from the last layer of the state encoder enhances performance\. Hence, our implementation does apply layer normalization without ELU after the last layer of the state encoder, which resembles the behavior of the state encoder in TD7\(Fujimoto2023\_td7\), in which AvgL1Norm is supposed to protect from monotonic growth of the features\.

## Appendix FHyperparameters of Task Scheduling Algorithms

For task scheduling, we sample a discrete set of contexts and create one replay buffer per context\. For each update, we first randomly sample the replay buffer and then a batch of samples from this replay buffer\. This ensures that we continue training each task even though a task is not selected anymore\.

In comparison to the original SMT\(Cho2024\), we do not perform any network resets, which does not seem to be necessary in MR\.Q\. Furthermore, we do not learn a task encoding\.

HyperparameterActive MTSMTCommonRL algorithmMR\.QMR\.QNumber of tasks\|𝒯\|\|\\mathcal\{T\}\|5050Training steps15​M15M15​M15MScheduling interval1 episode1 episodeReplay buffer size per taskmax⁡\(200​k,2​M\|𝒯\|\)\\max\\left\(200k,\\frac\{2M\}\{\|\\mathcal\{T\}\|\}\\right\)max⁡\(200​k,2​M\|𝒯\|\)\\max\\left\(200k,\\frac\{2M\}\{\|\\mathcal\{T\}\|\}\\right\)D\-UCBUpper bound for task selection rewardrmaxr\_\{\\max\}10\.8​k10\.8k\-Discount factorγ𝒯\\gamma\_\{\\mathcal\{T\}\}0\.950\.95\-Padding function strengthξ\\xi1​e−41\\text\{e\}\-4\-SMTStage 1 budgetB1B\_\{1\}\-12\.75​M12\.75MStage 2 budgetB2B\_\{2\}\-2\.25​M2\.25Mκ\\kappa\-0\.70\.7KK\-88Threshold for unsolved tasksmm\-−100\-100Threshold for solved tasksMM\-10​k10kReset interval\-No resets
## Appendix GAblation Studies

We perform ablation studies to find out why MR\.Q works in this application\. To evaluate the effect of algorithm components and hyperparameters in the MR\.Q algorithm, we train for2\.4​M2\.4Msteps in the single task setting with one seed, and measure the accumulated reward over ten evaluation episodes\. Although the results of these studies are not statistically robust with only one training seed, we expect the conclusions to be correct, since MR\.Q is a robust algorithm with mostly consistent behavior across random seeds and the results are unambiguous\.

### G\.1Effect of Policy Activation Penalty

Activation loss weightλpre\-activ\\lambda\_\{\\text\{pre\-activ\}\}010−610^\{\-6\}10−510^\{\-5\}\(default\)10−410^\{\-4\}MetricIQM±\\pminterquartile std\. dev\.Return10587±\\pm16610467±\\pm16310492±\\pm18110570±\\pm193Flap Roll Cmd\. \[rad\]1\.6e\-5±\\pm3\.1e\-51\.6e\-5±\\pm10e\-52\.0e\-5±\\pm5\.0e\-52\.1e\-5±\\pm6\.0e\-5Flap Pitch Cmd\. \[rad\]0\.8e\-6±\\pm2\.5e\-51\.4e\-5±\\pm8\.7e\-51\.2e\-5±\\pm3\.7e\-51\.0e\-5±\\pm4\.7e\-5Thruster Cmd\. \[N\]2\.97±\\pm3\.034\.31±\\pm3\.65\.29±\\pm4\.694\.60±\\pm4\.42

Similarly to the control cost of the reward function, we expected that the activation loss of the policy forces the policy to reduce the magnitude of control commands\. Other algorithms do not have this component and learn policies that often result in bang\-bang control\. However, settingλpre\-activ=0\\lambda\_\{\\text\{pre\-activ\}\}=0does not decrease the performance\. The activation penalty is not contributing to the considerable performance improvement of MR\.Q in comparison to TD3 and TD7\.

### G\.2Effect of Model\-Based Representation Learning

Encoder lossλDynamics\\lambda\_\{\\text\{Dynamics\}\}λReward\\lambda\_\{\\text\{Reward\}\}λTerminal\\lambda\_\{\\text\{Terminal\}\}Return \(IQM±\\pminterquartile std\. dev\.\)No loss0002987±\\pm277No reward loss100\.110455±\\pm212No dynamics loss00\.10\.110630±\\pm193Default10\.10\.110492±\\pm181The model\-based representation learned by the encoder is the main difference of MR\.Q in comparison to TD3 and TD7\. Deactivating the loss for the encoder completely, i\.e\., using random features, results in a performance similar to TD3 and TD7, which confirms that the model\-based representation of MR\.Q is important for the application\. Interestingly, deactivating either the dynamics loss or the reward loss has a negligible effect on the final performance\. Note that these results are specific for this application, in which a useful and meaningful reward can be computed in every step\. The behavior might be different for environments with sparse reward\.

### G\.3Effect of Horizon for Multi\-Step Returns in Critic Loss

Multi\-step returns horizonHQH\_\{Q\}13 \(default\)510Return \(IQM\)31841049292504483As EVE\-TD3\(liu\_attitude\_2022\)usesHQ=30H\_\{Q\}=30, we are interested in the effect of the hyperparameter in this application with MR\.Q\. The horizon controls the bias\-variance trade\-off in the target value for the action\-value function \(see alsoSchulman2016for a discussion in the context of advantage function estimation\)\. The results demonstrate that the multi\-step return target for the critic is a main reason for the good performance of MR\.Q in this application\. Furthermore, MR\.Q’s default value ofHQ=3H\_\{Q\}=3is also an optimum in this application\. Considering the control frequencies of14Hz/14\\text\{\\,\}\\mathrm\{Hz\}\\text\{/\}in our controller and50Hz/50\\text\{\\,\}\\mathrm\{Hz\}\\text\{/\}in the controller ofliu\_attitude\_2022\(control step size of20ms/20\\text\{\\,\}\\mathrm\{ms\}\\text\{/\}\), we use a horizon of ca\.0\.214s/0\.214\\text\{\\,\}\\mathrm\{s\}\\text\{/\}andliu\_attitude\_2022use a horizon of0\.6s/0\.6\\text\{\\,\}\\mathrm\{s\}\\text\{/\}in simulation time, which is in the same order of magnitude\.

## Appendix HCode

Although we cannot make the code for the simulation and experiments publicly available, the source code for all reinforcement learning algorithms used in our experiments is available at[https://github\.com/mlaux1/rl\-blox](https://github.com/mlaux1/rl-blox)\(version 0\.5\.7, JAX version 0\.7\.0, flax version 0\.11\.1\)\.

For statistical comparison of RL algorithms, we rely on the open\-source library rliable ofAgarwal2021\_statistical\_precipeavailable at[https://github\.com/google\-research/rliable](https://github.com/google-research/rliable)\.

## Appendix IComputation Time

One run with 10 million time steps in the single task setting takes about 48 h with MR\.Q and TD7, 24 h with TD3, and 30 h with SAC\. In total, we analyzed 10 training seeds and two control architectures \(pure RL and additive hybrid\) for each algorithm, resulting in\(48\+48\+24\+30\)×10×2=3000\(48\+48\+24\+30\)\\times 10\\times 2=3000hours of wall\-clock training time to generate the learning curves\. However, the runs could be parallelized\. In addition, we used another script to run the performance evaluation of the stored checkpoints\.

## Appendix JApplication\-Specific Performance Analysis Under Nominal Conditions

### J\.1Tracking Performance Over Full Trajectory

![Refer to caption](https://arxiv.org/html/2606.31291v1/x14.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x15.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x16.png)

Figure 8:Comparison of commanded and measured aerodynamic angles for pure RL controller based on MR\.Q trained under nominal conditions and baseline controller\.
### J\.2Tracking Performance at Critical Points

![Refer to caption](https://arxiv.org/html/2606.31291v1/x17.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x18.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x19.png)

Figure 9:Comparison of commanded and measured aerodynamic angles for pure RL controller based on MR\.Q trained under nominal conditions and baseline controller\. Zoomed in at critical points of the trajectory\.
### J\.3Commands

![Refer to caption](https://arxiv.org/html/2606.31291v1/x20.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x21.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x22.png)

Figure 10:Control commands for pure RL controller based on MR\.Q trained under nominal conditions and baseline controller\.
### J\.4Error Distribution

![Refer to caption](https://arxiv.org/html/2606.31291v1/x23.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x24.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x25.png)

Figure 11:Distributions of aerodynamic angle errors\. The results are obtained for the best policy and 10 evaluation seeds\. A kernel density estimation was applied to the sampled errors\. Note that density is plotted on a logarithmic scale to highlight the tails\. The peak of the baseline controller around0is more pronounced on a linear scale\.
### J\.5Distribution of Control Cost

![Refer to caption](https://arxiv.org/html/2606.31291v1/x26.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x27.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x28.png)

Figure 12:Distributions of control commands\. The results are obtained for the best policy and 10 evaluation seeds\. A kernel density estimation was applied to the sampled control commands\. Note that density is plotted on a logarithmic scale to highlight the tails\. The peak of the baseline controller around0is more pronounced on a linear scale\.

## Appendix KError Percentiles After Training With Dynamics Randomization

![Refer to caption](https://arxiv.org/html/2606.31291v1/x29.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x30.png)

![Refer to caption](https://arxiv.org/html/2606.31291v1/x31.png)

Figure 13:Distribution of aerodynamic angle errors evaluated over 100 test contexts\. The overlapping bars correspond to the following percentiles of the distribution: 1, 2, 5, 10, 20, 80, 90, 95, 98, 99\.
## Appendix LDomain\-Specific Performance Analysis

Table 10:Evaluation of robustness of controllers with domain\-specific metrics\. Cell background color indicatesbest valuesorvalues better than the baseline\(only if success rate is\>\>0%\)\. Appendix[K](https://arxiv.org/html/2606.31291#A11)contains more details on the error distributions\.TrainingSeed

Success

Median

90\-Perc\.

95\-Perc\.

98\-Perc\.

Median

90\-Perc\.

95\-Perc\.

98\-Perc\.

Median

90\-Perc\.

95\-Perc\.

98\-Perc\.

\|eα\|\|e\_\{\\alpha\}\|\[deg\]\|eβ\|\|e\_\{\\beta\}\|\[deg\]\|eμ\|\|e\_\{\\mu\}\|\[deg\]Baseline\-79%\\cellcolorgreen\!400\.050\.460\.672\.96\\cellcolorgreen\!400\.02\\cellcolorgreen\!400\.20\\cellcolorgreen\!400\.36\\cellcolorgreen\!400\.57\\cellcolorgreen\!400\.060\.721\.432\.35DR \(Add\.\)0\\cellcolorgreen\!1593%\\cellcolorgreen\!400\.05\\cellcolorgreen\!400\.23\\cellcolorgreen\!400\.34\\cellcolorgreen\!150\.590\.070\.250\.410\.650\.17\\cellcolorgreen\!400\.42\\cellcolorgreen\!400\.59\\cellcolorgreen\!400\.87DR \(Add\.\)1\\cellcolorgreen\!1593%0\.07\\cellcolorgreen\!150\.37\\cellcolorgreen\!150\.56\\cellcolorgreen\!150\.830\.070\.300\.530\.850\.15\\cellcolorgreen\!150\.60\\cellcolorgreen\!150\.84\\cellcolorgreen\!151\.11DR \(Add\.\)2\\cellcolorgreen\!1594%0\.07\\cellcolorgreen\!150\.33\\cellcolorgreen\!150\.47\\cellcolorgreen\!150\.710\.110\.390\.600\.840\.17\\cellcolorgreen\!150\.60\\cellcolorgreen\!150\.84\\cellcolorgreen\!151\.19DR0\\cellcolorgreen\!40100%0\.120\.47\\cellcolorgreen\!150\.63\\cellcolorgreen\!150\.860\.090\.450\.881\.870\.231\.141\.63\\cellcolorgreen\!152\.19DR1\\cellcolorgreen\!1599%0\.10\\cellcolorgreen\!150\.45\\cellcolorgreen\!150\.62\\cellcolorgreen\!150\.840\.100\.531\.062\.030\.341\.311\.822\.57DR2\\cellcolorgreen\!1598%0\.10\\cellcolorgreen\!150\.39\\cellcolorgreen\!150\.55\\cellcolorgreen\!150\.800\.090\.490\.951\.760\.25\\cellcolorgreen\!150\.69\\cellcolorgreen\!150\.98\\cellcolorgreen\!151\.49RR0\\cellcolorgreen\!1598%0\.15\\cellcolorgreen\!150\.45\\cellcolorgreen\!150\.58\\cellcolorgreen\!150\.750\.200\.680\.961\.420\.301\.211\.52\\cellcolorgreen\!151\.98RR1\\cellcolorgreen\!1596%0\.250\.710\.82\\cellcolorgreen\!150\.950\.260\.971\.211\.700\.571\.371\.782\.36RR2\\cellcolorgreen\!1595%0\.10\\cellcolorgreen\!150\.33\\cellcolorgreen\!150\.43\\cellcolorgreen\!400\.580\.150\.460\.731\.270\.260\.87\\cellcolorgreen\!151\.11\\cellcolorgreen\!151\.35SMT00%0\.050\.270\.430\.670\.230\.761\.241\.800\.400\.941\.482\.57SMT10%0\.080\.430\.621\.000\.181\.071\.792\.900\.282\.042\.903\.62SMT20%0\.170\.560\.901\.360\.231\.131\.782\.960\.381\.602\.283\.52AMT\-M0\\cellcolorgreen\!1594%0\.15\\cellcolorgreen\!150\.45\\cellcolorgreen\!150\.59\\cellcolorgreen\!150\.760\.230\.841\.271\.850\.491\.431\.892\.73AMT\-M1\\cellcolorgreen\!40100%0\.170\.570\.75\\cellcolorgreen\!151\.010\.210\.751\.101\.590\.401\.141\.64\\cellcolorgreen\!152\.20AMT\-M2\\cellcolorgreen\!1594%0\.150\.490\.67\\cellcolorgreen\!150\.960\.190\.921\.351\.990\.521\.412\.082\.90AMT\-B0\\cellcolorgreen\!1596%0\.08\\cellcolorgreen\!150\.32\\cellcolorgreen\!150\.43\\cellcolorgreen\!150\.590\.130\.460\.781\.390\.330\.85\\cellcolorgreen\!151\.04\\cellcolorgreen\!151\.41AMT\-B1\\cellcolorgreen\!1596%0\.11\\cellcolorgreen\!150\.38\\cellcolorgreen\!150\.51\\cellcolorgreen\!150\.740\.170\.721\.101\.690\.301\.171\.71\\cellcolorgreen\!152\.27AMT\-B2\\cellcolorgreen\!1599%0\.10\\cellcolorgreen\!150\.37\\cellcolorgreen\!150\.50\\cellcolorgreen\!150\.660\.200\.620\.911\.410\.361\.191\.51\\cellcolorgreen\!151\.95\|Δ​δe,m​a​x\|\|\\Delta\\delta\_\{e,max\}\|\[deg\]\|Δ​δa,m​a​x\|\|\\Delta\\delta\_\{a,max\}\|\[deg\]\|τz\|\|\\tau\_\{z\}\|\[N\]Baseline\-79%\\cellcolorgreen\!400\.00\\cellcolorgreen\!400\.01\\cellcolorgreen\!400\.01\\cellcolorgreen\!400\.06\\cellcolorgreen\!400\.00\\cellcolorgreen\!400\.01\\cellcolorgreen\!400\.03\\cellcolorgreen\!400\.06\\cellcolorgreen\!400\.31\\cellcolorgreen\!4016\.1171\.7098\.1DR \(Add\.\)0\\cellcolorgreen\!1593%0\.000\.100\.210\.370\.000\.070\.120\.215\.4918\.25\\cellcolorgreen\!1553\.15\\cellcolorgreen\!1589\.0DR \(Add\.\)1\\cellcolorgreen\!1593%0\.020\.260\.430\.690\.010\.130\.220\.388\.4625\.73\\cellcolorgreen\!1550\.76\\cellcolorgreen\!1588\.2DR \(Add\.\)2\\cellcolorgreen\!1594%0\.000\.210\.380\.660\.000\.150\.260\.436\.2431\.23\\cellcolorgreen\!1560\.88\\cellcolorgreen\!1593\.0DR0\\cellcolorgreen\!40100%0\.020\.430\.560\.710\.010\.270\.420\.636\.1523\.85\\cellcolorgreen\!4033\.44\\cellcolorgreen\!4051\.1DR1\\cellcolorgreen\!1599%0\.010\.390\.600\.840\.000\.210\.340\.546\.2830\.23\\cellcolorgreen\!1549\.73\\cellcolorgreen\!1572\.8DR2\\cellcolorgreen\!1598%0\.020\.480\.660\.860\.010\.260\.460\.884\.9523\.49\\cellcolorgreen\!1539\.33\\cellcolorgreen\!1567\.0RR0\\cellcolorgreen\!1598%0\.090\.560\.680\.830\.040\.310\.460\.6510\.9537\.90\\cellcolorgreen\!1557\.59100\.6RR1\\cellcolorgreen\!1596%0\.130\.660\.790\.920\.090\.500\.640\.7815\.5656\.7075\.46107\.0RR2\\cellcolorgreen\!1595%0\.060\.550\.750\.950\.040\.360\.480\.6310\.2637\.39\\cellcolorgreen\!1555\.20\\cellcolorgreen\!1585\.7SMT00%0\.040\.851\.011\.060\.020\.320\.540\.933\.3153\.4676\.66126\.8SMT10%0\.080\.700\.840\.960\.030\.470\.711\.063\.2347\.3878\.06127\.8SMT20%0\.160\.800\.900\.980\.141\.061\.071\.0717\.2369\.22102\.72152\.5AMT\-M0\\cellcolorgreen\!1594%0\.030\.350\.550\.820\.030\.370\.530\.7215\.2247\.83\\cellcolorgreen\!1571\.20115\.6AMT\-M1\\cellcolorgreen\!40100%0\.050\.470\.600\.760\.040\.330\.450\.6113\.2454\.5074\.81103\.8AMT\-M2\\cellcolorgreen\!1594%0\.210\.770\.911\.030\.140\.670\.911\.0614\.7464\.1890\.91139\.2AMT\-B0\\cellcolorgreen\!1596%0\.010\.300\.430\.580\.010\.250\.400\.627\.6930\.31\\cellcolorgreen\!1549\.38\\cellcolorgreen\!1589\.6AMT\-B1\\cellcolorgreen\!1596%0\.120\.650\.790\.950\.080\.420\.560\.7710\.3747\.69\\cellcolorgreen\!1571\.30107\.5AMT\-B2\\cellcolorgreen\!1599%0\.110\.620\.780\.960\.050\.340\.500\.7216\.3651\.86\\cellcolorgreen\!1569\.21100\.8
## Appendix MIndividual Learning Curves Under Nominal Conditions

Plots throughout the paper show the IQM for the returns, which remove outliers, e\.g\., failed runs\. Here we plot individual learning curves for training under nominal conditions without outlier filtering \(experiments from Section[4\.1\.1](https://arxiv.org/html/2606.31291#S4.SS1.SSS1)\)\. For each training run, we average the performance over ten evaluation episodes\. For comparison, we plot the performance of the baseline controller, no control, and the IQM of all training runs\. Individual learning curves for MR\.Q indicate that the training process is slightly unstable, but all runs surpass the baseline controller at some point\.

![Refer to caption](https://arxiv.org/html/2606.31291v1/x32.png)Figure 14:Individual learning curves of MR\.Q \(Only RL\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x33.png)Figure 15:Individual learning curves of TD7 \(Only RL\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x34.png)Figure 16:Individual learning curves of TD3 \(Only RL\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x35.png)Figure 17:Individual learning curves of SAC \(Only RL\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x36.png)Figure 18:Individual learning curves of MR\.Q \(Additive Hybrid\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x37.png)Figure 19:Individual learning curves of TD7 \(Additive Hybrid\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x38.png)Figure 20:Individual learning curves of TD3 \(Additive Hybrid\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x39.png)Figure 21:Individual learning curves of SAC \(Additive Hybrid\)\.![Refer to caption](https://arxiv.org/html/2606.31291v1/x40.png)Figure 22:Individual learning curves of MR\.Q \(Gain\-Scheduling Hybrid\)\.

Similar Articles

Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander

arXiv cs.LG

This paper addresses objective mismatch in model-based RL by proposing offline diagnostics to predict closed-loop performance of latent world models. On LunarLander-v3, the Reward Observability Fraction (ROF) and a Composite score (CROF) enable selecting checkpoints that yield strong MPC and model-based RL policies with far fewer real-environment interactions.