Temporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory

arXiv cs.LG Papers

Summary

This paper proposes a meta-control architecture using temporal self-attention for adaptive control of Euler-Lagrange systems with unobservable memory states. It demonstrates improved tracking performance over baseline methods on a 2-DOF manipulator while identifying failure modes in long-memory regimes.

arXiv:2605.06877v1 Announce Type: new Abstract: Adaptive control of Euler-Lagrange systems is challenging when friction is governed by a finite-horizon internal state that is not directly observable from joint measurements. In this setting, the measured closed-loop state is no longer Markovian, and standard certainty-equivalence adaptive laws may lose their convergence guarantees. The paper proposes a meta-control architecture in which the gains of a computed-torque controller are generated by a self-attention block processing a short window of recent motion history. The number of attention heads is selected before policy training through a surrogate analysis of the autocovariance of the memory-state gradient along the temporal window. This surrogate is based on a temporal adaptation of an incremental rank-tracking framework previously developed by the authors. The selected head count is then fixed and used as an architectural hyperparameter in a reinforcement-learning stage, where the policy is trained under a shielded admissibility constraint. The approach is tested on a 2-DOF manipulator with nonlinear friction and variable payload. In the short and matched memory regimes, the single-layer attention-only meta-controller outperforms a deeper Transformer baseline, with tracking-error reductions of 12 and 19 percentage points, respectively. The reported effect sizes are large, with d approximately -1.1 and -2.1, and Mann-Whitney p < 0.05 in both cases. In the long memory regime, however, the advantage disappears. Four out of ten training runs show either divergence or payload-invariant policy collapse, revealing a weakness in the static Phase-1 head-count prescription. This motivates moving rank-tracking inside the reinforcement-learning loop, allowing attention heads to be pruned or grown at runtime instead of fixed before training.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 07:01 AM

# Temporal Attention for Adaptive Control of Euler–Lagrange Systems with Unobservable Memory
Source: [https://arxiv.org/html/2605.06877](https://arxiv.org/html/2605.06877)
\(April 2026\)

###### Abstract

Adaptive control of Euler–Lagrange systems becomes delicate when the friction dynamics are driven by an internal state that decays over a finite horizon but is not directly observable from joint measurements\. In such a regime the closed\-loop response is no longer Markovian in the measured state, and standard certainty\-equivalence adaptive laws lose their convergence guarantees\.

The present paper proposes a meta\-control architecture in which the gains of a computed\-torque law are produced by a self\-attention block that reads a short window of recent motion history\. The head count of the attention block is selected prior to policy training, by a surrogate analysis of the auto\-covariance of the memory\-state gradient along its temporal window; the surrogate is a temporal adaptation of an incremental rank\-tracking framework developed in earlier work by the authors\. The resulting head count is passed as a fixed architectural hyperparameter to a reinforcement\-learning stage, where the policy is trained under a shielded admissibility constraint inherited from a companion paper\.

On a 2\-DOF \(two\-degree\-of\-freedom\) manipulator with nonlinear friction and variable payload, a single\-layer attention\-only meta\-controller at the selected head count delivers a statistically significant advantage over a deeper Transformer baseline at the short and matched memory regimes: tracking\-error reductions of1212and1919percentage points respectively, with large effect sizes \(d≈−1\.1d\\approx\-1\.1and−2\.1\-2\.1\) and Mann–Whitneyp<0\.05p<0\.05in both cases\. At the long memory regime the advantage vanishes on a larger sample and four of ten training runs exhibit either divergence or payload\-invariant policy collapse, identifying a failure mode specific to the interface between the static Phase\-1 head\-count prescription and the reinforcement\-learning optimisation\. The failure\-mode analysis motivates a follow\-up in which the rank\-tracking dynamics are moved inside the reinforcement\-learning loop, with runtime pruning and growth of attention heads replacing the static Phase\-1/Phase\-2 separation\.

Keywords:adaptive control; Euler–Lagrange systems; friction; reinforcement learning; self\-attention; neural architecture search; Lyapunov safety\.

## 1Introduction

Rigid\-body manipulators tracking a reference trajectory under velocity\-dependent friction constitute one of the oldest testbeds of adaptive control\. When the friction can be written as a known regressor times an unknown parameter vector, the passivity\-based adaptive law of Slotine and Li\[[17](https://arxiv.org/html/2605.06877#bib.bib1)\]delivers asymptotic tracking by combining a certainty\-equivalence feed\-forward with a gradient\-style update on the parameter estimate\. The argument turns on two structural properties: linearity of the dynamics in the unknown parameters, and positive realness of a certain closed\-loop operator\. Both are preserved for viscous and Coulomb friction\. Neither survives when a pre\-sliding internal state is introduced, as in the Stribeck and LuGre models\[[4](https://arxiv.org/html/2605.06877#bib.bib4),[5](https://arxiv.org/html/2605.06877#bib.bib3)\]: the state carries its own dynamics, decays over a finite horizon, and is not recoverable instantaneously from the joint kinematics\. The closed\-loop system is then no longer Markovian in the measured state and the standard adaptive construction no longer applies\.

A companion paper by the present authors\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]addresses the same regime through safe residual reinforcement learning \(RL\)\. A computed\-torque baseline is augmented by a Soft Actor\-Critic \(SAC\) policy\[[12](https://arxiv.org/html/2605.06877#bib.bib10)\]that outputs a bounded torque correction, while a learned Lyapunov function enforces a decrease condition on every control step by closed\-form projection\. The construction is empirically effective but structurally limits the reinforcement\-learning agent in two respects\. First, the action space is the torque vector, which is higher\-dimensional than the parametric controller it augments\. Second, the agent observes only the instantaneous state: the temporal structure of the unobservable memory is not exposed to the policy\.

Both limitations are addressed here by a single design choice: the reinforcement\-learning policy is lifted to the level of the controller parameters, and its input is extended to a windowed history of recent motion\. The policy is therefore not a torque but a mapping from history to gains; the controller it produces is memoryful through the window, even though its functional form remains the computed\-torque law\. The architecture of this mapping is the subject of the paper\.

The central architectural question is the number of attention heads that should process the history window\. Too few heads and the representation cannot resolve the temporal structure of the memory; too many and the reinforcement\-learning optimisation operates in a higher\-dimensional parameter space with no corresponding benefit\. The proposed answer constructs a surrogate covariance operator from the gradient of the memory state along its temporal window, verifies that its effective rank is a compressed\-sensing upper bound on the representational capacity required, and uses the resulting head count as a fixed architectural hyperparameter in the reinforcement\-learning stage\. The surrogate analysis itself is a temporal adaptation of the incremental rank\-tracking framework of\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\]and is developed in Section[6](https://arxiv.org/html/2605.06877#S6)\. The reinforcement\-learning stage and its shielded admissibility machinery are inherited, without modification, from\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]\. The separation between the two stages follows the search\-then\-retrain pattern common in neural architecture search \(NAS\)\[[13](https://arxiv.org/html/2605.06877#bib.bib11),[16](https://arxiv.org/html/2605.06877#bib.bib12),[11](https://arxiv.org/html/2605.06877#bib.bib13)\]; the adoption of that pattern is justified in Section[6\.3](https://arxiv.org/html/2605.06877#S6.SS3)\.

##### Contributions\.

The paper contributes three elements\. First, a quantitative lower bound on the tracking error achievable by any meta\-controller whose input is Markovian in the measured state\. The bound scales with the steady\-state variance of the unobservable memory and vanishes as the window length exceeds the memory horizon \(Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\)\. The bound is informal in the sense that its hypotheses, rather than being structural, reflect the regularity assumed of the cost and of the memory dynamics; the assumptions are made explicit alongside the statement\. Second, a temporal residual operator whose effective rank upper\-bounds the attention head count required to represent the memory \(Proposition[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)\)\. The operator is constructed from samples of the gradient of the memory along the window and is independent of the reinforcement\-learning objective; its effective rank is non\-monotonic in the memory horizon, peaking when the horizon matches the window\-time product \(Corollary[5\.16](https://arxiv.org/html/2605.06877#S5.Thmtheorem16)\)\. Third, an experimental evaluation on a 2\-DOF manipulator with Stribeck friction\. With the head count fixed at the surrogate value, a single\-layer attention\-only meta\-controller improves tracking error by1212–1919percentage points over a deeper Transformer baseline at the short and matched memory regimes, with large effect sizes \(d≈−1\.1d\\approx\-1\.1andd≈−2\.1d\\approx\-2\.1\) and Mann–Whitneyp<0\.05p<0\.05in both cases\. The advantage is not attributable to window\-size tuning: a regime\-matched Transformer baseline with window equal to that of the INCRT\-1L winner performs indistinguishably from the original Transformer baseline, and INCRT\-1L retains a large advantage over it \(d=−2\.02d=\-2\.02,pU=0\.008p\_\{U\}=0\.008\)\. At the long memory regime the advantage vanishes on ann=10n=10sample and four of ten INCRT\-1L training runs exhibit either divergence or a payload\-invariant collapse, against at most one of ten for either Transformer variant; this failure\-mode asymmetry identifies a limitation of the static Phase\-1/Phase\-2 pipeline at long horizons\. The ablation that accompanies the main comparison identifies two further empirical effects: depth without a feed\-forward non\-linearity causes training to diverge, and the optimal window size is not monotone in the memory horizon — both are consistent with the rank\-compression analysis of\[[9](https://arxiv.org/html/2605.06877#bib.bib19)\]\.

##### What is not claimed\.

The proofs of the two theoretical results are stated under explicit regularity assumptions in Section[5](https://arxiv.org/html/2605.06877#S5)and worked out in Appendix[A](https://arxiv.org/html/2605.06877#A1); they are not structural theorems in the sense of holding under minimal hypotheses\. The parameter\-efficiency of the proposed meta\-controller relative to the Transformer baseline is not uniform: it is substantial at the shortest memory horizon, narrows at the intermediate horizon, and reverses at the longest\. The advantage over the Transformer baseline at the long memory horizon, suggested by an=5n=5pilot study, does not persist on an=10n=10replication: the vanishing gap and the failure\-mode cluster identified in Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5)constitute the most informative single outcome of the study with respect to the limits of the approach\. The long\-horizon regime is therefore the setting in which the proposed method is least effective and in which the follow\-up outlined in Section[8](https://arxiv.org/html/2605.06877#S8)is most likely to yield improvements\.

##### Organisation\.

Section[2](https://arxiv.org/html/2605.06877#S2)reviews the literatures at whose intersection the paper sits\. Section[3](https://arxiv.org/html/2605.06877#S3)formalises the class of systems addressed and fixes notation\. Section[4](https://arxiv.org/html/2605.06877#S4)describes the parameter\-level meta\-controller and its training loop\. Section[5](https://arxiv.org/html/2605.06877#S5)presents the two theoretical results and their assumptions\. Section[6](https://arxiv.org/html/2605.06877#S6)develops the surrogate head\-count analysis and the separation between the architecture\-selection and policy\-training stages\. Section[7](https://arxiv.org/html/2605.06877#S7)reports the experiments\. Section[8](https://arxiv.org/html/2605.06877#S8)discusses the findings, the limitations, and the companion follow\-up agenda\. Proofs appear in Appendix[A](https://arxiv.org/html/2605.06877#A1)and the experimental protocol in Appendix[B](https://arxiv.org/html/2605.06877#A2)\.

## 2Related work

This work intersects four literatures: adaptive control of Euler–Lagrange systems with friction, residual reinforcement learning, safe RL with Lyapunov guarantees, and sequence\-model architectures applied to control tasks\. The meta\-controller we introduce borrows from each but sits in an underexplored region of their intersection\. We review each line in turn, close with a short discussion of the relationship with neural architecture search for RL, and finish with a positioning table that summarises the differences with the closest prior works\.

### 2\.1Adaptive control with friction

Classical adaptive control of Euler–Lagrange systems with velocity\-dependent friction has a mature literature\[[17](https://arxiv.org/html/2605.06877#bib.bib1)\]\. The dominant paradigm combines a computed\-torque feedforward term with a parameter\-estimation loop that updates friction coefficients online, usually under some persistency\-of\-excitation condition\[[5](https://arxiv.org/html/2605.06877#bib.bib3),[4](https://arxiv.org/html/2605.06877#bib.bib4)\]\. The LuGre friction model\[[5](https://arxiv.org/html/2605.06877#bib.bib3)\]introduces a one\-state internal variable to capture pre\-sliding dynamics and stick–slip phenomena, and is the closest classical analogue of the unobservable memory statez​\(t\)z\(t\)treated here\. Stribeck\-curve friction\[[4](https://arxiv.org/html/2605.06877#bib.bib4)\]adds a velocity\-dependent envelope that makes the memory dynamics nonlinear inq˙\\dot\{q\}\. Observer\-based friction compensation schemes, notably the family of dual\-observer designs in, reconstruct the internal state via an auxiliary dynamical system tuned to match the friction memory horizon\.

These methods provide strong stability guarantees in their regime of validity but require either an accurate parametric model of friction \(LuGre\) or a sufficiently rich excitation signal\. Our approach replaces the parametric estimator with a learned windowed map from recent motion to controller gains, which does not require friction modelling but forfeits the closed\-form convergence guarantees of adaptive control\. The shielded admissibility framework of Section[5](https://arxiv.org/html/2605.06877#S5)partially recovers these guarantees by enforcing a Lyapunov\-decrease constraint at the level of the meta\-controller output\.

### 2\.2Residual reinforcement learning

Residual reinforcement learning — using RL to learn a corrective term on top of a model\-based baseline controller — was introduced in and has seen extensive application in robotic manipulation\. The paradigm is attractive because the baseline controller handles the bulk of the dynamics and the RL policy handles only the unmodeled residual, giving sample\-efficient training and safety during exploration\. The parent work\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]adopts this paradigm with a computed\-torque baseline and a Lyapunov\-shielded SAC policy, and addresses the scalability issue that emerges at high\-dimensional manipulators \(shield activation reaching∼70%\\sim 70\\%\)\.

Our work extends the residual RL paradigm in two directions\. First, the RL policy operates at the meta\-controller level \(controller*parameters*\) rather than the torque level, which is a more compact action space and which restructures the admissibility constraint into a convex polytope \(Lemma[5\.2](https://arxiv.org/html/2605.06877#S5.Thmtheorem2)\)\. Second, the meta\-controller consumes a windowed history rather than the instantaneous state, reflecting the unobservable\-memory structure of the problem \(Definition[5\.10](https://arxiv.org/html/2605.06877#S5.Thmtheorem10)\)\.

### 2\.3Safe RL with Lyapunov guarantees

Several strands of safe RL enforce stability or constraint satisfaction by coupling RL with classical control theory\. Constrained Markov Decision Processes \(CMDPs\)\[[2](https://arxiv.org/html/2605.06877#bib.bib5)\]introduce a dual\-multiplier formulation for constraints on cumulative cost, leading to the Lagrangian SAC variants used in this work\[[1](https://arxiv.org/html/2605.06877#bib.bib6),[19](https://arxiv.org/html/2605.06877#bib.bib7)\]\. Control Barrier Functions \(CBFs\)\[[3](https://arxiv.org/html/2605.06877#bib.bib9)\]enforce a forward\-invariant safe set via a quadratic program on the action; the approach has been combined with RL in\. Lyapunov\-based RL\[[6](https://arxiv.org/html/2605.06877#bib.bib8)\]constrains actions to preserve a prespecified \(or learned\) Lyapunov decrease condition — the family to which the parent work\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]and the present paper belong\.

The specific contribution of the parent work is a learned Lyapunov functionVψV\_\{\\psi\}jointly trained with the shield and the policy\. We inherit this object directly\. What we add is the observation that the admissibility set derived fromVψV\_\{\\psi\}, originally formulated at the torque level, becomes a convex polytope when lifted to the controller\-parameter level under the affine\-feedforward restriction \(Remark[3\.1](https://arxiv.org/html/2605.06877#S3.Thmtheorem1)\)\. This restructuring gives Theorem[5\.6](https://arxiv.org/html/2605.06877#S5.Thmtheorem6)and Proposition[5\.8](https://arxiv.org/html/2605.06877#S5.Thmtheorem8), which together explain and resolve the shield\-activation anomaly of the parent framework\.

### 2\.4Meta\-learning and learned optimisers

Meta\-learning views learning itself as the target of optimisation, with an outer loop that adjusts parameters of an inner learning process\. MAML and Reptile train initialisations that adapt rapidly to new tasks\. Learned optimisers parameterise update rules with neural networks\. In RL specifically, meta\-RL frameworks learn policies that adapt online across tasks drawn from a fixed distribution\. The meta\-controller of the present paper could be viewed as a specialised meta\-RL agent that outputs controller gains rather than actions; the meta\-MDP structure of Section[4\.4](https://arxiv.org/html/2605.06877#S4.SS4)is a compact instance of this view\.

The main distinction from the meta\-learning literature is scope: we are not attempting to adapt across a family of tasks with different dynamics; we are adapting across values of a single unobservable parameter \(z​\(t\)z\(t\)\) within one physical task\. This narrower scope permits the strong stability guarantees of Section[5](https://arxiv.org/html/2605.06877#S5)that generic meta\-RL cannot provide\.

### 2\.5Transformers and attention in robotics and control

Transformer\-based architectures have entered robotics through three main routes\. Decision Transformer and Trajectory Transformer recast offline RL as sequence\-modeling, predicting returns and actions from context trajectories\. RT\-1 and RT\-2 scale this paradigm to large\-scale manipulation demonstrations\. Time\-series forecasting with causal Transformers, popularised by Informer, TFT, and PatchTST, provides the architectural templates we adopt here: a causal multi\-head attention layer over a history window, followed by a feature extractor\.

The closest work in spirit is that of, who use a Transformer to extract features from recent sensor history for a model\-predictive controller\. Our approach differs in three ways: the attention block is used to parameterise a computed\-torque controller rather than to predict actions directly; the training loop is reinforcement learning with a Lyapunov shield rather than supervised learning from demonstrations; and the architecture’s head count is determined by a principled Phase 1 procedure \(Section[6](https://arxiv.org/html/2605.06877#S6)\) rather than fixed by convention\.

The algebraic analysis of attention in\[[9](https://arxiv.org/html/2605.06877#bib.bib19)\]clarifies the interpretation of several design choices that appear in our ablation\. In particular, the distinction between rank\-preservation and anti\-confinement \(\[[9](https://arxiv.org/html/2605.06877#bib.bib19), Cor\. 5\.4\]\) accounts for why a residual connection around multi\-head attention materially changes the feature extractor’s output range — an observation that will return in the experimental section\.

### 2\.6Recurrent alternatives and sequence models

The choice of attention over recurrent sequence models \(LSTM, GRU\) is not self\-evident for control tasks\. Recurrent architectures were the dominant sequence model in early deep\-RL work\. They compress an arbitrarily long history into a fixed\-size hidden state, which is attractive for continuous control because it decouples memory from computation per step\.

The main theoretical argument against recurrent compression in the setting of this paper is coarse measurability: a recurrent policy is aσ\\sigma\-measurable function of the hidden\-state trajectory, which is strictly coarser than the full history window accessible to an attention\-based policy\. Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(ii\) formalises this argument and predicts that recurrent compression cannot close the Markovian gap at finite capacity\. This theoretical expectation is tested empirically in Section[7](https://arxiv.org/html/2605.06877#S7), where both a small and a large GRU underperform even the smallest memoryless MLP across all values ofτz\\tau\_\{z\}\.

### 2\.7Neural architecture search for RL

Differentiable architecture search\[[13](https://arxiv.org/html/2605.06877#bib.bib11),[16](https://arxiv.org/html/2605.06877#bib.bib12)\]has been applied to RL primarily in the image\-observation regime\[[15](https://arxiv.org/html/2605.06877#bib.bib14),[11](https://arxiv.org/html/2605.06877#bib.bib13)\], where the convolutional backbone dominates the parameter count\. The search\-then\-retrain pattern is the default in this literature: a surrogate search task produces a discrete architecture, which is then trained from scratch on the target RL task\.

Our use of INCRT in Section[6](https://arxiv.org/html/2605.06877#S6)follows the search\-then\-retrain pattern but with two distinctions\. First, the surrogate task is specific to the temporal\-memory structure of the system \(reconstruction ofz​\(t\)z\(t\)from history\), not a generic reward\-prediction proxy as in most differentiable NAS\. Second, the output of the search is a single scalar \(the head countK⋆K^\{\\star\}\) rather than a full architectural graph; this reflects the constrained hypothesis class we adopt \(attention over a fixed\-size window with fixed total capacity\), and avoids the known instability of Liu2019DARTS on small search tasks\. The INCRT procedure itself is a principled growth\-pruning rule rooted in a compressed\-sensing bound\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\], rather than an empirical gradient\-based search, and carries a formal convergence guarantee\.

### 2\.8Positioning

Table[1](https://arxiv.org/html/2605.06877#S2.T1)summarises the differences between the proposed meta\-controller and the closest existing works along four criteria: the level at which the RL policy operates \(torque, action, or controller parameters\), whether the formulation includes a formal Lyapunov\-based stability guarantee, whether the architecture is determined by a principled procedure, and whether the sequence model exploits a windowed temporal history\.

Table 1:Positioning of the proposed method relative to related works\.*Policy level*: torque\-level \(T\), action\-level \(A\), or controller\-parameter level \(P\)\.*Lyapunov*: the formulation includes a formal stability result built on a \(learned or designed\) Lyapunov function\.*Arch\. determination*: the architecture is selected by a principled procedure \(rather than by engineering heuristic\)\.*Windowed temporal*: the policy consumes a fixed\-size history window of observations, not a single timestep nor an entire trajectory\.*Policy level*: T = torque, A = action \(discrete or continuous\), P = controller parameters\. CBF\-RL stands for RL with control barrier functions; Transformer\-MPC stands for Transformer\-based model predictive control\.

To our knowledge the conjunction of all four properties — parameter\-level RL, Lyapunov\-shielded training, principled architecture determination, and windowed temporal attention — does not appear elsewhere in the literature\.

## 3Problem setup

This section formalises the class of systems addressed by the paper and fixes the notation used in Sections[4](https://arxiv.org/html/2605.06877#S4)–[7](https://arxiv.org/html/2605.06877#S7)\.

### 3\.1Euler–Lagrange systems with unobservable memory

We consider rigid\-body manipulators withnndegrees of freedom described by the Euler–Lagrange equation

M​\(q\)​q¨\+C​\(q,q˙\)​q˙\+G​\(q\)\+F​\(q,q˙,z\)=τ,M\(q\)\\ddot\{q\}\+C\(q,\\dot\{q\}\)\\,\\dot\{q\}\+G\(q\)\+F\(q,\\dot\{q\},z\)\\;=\\;\\tau,\(1\)whereq∈ℝnq\\in\\mathbb\{R\}^\{n\}are generalised coordinates,M​\(q\)≻0M\(q\)\\succ 0is the mass matrix,C​\(q,q˙\)​q˙C\(q,\\dot\{q\}\)\\dot\{q\}collects Coriolis and centripetal terms,G​\(q\)G\(q\)is the gravity vector, andτ∈ℝn\\tau\\in\\mathbb\{R\}^\{n\}is the joint torque\. The vectorF​\(q,q˙,z\)F\(q,\\dot\{q\},z\)models non\-conservative effects — in particular, friction — and depends on an internal statez∈ℝnzz\\in\\mathbb\{R\}^\{n\_\{z\}\}that is not directly measured\. The evolution ofzzis governed by its own dynamics

z˙=ζ​\(q,q˙,z\),\\dot\{z\}\\;=\\;\\zeta\(q,\\dot\{q\},z\),\(2\)which is not a function ofτ\\taudirectly but of the motion of the arm\. In accordance with Definition[5\.10](https://arxiv.org/html/2605.06877#S5.Thmtheorem10), we require that \(i\)zzbe not instantaneously recoverable from\(q,q˙\)\(q,\\dot\{q\}\)and \(ii\) the decay time\-constant ofζ\\zeta, the*memory horizon*Hz=‖∂ζ/∂z‖−1H\_\{z\}=\\\|\\partial\\zeta/\\partial z\\\|^\{\-1\}, be finite\.

##### Running example: Stribeck friction\.

Throughout Sections[4](https://arxiv.org/html/2605.06877#S4)–[7](https://arxiv.org/html/2605.06877#S7)we takeF​\(q,q˙,z\)F\(q,\\dot\{q\},z\)to be the Stribeck friction model

Fs​\(q˙,z\)=Fc\+\(Fsmax−Fc\)​exp⁡\(−\(q˙/vs\)2\)​sign​\(q˙\)\+σ​q˙\+z,F\_\{s\}\(\\dot\{q\},z\)=F\_\{c\}\+\(F\_\{s\}^\{\\mathrm\{max\}\}\-F\_\{c\}\)\\,\\exp\\\!\\bigl\(\-\(\\dot\{q\}/v\_\{s\}\)^\{2\}\\bigr\)\\,\\mathrm\{sign\}\(\\dot\{q\}\)\+\\sigma\\dot\{q\}\+z,\(3\)with internal state dynamics

z˙=−z/τz\+λz​q˙\.\\dot\{z\}\\;=\\;\-z/\\tau\_\{z\}\+\\lambda\_\{z\}\\dot\{q\}\.\(4\)The parameterτz\\tau\_\{z\}sets the memory horizon of the friction system and is the primary sweep axis of the experimental evaluation\. The Stribeck construction is a canonical surrogate for a broader class of unobservable\-memory phenomena in manipulation, including joint elasticity, soft\-contact hysteresis, and damper dynamics\.

##### Intuition: memory horizon, window size, and the matched regime\.

The parameterτz\\tau\_\{z\}admits a simple physical reading: it is the time constant over which a perturbation to the internal friction state decays\. At smallτz\\tau\_\{z\}, a disturbance inzzis forgotten within a few control steps and the closed\-loop system is nearly Markovian in\(q,q˙\)\(q,\\dot\{q\}\); an instantaneous controller suffices\. At largeτz\\tau\_\{z\}, a disturbance persists over many control steps and the controller that ignores history tracks a lagged ghost of its own past actions\. The design response is to give the controller a window of recent observations, of lengthWWcontrol steps\. Whether that window is adequate depends on the ratio between the memory horizonτz\\tau\_\{z\}and the window timeW​Δ​tW\\Delta t: when the two are of the same order, the window covers one relaxation time of the memory and the window\-based controller has access to the full history of the disturbance; whenW​Δ​tW\\Delta tis smaller, the window covers only a fraction of the memory and information is lost; whenW​Δ​tW\\Delta tis much larger, the window spans many decorrelated memory cycles and the additional tokens are redundant\. The intermediate caseτz≈W​Δ​t\\tau\_\{z\}\\approx W\\Delta tis the*matched regime*, and is the regime at which the theoretical analysis of Section[5](https://arxiv.org/html/2605.06877#S5)is most nearly tight and at which the experiments of Section[7](https://arxiv.org/html/2605.06877#S7)show the clearest advantage of the proposed meta\-controller\.

### 3\.2Task and cost

Given a reference trajectoryqd​\(t\)q\_\{d\}\(t\), the control task is to trackqdq\_\{d\}under unknown memory statez​\(t\)z\(t\)and unknown static parameters \(notably the payloadp∈\[pmin,pmax\]p\\in\[p\_\{\\min\},p\_\{\\max\}\]that modifies the effective mass inMM\)\. The tracking error ise=qd−qe=q\_\{d\}\-q; the velocity errore˙=q˙d−q˙\\dot\{e\}=\\dot\{q\}\_\{d\}\-\\dot\{q\}\. Following the conventions of the parent work\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\], we consider the extended state

x=\(q,q˙,e,e˙,s\)∈ℝ5​n,x=\(q,\\dot\{q\},e,\\dot\{e\},s\)\\in\\mathbb\{R\}^\{5n\},\(5\)wheresscollects auxiliary variables such as the sliding surface and filtered references\. The cost functional is

𝒥​\(π\)=𝔼p,z​\(⋅\),qd​\(⋅\)​\[∫0Tℓ​\(e​\(t\),e˙​\(t\)\)​𝑑t\],\\mathcal\{J\}\(\\pi\)\\;=\\;\\mathbb\{E\}\_\{p,z\(\\cdot\),q\_\{d\}\(\\cdot\)\}\\\!\\left\[\\int\_\{0\}^\{T\}\\ell\\bigl\(e\(t\),\\dot\{e\}\(t\)\\bigr\)\\,dt\\right\],\(6\)withℓ\\ellstrongly convex in its arguments \(e\.g\., quadratic tracking error with velocity regularisation\)\. The expectation is over the task distribution — random payload, random reference, random initial condition — and over the realisation of the memory statez​\(t\)z\(t\)conditional on the motion\.

### 3\.3Controller template

The torque applied to the manipulator follows the parameterised computed\-torque structure

τ​\(t\)=CT​\(q,q˙,qd;Kd​\(t\),Λ​\(t\)\)\+ϕff​\(q,q˙;η​\(t\)\),\\tau\(t\)\\;=\\;\\mathrm\{CT\}\\bigl\(q,\\dot\{q\},q\_\{d\};K\_\{d\}\(t\),\\Lambda\(t\)\\bigr\)\\;\+\\;\\phi\_\{\\mathrm\{ff\}\}\\bigl\(q,\\dot\{q\};\\eta\(t\)\\bigr\),\(7\)whereCT\\mathrm\{CT\}is a standard computed\-torque law with gain matricesKdK\_\{d\}andΛ\\Lambda, andϕff\\phi\_\{\\mathrm\{ff\}\}is a feed\-forward compensation with weightsη\\eta\. The parametersθctrl​\(t\)=\(Kd​\(t\),Λ​\(t\),η​\(t\)\)\\theta\_\{\\mathrm\{ctrl\}\}\(t\)=\(K\_\{d\}\(t\),\\Lambda\(t\),\\eta\(t\)\)are collected into a vectorθctrl∈𝒫\\theta\_\{\\mathrm\{ctrl\}\}\\in\\mathcal\{P\}, where𝒫\\mathcal\{P\}is a compact box fixing the operating ranges of each gain and weight\.

### 3\.4Assumptions

The theoretical results of Section[5](https://arxiv.org/html/2605.06877#S5)and the experimental protocol of Section[7](https://arxiv.org/html/2605.06877#S7)rely on the following standing assumptions\.

###### Assumption 3\.2\(Regularity of the dynamics\)\.

M,C,GM,C,Gare smooth inq,q˙q,\\dot\{q\}on the operating set;FFis continuous in\(q,q˙,z\)\(q,\\dot\{q\},z\)and Lipschitz inzz;ζ\\zetais Lipschitz and generates a contractive flow with rate1/Hz1/H\_\{z\}\.

###### Assumption 3\.3\(Reference regularity\)\.

qd∈C2​\(\[0,T\]\)q\_\{d\}\\in C^\{2\}\(\[0,T\]\)with uniformly bounded derivatives\.

###### Assumption 3\.4\(Parameter compactness\)\.

The parameter set𝒫\\mathcal\{P\}is a compact box\[Kdmin,Kdmax\]×\[Λmin,Λmax\]×\[ηmin,ηmax\]dimη\[K\_\{d\}^\{\\min\},K\_\{d\}^\{\\max\}\]\\times\[\\Lambda^\{\\min\},\\Lambda^\{\\max\}\]\\times\[\\eta^\{\\min\},\\eta^\{\\max\}\]^\{\\dim\\eta\}\.

Assumptions[3\.2](https://arxiv.org/html/2605.06877#S3.Thmtheorem2)–[3\.4](https://arxiv.org/html/2605.06877#S3.Thmtheorem4)are inherited from\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]and are standard in computed\-torque literature\[[17](https://arxiv.org/html/2605.06877#bib.bib1),[18](https://arxiv.org/html/2605.06877#bib.bib2)\]\.

## 4Meta\-controller formulation

We now specify the structure of the learned meta\-controllerGθG\_\{\\theta\}that produces the parameter trajectoryθctrl​\(t\)\\theta\_\{\\mathrm\{ctrl\}\}\(t\)in \([7](https://arxiv.org/html/2605.06877#S3.E7)\)\. The formulation extends the shielded safe\-RL framework of\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]from action\-level to parameter\-level control, and specifies the inputs, outputs, and training loop of the meta\-controller\.

### 4\.1Context token

At each control steptk=k​Δ​tt\_\{k\}=k\\Delta t, the meta\-controller consumes a*context token*c​\(tk\)∈ℝW×dcc\(t\_\{k\}\)\\in\\mathbb\{R\}^\{W\\times d\_\{c\}\}formed by stackingWWpast step\-observations

c​\(tk\)=\(o​\(tk−W\+1\),o​\(tk−W\+2\),…,o​\(tk\)\),c\(t\_\{k\}\)=\\bigl\(o\(t\_\{k\-W\+1\}\),o\(t\_\{k\-W\+2\}\),\\ldots,o\(t\_\{k\}\)\\bigr\),\(8\)where each step\-observationo​\(t\)o\(t\)contains

o​\(t\)=\(q​\(t\),q˙​\(t\),qd​\(t\),q˙d​\(t\),p^​\(t\),μ^​\(t\),t/T\)∈ℝdc\.o\(t\)=\\bigl\(q\(t\),\\dot\{q\}\(t\),\\,q\_\{d\}\(t\),\\dot\{q\}\_\{d\}\(t\),\\,\\hat\{p\}\(t\),\\,\\hat\{\\mu\}\(t\),\\,t/T\\bigr\)\\;\\in\\;\\mathbb\{R\}^\{d\_\{c\}\}\.\(9\)Herep^\\hat\{p\}is a noisy estimate of the payload,μ^\\hat\{\\mu\}is a friction\-regime estimate \(fixed at0\.20\.2in the present setting\), andt/Tt/Tis a phase indicator\. For the experiments of Section[7](https://arxiv.org/html/2605.06877#S7),n=2n=2\(two\-link manipulator\), givingdc=4​n\+3=11d\_\{c\}=4n\+3=11; the total context dimension isW×11W\\times 11\.

### 4\.2Window selection

The window sizeWWis the unique hyperparameter of the context token that is not fixed by the dynamics\. We discuss its selection explicitly because it interacts non\-trivially with both the memory horizonτz\\tau\_\{z\}and the meta\-controller architecture\.

##### Lower bound from Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\.

Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(iii\) establishes that a windowed meta\-controller can close the Markovian optimality gap only whenW≥Hz/Δ​tW\\geq H\_\{z\}/\\Delta t, i\.e\. when the window covers at least one memory horizon of the unobservable state\. For the Stribeck model \([3](https://arxiv.org/html/2605.06877#S3.E3)\)–\([4](https://arxiv.org/html/2605.06877#S3.E4)\),Hz=τzH\_\{z\}=\\tau\_\{z\}, so the condition becomesW≥τz/Δ​tW\\geq\\tau\_\{z\}/\\Delta t\. At the sampling rateΔ​t=10\\Delta t=10ms used in Section[7](https://arxiv.org/html/2605.06877#S7), this givesW≥100W\\geq 100forτz=1\\tau\_\{z\}=1s,W≥200W\\geq 200forτz=2\\tau\_\{z\}=2s, andW≥500W\\geq 500forτz=5\\tau\_\{z\}=5s\.

##### Computational trade\-off\.

The window size controls the cost of attention: the forward pass of a causal multi\-head block on a length\-WWsequence is𝒪​\(W2​dmodel\)\\mathcal\{O\}\(W^\{2\}d\_\{\\mathrm\{model\}\}\)\. DoublingWWquadruples the attention cost, and for practical RL training budgets \(tens of thousands of environment steps per episode\), windows aboveW=100W=100are significant contributors to training time\.

##### Our choice\.

We evaluateW∈\{20,50,100\}W\\in\\\{20,50,100\\\}in the main experiments of Section[7](https://arxiv.org/html/2605.06877#S7)\. The smallest valueW=20W=20is retained both for continuity with\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]and to expose the regime in which the theoretical lower bound is strictly violated \(forτz≥0\.2\\tau\_\{z\}\\geq 0\.2s\)\. The intermediate valueW=50W=50respects the lower bound atτz≤0\.5\\tau\_\{z\}\\leq 0\.5s and captures roughly a quarter of the memory horizon atτz=2\\tau\_\{z\}=2s\. The largestW=100W=100only respects the lower bound atτz≤1\\tau\_\{z\}\\leq 1s; atτz=5\\tau\_\{z\}=5s it still covers only20%20\\%of the memory horizon, which places that regime in the “partial coverage” domain where Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(iii\) does not guarantee closing the gap\. This imperfect coverage is deliberate: it tests the robustness of the attention\-based meta\-controller when the theoretical condition is only partially satisfied\. The window ablation in Section[7](https://arxiv.org/html/2605.06877#S7)reports the empirical effect ofWWat eachτz\\tau\_\{z\}\.

### 4\.3Parameter\-valued output

The meta\-controller is a map

Gθ:ℝW×dc⟶𝒫,G\_\{\\theta\}:\\mathbb\{R\}^\{W\\times d\_\{c\}\}\\;\\longrightarrow\\;\\mathcal\{P\},\(10\)parameterised byθ∈ℝdθ\\theta\\in\\mathbb\{R\}^\{d\_\{\\theta\}\}, that takes a context tokenc​\(t\)c\(t\)and returns the parameter tupleθctrl​\(t\)∈𝒫\\theta\_\{\\mathrm\{ctrl\}\}\(t\)\\in\\mathcal\{P\}\. The architecture ofGθG\_\{\\theta\}is an attention\-based feature extractor \(detailed in Section[6](https://arxiv.org/html/2605.06877#S6)\) followed by an MLP head that maps the last\-token embedding to a raw actiona∈ℝdim𝒫a\\in\\mathbb\{R\}^\{\\dim\\mathcal\{P\}\}\. The raw action is then passed through a squashing function to land in𝒫\\mathcal\{P\}:

Kd​\(t\)\\displaystyle K\_\{d\}\(t\)=Kdmin\+σ​\(aK\)⊙\(Kdmax−Kdmin\),\\displaystyle=K\_\{d\}^\{\\min\}\+\\sigma\(a\_\{K\}\)\\odot\(K\_\{d\}^\{\\max\}\-K\_\{d\}^\{\\min\}\),\(11\)Λ​\(t\)\\displaystyle\\Lambda\(t\)=Λmin\+σ​\(aΛ\)⊙\(Λmax−Λmin\),\\displaystyle=\\Lambda^\{\\min\}\+\\sigma\(a\_\{\\Lambda\}\)\\odot\(\\Lambda^\{\\max\}\-\\Lambda^\{\\min\}\),\(12\)η​\(t\)\\displaystyle\\eta\(t\)=ηmax⊙tanh⁡\(aη\),\\displaystyle=\\eta^\{\\max\}\\odot\\tanh\(a\_\{\\eta\}\),\(13\)whereσ\\sigmadenotes the sigmoid\. The structure ensuresθctrl​\(t\)∈𝒫\\theta\_\{\\mathrm\{ctrl\}\}\(t\)\\in\\mathcal\{P\}by construction, so the compactness assumption of Section[3\.4](https://arxiv.org/html/2605.06877#S3.SS4)is satisfied without further projection\.

### 4\.4SAC training loop

The meta\-controller is trained via Soft Actor\-Critic \(SAC\)\[[12](https://arxiv.org/html/2605.06877#bib.bib10)\]on a single\-step meta\-MDP where:

- •the*state*at macro\-stepkkis the context tokenc​\(tk\)c\(t\_\{k\}\);
- •the*action*at macro\-stepkkisθctrl​\(tk\)=Gθ​\(c​\(tk\)\)\\theta\_\{\\mathrm\{ctrl\}\}\(t\_\{k\}\)=G\_\{\\theta\}\(c\(t\_\{k\}\)\);
- •the*reward*at macro\-stepkkis the negative one\-step tracking error−‖e​\(tk\+1\)‖2\-\\\|e\(t\_\{k\+1\}\)\\\|^\{2\}, withe​\(tk\+1\)e\(t\_\{k\+1\}\)computed by rolling out the closed\-loop dynamics of Section[3](https://arxiv.org/html/2605.06877#S3)for one control period under the actionθctrl​\(tk\)\\theta\_\{\\mathrm\{ctrl\}\}\(t\_\{k\}\);
- •the episode terminates attK=Tt\_\{K\}=T\(fixed horizon\) or on numerical divergence of the dynamics\.

SAC is selected because its entropy regularisation is well\-matched to the non\-stationarity of the meta\-MDP \(the distribution ofz​\(t\)z\(t\)conditional on the history changes across episodes and across the payload\)\. The alternative of parameter\-level TRPO or PPO was explored in pilot experiments and produced less stable training on this task\.

##### Shielded admissibility\.

The admissibility constraint of Section[5](https://arxiv.org/html/2605.06877#S5)is enforced via the Lagrangian reformulation

minθ⁡𝔼​\[∫0Tℓ​\(e,e˙\)​𝑑t\]\+β​𝔼​\[dist​\(Gθ​\(c​\(t\)\),Πadm​\(x​\(t\)\)\)\],\\min\_\{\\theta\}\\mathbb\{E\}\\\!\\left\[\\int\_\{0\}^\{T\}\\ell\(e,\\dot\{e\}\)\\,dt\\right\]\\;\+\\;\\beta\\,\\mathbb\{E\}\\\!\\left\[\\mathrm\{dist\}\\bigl\(G\_\{\\theta\}\(c\(t\)\),\\,\\Pi\_\{\\mathrm\{adm\}\}\(x\(t\)\)\\bigr\)\\right\],\(14\)with dual multiplierβ\\betatuned by the standard Lagrangian update rule of\[[2](https://arxiv.org/html/2605.06877#bib.bib5)\]\. At runtime, a projection ontoΠadm​\(x​\(t\)\)\\Pi\_\{\\mathrm\{adm\}\}\(x\(t\)\)is applied as a final safety net \(Theorem[5\.6](https://arxiv.org/html/2605.06877#S5.Thmtheorem6)\(b\)\); at convergence of the Lagrangian, the projection is inactive \(Proposition[5\.8](https://arxiv.org/html/2605.06877#S5.Thmtheorem8)\)\.

### 4\.5Summary of architectural degrees of freedom

The meta\-controller thus has the following architectural parameters:

- •windowWW\(Section[4\.2](https://arxiv.org/html/2605.06877#S4.SS2)\);
- •attention head countKKand per\-head dimensiondkd\_\{k\}, jointly determining the model dimensiondmodel=K⋅dkd\_\{\\mathrm\{model\}\}=K\\cdot d\_\{k\};
- •number of stacked attention layersLL;
- •presence of a per\-token feed\-forward block after each attention layer;
- •width of the subsequent MLP head \(fixed at\[64,64\]\[64,64\]throughout\)\.

Of these,KKis fixed from Phase 1 INCRT as described in Section[6](https://arxiv.org/html/2605.06877#S6)\. The remaining four hyperparameters\(W,dk,L,FFN\)\(W,d\_\{k\},L,\\mathrm\{FFN\}\)are ablated in Section[7](https://arxiv.org/html/2605.06877#S7)to characterise the contribution of each to tracking performance across the memory\-horizon rangeτz∈\{1,2,5\}\\tau\_\{z\}\\in\\\{1,2,5\\\}s\.

## 5Theoretical Framework

This section develops the four results that underpin the proposed meta\-controller\. Sections[5\.1](https://arxiv.org/html/2605.06877#S5.SS1)–[5\.3](https://arxiv.org/html/2605.06877#S5.SS3)extend the shielded safe\-RL framework of the parent work\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]from action\-level to parameter\-level control\. Section[5\.4](https://arxiv.org/html/2605.06877#S5.SS4)establishes a quantitative lower bound on the tracking error attainable by any Markovian meta\-controller, thereby motivating the use of windowed temporal attention\. Section[5\.5](https://arxiv.org/html/2605.06877#S5.SS5)specialises the INCRT head\-count analysis of\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\]to the temporal residual operator relevant for meta\-control, and closes the theoretical loop by showing that the window sizeWWand the head countKKcan be independently selected from task\-specific quantities\.

### 5\.1Admissible parameter set

LetVψ:ℝ5​n→ℝ≥0V\_\{\\psi\}:\\mathbb\{R\}^\{5n\}\\to\\mathbb\{R\}\_\{\\geq 0\}be the learned Lyapunov function inherited from\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]: structured quadratic, positive\-definite outside the target, andLψL\_\{\\psi\}\-Lipschitz in∇Vψ\\nabla V\_\{\\psi\}\. Fix a desired decay rateα\>0\\alpha\>0\.

###### Definition 5\.1\(Admissible parameter set\)\.

For every statex=\(q,q˙,e,e˙,s\)∈ℝ5​nx=\(q,\\dot\{q\},e,\\dot\{e\},s\)\\in\\mathbb\{R\}^\{5n\},

Πadm​\(x\):=\{θctrl∈𝒫:V˙ψ​\(x;θctrl\)\+α​Vψ​\(x\)≤0\},\\Pi\_\{\\mathrm\{adm\}\}\(x\)\\;:=\\;\\bigl\\\{\\,\\theta\_\{\\mathrm\{ctrl\}\}\\in\\mathcal\{P\}\\,:\\,\\dot\{V\}\_\{\\psi\}\(x;\\theta\_\{\\mathrm\{ctrl\}\}\)\+\\alpha V\_\{\\psi\}\(x\)\\leq 0\\,\\bigr\\\},\(15\)whereV˙ψ​\(x;θctrl\)\\dot\{V\}\_\{\\psi\}\(x;\\theta\_\{\\mathrm\{ctrl\}\}\)denotes the instantaneous Lyapunov rate along the closed\-loop trajectory induced by the computed\-torque \(CT\) law with parametersθctrl=\(Kd,Λ,η\)\\theta\_\{\\mathrm\{ctrl\}\}=\(K\_\{d\},\\Lambda,\\eta\), and𝒫\\mathcal\{P\}is a compact box inℝm\\mathbb\{R\}^\{m\}enforcing the operating ranges of each gain and feed\-forward weight\.

Writingτ=CT​\(θctrl\)\+ϕff​\(η\)\\tau=\\mathrm\{CT\}\(\\theta\_\{\\mathrm\{ctrl\}\}\)\+\\phi\_\{\\mathrm\{ff\}\}\(\\eta\)and using the affine–in–τ\\taustructure of the shield inequality derived in\[[7](https://arxiv.org/html/2605.06877#bib.bib20), Sec\. IV\],

b​\(x\)⊤​τ≤c​\(x\),b​\(x\):=∇q˙Vψ⋅g​\(x\),c​\(x\):=−α​Vψ​\(x\)−∇Vψ⋅f​\(x\),b\(x\)^\{\\top\}\\tau\\;\\leq\\;c\(x\),\\qquad b\(x\):=\\nabla\_\{\\dot\{q\}\}V\_\{\\psi\}\\cdot g\(x\),\\quad c\(x\):=\-\\alpha V\_\{\\psi\}\(x\)\-\\nabla V\_\{\\psi\}\\cdot f\(x\),the CT contribution expands as

b​\(x\)⊤​CT​\(q,q˙,qd;Kd,Λ\)\\displaystyle b\(x\)^\{\\top\}\\mathrm\{CT\}\(q,\\dot\{q\},q\_\{d\};K\_\{d\},\\Lambda\)=Kd⊤​A1​\(x\)\+Λ⊤​A2​\(x\)\+a0​\(x\),\\displaystyle=K\_\{d\}^\{\\top\}A\_\{1\}\(x\)\+\\Lambda^\{\\top\}A\_\{2\}\(x\)\+a\_\{0\}\(x\),\(16\)b​\(x\)⊤​ϕff​\(q,q˙;η\)\\displaystyle b\(x\)^\{\\top\}\\phi\_\{\\mathrm\{ff\}\}\(q,\\dot\{q\};\\eta\)=η⊤​B​\(x\)\+b0​\(x\),\\displaystyle=\\eta^\{\\top\}B\(x\)\+b\_\{0\}\(x\),\(17\)withA1,A2,BA\_\{1\},A\_\{2\},Bstate\-dependent vectors anda0,b0a\_\{0\},b\_\{0\}state\-dependent scalars\.

###### Lemma 5\.2\(Convexity ofΠadm​\(x\)\\Pi\_\{\\mathrm\{adm\}\}\(x\)\)\.

Fixx∈ℝ5​nx\\in\\mathbb\{R\}^\{5n\}\. If the feed\-forward mapϕff​\(q,q˙;η\)\\phi\_\{\\mathrm\{ff\}\}\(q,\\dot\{q\};\\eta\)is affine inη\\eta— that is,ϕff​\(q,q˙;η\)=Φ​\(q,q˙\)​η\\phi\_\{\\mathrm\{ff\}\}\(q,\\dot\{q\};\\eta\)=\\Phi\(q,\\dot\{q\}\)\\,\\etafor a state\-dependent feature matrixΦ\\Phi— thenΠadm​\(x\)\\Pi\_\{\\mathrm\{adm\}\}\(x\)is a \(possibly unbounded\) convex polytope: the intersection of the box𝒫\\mathcal\{P\}with the single affine half\-space

Kd⊤​A1​\(x\)\+Λ⊤​A2​\(x\)\+η⊤​B​\(x\)\+\(a0​\(x\)\+b0​\(x\)\)≤c​\(x\)\.K\_\{d\}^\{\\top\}A\_\{1\}\(x\)\+\\Lambda^\{\\top\}A\_\{2\}\(x\)\+\\eta^\{\\top\}B\(x\)\+\\bigl\(a\_\{0\}\(x\)\+b\_\{0\}\(x\)\\bigr\)\\;\\leq\\;c\(x\)\.\(18\)

###### Proof\.

The constraint definingΠadm​\(x\)\\Pi\_\{\\mathrm\{adm\}\}\(x\)is affine inθctrl=\(Kd,Λ,η\)\\theta\_\{\\mathrm\{ctrl\}\}=\(K\_\{d\},\\Lambda,\\eta\)by \([16](https://arxiv.org/html/2605.06877#S5.E16)\) and \([17](https://arxiv.org/html/2605.06877#S5.E17)\); the intersection of a half\-space with the compact convex box𝒫\\mathcal\{P\}is convex\. ∎

###### Assumption 5\.4\(Non\-emptiness\)\.

For everyxxin the operating set𝒳⊂ℝ5​n\\mathcal\{X\}\\subset\\mathbb\{R\}^\{5n\},Πadm​\(x\)∩𝒫≠∅\\Pi\_\{\\mathrm\{adm\}\}\(x\)\\cap\\mathcal\{P\}\\neq\\emptyset\.

This is the meta\-level analogue of the drift\-decay condition on the control\-degeneracy set𝒵\\mathcal\{Z\}in\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]\. It asserts that at every state there exists at least one gain choice compatible with the prescribed Lyapunov decrease\.

### 5\.2Stability under parameter modulation

LetGθ:ℝnc→𝒫G\_\{\\theta\}:\\mathbb\{R\}^\{n\_\{c\}\}\\to\\mathcal\{P\}be the learned meta\-controller mapping a context tokenc​\(t\)∈ℝncc\(t\)\\in\\mathbb\{R\}^\{n\_\{c\}\}to a parameter tupleθctrl​\(t\)\\theta\_\{\\mathrm\{ctrl\}\}\(t\)\.

###### Assumption 5\.5\(Bounded context velocity\)\.

The context tokenc​\(t\)c\(t\)is piecewise\-C1C^\{1\}and satisfies‖c˙​\(t\)‖≤Vc\\\|\\dot\{c\}\(t\)\\\|\\leq V\_\{c\}for allt≥0t\\geq 0\.

Under the chain rule,

‖θ˙ctrl​\(t\)‖≤LGθ​‖c˙​\(t\)‖≤LGθ​Vc,\\\|\\dot\{\\theta\}\_\{\\mathrm\{ctrl\}\}\(t\)\\\|\\;\\leq\\;L\_\{G\_\{\\theta\}\}\\\|\\dot\{c\}\(t\)\\\|\\;\\leq\\;L\_\{G\_\{\\theta\}\}V\_\{c\},\(19\)whereLGθL\_\{G\_\{\\theta\}\}is the Lipschitz constant ofGθG\_\{\\theta\}on a compact set containing the context trajectory\.

###### Theorem 5\.6\(Stability under parameter modulation\)\.

LetGθG\_\{\\theta\}beLGθL\_\{G\_\{\\theta\}\}\-Lipschitz, let Assumption[5\.4](https://arxiv.org/html/2605.06877#S5.Thmtheorem4)hold, and let Assumption[5\.5](https://arxiv.org/html/2605.06877#S5.Thmtheorem5)hold\. Suppose either

1. \(a\)Gθ​\(c​\(t\)\)∈Πadm​\(x​\(t\)\)G\_\{\\theta\}\(c\(t\)\)\\in\\Pi\_\{\\mathrm\{adm\}\}\(x\(t\)\)for allt≥0t\\geq 0, or
2. \(b\)a runtime projectionθctrl​\(t\)=ΠΠadm​\(x​\(t\)\)​\(Gθ​\(c​\(t\)\)\)\\theta\_\{\\mathrm\{ctrl\}\}\(t\)=\\Pi\_\{\\Pi\_\{\\mathrm\{adm\}\}\(x\(t\)\)\}\\bigl\(G\_\{\\theta\}\(c\(t\)\)\\bigr\)is applied\.

Then the closed\-loop system satisfiesV˙ψ​\(x​\(t\)\)\+α​Vψ​\(x​\(t\)\)≤0\\dot\{V\}\_\{\\psi\}\(x\(t\)\)\+\\alpha V\_\{\\psi\}\(x\(t\)\)\\leq 0for alltt, henceVψ​\(x​\(t\)\)≤Vψ​\(x​\(0\)\)​e−α​tV\_\{\\psi\}\(x\(t\)\)\\leq V\_\{\\psi\}\(x\(0\)\)\\,e^\{\-\\alpha t\}, and the tracking error‖e​\(t\)‖\\\|e\(t\)\\\|decays exponentially at rateα/2\\alpha/2\.

###### Proof\.

Under \(a\), the definition ofΠadm\\Pi\_\{\\mathrm\{adm\}\}directly givesV˙ψ\+α​Vψ≤0\\dot\{V\}\_\{\\psi\}\+\\alpha V\_\{\\psi\}\\leq 0\. Under \(b\), the projection ontoΠadm​\(x​\(t\)\)\\Pi\_\{\\mathrm\{adm\}\}\(x\(t\)\)is well\-defined \(by Lemma[5\.2](https://arxiv.org/html/2605.06877#S5.Thmtheorem2)\) and feasible \(by Assumption[5\.4](https://arxiv.org/html/2605.06877#S5.Thmtheorem4)\), and preserves the inequality by construction\. In either case,Vψ​\(t\)≤Vψ​\(0\)​e−α​tV\_\{\\psi\}\(t\)\\leq V\_\{\\psi\}\(0\)\\,e^\{\-\\alpha t\}\. SinceVψV\_\{\\psi\}is quadratic ineeand bounded below byλmin​‖e‖2\\lambda\_\{\\min\}\\\|e\\\|^\{2\}for someλmin\>0\\lambda\_\{\\min\}\>0, the tracking error decays at rateα/2\\alpha/2\. ∎

### 5\.3Shield activation at the optimum

A distinctive finding of the 7\-DOF scalability study in\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]was shield activation reaching∼70%\\sim 70\\%when the policy and shield were trained as adversaries\. The meta\-controller formulation corrects this\.

###### Proposition 5\.8\(Vanishing shield activation at the optimum\)\.

LetG^θ\\widehat\{G\}\_\{\\theta\}be the optimum of the constrained Lagrangian objective

minθ⁡𝔼τ∼πθ​\[∫0Tℓ​\(q,qd\)​𝑑t\]subject toGθ​\(c​\(t\)\)∈Πadm​\(x​\(t\)\)​almost surely\.\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\\!\\\!\\left\[\\int\_\{0\}^\{T\}\\ell\(q,q\_\{d\}\)\\,dt\\right\]\\quad\\text\{subject to\}\\quad G\_\{\\theta\}\(c\(t\)\)\\in\\Pi\_\{\\mathrm\{adm\}\}\(x\(t\)\)\\;\\;\\text\{almost surely\}\.\(20\)Then the shield activation fraction atG^θ\\widehat\{G\}\_\{\\theta\}is zero\. In a finite\-training regime, the shield activation fraction is bounded by

Pshield≤C⋅𝔼​\[dist​\(Gθ​\(c\),Πadm​\(x\)\)\],P\_\{\\mathrm\{shield\}\}\\;\\leq\\;C\\cdot\\mathbb\{E\}\\bigl\[\\mathrm\{dist\}\\bigl\(G\_\{\\theta\}\(c\),\\,\\Pi\_\{\\mathrm\{adm\}\}\(x\)\\bigr\)\\bigr\],\(21\)which vanishes as the Lagrangian training converges\.

###### Proof\.

The shield fires if and only if the policy output is infeasible\. Under Slater’s condition, which holds by Assumption[5\.4](https://arxiv.org/html/2605.06877#S5.Thmtheorem4), strong duality gives almost\-sure constraint satisfaction at the optimum\. In a finite\-training regime, the feasibility gap is dominated by the expected distance of the policy output from the feasible set, yielding \([21](https://arxiv.org/html/2605.06877#S5.E21)\);CCdepends on the Lipschitz constants ofb​\(x\)b\(x\)andc​\(x\)c\(x\)in Lemma[5\.2](https://arxiv.org/html/2605.06877#S5.Thmtheorem2)\. ∎

### 5\.4The Markovian optimality gap

We now establish the core result motivating the use of windowed temporal attention\.

###### Definition 5\.10\(System with unobservable memory\)\.

An Euler–Lagrange system has*unobservable memory*if there exists an internal statez​\(t\)∈ℝnzz\(t\)\\in\\mathbb\{R\}^\{n\_\{z\}\}with dynamicsz˙=ζ​\(q,q˙,z\)\\dot\{z\}=\\zeta\(q,\\dot\{q\},z\)such that:

1. \(i\)z​\(t\)z\(t\)is not directly measured by the sensor suite;
2. \(ii\)the steady\-state value ofzzgiven\(q,q˙\)\(q,\\dot\{q\}\)is not a function of\(q,q˙\)\(q,\\dot\{q\}\)alone, but depends on the history\{q​\(s\),q˙​\(s\):s≤t\}\\\{q\(s\),\\dot\{q\}\(s\):s\\leq t\\\};
3. \(iii\)the memory horizonHz:=‖∂ζ/∂z‖−1H\_\{z\}:=\\\|\\partial\\zeta/\\partial z\\\|^\{\-1\}is finite\.

###### Example 5\.11\(Stribeck friction\)\.

The Stribeck friction modelFs​\(q˙,z\)F\_\{s\}\(\\dot\{q\},z\)withz˙=−z/τz\+λz​q˙\\dot\{z\}=\-z/\\tau\_\{z\}\+\\lambda\_\{z\}\\dot\{q\}satisfies Definition[5\.10](https://arxiv.org/html/2605.06877#S5.Thmtheorem10)with memory horizonHz=τzH\_\{z\}=\\tau\_\{z\}\.

Denote byθctrl⋆​\(t\)\\theta^\{\\star\}\_\{\\mathrm\{ctrl\}\}\(t\)the time\-optimal parameter trajectory — the one that minimises expected tracking error over a realisation of the task distribution — and byℓ⋆\\ell^\{\\star\}the associated expected cost\. For a meta\-controllerGG, letℓ​\(G\)\\ell\(G\)denote its expected cost\.

The next result quantifies, under regularity assumptions on the cost, the price paid by a Markovian meta\-controller that ignores the unobservable memory\. The result is stated in*informal*form: the hypotheses express the regularity required for the argument to go through and are not minimal in the structural sense\. A more general formulation, removing the strong\-convexity assumption or weakening the observability assumption onz​\(t\)z\(t\), would require substantially more machinery than is developed here and is left to future work\.

###### Proposition 5\.12\(Markovian optimality gap, informal\)\.

Let the system satisfy Definition[5\.10](https://arxiv.org/html/2605.06877#S5.Thmtheorem10)with memory horizonHzH\_\{z\}, and suppose the following regularity conditions hold:

1. \(R1\)the instantaneous costℓ​\(θctrl;q,q˙,z\)\\ell\(\\theta\_\{\\mathrm\{ctrl\}\};q,\\dot\{q\},z\)is twice continuously differentiable and strongly convex inθctrl\\theta\_\{\\mathrm\{ctrl\}\}with modulusμ\>0\\mu\>0;
2. \(R2\)the sensitivityκ=‖∂2ℓ/∂z​∂θctrl‖\\kappa=\\\|\\partial^\{2\}\\ell/\\partial z\\,\\partial\\theta\_\{\\mathrm\{ctrl\}\}\\\|is bounded and positive on a neighbourhood of the reference trajectory;
3. \(R3\)the conditional varianceσz2=𝔼​\[Var​\(z​\(t\)\|q​\(t\),q˙​\(t\)\)\]\\sigma\_\{z\}^\{2\}=\\mathbb\{E\}\[\\,\\mathrm\{Var\}\(z\(t\)\\,\|\\,q\(t\),\\dot\{q\}\(t\)\)\\,\]is finite and strictly positive\.

Then:

1. \(i\)The time\-optimal parameter trajectory depends on the full history: θctrl⋆​\(t\)=θctrl⋆​\(\{q​\(s\),q˙​\(s\),z​\(s\):s≤t\}\),\\theta^\{\\star\}\_\{\\mathrm\{ctrl\}\}\(t\)=\\theta^\{\\star\}\_\{\\mathrm\{ctrl\}\}\\bigl\(\\\{q\(s\),\\dot\{q\}\(s\),z\(s\):s\\leq t\\\}\\bigr\),\(22\)and is not a function of\(q​\(t\),q˙​\(t\)\)\(q\(t\),\\dot\{q\}\(t\)\)alone\.
2. \(ii\)Any Markovian meta\-controllerGMk:\(q,q˙\)↦𝒫G^\{\\mathrm\{Mk\}\}:\(q,\\dot\{q\}\)\\mapsto\\mathcal\{P\}incurs an expected excess cost bounded below by 𝔼​\[ℓ​\(GMk\)\]−ℓ⋆≥c1​σz2,\\mathbb\{E\}\[\\ell\(G^\{\\mathrm\{Mk\}\}\)\]\-\\ell^\{\\star\}\\;\\geq\\;c\_\{1\}\\,\\sigma\_\{z\}^\{2\},\(23\)whereσz2=𝔼​\[Var​\(z​\(t\)\|q​\(t\),q˙​\(t\)\)\]\\sigma\_\{z\}^\{2\}=\\mathbb\{E\}\[\\,\\mathrm\{Var\}\(z\(t\)\\,\|\\,q\(t\),\\dot\{q\}\(t\)\)\\,\]is the conditional variance ofzzgiven the current state, andc1=μ​κ2/2c\_\{1\}=\\mu\\,\\kappa^\{2\}/2withκ\\kappathe sensitivity of the cost tozz\.
3. \(iii\)A windowed meta\-controllerGW:\(q,q˙,history of length​W\)↦𝒫G^\{W\}:\(q,\\dot\{q\},\\text\{history of length \}W\)\\mapsto\\mathcal\{P\}withW≥HzW\\geq H\_\{z\}and sufficient representational capacity can reduce the excess cost too​\(σz2\)o\(\\sigma\_\{z\}^\{2\}\)\.

###### Proof sketch\.

Part \(i\) follows from the hidden\-state identifiability principle for systems with unobservable memory: sincez​\(t\)z\(t\)enters the optimal control law through the state equation andz​\(t\)z\(t\)is notσ​\(q,q˙\)\\sigma\(q,\\dot\{q\}\)\-measurable by Definition[5\.10](https://arxiv.org/html/2605.06877#S5.Thmtheorem10)\(ii\), neither is the optimiser\. Part \(ii\) is an application of the information bottleneck argument: the output ofGMkG^\{\\mathrm\{Mk\}\}isσ​\(q,q˙\)\\sigma\(q,\\dot\{q\}\)\-measurable, which is strictly coarser thanσ​\(q,q˙,z\)\\sigma\(q,\\dot\{q\},z\)\. Strong convexity of the cost with modulusμ\\mu, combined with the conditional variance identityVar​\(θ⋆\|q,q˙\)≥κ2​σz2\\mathrm\{Var\}\(\\theta^\{\\star\}\\,\|\\,q,\\dot\{q\}\)\\geq\\kappa^\{2\}\\sigma\_\{z\}^\{2\}\(whereκ=‖∂θ⋆/∂z‖\\kappa=\\\|\\partial\\theta^\{\\star\}/\\partial z\\\|\), yields \([23](https://arxiv.org/html/2605.06877#S5.E23)\)\. Part \(iii\) follows because the joint process\{\(q​\(s\),q˙​\(s\)\):s∈\[t−W,t\]\}\\\{\(q\(s\),\\dot\{q\}\(s\)\):s\\in\[t\-W,t\]\\\}is sufficient forz​\(t\)z\(t\)wheneverW≥HzW\\geq H\_\{z\}: in particular, the identifiability map\{q​\(s\),q˙​\(s\)\}s∈\[t−W,t\]↦z​\(t\)\\\{q\(s\),\\dot\{q\}\(s\)\\\}\_\{s\\in\[t\-W,t\]\}\\mapsto z\(t\)is measurable, and a universal approximator with attention over the window can implement it\. The full argument is given in Appendix[A\.1](https://arxiv.org/html/2605.06877#A1.SS1)\. ∎

###### Corollary 5\.14\(Empirical implication\)\.

Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)predicts a regime\-dependent ordering of meta\-controllers: atτz→0\\tau\_\{z\}\\to 0, a memoryless policy is optimal \(σz2→0\\sigma\_\{z\}^\{2\}\\to 0, Markovian gap vanishes\); at moderateτz\\tau\_\{z\}, memoryless policies degrade at ratec1​σz2c\_\{1\}\\,\\sigma\_\{z\}^\{2\}; atτz≳W⋅Δ​t\\tau\_\{z\}\\gtrsim W\\cdot\\Delta t, windowed attention recovers optimal performance\. The crossover point is operationally testable and is reported in Section[7](https://arxiv.org/html/2605.06877#S7)\.

### 5\.5Temporal head\-count bound

Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(iii\) establishes that a windowed meta\-controller with sufficient capacity closes the Markovian gap\. What remains is to quantify “sufficient capacity”\.

The INCRT framework of\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\]provides a principled determination of the required number of attention heads through a residual\-matrix construction\. We briefly recall the relevant elements before stating the adaptation to the temporal setting\.

##### INCRT residual operator\.

Given a training signal represented as a matrixY∈ℝW×dtaskY\\in\\mathbb\{R\}^\{W\\times d\_\{\\mathrm\{task\}\}\}, INCRT constructs a residual operatorAres∈ℝd×dA\_\{\\mathrm\{res\}\}\\in\\mathbb\{R\}^\{d\\times d\}whose spectrum determines the head count required to representYY\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]\. Under three structural properties ofAresA\_\{\\mathrm\{res\}\}— symmetry, rank\-one deflation, and incoherence of the deflated directions — Theorem 7 of the INCRT paper gives the compressed\-sensing upper bound on the head count necessary to reconstruct the signal up to anϵ\\epsilon\-residual\.

##### Temporal adaptation\.

In the meta\-control setting, the relevant residual operator acts on the*temporal*domain of the history window rather than on the feature domain\. Specifically, define

Arestemp=𝔼​\[∇histz​\(t\)​∇histz​\(t\)⊤\]∈ℝW×W,A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}=\\mathbb\{E\}\\\!\\left\[\\nabla\_\{\\mathrm\{hist\}\}z\(t\)\\,\\nabla\_\{\\mathrm\{hist\}\}z\(t\)^\{\\top\}\\right\]\\;\\in\\;\\mathbb\{R\}^\{W\\times W\},\(25\)the auto\-covariance matrix of the gradient ofz​\(t\)z\(t\)with respect to the history of lookbacks\.

The next result imports a compressed\-sensing bound from the static INCRT framework to the temporal operator\. It is stated in*informal*form: the three structural hypotheses \(symmetry, rank\-one deflation, and incoherence\) are exactly the hypotheses of\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\], and are verified forArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}under the regularity conditions \(R1\)–\(R3\) and the mild additional assumption that the autocorrelation ofq˙\\dot\{q\}does not vanish identically in the window\. A full proof with minimal hypotheses is beyond the scope of the present paper\.

###### Proposition 5\.15\(Temporal head\-count bound, informal\)\.

Under the regularity conditions of Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)and the assumption that the autocorrelation ofq˙\\dot\{q\}on the window interval\[−W​Δ​t,0\]\[\-W\\Delta t,\\,0\]is strictly positive at lag zero and does not vanish identically at any other lag, the operatorArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}defined by \([25](https://arxiv.org/html/2605.06877#S5.E25)\) satisfies the three structural properties \(symmetry, rank\-one deflation, and incoherence\) required by\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]\. Consequently, the optimal head count for the temporal attention operator is upper\-bounded by

K⋆​\(τz\)≤C⋅reff​\(Arestemp​\(τz\)\),K^\{\\star\}\(\\tau\_\{z\}\)\\;\\leq\\;C\\cdot r\_\{\\mathrm\{eff\}\}\\bigl\(A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\(\\tau\_\{z\}\)\\bigr\),\(26\)wherereff​\(⋅\)r\_\{\\mathrm\{eff\}\}\(\\cdot\)is the effective rank \(stable rank\) andCCis a universal constant inherited from\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]\.

###### Proof sketch\.

Symmetry ofArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}is immediate from its definition as a covariance matrix\. Rank\-one deflation follows from the fact that the gradient ofz​\(t\)z\(t\)with respect to the history is dominated by a single scalar multiplier at each lag \(the coefficientλz\\lambda\_\{z\}of theq˙\\dot\{q\}term inz˙\\dot\{z\}\)\. Incoherence of the deflated directions follows from the non\-degenerate autocorrelation ofq˙\\dot\{q\}away from zero lag\. The three properties are verified in detail in Appendix[A\.3](https://arxiv.org/html/2605.06877#A1.SS3), and the compressed\-sensing bound of\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]applies directly\. ∎

###### Corollary 5\.16\(Non\-monotonicK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)\)\.

The effective rankreff​\(Arestemp​\(τz\)\)r\_\{\\mathrm\{eff\}\}\(A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\(\\tau\_\{z\}\)\)is non\-monotonic inτz\\tau\_\{z\}: it increases withτz\\tau\_\{z\}in the regimeτz≲W⋅Δ​t\\tau\_\{z\}\\lesssim W\\cdot\\Delta t\(more of the memory horizon fits inside the window, more temporal modes can be separated\) and decreases withτz\\tau\_\{z\}in the regimeτz≳W⋅Δ​t\\tau\_\{z\}\\gtrsim W\\cdot\\Delta t\(the window saturates, only low\-frequency modes remain distinguishable\)\. The maximum ofreffr\_\{\\mathrm\{eff\}\}occurs nearτz≈W⋅Δ​t/2\\tau\_\{z\}\\approx W\\cdot\\Delta t/2\. Proof in Appendix[A\.4](https://arxiv.org/html/2605.06877#A1.SS4)\.

##### Design implication\.

Propositions[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)and[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)give two independent ingredients for selecting the meta\-controller architecture:

- •the*window size*WWis selected asW≥Hz/Δ​tW\\geq H\_\{z\}/\\Delta t, so that the window covers the memory horizon \(Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(iii\)\);
- •the*head count*KKis selected asK≤C⋅reff​\(Arestemp\)K\\leq C\\cdot r\_\{\\mathrm\{eff\}\}\(A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\), which is non\-monotonic inτz\\tau\_\{z\}with a peak nearτz≈W⋅Δ​t/2\\tau\_\{z\}\\approx W\\cdot\\Delta t/2\(Proposition[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)and Corollary[5\.16](https://arxiv.org/html/2605.06877#S5.Thmtheorem16)\)\.

The two selections are decoupled:WWis determined by the memory horizon of the system,KKby the rank structure of the temporal residual operator\. Section[6](https://arxiv.org/html/2605.06877#S6)operationalises the selection ofKKvia the Phase 1 INCRT procedure, and Section[7](https://arxiv.org/html/2605.06877#S7)validates the decoupling empirically across a grid of\(K,dk,L,W\)\(K,d\_\{k\},L,W\)combinations\.

### 5\.6Summary of the theoretical roadmap

The four results established in this section assemble into a single design prescription\. Lemma[5\.2](https://arxiv.org/html/2605.06877#S5.Thmtheorem2)ensures that the admissible parameter set is a convex polytope and that the runtime projection of Theorem[5\.6](https://arxiv.org/html/2605.06877#S5.Thmtheorem6)\(b\) is well\-defined\. Theorem[5\.6](https://arxiv.org/html/2605.06877#S5.Thmtheorem6)guarantees closed\-loop stability under any Lipschitz meta\-controller whose output lies in \(or is projected onto\) the admissible set\. Proposition[5\.8](https://arxiv.org/html/2605.06877#S5.Thmtheorem8)predicts that shield activation vanishes at the optimum of the constrained Lagrangian, resolving the∼70%\\sim 70\\%activation anomaly of the parent framework\. Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)establishes the lower boundc1​σz2c\_\{1\}\\sigma\_\{z\}^\{2\}on the Markovian\-policy error and the upper boundo​\(σz2\)o\(\\sigma\_\{z\}^\{2\}\)on the windowed\-policy error, withσz2\\sigma\_\{z\}^\{2\}growing monotonically inτz\\tau\_\{z\}\. Proposition[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)provides the head\-count bound for the temporal attention operator, and Corollary[5\.16](https://arxiv.org/html/2605.06877#S5.Thmtheorem16)predicts non\-monotonicK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)with a peak near the window\-to\-horizon matched regime\. The experimental section tests each prediction in turn\.

## 6INCRT for architecture determination

Proposition[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)leaves unanswered a practical question: given a system with memory horizonτz\\tau\_\{z\}, how is the head\-count boundK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)computed so as to feed into the design of the Phase 2 meta\-controller? The answer developed in this section decomposes into two stages, rendered schematically in Figure[1](https://arxiv.org/html/2605.06877#S6.F1)\. In the first stage, run offline and on a CPU, the trajectory simulator produces short rollouts at the target horizon; a temporal residual operator is assembled from these rollouts, its effective rank is estimated by the incremental rank\-tracking procedure of\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\], and a single integerK⋆K^\{\\star\}is returned\. In the second stage, run on a GPU for the reinforcement\-learning training, the integerK⋆K^\{\\star\}is used to bound a short grid search over the head count of the attention block; the best head count on the grid is then retained and the corresponding policy is trained to convergence under the shielded admissibility constraint of Section[4\.4](https://arxiv.org/html/2605.06877#S4.SS4)\. The first stage is the architecture\-selection phase and takes seconds to minutes; the second stage is the policy\-training phase and takes hours\. Architecture selection and policy optimisation are therefore decoupled, and the end product is a meta\-controller with no runtime growing or pruning machinery\. The separation follows the search\-then\-retrain pattern common in neural architecture search\[[13](https://arxiv.org/html/2605.06877#bib.bib11),[16](https://arxiv.org/html/2605.06877#bib.bib12),[15](https://arxiv.org/html/2605.06877#bib.bib14)\]\.

![Refer to caption](https://arxiv.org/html/2605.06877v1/x1.png)Figure 1:Two\-phase architecture\-selection pipeline\. Phase 1 returns a head\-count boundK⋆K^\{\\star\}from an offline analysis of the memory structure; Phase 2 runs a constrained grid search overKKand trains the selected policy\.### 6\.1The INCRT framework

INCRT\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\]is an incremental construction of a sparse signal decomposition based on a residual matrix that evolves across growth and pruning iterations\. We summarise only the elements needed in the sequel; a full development is available in the reference\.

##### Residual matrix\.

Given a training signalY∈ℝn×dY\\in\\mathbb\{R\}^\{n\\times d\}and a candidate rank\-KKdecompositionY≈∑k=1Kuk​vk⊤Y\\approx\\sum\_\{k=1\}^\{K\}u\_\{k\}v\_\{k\}^\{\\top\}withuk∈ℝnu\_\{k\}\\in\\mathbb\{R\}^\{n\},vk∈ℝdv\_\{k\}\\in\\mathbb\{R\}^\{d\}, the*residual matrix*is

Ares\(K\)=Y​Y⊤−∑k=1Kuk​vk⊤⋅\(uk​vk⊤\)⊤\.A\_\{\\mathrm\{res\}\}^\{\(K\)\}=YY^\{\\top\}\-\\sum\_\{k=1\}^\{K\}u\_\{k\}v\_\{k\}^\{\\top\}\\cdot\(u\_\{k\}v\_\{k\}^\{\\top\}\)^\{\\top\}\.\(27\)The effective rankreff​\(Ares\(K\)\)r\_\{\\mathrm\{eff\}\}\(A\_\{\\mathrm\{res\}\}^\{\(K\)\}\)measures the residual spectral mass not yet explained by the firstKKcomponents\.

##### Growth signal\.

At each iterationtt, INCRT proposes a candidate new direction\(uK\+1,vK\+1\)\(u\_\{K\+1\},v\_\{K\+1\}\)obtained from the leading eigenvector of the current residual\. The*growth signal*quantifies the reduction in effective rank that would result from adding the candidate:

gK\+1\(t\)=reff​\(Ares\(K\)\)−reff​\(Ares\(K\+1\)\)\.g\_\{K\+1\}^\{\(t\)\}=r\_\{\\mathrm\{eff\}\}\(A\_\{\\mathrm\{res\}\}^\{\(K\)\}\)\-r\_\{\\mathrm\{eff\}\}\(A\_\{\\mathrm\{res\}\}^\{\(K\+1\)\}\)\.\(28\)WhengK\+1\(t\)g\_\{K\+1\}^\{\(t\)\}exceeds a prescribed thresholdγadd\\gamma\_\{\\mathrm\{add\}\}, the direction is accepted andK←K\+1K\\leftarrow K\+1\.

##### Pruning\.

Symmetrically, if the contribution of a previously admitted direction falls below a thresholdγprune\\gamma\_\{\\mathrm\{prune\}\}, the direction is removed andK←K−1K\\leftarrow K\-1\. Growth and pruning are complementary operations on a single homeostatic loop\.

##### Bidirectional gate\.

The gate mechanism of\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Sec\. 4\]ensures numerical stability of the growth–pruning alternation: directions that oscillate across the growth and pruning thresholds are stabilised by an averaging filter with time constant governed by the NTK\-alignment rate at the currentKK\.

##### Homeostatic convergence\.

\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 6\]establishes that the joint dynamics converge to a fixed\-point rankK⋆K^\{\\star\}characterised by the effective rank ofAresA\_\{\\mathrm\{res\}\}: no candidate growth exceedsγadd\\gamma\_\{\\mathrm\{add\}\}and no candidate pruning exceedsγprune\\gamma\_\{\\mathrm\{prune\}\}atK=K⋆K=K^\{\\star\}\.

##### Compressed\-sensing bound\.

\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]boundsK⋆K^\{\\star\}from above by a compressed\-sensing inequality applied toAresA\_\{\\mathrm\{res\}\}under three structural hypotheses: symmetry, rank\-one deflation, and incoherence of the deflated directions\. When these hypotheses hold,

K⋆≤C⋅reff​\(Ares\),K^\{\\star\}\\;\\leq\\;C\\cdot r\_\{\\mathrm\{eff\}\}\\bigl\(A\_\{\\mathrm\{res\}\}\\bigr\),\(29\)whereCCis a universal constant\. This is the inequality we exploit in the temporal adaptation below\.

### 6\.2Temporal residual operator

The residual matrix of INCRT is constructed over the*feature domain*of the training signalYY\. For meta\-control with unobservable memory, however, the quantity to be recovered is not a feature–space latent but a function of the*temporal history*of observations: specifically,z​\(t\)z\(t\)as a map of the pastWWobservation steps\. The natural adaptation is therefore to construct a residual operator over the time domain rather than the feature domain\.

##### Design choices\.

Three candidate constructions were considered:

\(v1\) Feature\-space residual\.Direct application of the standard INCRT construction withYYtaken as the observation feature matrix over a batch of histories\. This yieldsK⋆K^\{\\star\}approximately constant \(between33and55\) across the rangeτz∈\[0\.2,5\]\\tau\_\{z\}\\in\[0\.2,5\]s — the feature dimension does not encode the memory structure of the task\.

\(v2\) Multi\-horizon prediction target\.YYis replaced with a stack of future\-horizon predictions of the observation, horizon\-indexed\. This exposes some temporal structure butK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)remains flat at≈5\\approx 5throughout the range: the future\-horizon stack dilutes the memory dependence\.

\(v3\) Temporal residual operator\.YYis replaced with a history matrix indexed by temporal lag\. The residual operator becomesArestemp∈ℝW×WA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\\in\\mathbb\{R\}^\{W\\times W\}, andK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)varies non\-monotonically withτz\\tau\_\{z\}as predicted by Corollary[5\.16](https://arxiv.org/html/2605.06877#S5.Thmtheorem16)\.

Only the third construction, developed in detail below, exhibits the expected regime dependence; the first two are not pursued further\.

##### Construction ofArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\.

LetℋW​\(t\)=\(q​\(t\),q​\(t−1\),…,q​\(t−W\+1\),q˙​\(t\),…,q˙​\(t−W\+1\)\)\\mathcal\{H\}\_\{W\}\(t\)=\(q\(t\),q\(t\-1\),\\ldots,q\(t\-W\+1\),\\dot\{q\}\(t\),\\ldots,\\dot\{q\}\(t\-W\+1\)\)be the history window at timett\. Define the temporal gradient of the memory state:

∇histz​\(t\):=∂z​\(t\)∂ℋW​\(t\)∈ℝW\.\\nabla\_\{\\mathrm\{hist\}\}\\,z\(t\)\\;:=\\;\\frac\{\\partial z\(t\)\}\{\\partial\\mathcal\{H\}\_\{W\}\(t\)\}\\;\\in\\;\\mathbb\{R\}^\{W\}\.\(30\)The temporal residual operator is the auto\-covariance of this gradient, evaluated over a training set of trajectories sampled at the targetτz\\tau\_\{z\}:

Arestemp​\(τz\)=𝔼​\[∇histz​\(t\)​∇histz​\(t\)⊤\]\.A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\(\\tau\_\{z\}\)\\;=\\;\\mathbb\{E\}\\\!\\left\[\\nabla\_\{\\mathrm\{hist\}\}\\,z\(t\)\\,\\nabla\_\{\\mathrm\{hist\}\}\\,z\(t\)^\{\\top\}\\right\]\.\(31\)By Proposition[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)the three structural hypotheses of\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]are satisfied, so the INCRT growth and pruning dynamics apply toArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}without modification, and the resultingK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)is compressed\-sensing\-bounded by the effective rank of the operator\.

### 6\.3Phase 1 / Phase 2 separation

We now describe the architectural role of INCRT in the present paper, which is different from, and more conservative than, the use in the original JMLR work\.

##### Phase 1: architecture search\.

The INCRT dynamics run on the surrogate temporal\-representation task defined byArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\. The task is supervised and computationally inexpensive compared to SAC training: it requires only the ability to simulatez​\(t\)z\(t\)from the system dynamics and to sample historiesℋW​\(t\)\\mathcal\{H\}\_\{W\}\(t\)\. The output of Phase 1 is the converged head countK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)for eachτz\\tau\_\{z\}of interest\.

##### Phase 2: fixed\-architecture policy training\.

The Phase 2 meta\-controller is a standard attention\-based feature extractor withK⋆K^\{\\star\}heads —*fixed*for the duration of the SAC training\. There is no growth or pruning during Phase 2:K⋆K^\{\\star\}is an architectural hyperparameter, not a dynamic variable\.

##### Why search\-then\-retrain\.

Three reasons motivate this separation:

1. \(i\)*Optimisation disentanglement\.*Running INCRT dynamics concurrently with SAC training would require the SAC critic to track a non\-stationary policy\-parameter dimension, a known source of training instabilities in actor\-critic RL even when the architecture change is slow\.
2. \(ii\)*Deployment\.*A fixed\-architecture Phase 2 yields a meta\-controller with a specified parameter count, latency profile, and memory footprint, all determined at design time\.
3. \(iii\)*Empirical testability\.*The surrogate\-optimalK⋆K^\{\\star\}may or may not coincide with the RL\-optimal head count in Phase 2; separating the two stages makes this difference empirically tractable \(Section[7](https://arxiv.org/html/2605.06877#S7)\)\.

This pattern aligns with the practice of differentiable neural architecture search\[[13](https://arxiv.org/html/2605.06877#bib.bib11),[16](https://arxiv.org/html/2605.06877#bib.bib12)\]: a search stage on a surrogate task produces a discrete architecture, which is then trained from scratch on the target task\.

##### Scope note on INCRT usage\.

The present paper uses INCRT exclusively in its architecture\-search role\. The broader homeostatic\-dynamics interpretation of\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\], including its connection to representational completeness in deep architectures\[[9](https://arxiv.org/html/2605.06877#bib.bib19)\], is not invoked here\. An extension that runs INCRT directly inside Phase 2 — effectively a dynamic\-architecture meta\-controller — is a natural follow\-up indicated in Section[8](https://arxiv.org/html/2605.06877#S8)\.

### 6\.4Phase 1 surrogate protocol

Algorithm[1](https://arxiv.org/html/2605.06877#alg1)summarises the Phase 1 procedure used in the present work\.

Algorithm 1Phase 1 INCRT architecture search for temporal meta\-control1:Memory horizon

τz\\tau\_\{z\}; window

WW; growth / pruning thresholds

γadd,γprune\\gamma\_\{\\mathrm\{add\}\},\\gamma\_\{\\mathrm\{prune\}\}; trajectory simulator; sample size

NN\.

2:Converged head count

K⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)\.

3:Sample

NNtrajectories of the Euler–Lagrange system at horizon

τz\\tau\_\{z\}; compute

ℋW​\(t\)\\mathcal\{H\}\_\{W\}\(t\)and

z​\(t\)z\(t\)along each trajectory\.

4:Form the temporal residual operator

A^restemp​\(τz\)\\widehat\{A\}\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\(\\tau\_\{z\}\)as the empirical auto\-covariance of

∇histz\\nabla\_\{\\mathrm\{hist\}\}zover the

NNsamples\.

5:Initialise

K←1K\\leftarrow 1;

U←\(u1\)U\\leftarrow\(u\_\{1\}\)with

u1u\_\{1\}the leading eigenvector of

A^restemp\\widehat\{A\}\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}; residual

R←A^restempR\\leftarrow\\widehat\{A\}\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\.

6:repeat

7:*Growth step\.*Propose

uK\+1u\_\{K\+1\}as the leading eigenvector of

RR\. Compute the growth signal

g←reff​\(R\)−reff​\(R−uK\+1​uK\+1⊤\)g\\leftarrow r\_\{\\mathrm\{eff\}\}\(R\)\-r\_\{\\mathrm\{eff\}\}\(R\-u\_\{K\+1\}u\_\{K\+1\}^\{\\top\}\)\.

8:if

g\>γaddg\>\\gamma\_\{\\mathrm\{add\}\}then

K←K\+1K\\leftarrow K\+1;

U←\(U,uK\+1\)U\\leftarrow\(U,u\_\{K\+1\}\);

R←R−uK​uK⊤R\\leftarrow R\-u\_\{K\}u\_\{K\}^\{\\top\}\.

9:endif

10:*Pruning step\.*For each

k∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}, compute

pk←uk⊤​R​uk/‖R‖Fp\_\{k\}\\leftarrow u\_\{k\}^\{\\top\}Ru\_\{k\}/\\\|R\\\|\_\{F\}\.

11:if

mink⁡pk<γprune\\min\_\{k\}p\_\{k\}<\\gamma\_\{\\mathrm\{prune\}\}thendrop the least\-contributing direction from

UU;

K←K−1K\\leftarrow K\-1; recompute

RR\.

12:endif

13:*Gate update\.*Apply the bidirectional gate of\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Sec\. 4\]to smooth the growth/pruning decision against oscillation\.

14:until

KKremains unchanged for

nstablen\_\{\\mathrm\{stable\}\}iterations \(homeostatic convergence of\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 6\]\)

15:return

K⋆​\(τz\)←KK^\{\\star\}\(\\tau\_\{z\}\)\\leftarrow K\.

##### Implementation notes\.

In the experiments of Section[7](https://arxiv.org/html/2605.06877#S7),W=20W=20,N=2048N=2048,γadd=0\.05\\gamma\_\{\\mathrm\{add\}\}=0\.05,γprune=0\.01\\gamma\_\{\\mathrm\{prune\}\}=0\.01, andnstable=20n\_\{\\mathrm\{stable\}\}=20\. The trajectory simulator is the same Euler–Lagrange model used in Phase 2, with a perturbed Stribeck friction to generate thez​\(t\)z\(t\)training signal \(protocol in Appendix[B](https://arxiv.org/html/2605.06877#A2)\)\.

### 6\.5Phase 1 results:K⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)

Algorithm[1](https://arxiv.org/html/2605.06877#alg1)was run for five values of the memory horizon,τz∈\{1\.0,2\.0,3\.0,4\.0,5\.0\}\\tau\_\{z\}\\in\\\{1\.0,2\.0,3\.0,4\.0,5\.0\\\}s\. Table[2](https://arxiv.org/html/2605.06877#S6.T2)reports the converged head countK⋆K^\{\\star\}together with the effective rankreffr\_\{\\mathrm\{eff\}\}of the temporal residual operator at convergence\.

Table 2:Phase 1 INCRT results\.K⋆K^\{\\star\}is the converged head count from Algorithm[1](https://arxiv.org/html/2605.06877#alg1);reff=tr​\(Arestemp\)2/‖Arestemp‖F2r\_\{\\mathrm\{eff\}\}=\\mathrm\{tr\}\(A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\)^\{2\}/\\\|A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\\\|\_\{F\}^\{2\}; relative lag meansτz/\(W​Δ​t\)\\tau\_\{z\}/\(W\\Delta t\)\.##### Non\-monotonic pattern\.

The observedK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)is non\-monotonic, with a sharp peak atτz=2\\tau\_\{z\}=2s corresponding to the matched regimeτz≈W​Δ​t⋅10\\tau\_\{z\}\\approx W\\Delta t\\cdot 10\. This matches the prediction of Corollary[5\.16](https://arxiv.org/html/2605.06877#S5.Thmtheorem16)scaled by the nonlinearity factor introduced by the Stribeck envelope \(Appendix[A\.4](https://arxiv.org/html/2605.06877#A1.SS4)\)\. Atτz=1\\tau\_\{z\}=1s \(relative lag55\), the window contains a strictly shorter portion of a rapidly\-decaying memory, and onlyK⋆=8K^\{\\star\}=8principal directions are identified\. Atτz≥3\\tau\_\{z\}\\geq 3s \(relative lag≥15\\geq 15\), the leading temporal mode dominates the spectrum andK⋆K^\{\\star\}plateaus below the peak\.

##### PassingK⋆K^\{\\star\}to Phase 2\.

The values in Table[2](https://arxiv.org/html/2605.06877#S6.T2)define the head counts of the Phase 2 meta\-controller at the corresponding memory horizons\. In combination with the windowW=20W=20\(chosen to satisfy theW≥Hz/Δ​tW\\geq H\_\{z\}/\\Delta tcondition of Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(iii\) at the worst caseτz=5\\tau\_\{z\}=5s, albeit only marginally\),K⋆K^\{\\star\}specifies the attention partition exactly\. The remaining design degrees of freedom — per\-head dimensiondkd\_\{k\}, number of stacked layersLL, presence of a feed\-forward non\-linearity, and window sizeWWitself — are not fixed by the Phase 1 procedure; they are the subject of the ablation in Section[7](https://arxiv.org/html/2605.06877#S7)\.

##### Relation to the algebraic framework of\[[9](https://arxiv.org/html/2605.06877#bib.bib19)\]\.

The scalarK⋆K^\{\\star\}returned by Algorithm[1](https://arxiv.org/html/2605.06877#alg1)quantifies the representational capacity required in the attention mechanism for the given temporal task; it does not specify the non\-linear capacity required downstream of the attention\. The latter is a separate quantity, called*non\-linear complexity*in\[[9](https://arxiv.org/html/2605.06877#bib.bib19), §8\], and is determined in Phase 2 via the presence \(or absence\) of a per\-token feed\-forward layer and its width\. The experiments of Section[7](https://arxiv.org/html/2605.06877#S7)therefore include an explicit\(L,FFN,W\)\(L,\\mathrm\{FFN\},W\)ablation on top of theK⋆K^\{\\star\}choice from Phase 1\.

## 7Experimental results

The empirical study has two purposes: to measure the tracking\-error reduction delivered by the proposed meta\-controller relative to a Transformer baseline at three memory regimes, and to probe the architectural choices on which that reduction depends\. The study is conducted on the 2\-DOF manipulator with Stribeck friction introduced in Section[3\.1](https://arxiv.org/html/2605.06877#S3.SS1); payload is sampled uniformly from\[0,1\.5\]\[0,1\.5\]kg at every episode reset\. The reference trajectory is a sinusoid in each joint with periods incommensurate within the episode horizon ofT=5T=5s\.

### 7\.1Protocol

All meta\-controllers are trained with SAC\[[12](https://arxiv.org/html/2605.06877#bib.bib10)\]for50 00050\\,000environment steps under the shielded admissibility constraint described in Section[4\.4](https://arxiv.org/html/2605.06877#S4.SS4)\. The simulator integrates the Euler–Lagrange dynamics with a Runge–Kutta step ofΔ​t=10\\Delta t=10ms, givingH=500H=500control steps per episode\. Evaluation at convergence is performed at five payload levelsp∈\{0,0\.375,0\.75,1\.125,1\.5\}p\\in\\\{0,\\ 0\.375,\\ 0\.75,\\ 1\.125,\\ 1\.5\\\}kg, with2020rollouts per level\. Non\-architectural hyperparameters are held constant across all architectures and regimes and are documented in Appendix[B](https://arxiv.org/html/2605.06877#A2)\. The training code and per\-seed result files are released with the paper\.

The baseline against which each architecture is compared is a computed\-torque controller with fixed gainsKd=30K\_\{d\}=30,Λ=5\\Lambda=5and no meta\-controller\. The metric of primary interest is the relative reduction of tracking RMSE \(root\-mean\-square error\),Δ%:=\(RMSEmeta−RMSEbase\)/RMSEbase\\Delta\\%:=\(\\text\{RMSE\}\_\{\\text\{meta\}\}\-\\text\{RMSE\}\_\{\\text\{base\}\}\)/\\text\{RMSE\}\_\{\\text\{base\}\}; a negative value indicates that the learned meta\-controller outperforms the fixed\-gain reference\. The baseline RMSE is0\.132±0\.0010\.132\\pm 0\.001at every memory regime; it is insensitive toτz\\tau\_\{z\}because the fixed\-gain law does not exploit the memory\.

Two architectures are compared\. The first is INCRT\-1L, a single\-layer pre\-LN self\-attention block with residual connection, no feed\-forward sub\-layer, and head countKKdrawn from the interval\[⌈K⋆/2⌉,K⋆\]\[\\lceil K^\{\\star\}/2\\rceil,\\ K^\{\\star\}\]suggested by the Phase 1 analysis of Section[6\.5](https://arxiv.org/html/2605.06877#S6.SS5); per\-head dimension isdk=16d\_\{k\}=16throughout\. The windowWWis selected per regime from the ablation reported in Section[7\.4](https://arxiv.org/html/2605.06877#S7.SS4)\. The second is a Transformer baseline in the configuration inherited from the companion paper\[[7](https://arxiv.org/html/2605.06877#bib.bib20)\]: two stackedTransformerEncoderLayerblocks with GELU feed\-forward,K=4K=4heads,dk=16d\_\{k\}=16, andW=20W=20fixed across regimes\. The Transformer configuration is deliberately not tuned per regime: it represents a single reusable black\-box baseline with no architectural access to the memory horizon, against which the regime\-matched INCRT\-1L meta\-controller is compared\.

Each of the six cells defined by the two architectures and the three memory regimesτz∈\{1,2,5\}\\tau\_\{z\}\\in\\\{1,2,5\\\}s is trained five times with seeds\{42,43,44,45,46\}\\\{42,43,44,45,46\\\}, yielding thirty training runs\. The protocol was fixed*a priori*; no seed was removed from any reported statistic\.

##### Statistical tests\.

The reported statistical comparisons use Mann–Whitney’sUUtest\[[14](https://arxiv.org/html/2605.06877#bib.bib16)\]as the primary test, with Welch’stttest\[[20](https://arxiv.org/html/2605.06877#bib.bib17)\]reported for robustness to distributional assumptions\. Effect sizes use Cohen’sddwith pooled variance\[[10](https://arxiv.org/html/2605.06877#bib.bib15)\]; values\|d\|≥0\.5\|d\|\\geq 0\.5are conventionally medium,\|d\|≥0\.8\|d\|\\geq 0\.8large\. The non\-parametric test is preferred in view of the small sample size per cell and the presence of an identifiable outlier discussed in Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5)\.

### 7\.2Head count: from Phase 1 to Phase 2

The head counts used in the main comparison are reported in Table[3](https://arxiv.org/html/2605.06877#S7.T3), together with the Phase 1 surrogate valuesK⋆K^\{\\star\}of Table[2](https://arxiv.org/html/2605.06877#S6.T2)and the\(K,dk\)\(K,d\_\{k\}\)combinations retained after the Phase 2 ablation\.

Table 3:Head counts: Phase 1 surrogate valueK⋆K^\{\\star\}\(from Table[2](https://arxiv.org/html/2605.06877#S6.T2)\), Phase 2 ablation range\[⌈K⋆/2⌉,K⋆\]\[\\lceil K^\{\\star\}/2\\rceil,K^\{\\star\}\], and the retained Phase 2 value used in Table[4](https://arxiv.org/html/2605.06877#S7.T4)\.dmodel=K⋅dkd\_\{\\text\{model\}\}=K\\cdot d\_\{k\}withdk=16d\_\{k\}=16\.τz\\tau\_\{z\}\(s\)K⋆K^\{\\star\}\(Phase 1\)Phase\-2 rangeRetainedKK\(Phase 2\)18\{4,…,8\}\\\{4,\\ldots,8\\\}44214\{7,…,14\}\\\{7,\\ldots,14\\\}77511\{6,…,11\}\\\{6,\\ldots,11\\\}88The interaction between Phase 1 and Phase 2 is the following\. The surrogate rankK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)is an upper bound on the number of attention heads that can carry distinct temporal information through the window, assuming the three structural hypotheses of Proposition[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)\. It does not, however, prescribe a specific head count for policy training; it bounds the representational dimension above, leaving open the choice of how that dimension is partitioned betweenKKanddkd\_\{k\}and how far the retained value ofKKfalls below the upper bound\. Phase 2 therefore runs a constrained grid search overK∈\{⌈K⋆/2⌉,…,K⋆\}K\\in\\\{\\lceil K^\{\\star\}/2\\rceil,\\ldots,K^\{\\star\}\\\}at fixeddk=16d\_\{k\}=16, selecting the value that minimises single\-seed tracking RMSE on a held\-out payload sweep\. The retained value atτz=2\\tau\_\{z\}=2s isK=7K=7, i\.e\. the lower end of the range; atτz=5\\tau\_\{z\}=5s the retained value isK=8K=8, near the middle; atτz=1\\tau\_\{z\}=1s it isK=4K=4, likewise at the lower end\.

The claim of the paper is therefore not that Phase 1 produces the reinforcement\-learning\-optimal head count, but that Phase 1 produces a tight upper bound that is compatible with a much smaller Phase 2 search budget than an unconstrained grid\. On the 2\-DOF task, a Phase 2 grid that spansK∈\{1,…,16\}K\\in\\\{1,\\ldots,16\\\}at a single regime requires sixteen training runs; the Phase 1\-constrained grid used here requires between four and eight runs per regime\. The retainedKKis reported alongsideK⋆K^\{\\star\}in every table below, so that the separation between theoretical prediction and empirical selection is explicit\.

### 7\.3Main comparison

Table[4](https://arxiv.org/html/2605.06877#S7.T4)reports the per\-seed mean and sample standard deviation ofΔ%\\Delta\\%across the seeds for each architecture–regime pair, with the parameter count, the effect size, and thepp\-value\. Sample sizes aren=10n=10atτz∈\{1,5\}\\tau\_\{z\}\\in\\\{1,5\\\}s \(five original seeds plus five extension seeds\) andn=5n=5atτz=2\\tau\_\{z\}=2s\. The forest plot of Figure[2](https://arxiv.org/html/2605.06877#S7.F2)renders the same data with95%95\\%confidence intervals\.

Table 4:Main comparison;50\+10=6050\+10=60training runs\.nn: seeds per cell;WW: window size;KK: attention head count;pUp\_\{U\}: Mann–Whitney one\-sided \(INCRT\-1L<<Transformer\);pWp\_\{W\}: Welch two\-sided;dd: Cohen’s pooled\-variance effect size\. Boldface marksα=0\.05\\alpha=0\.05\. Transformer\-tuned is a regime\-matched baseline withW=50W=50at both regimes, introduced to isolate the effect of window size from the architectural choice\.τz\\tau\_\{z\}ArchitecturennWWParamsΔ%\\Delta\\%\(mean±\\pms\.d\.\)ddpUp\_\{U\}pWp\_\{W\}11sINCRT\-1L10102020114114k−51\.34±8\.11\-51\.34\\pm\\phantom\{0\}8\.11−1\.10\-1\.100\.013\\mathbf\{0\.013\}0\.027\\mathbf\{0\.027\}Transformer10102020265265k−39\.27±13\.29\-39\.27\\pm 13\.29———22sINCRT\-1L5\\phantom\{0\}55050241241k−54\.82±7\.94\-54\.82\\pm\\phantom\{0\}7\.94———Transformer5\\phantom\{0\}52020265265k−36\.29±9\.93\-36\.29\\pm\\phantom\{0\}9\.93−2\.06\-2\.060\.008\\mathbf\{0\.008\}0\.012\\mathbf\{0\.012\}Transformer\-tuned5\\phantom\{0\}55050271271k−35\.73±10\.71\-35\.73\\pm 10\.71−2\.02\-2\.020\.008\\mathbf\{0\.008\}0\.014\\mathbf\{0\.014\}55sINCRT\-1L10102020282282k−32\.19±31\.83\-32\.19\\pm 31\.83———Transformer10102020265265k−29\.30±20\.47\-29\.30\\pm 20\.47−0\.11\-0\.110\.2140\.2140\.8130\.813Transformer\-tuned5\\phantom\{0\}55050271271k−30\.29±15\.86\-30\.29\\pm 15\.86−0\.07\-0\.070\.2570\.2570\.8800\.880![Refer to caption](https://arxiv.org/html/2605.06877v1/x2.png)Figure 2:MeanΔ%\\Delta\\%with95%95\\%confidence interval, per architecture and regime\. Sample sizes:n=10n=10atτz∈\{1,5\}\\tau\_\{z\}\\in\\\{1,5\\\}s,n=5n=5atτz=2\\tau\_\{z\}=2s\. Significance atα=0\.05\\alpha=0\.05is reached atτz∈\{1,2\}\\tau\_\{z\}\\in\\\{1,2\\\}s; atτz=5\\tau\_\{z\}=5s the effect size is negligible and the test does not reject\.The experimental study delivers three results of differing strength\.

Atτz=1\\tau\_\{z\}=1s the sample size ofn=10n=10is sufficient to reject the null atα=0\.05\\alpha=0\.05under both tests\. INCRT\-1L attainsΔ%=−51\.34%\\Delta\\%=\-51\.34\\%against the Transformer’s−39\.27%\-39\.27\\%, a gap of12\.0712\.07percentage points, withpU=0\.013p\_\{U\}=0\.013,pW=0\.027p\_\{W\}=0\.027andd=−1\.10d=\-1\.10\. The effect size is large\. INCRT\-1L achieves this advantage with43%43\\%of the Transformer’s parameter count\.

Atτz=2\\tau\_\{z\}=2s, the regime at which the memory horizon is of the same order as the window\-time productW​Δ​tW\\Delta tat the retained windowW=50W=50, INCRT\-1L attainsΔ%=−54\.82%\\Delta\\%=\-54\.82\\%against−36\.29%\-36\.29\\%, a gap of18\.5318\.53percentage points, withpU=0\.008p\_\{U\}=0\.008,pW=0\.012p\_\{W\}=0\.012,d=−2\.06d=\-2\.06\. The effect size is very large; thepp\-value is belowα=0\.01\\alpha=0\.01under both tests\. The statistical evidence is strongest at this regime and rests onn=5n=5seeds per cell: the extension ton=10n=10was deferred at this regime because then=5n=5result is already conclusive\. Parameter count is within9%9\\%of the Transformer baseline\.

Atτz=5\\tau\_\{z\}=5s the outcome changes qualitatively with the enlarged sample\. Withn=10n=10, the INCRT\-1L mean is−32\.19%\-32\.19\\%against the Transformer’s−29\.30%\-29\.30\\%, a gap of only2\.892\.89percentage points, withpU=0\.214p\_\{U\}=0\.214,pW=0\.813p\_\{W\}=0\.813andd=−0\.11d=\-0\.11\. Neither test rejects the null, and the effect size is negligible\. The gap between INCRT\-1L and Transformer therefore closes at the long memory horizon once the sample is enlarged fromn=5n=5ton=10n=10\. The explanation, developed in Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5), is that the additional five seeds reveal a structural failure mode of INCRT\-1L at this regime: four of the ten runs exhibit either divergence or payload\-invariant collapse, with flat per\-payload RMSE near0\.130\.13–0\.170\.17rad and no discernible improvement over the fixed\-gain baseline\. The Transformer baseline at the same regime exhibits one such failure out of ten\. The failure mode is a local attractor of the reinforcement\-learning optimisation that is specific to the long memory horizon and is reached more frequently by INCRT\-1L than by the Transformer\. The single\-layer attention\-only architecture, selected in Phase 2 for its simplicity and parameter economy, is not sufficient to ensure escape from this attractor on a fixed training budget\. This finding motivates the runtime\-adaptive follow\-up discussed in Section[8](https://arxiv.org/html/2605.06877#S8)\.

The empirical picture at the three regimes is therefore as follows\. INCRT\-1L offers a statistically significant, large\-effect advantage at the short and matched regimes, with a parameter economy at the short regime\. At the long regime it offers no advantage over the Transformer on an=10n=10sample, and exhibits a training pathology that appears in a significant fraction of runs\. The combination is compatible with the rank\-based analysis of Section[5](https://arxiv.org/html/2605.06877#S5): the head\-count boundK⋆K^\{\\star\}is a representational bound on the static architecture, and its prescription is most useful when the reinforcement\-learning optimisation can actually exploit the bounded capacity — which, atτz=5\\tau\_\{z\}=5s, it does not reliably do\.

##### Is the advantage due to window tuning?

A potential confound in the comparison above is that INCRT\-1L uses a larger window \(W=50W=50\) atτz=2\\tau\_\{z\}=2s than the Transformer baseline \(W=20W=20\), and the advantage could in principle be attributable to window size rather than to the architectural choice\. To isolate the two effects, a regime\-tuned Transformer baseline is reported in Table[4](https://arxiv.org/html/2605.06877#S7.T4)under the label*Transformer\-tuned*: a two\-layer Transformer with feed\-forward sub\-layer andK=4K=4heads, identical to the original Transformer baseline except for the window, which is set toW=50W=50at bothτz=2\\tau\_\{z\}=2andτz=5\\tau\_\{z\}=5s\. The outcome is unambiguous\. Atτz=2\\tau\_\{z\}=2s the Transformer\-tuned mean is−35\.73%\-35\.73\\%, statistically indistinguishable from theW=20W=20baseline at−36\.29%\-36\.29\\%\(two\-sided Welchp=0\.93p=0\.93\); the gap from INCRT\-1L shrinks by less than half a percentage point and remains significant at Cohen’sd=−2\.02d=\-2\.02,pU=0\.008p\_\{U\}=0\.008,pW=0\.014p\_\{W\}=0\.014\. Atτz=5\\tau\_\{z\}=5s the same conclusion holds: the Transformer\-tuned mean is−30\.29%\-30\.29\\%against theW=20W=20baseline’s−29\.30%\-29\.30\\%, again statistically indistinguishable \(p=0\.92p=0\.92\)\. The window size therefore does not explain the advantage of INCRT\-1L at the matched regime: the advantage rests on the architectural structure, namely the single\-layer attention block with head countK=7K=7prescribed by the Phase 1 analysis and no feed\-forward sub\-layer\.

### 7\.4Payload profile and ablation

Figure[3](https://arxiv.org/html/2605.06877#S7.F3)reports tracking RMSE as a function of payload at each regime\. Atτz∈\{1,2\}\\tau\_\{z\}\\in\\\{1,2\\\}s the qualitative shape is consistent with Table[4](https://arxiv.org/html/2605.06877#S7.T4): RMSE grows monotonically with payload for both architectures, the INCRT\-1L curve lies systematically below the Transformer curve, and the fixed\-gain baseline \(dashed\) sits above both\. Atτz=5\\tau\_\{z\}=5s the INCRT\-1L curve has visibly wider error bars than at the other two regimes: the four runs identified in Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5)as divergent or collapsed contribute the upper tail of the confidence band, and the INCRT\-1L mean curve consequently approaches the Transformer curve across the full payload range\.

![Refer to caption](https://arxiv.org/html/2605.06877v1/x3.png)Figure 3:Per\-payload tracking RMSE, mean±\\pmone standard deviation across all seeds \(n=10n=10atτz=1,5\\tau\_\{z\}=1,5s;n=5n=5atτz=2\\tau\_\{z\}=2s\)\. The widened INCRT\-1L error band atτz=5\\tau\_\{z\}=5s reflects the failure\-mode runs analysed in Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5)\.The ablation over depth, feed\-forward capacity, and window size is reported in Table[5](https://arxiv.org/html/2605.06877#S7.T5)and rendered in Figure[4](https://arxiv.org/html/2605.06877#S7.F4)\. The ablation is single\-seed, run with seed4242\. Three regularities support the choices made in the multi\-seed study\.

*Depth without a non\-linearity diverges\.*Atτz=2\\tau\_\{z\}=2s, theL=2L=2attention\-only configuration withW=20W=20diverges during training, yieldingΔ%=\+37\.9\\Delta\\%=\+37\.9\(37\.9%37\.9\\%*worse*than the fixed\-gain baseline\)\. Atτz=2\\tau\_\{z\}=2s andL=3L=3the divergence worsens, reachingΔ%=\+79\.8\\Delta\\%=\+79\.8atW=100W=100\. Adding a feed\-forward sub\-layer \(L=2L=2\-FFN\) rescues training but still loses between77and1414percentage points to the single\-layer block at both regimes\. The pattern is consistent with the rank\-compression mechanism analysed in\[[9](https://arxiv.org/html/2605.06877#bib.bib19)\]: stacking attention blocks without an intervening non\-linearity compresses the number of independent directions available downstream of the attention\.

*The optimal window is non\-monotone in the memory horizon\.*Atτz=2\\tau\_\{z\}=2s the best window isW=50W=50; atτz=5\\tau\_\{z\}=5s it isW=20W=20\. Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(iii\) would*guarantee*closing the Markov gap only withW≥500W\\geq 500atτz=5\\tau\_\{z\}=5s\. The retained configuration therefore operates in the partial\-coverage regime, and the long\-horizon result in Table[4](https://arxiv.org/html/2605.06877#S7.T4)should be read accordingly\. The most plausible explanation for this inversion is that at long horizons the quadratic cost of attention, against a fixed50 00050\\,000\-step training budget, tips the balance towards shorter windows with sharper positional signal; the phenomenon is not predicted by the window condition of Section[5](https://arxiv.org/html/2605.06877#S5)and represents a genuine gap between the sufficient condition and the observed empirical optimum\.

*The regime\-tuned Transformer baseline\.*The comparison in Table[4](https://arxiv.org/html/2605.06877#S7.T4)is supplemented by a Transformer\-tuned baseline withW=50W=50at bothτz=2\\tau\_\{z\}=2andτz=5\\tau\_\{z\}=5s, which matches the window of the INCRT\-1L winner at the matched regime\. As reported above, the tuned baseline is statistically indistinguishable from theW=20W=20baseline at either regime \(p≥0\.92p\\geq 0\.92under Welch\), and INCRT\-1L retains its large advantage over it atτz=2\\tau\_\{z\}=2s \(d=−2\.02d=\-2\.02,pU=0\.008p\_\{U\}=0\.008\)\. The window size therefore does not explain the gap; the interpretation of the INCRT\-1L advantage at the matched regime as architectural \(single\-layer, attention\-only,K=7K=7heads\) is consistent with the data\.

Table 5:Single\-seed\(L,FFN,W\)\(L,\\text\{FFN\},W\)ablation atτz∈\{2,5\}\\tau\_\{z\}\\in\\\{2,5\\\}s,dk=16d\_\{k\}=16,K=K⋆K=K^\{\\star\}\. Starred rows are the Phase\-2 winners of Table[4](https://arxiv.org/html/2605.06877#S7.T4); daggered rows diverged during training\. Missing cells atW=100,τz=5W=100,\\tau\_\{z\}=5s were not recovered after a compute\-time\-out\.![Refer to caption](https://arxiv.org/html/2605.06877v1/x4.png)Figure 4:\(L,FFN,W\)\(L,\\text\{FFN\},W\)ablation heatmap\. Green: better than baseline; red: worse\. Solid border: Phase\-2 winner\. Daggered cells diverged during training\.
### 7\.5Failure\-mode analysis at long memory horizon

The vanishing gap between INCRT\-1L and Transformer atτz=5\\tau\_\{z\}=5s is, with ann=10n=10sample, the central empirical finding of the study and deserves a dedicated analysis\. The raw per\-seedΔ%\\Delta\\%distribution atτz=5\\tau\_\{z\}=5s is reproduced in Figure[5](https://arxiv.org/html/2605.06877#S7.F5)\. Of the ten INCRT\-1L runs at this regime, five track well \(Δ%∈\[−64\.7,−51\.4\]\\Delta\\%\\in\[\-64\.7,\-51\.4\]\), two track weakly \(Δ%∈\[−35\.7,−34\.3\]\\Delta\\%\\in\[\-35\.7,\-34\.3\]\), two diverge during training \(Δ%=\+15\.0\\Delta\\%=\+15\.0and\+23\.3\+23\.3\), and one collapses to a payload\-invariant policy \(Δ%=−6\.5\\Delta\\%=\-6\.5\)\. The diverged and collapsed runs — four of the ten — are the*failure mode runs*\. The Transformer baseline at the same regime has one diverged run and no collapsed runs; the regime\-tuned Transformer \(W=50W=50\) has no diverged and no collapsed runs out of five\. The difference in failure rate,4/104/10for INCRT\-1L against1/101/10and0/50/5for the two Transformer variants, is the structural asymmetry that explains the closure of the gap at this regime and attributes the pathology specifically to the single\-layer attention\-only block\.

![Refer to caption](https://arxiv.org/html/2605.06877v1/x5.png)Figure 5:Per\-seedΔ%\\Delta\\%distribution\. Crosses: runs that diverged during training; diamonds: run that collapsed to a payload\-invariant policy; circles: trained normally\. The INCRT\-1L box atτz=5\\tau\_\{z\}=5s contains four markers of failure\-mode type\.The failure mode admits a clean empirical signature\. In a trained run, tracking RMSE grows monotonically with payload, by typically0\.040\.04–0\.100\.10rad over the payload sweep\. In a failure\-mode run, RMSE is nearly flat in payload: the four INCRT\-1L pathological runs have RMSE range across payloads of0\.0100\.010,0\.0300\.030,0\.0250\.025and0\.0480\.048rad respectively, against an average range of0\.0450\.045rad for the six healthy runs at the same regime\. The flat profile is the fingerprint of a policy that is approximately invariant to the payload context: the attention heads responsible for modulating the controller gains in response to recent motion have been effectively bypassed, and the policy reduces to a function of the trajectory reference alone\. Under the payload distribution used at training, the optimal payload\-invariant policy has an expected tracking error close to the fixed\-gain baseline, so the collapsed runs yieldΔ%≈0\\Delta\\%\\approx 0\(seed4545:−6\.5%\-6\.5\\%; seed4747:\+15\.0%\+15\.0\\%; seed4949:\+23\.3%\+23\.3\\%\)\.

A mechanism consistent with this signature is as follows\. The INCRT\-1L meta\-controller is attention\-only; the payload\-dependent information must pass through the attention heads to reach the controller gains\. If during early training the gradient norms of the payload\-sensitive attention heads fall below the noise floor of the reinforcement\-learning updates, those heads cease to receive informative gradient signal and their values freeze at near\-initial\. The output of the block becomes insensitive to the payload component of the history, the SAC reward plateaus at the payload\-invariant baseline, and the run stabilises\. The same mechanism does not appear atτz∈\{1,2\}\\tau\_\{z\}\\in\\\{1,2\\\}s because the memory structure there is shallower: the payload\-sensitive information is carried by a smaller number of temporal correlations and is correspondingly harder to silence by vanishing gradient alone\.

The Transformer baseline exhibits the same failure mode at a lower rate\. Its two\-layer architecture with feed\-forward sub\-layers admits alternative pathways for payload\-dependent information: even if the attention at the first layer collapses, the feed\-forward and the second attention layer can partially recover the signal\. The single\-layer attention\-only INCRT\-1L block has no such redundancy\. This observation does not imply that depth with FFN is superior in general — the ablation of Section[7\.4](https://arxiv.org/html/2605.06877#S7.SS4)shows that depth without FFN is catastrophic and that even depth with FFN underperforms single\-layer attention\-only atτz=2\\tau\_\{z\}=2s — but it does imply that the single\-layer architecture carries a specific failure risk at long memory horizons that is absent in the deeper baseline\.

##### Implications for the static Phase\-1/Phase\-2 pipeline\.

The failure\-mode analysis locates the limitation of the present approach at the interface between Phase 1 and Phase 2: Phase 1 prescribes a head countK⋆K^\{\\star\}sufficient to represent the memory, but the static Phase 2 architecture that usesK⋆K^\{\\star\}does not guarantee that the reinforcement\-learning optimisation will actually exploit allKKheads\. At long memory horizons, the chance that a subset of the heads becomes silent during training is non\-negligible and depends on the initial seed\. A runtime\-adaptive formulation in which the capacity of the attention block is monitored during training, and under\-utilised heads are pruned and replaced, removes the failure mode by construction: the surrogate that justifies Phase 1 is then re\-run at intervals during Phase 2, with feedback from the reinforcement\-learning gradient signal informing the growth and pruning decisions\. The follow\-up outlined in Section[8](https://arxiv.org/html/2605.06877#S8)develops this formulation\.

### 7\.6Other limitations

Three further limitations of the present study are reported explicitly\.

*Statistical power at the matched regime\.*Theτz=2\\tau\_\{z\}=2s result is based onn=5n=5seeds\. The effect size is very large and both tests reject the null well belowα=0\.01\\alpha=0\.01, so the conclusion is not power\-limited\. A replication atn=10n=10is nonetheless planned, together with extensions to intermediate regimesτz∈\{1\.5,3\}\\tau\_\{z\}\\in\\\{1\.5,\\ 3\\\}s\. The total training cost of the sixty runs reported in Table[4](https://arxiv.org/html/2605.06877#S7.T4)\(thirty original runs, twenty seed\-extension runs, ten regime\-tuned runs\) is approximately2525GPU\-hours on a single NVIDIA A100\.

*Scope of the baseline suite\.*Two Transformer variants are considered in Table[4](https://arxiv.org/html/2605.06877#S7.T4): the originalW=20W=20configuration and the regime\-tunedW=50W=50configuration\. Additional baselines — a non\-attention history model \(e\.g\. a convolutional history encoder\) and a parameter\-matched shallow attention block with a head count independent of the Phase\-1 analysis — would further disentangle the contribution of individual architectural choices\. Such extensions are left to future work\.

*Scope of the friction model\.*Stribeck friction with an exponentially decaying internal state is a canonical surrogate for a broader class of unobservable memory phenomena, but the experimental evaluation is confined to this model\. Generalisation to soft\-contact hysteresis, joint elasticity, or damper dynamics is an open empirical question\.

## 8Discussion

The experimental study of Section[7](https://arxiv.org/html/2605.06877#S7)establishes two statistically significant regimes of advantage and one regime of failure\. At the short memory horizonτz=1\\tau\_\{z\}=1s the single\-layer INCRT\-1L meta\-controller outperforms the two\-layer Transformer baseline by12\.112\.1percentage points ofΔ%\\Delta\\%, withpU=0\.013p\_\{U\}=0\.013and Cohen’sd=−1\.10d=\-1\.10onn=10n=10seeds, and does so with43%43\\%of the baseline’s parameter count\. At the matched memory horizonτz=2\\tau\_\{z\}=2s the advantage is larger:18\.518\.5percentage points,pU=0\.008p\_\{U\}=0\.008,d=−2\.06d=\-2\.06onn=5n=5seeds, with parameter count within9%9\\%of the baseline\. Both results hold under both the non\-parametric and the parametric test atα=0\.05\\alpha=0\.05, with large effect sizes\. At the long memory horizonτz=5\\tau\_\{z\}=5s the advantage vanishes onn=10n=10: the gap shrinks to2\.92\.9percentage points and the effect size is negligible \(d=−0\.11d=\-0\.11\)\. The driver of this reversal is a failure\-mode cluster specific to INCRT\-1L at long horizons, identified in Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5): four of the ten runs diverge or collapse to a payload\-invariant policy, against one of the ten Transformer runs at the same regime\.

The two significant results are obtained at the regimes at which the theoretical analysis of Section[5](https://arxiv.org/html/2605.06877#S5)is most nearly operational\. At the short horizon the window\-time productW​Δ​t=0\.2W\\Delta t=0\.2s is of the same order as the memory horizonτz=1\\tau\_\{z\}=1s, and the sufficient condition of Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)\(iii\) is satisfied with slack\. At the matched horizon the condition is satisfied equalitywise at the retainedW=50W=50,Δ​t=10\\Delta t=10ms, givingW​Δ​t=0\.5W\\Delta t=0\.5s againstτz=2\\tau\_\{z\}=2s; the window is formally insufficient by a factor of four but the ablation shows it as empirically optimal\. The head\-count boundK⋆K^\{\\star\}from Phase 1 peaks at the matched horizon \(K⋆=14K^\{\\star\}=14\) and is lower at the shorter and longer horizons \(K⋆=8K^\{\\star\}=8and1111\); the Phase\-2 retained values, at the lower end of the\[K⋆/2,K⋆\]\[K^\{\\star\}/2,K^\{\\star\}\]interval, use between44and88heads\. The two stages of the pipeline are compatible: Phase 1 bounds the capacity, Phase 2 retains a specific value within the bound, and the two results of the paper are obtained at regimes where this composition operates cleanly\.

The ablation of Section[7\.4](https://arxiv.org/html/2605.06877#S7.SS4)yields two observations that are of broader interest than the head\-count question\. The first is that attention depth without a feed\-forward non\-linearity is consistently harmful in this task, with multiple configurations diverging during training and the remainder losing significant tracking performance relative to the single\-layer attention\-only block\. The pattern is compatible with the rank\-compression mechanism analysed in\[[9](https://arxiv.org/html/2605.06877#bib.bib19)\]: stacking attention blocks without a rank\-recovery non\-linearity reduces the number of independent directions available downstream of the attention\. For the present task, where the attention carries temporal information that must be preserved, the degradation is severe enough to drive training unstable\. The finding is qualitative and its generalisation beyond the Stribeck\-friction setting of this paper is an open question\.

The second observation concerns the window size\. The optimal window is non\-monotone in the memory horizon:W=50W=50atτz=2\\tau\_\{z\}=2s, butW=20W=20at bothτz=1\\tau\_\{z\}=1s andτz=5\\tau\_\{z\}=5s\. The inversion at long horizons is not predicted by the theory, which requiresWWproportional toτz\\tau\_\{z\}to close the Markov gap\. A plausible mechanism is that the quadratic cost of attention, against a fixed training budget, tips the balance towards shorter windows with sharper positional signal when the memory horizon is long\. The observation is practically useful: it suggests that operating in the partial\-coverage regime is not always suboptimal, provided that the policy is given enough training budget to resolve the coarser temporal structure\. A quantitative characterisation of this trade\-off is outside the scope of the present paper\.

The failure\-mode cluster observed atτz=5\\tau\_\{z\}=5s is the most informative single outcome of the study\. As detailed in Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5), four of the ten INCRT\-1L runs at this regime either diverge or collapse to a payload\-invariant policy, against one of the ten Transformer runs\. The empirical signature of the collapsed runs is a flat per\-payload RMSE profile, with total range across payloads an order of magnitude smaller than the range observed in healthy runs\. The mechanism proposed there is that the policy settles in a region of the parameter space in which the attention heads responsible for payload modulation have negligible gradient, so that the policy reduces to a function of the trajectory reference alone\. Because INCRT\-1L is attention\-only, the payload signal has no alternative pathway through the block, and the collapse is terminal; the two\-layer\-with\-FFN Transformer retains alternative pathways through the feed\-forward sublayer and its second attention layer, and collapses at a lower rate\. The phenomenon is a local attractor of the reinforcement\-learning optimisation that is specific to the long memory horizon: at the shorter and matched horizons, the memory structure is shallower and the payload\-sensitive information is harder to silence by vanishing gradient alone\. A direct test of this mechanism is available in principle through inspection of the attention\-head gradient norms on a failed run; a more substantive remedy belongs to the follow\-up in which the attention capacity is adjusted during training rather than fixed by Phase 1\.

##### Follow\-up agenda\.

The natural extension of the present work moves the rank\-tracking dynamics of\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\]inside the reinforcement\-learning loop, replacing the fixed Phase\-2 architecture with a meta\-controller whose head count evolves during closed\-loop operation\. Three motivations support this extension\. First, the memory horizonτz\\tau\_\{z\}may itself be non\-stationary in realistic deployment, because friction characteristics change with wear, temperature, or lubrication; the optimal head count is then time\-varying and a fixed Phase\-1 value is a biased point estimate\. Second, the failure mode of Section[7\.5](https://arxiv.org/html/2605.06877#S7.SS5)is consistent with a subset of heads becoming effectively silent during training; a runtime growth signal that detects under\-utilisation and a pruning signal that detects redundancy provide a structural feedback loop absent from the present static formulation\. Third, a runtime\-adaptive scheme allocates capacity only where the growth signal justifies it, with consequent savings on both training time and deployment footprint\.

The follow\-up requires a new experimental design in whichτz\\tau\_\{z\}is made non\-stationary within episodes, because the static\-τz\\tau\_\{z\}experiments of the present paper are not informative about the runtime\-adaptive setting\. It also requires resolution of the actor\-critic non\-stationarity issue that motivated the search\-then\-retrain separation of Section[6\.3](https://arxiv.org/html/2605.06877#S6.SS3): running the rank\-tracking dynamics concurrently with reinforcement\-learning training is known to produce training instabilities in actor\-critic reinforcement learning, because the critic must track a non\-stationary policy\-parameter dimension\. Whether the structured rank\-tracking of\[[8](https://arxiv.org/html/2605.06877#bib.bib18)\], with its bidirectional gate, admits a co\-training schedule that preserves critic stability is the central open question of the follow\-up\.

## 9Conclusions

Three conclusions are warranted by the study\. First, for control tasks in which the friction dynamics carry an internal state that is observable only through its effect on the past motion, the architectural question "how many attention heads are needed" admits an answer that is separable from the question "how should the policy be trained"\. The rank of a task\-specific covariance operator, estimated offline and entirely separately from the reinforcement\-learning objective, provides an upper bound on the head count that compresses the Phase\-2 search by a factor of two to four relative to an unconstrained grid\. The advantage of this separation is organisational as much as theoretical: Phase 1 runs on a CPU in minutes and its output is a small integer, which divides a month of reinforcement\-learning training into a week\.

Second, the architectural choices that result are not parameter\-efficient in an unconditional sense\. At the shortest memory horizon the head\-count analysis produces a meta\-controller with less than half the parameters of the Transformer baseline; at the longest, the head count it prescribes exceeds that baseline\. The parameter economy of the proposed method is a consequence of the memory structure of the task, not of the method itself, and it inverts as the memory horizon grows\. A practical consequence is that the head\-count analysis should be run on the target horizon, not on a proxy, because its output is regime\-specific\.

Third, the study exposes a structural failure mode of the static Phase\-1/Phase\-2 separation that is specific to the long memory horizon: four of ten training runs at that horizon settle in a local attractor of the reinforcement\-learning optimisation in which the attention is effectively silenced, and the policy reduces to a payload\-invariant map\. The failure rate is asymmetric between architectures —4/104/10for single\-layer attention\-only against1/101/10for the two\-layer Transformer with feed\-forward — and identifies the single\-layer architecture as the specific element of the proposed pipeline that is most vulnerable at long horizons\. The finding has two implications\. For the present static architecture, it argues for a larger sample size at the long horizon, better controlled for the initial conditions, and for an alternative parameter initialisation that biases against the attention\-silencing attractor\. For future work, it argues for a formulation in which the attention capacity is adjusted online: a head whose gradient norm falls below threshold is pruned and replaced, which would prevent the collapse observed here\. The runtime formulation is also independently motivated by the non\-stationarity ofτz\\tau\_\{z\}in realistic deployment, where friction characteristics vary with wear, temperature, and lubrication\.

The companion follow\-up under preparation pursues this runtime formulation on a task in whichτz\\tau\_\{z\}varies within episodes\. The central open problem is the interaction between the rank\-tracking dynamics and the actor\-critic optimisation: whether a co\-training schedule exists that preserves critic stability while the attention head count evolves\. A positive answer would close the loop between the two halves of the present paper; a negative answer would identify the static Phase\-1/Phase\-2 separation as a structural limitation rather than a design choice\.

## Acknowledgements

The authors thank the MIRPALab and LTI laboratory members for discussions on the interplay between adaptive control, safe reinforcement learning, and attention\-based meta\-control\.

## Conflict of Interest Statement

The authors declare no conflict of interest\.

## Data Availability Statement

The per\-seed JSON result files and the aggregation and statistical\-analysis scripts used to produce Table[4](https://arxiv.org/html/2605.06877#S7.T4)and Figure[3](https://arxiv.org/html/2605.06877#S7.F3)will be made available in a public repository upon acceptance of the manuscript\.

## Appendix AProofs of Section[5](https://arxiv.org/html/2605.06877#S5)results

This appendix collects the proofs and derivations referenced in Section[5](https://arxiv.org/html/2605.06877#S5)\. Sections[A\.1](https://arxiv.org/html/2605.06877#A1.SS1)and[A\.2](https://arxiv.org/html/2605.06877#A1.SS2)treat the Markovian optimality gap and theσz2\\sigma\_\{z\}^\{2\}\-scaling law respectively\. Section[A\.3](https://arxiv.org/html/2605.06877#A1.SS3)verifies the three structural properties of the temporal residual operator required by\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]\. Section[A\.4](https://arxiv.org/html/2605.06877#A1.SS4)establishes the non\-monotonicity ofK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)\.

### A\.1Proof of Proposition[5\.12](https://arxiv.org/html/2605.06877#S5.Thmtheorem12)

We prove the three parts in turn\.

#### A\.1\.1Part \(i\): history\-dependence of the optimal parameter

Under Definition[5\.10](https://arxiv.org/html/2605.06877#S5.Thmtheorem10), the state of the controlled system is\(q,q˙,z\)\(q,\\dot\{q\},z\)\. The optimal feedback is the minimiser of the expected cost\-to\-go conditional on the full state:

θctrl⋆​\(t\)=arg​minθ∈𝒫⁡𝔼​\[∫tTℓ​\(q​\(s\),qd​\(s\)\)​𝑑s\|q​\(t\),q˙​\(t\),z​\(t\),θ\]\.\\theta^\{\\star\}\_\{\\mathrm\{ctrl\}\}\(t\)=\\operatorname\*\{arg\\,min\}\_\{\\theta\\in\\mathcal\{P\}\}\\;\\mathbb\{E\}\\\!\\left\[\\int\_\{t\}^\{T\}\\ell\(q\(s\),q\_\{d\}\(s\)\)\\,ds\\;\\Big\|\\;q\(t\),\\dot\{q\}\(t\),z\(t\),\\theta\\right\]\.\(32\)By the strong convexity ofℓ\\ellinθ\\theta, the minimiser is unique and differentiable in the conditioning random variables\. In particular,θctrl⋆​\(t\)\\theta^\{\\star\}\_\{\\mathrm\{ctrl\}\}\(t\)depends onz​\(t\)z\(t\)\. By Definition[5\.10](https://arxiv.org/html/2605.06877#S5.Thmtheorem10)\(ii\),z​\(t\)z\(t\)is notσ​\(q​\(t\),q˙​\(t\)\)\\sigma\(q\(t\),\\dot\{q\}\(t\)\)\-measurable: equivalently, there exist sample paths\(q1,q˙1\)\(q\_\{1\},\\dot\{q\}\_\{1\}\)and\(q2,q˙2\)\(q\_\{2\},\\dot\{q\}\_\{2\}\)that agree at timettup to measure\-zero events but produce different values ofz​\(t\)z\(t\), forced by different histories\. Henceθctrl⋆​\(t\)\\theta^\{\\star\}\_\{\\mathrm\{ctrl\}\}\(t\)is not a function of\(q​\(t\),q˙​\(t\)\)\(q\(t\),\\dot\{q\}\(t\)\)alone, completing part \(i\)\.

#### A\.1\.2Part \(ii\): lower bound for Markovian policies

LetGMk​\(q,q˙\)G^\{\\mathrm\{Mk\}\}\(q,\\dot\{q\}\)be any fixed measurable function of the instantaneous state\. Its expected cost decomposes as

𝔼​\[ℓ​\(GMk\)\]\\displaystyle\\mathbb\{E\}\[\\ell\(G^\{\\mathrm\{Mk\}\}\)\]=𝔼​\[ℓ​\(GMk​\(q,q˙\),z\)\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\ell\\bigl\(G^\{\\mathrm\{Mk\}\}\(q,\\dot\{q\}\),\\,z\\bigr\)\\right\]\(33\)=𝔼q,q˙​\[𝔼z\|q,q˙​\[ℓ​\(GMk​\(q,q˙\),z\)\]\]\.\\displaystyle=\\mathbb\{E\}\_\{q,\\dot\{q\}\}\\\!\\left\[\\mathbb\{E\}\_\{z\|q,\\dot\{q\}\}\\\!\\bigl\[\\ell\(G^\{\\mathrm\{Mk\}\}\(q,\\dot\{q\}\),z\)\\bigr\]\\right\]\.\(34\)Fix\(q,q˙\)\(q,\\dot\{q\}\)\. Letz¯=𝔼​\[z\|q,q˙\]\\bar\{z\}=\\mathbb\{E\}\[z\\,\|\\,q,\\dot\{q\}\]and letθ⋆​\(q,q˙,z\)=arg​minθ⁡ℓ​\(θ,z\)\\theta^\{\\star\}\(q,\\dot\{q\},z\)=\\operatorname\*\{arg\\,min\}\_\{\\theta\}\\ell\(\\theta,z\)be the pointwise optimiser inθ\\thetafor every value ofzz\. By strong convexity ofℓ\\ellwith modulusμ\>0\\mu\>0,

ℓ​\(θ,z\)≥ℓ​\(θ⋆​\(q,q˙,z\),z\)\+μ2​‖θ−θ⋆​\(q,q˙,z\)‖2\\ell\(\\theta,z\)\\;\\geq\\;\\ell\\bigl\(\\theta^\{\\star\}\(q,\\dot\{q\},z\),z\\bigr\)\+\\frac\{\\mu\}\{2\}\\bigl\\\|\\theta\-\\theta^\{\\star\}\(q,\\dot\{q\},z\)\\bigr\\\|^\{2\}\(35\)for allθ∈𝒫\\theta\\in\\mathcal\{P\}and allzz\. Takingθ=GMk​\(q,q˙\)\\theta=G^\{\\mathrm\{Mk\}\}\(q,\\dot\{q\}\)\(which does not depend onzz\) and taking conditional expectation overz\|\(q,q˙\)z\|\(q,\\dot\{q\}\):

𝔼z​\[ℓ​\(GMk,z\)\]\\displaystyle\\mathbb\{E\}\_\{z\}\\\!\\\!\\left\[\\ell\(G^\{\\mathrm\{Mk\}\},z\)\\right\]≥𝔼z​\[ℓ​\(θ⋆,z\)\]\+μ2​𝔼z​\[‖GMk−θ⋆​\(q,q˙,z\)‖2\]\.\\displaystyle\\geq\\mathbb\{E\}\_\{z\}\\\!\\\!\\left\[\\ell\(\\theta^\{\\star\},z\)\\right\]\+\\frac\{\\mu\}\{2\}\\,\\mathbb\{E\}\_\{z\}\\\!\\\!\\left\[\\bigl\\\|G^\{\\mathrm\{Mk\}\}\-\\theta^\{\\star\}\(q,\\dot\{q\},z\)\\bigr\\\|^\{2\}\\right\]\.\(36\)The rightmost term is a conditional second moment\. Using𝔼z​‖θ⋆−θ¯⋆‖2≤𝔼z​‖GMk−θ⋆‖2\\mathbb\{E\}\_\{z\}\\\|\\theta^\{\\star\}\-\\bar\{\\theta\}^\{\\star\}\\\|^\{2\}\\leq\\mathbb\{E\}\_\{z\}\\\|G^\{\\mathrm\{Mk\}\}\-\\theta^\{\\star\}\\\|^\{2\}whereθ¯⋆:=𝔼z​\[θ⋆​\(q,q˙,z\)\]\\bar\{\\theta\}^\{\\star\}:=\\mathbb\{E\}\_\{z\}\[\\theta^\{\\star\}\(q,\\dot\{q\},z\)\]is the best Markovian response, we obtain

𝔼z​\[‖GMk−θ⋆​\(q,q˙,z\)‖2\]≥Var​\(θ⋆​\(q,q˙,z\)\|q,q˙\)\.\\mathbb\{E\}\_\{z\}\\\!\\\!\\left\[\\bigl\\\|G^\{\\mathrm\{Mk\}\}\-\\theta^\{\\star\}\(q,\\dot\{q\},z\)\\bigr\\\|^\{2\}\\right\]\\;\\geq\\;\\mathrm\{Var\}\\bigl\(\\theta^\{\\star\}\(q,\\dot\{q\},z\)\\,\\big\|\\,q,\\dot\{q\}\\bigr\)\.\(37\)By the implicit function theorem applied to the first\-order optimality condition ofθ⋆\\theta^\{\\star\}, there existsκ\>0\\kappa\>0such that

Var​\(θ⋆​\(q,q˙,z\)\|q,q˙\)≥κ2​Var​\(z\|q,q˙\)\.\\mathrm\{Var\}\\bigl\(\\theta^\{\\star\}\(q,\\dot\{q\},z\)\\,\\big\|\\,q,\\dot\{q\}\\bigr\)\\;\\geq\\;\\kappa^\{2\}\\,\\mathrm\{Var\}\(z\\,\|\\,q,\\dot\{q\}\)\.\(38\)The constantκ\\kappameasures the sensitivity of the pointwise optimiser tozzand is strictly positive wheneverzzenters the cost through a non\-degenerate direction — which is the case for friction\-dependent meta\-control by construction\. Combining \([36](https://arxiv.org/html/2605.06877#A1.E36)\)–\([38](https://arxiv.org/html/2605.06877#A1.E38)\) and taking outer expectation over\(q,q˙\)\(q,\\dot\{q\}\):

𝔼​\[ℓ​\(GMk\)\]−ℓ⋆≥μ​κ22⋅𝔼​\[Var​\(z\|q,q˙\)\]=c1​σz2,\\mathbb\{E\}\[\\ell\(G^\{\\mathrm\{Mk\}\}\)\]\-\\ell^\{\\star\}\\;\\geq\\;\\frac\{\\mu\\kappa^\{2\}\}\{2\}\\cdot\\mathbb\{E\}\\\!\\left\[\\mathrm\{Var\}\(z\|q,\\dot\{q\}\)\\right\]\\;=\\;c\_\{1\}\\sigma\_\{z\}^\{2\},\(39\)withc1=μ​κ2/2c\_\{1\}=\\mu\\kappa^\{2\}/2\. This is \([23](https://arxiv.org/html/2605.06877#S5.E23)\)\.

#### A\.1\.3Part \(iii\): windowed meta\-controller achieveso​\(σz2\)o\(\\sigma\_\{z\}^\{2\}\)

LetℋW​\(t\):=\{q​\(s\),q˙​\(s\):s∈\[t−W,t\]\}\\mathcal\{H\}\_\{W\}\(t\):=\\\{q\(s\),\\dot\{q\}\(s\):s\\in\[t\-W,t\]\\\}be the history window of lengthWW\. We claim that, whenW≥HzW\\geq H\_\{z\}, there exists a measurable mapΨ:ℋW→ℝnz\\Psi:\\mathcal\{H\}\_\{W\}\\to\\mathbb\{R\}^\{n\_\{z\}\}with the property𝔼​\[\(Ψ​\(ℋW​\(t\)\)−z​\(t\)\)2\]→0\\mathbb\{E\}\[\\,\(\\Psi\(\\mathcal\{H\}\_\{W\}\(t\)\)\-z\(t\)\)^\{2\}\\,\]\\to 0asW→∞W\\to\\infty\. FixingWWsufficiently large,Ψ\\Psirecoversz​\(t\)z\(t\)up to an error of ordere−W/Hze^\{\-W/H\_\{z\}\}, and this error enters the meta\-controller excess cost with the same multiplicative constant as in part \(ii\)\. Hence, for a windowed meta\-controllerGWG^\{W\}which composesΨ\\Psiwithθ⋆\\theta^\{\\star\}, the excess cost is bounded byc1​e−2​W/Hz​σz2=o​\(σz2\)c\_\{1\}\\,e^\{\-2W/H\_\{z\}\}\\sigma\_\{z\}^\{2\}=o\(\\sigma\_\{z\}^\{2\}\)asW→∞W\\to\\infty\.

The existence ofΨ\\Psifollows from the linear \(or linearisable\) dynamicsz˙=ζ​\(q,q˙,z\)\\dot\{z\}=\\zeta\(q,\\dot\{q\},z\): the variation\-of\-constants formula expressesz​\(t\)z\(t\)as a convolution integral ofq˙\\dot\{q\}over the past, and truncating the integral at depthWWintroduces an exponentially\-decaying error\. A universal approximator \(e\.g\., a causal Transformer with attention over the window\) can realiseΨ\\Psito any desired approximation error, by Cybenko\-type arguments adapted to sequence models\. This completes part \(iii\)\.

### A\.2Derivation of theσz2\\sigma\_\{z\}^\{2\}scaling law \(Remark[5\.13](https://arxiv.org/html/2605.06877#S5.Thmtheorem13)\)

We derive the closed\-form expression forσz2\\sigma\_\{z\}^\{2\}under the assumptions thatzzfollows the linear dynamicsz˙=−z/τz\+λz​q˙\\dot\{z\}=\-z/\\tau\_\{z\}\+\\lambda\_\{z\}\\dot\{q\}and thatq˙\\dot\{q\}is a zero\-mean second\-order stationary process with autocorrelationρq˙​\(τ\)\\rho\_\{\\dot\{q\}\}\(\\tau\)and varianceσq˙2=𝔼​\[q˙2\]\\sigma\_\{\\dot\{q\}\}^\{2\}=\\mathbb\{E\}\[\\dot\{q\}^\{2\}\]\.

The variation\-of\-constants formula gives

z​\(t\)=λz​∫−∞te−\(t−s\)/τz​q˙​\(s\)​𝑑s\.z\(t\)=\\lambda\_\{z\}\\int\_\{\-\\infty\}^\{t\}e^\{\-\(t\-s\)/\\tau\_\{z\}\}\\,\\dot\{q\}\(s\)\\,ds\.\(40\)Stationarity in the steady regime implies

Var​\(z​\(t\)\)=λz2​∫−∞t∫−∞te−\(t−s1\)/τz​e−\(t−s2\)/τz​𝔼​\[q˙​\(s1\)​q˙​\(s2\)\]​𝑑s1​𝑑s2\.\\mathrm\{Var\}\(z\(t\)\)=\\lambda\_\{z\}^\{2\}\\int\_\{\-\\infty\}^\{t\}\\\!\\\!\\int\_\{\-\\infty\}^\{t\}e^\{\-\(t\-s\_\{1\}\)/\\tau\_\{z\}\}e^\{\-\(t\-s\_\{2\}\)/\\tau\_\{z\}\}\\mathbb\{E\}\[\\dot\{q\}\(s\_\{1\}\)\\dot\{q\}\(s\_\{2\}\)\]\\,ds\_\{1\}\\,ds\_\{2\}\.\(41\)Substituting𝔼​\[q˙​\(s1\)​q˙​\(s2\)\]=σq˙2​ρq˙​\(s1−s2\)\\mathbb\{E\}\[\\dot\{q\}\(s\_\{1\}\)\\dot\{q\}\(s\_\{2\}\)\]=\\sigma\_\{\\dot\{q\}\}^\{2\}\\rho\_\{\\dot\{q\}\}\(s\_\{1\}\-s\_\{2\}\)and changing variablesu=t−s1u=t\-s\_\{1\},v=t−s2v=t\-s\_\{2\}:

Var​\(z​\(t\)\)=λz2​σq˙2​∫0∞∫0∞e−u/τz​e−v/τz​ρq˙​\(u−v\)​𝑑u​𝑑v\.\\mathrm\{Var\}\(z\(t\)\)=\\lambda\_\{z\}^\{2\}\\sigma\_\{\\dot\{q\}\}^\{2\}\\int\_\{0\}^\{\\infty\}\\\!\\\!\\int\_\{0\}^\{\\infty\}e^\{\-u/\\tau\_\{z\}\}e^\{\-v/\\tau\_\{z\}\}\\rho\_\{\\dot\{q\}\}\(u\-v\)\\,du\\,dv\.\(42\)
The conditional varianceσz2=𝔼​\[Var​\(z\|q,q˙\)\]\\sigma\_\{z\}^\{2\}=\\mathbb\{E\}\[\\mathrm\{Var\}\(z\|q,\\dot\{q\}\)\]differs fromVar​\(z\)\\mathrm\{Var\}\(z\)only by the information that\(q,q˙\)\(q,\\dot\{q\}\)carry aboutzzat the present time\. For the linear\-Gaussian case, the ratioσz2/Var​\(z\)\\sigma\_\{z\}^\{2\}/\\mathrm\{Var\}\(z\)is\(1−ρq˙​z2​\(0\)\)\(1\-\\rho\_\{\\dot\{q\}z\}^\{2\}\(0\)\)whereρq˙​z​\(0\)\\rho\_\{\\dot\{q\}z\}\(0\)is the normalised covariance betweenq˙​\(t\)\\dot\{q\}\(t\)andz​\(t\)z\(t\)\. A direct computation from \([40](https://arxiv.org/html/2605.06877#A1.E40)\) gives

Cov​\(q˙​\(t\),z​\(t\)\)=λz​∫0∞e−u/τz​σq˙2​ρq˙​\(u\)​𝑑u,\\mathrm\{Cov\}\(\\dot\{q\}\(t\),z\(t\)\)=\\lambda\_\{z\}\\int\_\{0\}^\{\\infty\}e^\{\-u/\\tau\_\{z\}\}\\sigma\_\{\\dot\{q\}\}^\{2\}\\rho\_\{\\dot\{q\}\}\(u\)du,\(43\)and combining with \([42](https://arxiv.org/html/2605.06877#A1.E42)\) after simplification yields

σz2​\(τz\)=λz2​σq˙2⋅τz2⋅\(1−ρq˙​\(τz\)\),\\sigma\_\{z\}^\{2\}\(\\tau\_\{z\}\)=\\lambda\_\{z\}^\{2\}\\sigma\_\{\\dot\{q\}\}^\{2\}\\cdot\\frac\{\\tau\_\{z\}\}\{2\}\\cdot\\bigl\(1\-\\rho\_\{\\dot\{q\}\}\(\\tau\_\{z\}\)\\bigr\),\(44\)whereρq˙​\(τz\)\\rho\_\{\\dot\{q\}\}\(\\tau\_\{z\}\)is the autocorrelation ofq˙\\dot\{q\}at the lag equal to the memory horizon\. Two limits are of interest:

1. \(i\)Asτz→0\\tau\_\{z\}\\to 0,ρq˙​\(τz\)→1\\rho\_\{\\dot\{q\}\}\(\\tau\_\{z\}\)\\to 1and the factorτz​\(1−ρq˙​\(τz\)\)\\tau\_\{z\}\(1\-\\rho\_\{\\dot\{q\}\}\(\\tau\_\{z\}\)\)vanishes \(Taylor expansion ofρ\\rho\), soσz2→0\\sigma\_\{z\}^\{2\}\\to 0: a memoryless policy becomes optimal\.
2. \(ii\)Asτz→∞\\tau\_\{z\}\\to\\infty,ρq˙​\(τz\)→0\\rho\_\{\\dot\{q\}\}\(\\tau\_\{z\}\)\\to 0and the factor converges toτz/2\\tau\_\{z\}/2\. The conditional variance grows linearly inτz\\tau\_\{z\}, so the Markovian gap diverges: no memoryless policy can achieve bounded excess cost\.

For the task considered in Section[7](https://arxiv.org/html/2605.06877#S7),q˙\\dot\{q\}is a sinusoid with the reference\-trajectory period2​π2\\pi: in this caseρq˙\\rho\_\{\\dot\{q\}\}oscillates and \([44](https://arxiv.org/html/2605.06877#A1.E44)\) gives a predicted scaling inτz\\tau\_\{z\}consistent with the multi\-seed comparison reported in Table[4](https://arxiv.org/html/2605.06877#S7.T4)\.

### A\.3Structural properties ofArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\(Proposition[5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15)\)

We verify in turn the three properties ofArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}required by\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]: symmetry, rank\-one deflation, and incoherence of the deflated directions\.

##### Symmetry\.

By its definition \([25](https://arxiv.org/html/2605.06877#S5.E25)\) as an expected outer product,ArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}is symmetric positive semi\-definite:Arestemp=Arestemp⊤A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}=A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\\top\}andλi​\(Arestemp\)≥0\\lambda\_\{i\}\(A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\)\\geq 0for allii\.

##### Rank\-one deflation\.

Differentiating the variation\-of\-constants formula \([40](https://arxiv.org/html/2605.06877#A1.E40)\) with respect toq˙​\(t−ℓ\)\\dot\{q\}\(t\-\\ell\)\(the history entry at lagℓ∈\[0,W\]\\ell\\in\[0,W\]\):

∂z​\(t\)∂q˙​\(t−ℓ\)=λz​e−ℓ/τz\.\\frac\{\\partial z\(t\)\}\{\\partial\\dot\{q\}\(t\-\\ell\)\}=\\lambda\_\{z\}\\,e^\{\-\\ell/\\tau\_\{z\}\}\.\(45\)This is a scalar multiplier depending only onℓ\\ellandτz\\tau\_\{z\}, not onq˙\\dot\{q\}itself\. Under the linear\-Stribeck model, the history gradient∇histz​\(t\)∈ℝW\\nabla\_\{\\mathrm\{hist\}\}z\(t\)\\in\\mathbb\{R\}^\{W\}is therefore a fixed exponential\-decay vectorv​\(τz\):=λz​\(e−ℓ1/τz,…,e−ℓW/τz\)v\(\\tau\_\{z\}\):=\\lambda\_\{z\}\\bigl\(e^\{\-\\ell\_\{1\}/\\tau\_\{z\}\},\\ldots,e^\{\-\\ell\_\{W\}/\\tau\_\{z\}\}\\bigr\)up to sampling noise\. Hence

Arestemp=v​\(τz\)​v​\(τz\)⊤\+Σϵ,A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}=v\(\\tau\_\{z\}\)\\,v\(\\tau\_\{z\}\)^\{\\top\}\+\\Sigma\_\{\\epsilon\},\(46\)whereΣϵ\\Sigma\_\{\\epsilon\}captures second\-order corrections \(nonlinearity of Stribeck, measurement noise, non\-stationarity ofq˙\\dot\{q\}\)\. The leading term is rank\-one, satisfying the deflation hypothesis\.

##### Incoherence of the deflated directions\.

After deflating the rank\-one principal directionv​\(τz\)v\(\\tau\_\{z\}\), the residualΣϵ\\Sigma\_\{\\epsilon\}must have its eigenvectors incoherent with the canonical basis \(i\.e\., the temporal lag basis\) to satisfy the compressed\-sensing hypothesis\. This follows from two facts: \(i\) The nonlinear corrections to \([45](https://arxiv.org/html/2605.06877#A1.E45)\) \(the Stribeck envelopeFsF\_\{s\}depends non\-linearly onq˙\\dot\{q\}\) produce gradient entries that are smoothly distributed across lags, with no concentration on any single lag\. \(ii\) Stationarity ofq˙\\dot\{q\}and the smoothness of its autocorrelation imply the cross\-lag covariances of∇histz\\nabla\_\{\\mathrm\{hist\}\}zare bounded below away from alignment with any lag direction\. Both facts give incoherence with constantμc<1/W\\mu\_\{c\}<1/\\sqrt\{W\}in the restricted\-isometry sense, which is sufficient for the compressed\-sensing bound of\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]\.

The three properties imply that\[[8](https://arxiv.org/html/2605.06877#bib.bib18), Thm\. 7\]applies toArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}, yielding \([26](https://arxiv.org/html/2605.06877#S5.E26)\)\. ∎

### A\.4Non\-monotonicity ofK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)\(Corollary[5\.16](https://arxiv.org/html/2605.06877#S5.Thmtheorem16)\)

We show that the effective rankreff​\(Arestemp​\(τz\)\)r\_\{\\mathrm\{eff\}\}\(A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\(\\tau\_\{z\}\)\)is a non\-monotonic function ofτz\\tau\_\{z\}, with a peak in the window\-to\-horizon matched regime\. The analysis proceeds through the explicit form \([46](https://arxiv.org/html/2605.06877#A1.E46)\)\.

##### Effective\-rank formula\.

For a symmetric positive semi\-definite matrixMMwith eigenvaluesλ1≥⋯≥λW≥0\\lambda\_\{1\}\\geq\\cdots\\geq\\lambda\_\{W\}\\geq 0, the effective rank \(stable rank\) is

reff​\(M\)=\(∑iλi\)2∑iλi2=tr​\(M\)2‖M‖F2\.r\_\{\\mathrm\{eff\}\}\(M\)=\\frac\{\(\\sum\_\{i\}\\lambda\_\{i\}\)^\{2\}\}\{\\sum\_\{i\}\\lambda\_\{i\}^\{2\}\}=\\frac\{\\mathrm\{tr\}\(M\)^\{2\}\}\{\\\|M\\\|\_\{F\}^\{2\}\}\.\(47\)

##### Spectrum ofArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\.

From \([46](https://arxiv.org/html/2605.06877#A1.E46)\),tr​\(Arestemp\)=‖v​\(τz\)‖2\+tr​\(Σϵ\)\\mathrm\{tr\}\(A\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\)=\\\|v\(\\tau\_\{z\}\)\\\|^\{2\}\+\\mathrm\{tr\}\(\\Sigma\_\{\\epsilon\}\)and the leading eigenvalue isλ1≈‖v​\(τz\)‖2\\lambda\_\{1\}\\approx\\\|v\(\\tau\_\{z\}\)\\\|^\{2\}whenΣϵ\\Sigma\_\{\\epsilon\}is small\. Writing the exponential\-decay vectorv​\(τz\)v\(\\tau\_\{z\}\)explicitly withℓk=k​Δ​t\\ell\_\{k\}=k\\Delta t:

‖v​\(τz\)‖2=λz2​∑k=1We−2​k​Δ​t/τz=λz2⋅e−2​Δ​t/τz​\(1−e−2​W​Δ​t/τz\)1−e−2​Δ​t/τz\.\\\|v\(\\tau\_\{z\}\)\\\|^\{2\}=\\lambda\_\{z\}^\{2\}\\sum\_\{k=1\}^\{W\}e^\{\-2k\\Delta t/\\tau\_\{z\}\}=\\lambda\_\{z\}^\{2\}\\cdot\\frac\{e^\{\-2\\Delta t/\\tau\_\{z\}\}\(1\-e^\{\-2W\\Delta t/\\tau\_\{z\}\}\)\}\{1\-e^\{\-2\\Delta t/\\tau\_\{z\}\}\}\.\(48\)This quantity interpolates between two regimes:

τz≪W​Δ​t:\\displaystyle\\tau\_\{z\}\\ll W\\Delta t:\\quad‖v​\(τz\)‖2∼λz2⋅τz/\(2​Δ​t\),\\displaystyle\\\|v\(\\tau\_\{z\}\)\\\|^\{2\}\\sim\\lambda\_\{z\}^\{2\}\\cdot\\tau\_\{z\}/\(2\\Delta t\),\(49\)τz≫W​Δ​t:\\displaystyle\\tau\_\{z\}\\gg W\\Delta t:\\quad‖v​\(τz\)‖2∼λz2⋅W\.\\displaystyle\\\|v\(\\tau\_\{z\}\)\\\|^\{2\}\\sim\\lambda\_\{z\}^\{2\}\\cdot W\.\(50\)

##### Qualitative non\-monotonicity\.

Two effects compete asτz\\tau\_\{z\}grows:

1. \(i\)More of the memory horizon fits inside the window, enriching the temporal structure captured byArestempA\_\{\\mathrm\{res\}\}^\{\\mathrm\{temp\}\}\. This increases the*number*of significant eigenvalues\.
2. \(ii\)The leading eigenvalueλ1∝‖v​\(τz\)‖2\\lambda\_\{1\}\\propto\\\|v\(\\tau\_\{z\}\)\\\|^\{2\}saturates atλz2​W\\lambda\_\{z\}^\{2\}W, so the spectrum becomes dominated by the rank\-one leading direction\. This reducesreffr\_\{\\mathrm\{eff\}\}back towards11\.

The first effect dominates whenτz≲W​Δ​t\\tau\_\{z\}\\lesssim W\\Delta tand the second whenτz≳W​Δ​t\\tau\_\{z\}\\gtrsim W\\Delta t, placing the peak ofreffr\_\{\\mathrm\{eff\}\}in the neighbourhood ofτz⋆∼W​Δ​t\\tau\_\{z\}^\{\\star\}\\sim W\\Delta t\. The exact location depends on the non\-linear Stribeck corrections inΣϵ\\Sigma\_\{\\epsilon\}, which shiftτz⋆\\tau\_\{z\}^\{\\star\}into the interval\[W​Δ​t/10,W​Δ​t\]\[W\\Delta t/10,W\\Delta t\]\. For the experimental setting of Section[7](https://arxiv.org/html/2605.06877#S7)\(W=20W=20,Δ​t=0\.01\\Delta t=0\.01, givingW​Δ​t=0\.2W\\Delta t=0\.2s\), the peak of the empiricalK⋆​\(τz\)K^\{\\star\}\(\\tau\_\{z\}\)falls within this interval scaled by the nonlinearity factor; the observed maximum atτz=2\\tau\_\{z\}=2s is consistent with the scaling given by the Stribeck\-envelope’s approximate saturation at 10 times the linear memory horizon\. ∎

## Appendix BExperimental details

This appendix documents the protocol items not fully specified in Section[7](https://arxiv.org/html/2605.06877#S7)\.

### B\.1Simulator and task

The 2\-DOF manipulator dynamics follow the Euler–Lagrange model of Section[3](https://arxiv.org/html/2605.06877#S3)with the Stribeck friction parametrisation of equations \([3](https://arxiv.org/html/2605.06877#S3.E3)\)–\([4](https://arxiv.org/html/2605.06877#S3.E4)\)\. Integration is performed with an explicit Runge–Kutta step atΔ​t=10\\Delta t=10ms; the control period coincides with the integration step\. Each episode has fixed horizonT=5T=5s, corresponding toH=500H=500control steps, after which the episode terminates\. Payloadp∈\[0,1\.5\]p\\in\[0,1\.5\]kg is sampled uniformly at reset and modifies the effective mass matrixM​\(q,p\)M\(q,p\)multiplicatively\.

### B\.2Baseline controller

The fixed\-gain reference is a computed\-torque law withKd=30K\_\{d\}=30\(diagonal, both joints\) andΛ=5\\Lambda=5\(sliding\-surface coefficient\), without feed\-forward friction compensation\. The baseline RMSE at each memory horizon, averaged over the five payload levels and computed on a fresh evaluation set of2020rollouts per level, is

RMSEbase​\(τz=1\)=0\.1317,RMSEbase​\(τz=2\)=0\.1317,RMSEbase​\(τz=5\)=0\.1311\.\\text\{RMSE\}\_\{\\text\{base\}\}\(\\tau\_\{z\}=1\)=0\.1317,\\quad\\text\{RMSE\}\_\{\\text\{base\}\}\(\\tau\_\{z\}=2\)=0\.1317,\\quad\\text\{RMSE\}\_\{\\text\{base\}\}\(\\tau\_\{z\}=5\)=0\.1311\.These values are the denominators of theΔ%\\Delta\\%metric reported throughout Section[7](https://arxiv.org/html/2605.06877#S7)\.

### B\.3Meta\-controller training

Training uses SAC\[[12](https://arxiv.org/html/2605.06877#bib.bib10)\]as implemented instable\-baselines3v2\.x, with the shielded Lagrangian formulation of Section[4\.4](https://arxiv.org/html/2605.06877#S4.SS4)\. Each training run is allocated50 00050\\,000environment steps\. Non\-architectural hyperparameters are held constant across all architectures and regimes: learning rate3⋅10−43\\cdot 10^\{\-4\}, replay buffer size10510^\{5\}, batch size256256, entropy coefficient learned with the standard auto\-tuning schedule, target smoothingτ=0\.005\\tau=0\.005, discountγ=0\.99\\gamma=0\.99\. The Lagrangian multiplierβ\\betais updated at every gradient step with step size10−310^\{\-3\}\.

### B\.4Evaluation

Evaluation at convergence is performed on a grid of five payload levelsp∈\{0,0\.375,0\.75,1\.125,1\.5\}p\\in\\\{0,\\allowbreak\\,0\.375,\\allowbreak\\,0\.75,\\allowbreak\\,1\.125,\\allowbreak\\,1\.5\\\}kg\. For each payload,2020rollouts are generated with fresh initial conditions sampled from the reset distribution, and the per\-rollout RMSE is averaged\. The per\-payload standard deviations shown in Figure[3](https://arxiv.org/html/2605.06877#S7.F3)are the between\-rollout standard deviations within a seed\. The between\-seed standard deviations reported in Table[4](https://arxiv.org/html/2605.06877#S7.T4)are computed over the per\-seed means ofΔ%\\Delta\\%\.

### B\.5Seeds and training budget

The main comparison uses seeds\{42,…,46\}\\\{42,\\ldots,46\\\}at all three regimes and additionally\{47,…,51\}\\\{47,\\ldots,51\\\}atτz∈\{1,5\}\\tau\_\{z\}\\in\\\{1,5\\\}s, for a total of fifty training runs\. No seed was excluded from any reported statistic\. Total wall\-clock training time across the fifty runs was approximately2121GPU\-hours, distributed between a Colab A100 and a Kaggle T4/A100\-class instance, with per\-run duration ranging from2020to3232minutes depending on architecture and window size\.

### B\.6Reproducibility

Per\-seed result files are released as supplementary material\. Each file contains, for the corresponding run, the architecture name, the parameter count, the per\-payload RMSE with its between\-rollout standard deviation, the baseline RMSE, and the final value ofΔ%\\Delta\\%\. The filename encodes the architecture, the value ofτz\\tau\_\{z\}, and the seed\. The aggregation script that produces Table[4](https://arxiv.org/html/2605.06877#S7.T4)and Figures[3](https://arxiv.org/html/2605.06877#S7.F3)–[5](https://arxiv.org/html/2605.06877#S7.F5)from these files is released alongside\.

## References

- \[1\]\(2017\)Constrained policy optimization\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),Vol\.70,pp\. 22–31\.Cited by:[§2\.3](https://arxiv.org/html/2605.06877#S2.SS3.p1.1)\.
- \[2\]E\. Altman\(1999\)Constrained Markov decision processes\.Chapman & Hall/CRC,Boca Raton, FL\.Cited by:[§2\.3](https://arxiv.org/html/2605.06877#S2.SS3.p1.1),[§4\.4](https://arxiv.org/html/2605.06877#S4.SS4.SSS0.Px1.p1.2)\.
- \[3\]A\. D\. Ames, S\. Coogan, M\. Egerstedt, G\. Notomista, K\. Sreenath, and P\. Tabuada\(2019\)Control barrier functions: theory and applications\.In18th European Control Conference \(ECC\),pp\. 3420–3431\.External Links:[Document](https://dx.doi.org/10.23919/ECC.2019.8796030)Cited by:[§2\.3](https://arxiv.org/html/2605.06877#S2.SS3.p1.1)\.
- \[4\]B\. Armstrong\-Hélouvry, P\. Dupont, and C\. Canudas de Wit\(1994\)A survey of models, analysis tools and compensation methods for the control of machines with friction\.Automatica30\(7\),pp\. 1083–1138\.External Links:[Document](https://dx.doi.org/10.1016/0005-1098%2894%2990209-7)Cited by:[§1](https://arxiv.org/html/2605.06877#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.06877#S2.SS1.p1.2)\.
- \[5\]C\. Canudas de Wit, H\. Olsson, K\. J\. Åström, and P\. Lischinsky\(1995\)A new model for control of systems with friction\.IEEE Transactions on Automatic Control40\(3\),pp\. 419–425\.External Links:[Document](https://dx.doi.org/10.1109/9.376053)Cited by:[§1](https://arxiv.org/html/2605.06877#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.06877#S2.SS1.p1.2),[Table 1](https://arxiv.org/html/2605.06877#S2.T1.2.2.3)\.
- \[6\]Y\. Chow, O\. Nachum, E\. Duenez\-Guzman, and M\. Ghavamzadeh\(2018\)A Lyapunov\-based approach to safe reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.31,pp\. 8103–8112\.Cited by:[§2\.3](https://arxiv.org/html/2605.06877#S2.SS3.p1.1)\.
- \[7\]G\. Cirrincione and A\. Fagiolini\(2026\)Learned Lyapunov certificates for safe residual reinforcement learning on Euler–Lagrange systems\.Note:Manuscript submitted to the International Journal of Robust and Nonlinear ControlCited by:[§1](https://arxiv.org/html/2605.06877#S1.p2.1),[§1](https://arxiv.org/html/2605.06877#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.06877#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.06877#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.06877#S2.T1.5.5.4),[§3\.2](https://arxiv.org/html/2605.06877#S3.SS2.p1.7),[§3\.4](https://arxiv.org/html/2605.06877#S3.SS4.p2.1),[§4\.2](https://arxiv.org/html/2605.06877#S4.SS2.SSS0.Px3.p1.12),[§4](https://arxiv.org/html/2605.06877#S4.p1.2),[§5\.1](https://arxiv.org/html/2605.06877#S5.SS1.p1.4),[§5\.1](https://arxiv.org/html/2605.06877#S5.SS1.p2.2),[§5\.1](https://arxiv.org/html/2605.06877#S5.SS1.p3.1),[§5\.3](https://arxiv.org/html/2605.06877#S5.SS3.p1.1),[Remark 5\.9](https://arxiv.org/html/2605.06877#S5.Thmtheorem9.p1.1),[§5](https://arxiv.org/html/2605.06877#S5.p1.2),[§7\.1](https://arxiv.org/html/2605.06877#S7.SS1.p3.7)\.
- \[8\]G\. Cirrincione\(2026\)INCRT: an incremental transformer that determines its own architecture\.Note:arXiv preprint arXiv:2604\.10703External Links:2604\.10703,[Link](https://arxiv.org/abs/2604.10703)Cited by:[§A\.3](https://arxiv.org/html/2605.06877#A1.SS3.SSS0.Px3.p1.7),[§A\.3](https://arxiv.org/html/2605.06877#A1.SS3.SSS0.Px3.p2.1),[§A\.3](https://arxiv.org/html/2605.06877#A1.SS3.p1.1),[Appendix A](https://arxiv.org/html/2605.06877#A1.p1.2),[§1](https://arxiv.org/html/2605.06877#S1.p4.1),[§2\.7](https://arxiv.org/html/2605.06877#S2.SS7.p2.2),[§5\.5](https://arxiv.org/html/2605.06877#S5.SS5.SSS0.Px1.p1.5),[§5\.5](https://arxiv.org/html/2605.06877#S5.SS5.SSS0.Px2.1.p1.6),[§5\.5](https://arxiv.org/html/2605.06877#S5.SS5.SSS0.Px2.p2.2),[§5\.5](https://arxiv.org/html/2605.06877#S5.SS5.p2.1),[Proposition 5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15.p1.3.3),[Proposition 5\.15](https://arxiv.org/html/2605.06877#S5.Thmtheorem15.p1.5.2),[§5](https://arxiv.org/html/2605.06877#S5.p1.2),[§6\.1](https://arxiv.org/html/2605.06877#S6.SS1.SSS0.Px4.p1.1),[§6\.1](https://arxiv.org/html/2605.06877#S6.SS1.SSS0.Px5.p1.5),[§6\.1](https://arxiv.org/html/2605.06877#S6.SS1.SSS0.Px6.p1.2),[§6\.1](https://arxiv.org/html/2605.06877#S6.SS1.p1.1),[§6\.2](https://arxiv.org/html/2605.06877#S6.SS2.SSS0.Px2.p1.5),[§6\.3](https://arxiv.org/html/2605.06877#S6.SS3.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2605.06877#S6.p1.4),[§8](https://arxiv.org/html/2605.06877#S8.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2605.06877#S8.SS0.SSS0.Px1.p2.2),[13](https://arxiv.org/html/2605.06877#alg1.l13),[14](https://arxiv.org/html/2605.06877#alg1.l14)\.
- \[9\]G\. Cirrincione\(2026\)Rank, channel destruction, and symmetry breaking in transformer architectures\.Note:Manuscript submitted to IEEE Transactions on Neural Networks and Learning SystemsCited by:[§1](https://arxiv.org/html/2605.06877#S1.SS0.SSS0.Px1.p1.8),[§2\.5](https://arxiv.org/html/2605.06877#S2.SS5.p3.1),[§6\.3](https://arxiv.org/html/2605.06877#S6.SS3.SSS0.Px4.p1.1),[§6\.5](https://arxiv.org/html/2605.06877#S6.SS5.SSS0.Px3),[§6\.5](https://arxiv.org/html/2605.06877#S6.SS5.SSS0.Px3.p1.3),[§7\.4](https://arxiv.org/html/2605.06877#S7.SS4.p3.12),[§8](https://arxiv.org/html/2605.06877#S8.p3.1)\.
- \[10\]J\. Cohen\(1988\)Statistical power analysis for the behavioral sciences\.2nd edition,Lawrence Erlbaum Associates,Hillsdale, NJ\.Cited by:[§7\.1](https://arxiv.org/html/2605.06877#S7.SS1.SSS0.Px1.p1.5)\.
- \[11\]T\. Elsken, J\. H\. Metzen, and F\. Hutter\(2019\)Neural architecture search: a survey\.Journal of Machine Learning Research20\(55\),pp\. 1–21\.Cited by:[§1](https://arxiv.org/html/2605.06877#S1.p4.1),[§2\.7](https://arxiv.org/html/2605.06877#S2.SS7.p1.1)\.
- \[12\]T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine\(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InProceedings of the 35th International Conference on Machine Learning \(ICML\),pp\. 1861–1870\.Cited by:[§B\.3](https://arxiv.org/html/2605.06877#A2.SS3.p1.8),[§1](https://arxiv.org/html/2605.06877#S1.p2.1),[§4\.4](https://arxiv.org/html/2605.06877#S4.SS4.p1.2),[§7\.1](https://arxiv.org/html/2605.06877#S7.SS1.p1.5)\.
- \[13\]H\. Liu, K\. Simonyan, and Y\. Yang\(2019\)DARTS: differentiable architecture search\.In7th International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.06877#S1.p4.1),[§2\.7](https://arxiv.org/html/2605.06877#S2.SS7.p1.1),[§6\.3](https://arxiv.org/html/2605.06877#S6.SS3.SSS0.Px3.p1.2),[§6](https://arxiv.org/html/2605.06877#S6.p1.4)\.
- \[14\]H\. B\. Mann and D\. R\. Whitney\(1947\)On a test of whether one of two random variables is stochastically larger than the other\.The Annals of Mathematical Statistics18\(1\),pp\. 50–60\.External Links:[Document](https://dx.doi.org/10.1214/aoms/1177730491)Cited by:[§7\.1](https://arxiv.org/html/2605.06877#S7.SS1.SSS0.Px1.p1.5)\.
- \[15\]Y\. Miao, X\. Song, J\. D\. Co\-Reyes, D\. Peng, S\. Yue, E\. Brevdo, and A\. Faust\(2022\)Differentiable architecture search for reinforcement learning\.InProceedings of the First International Conference on Automated Machine Learning \(AutoML\),Proceedings of Machine Learning Research, Vol\.188,pp\. 20/1–20/17\.Cited by:[§2\.7](https://arxiv.org/html/2605.06877#S2.SS7.p1.1),[§6](https://arxiv.org/html/2605.06877#S6.p1.4)\.
- \[16\]H\. Pham, M\. Y\. Guan, B\. Zoph, Q\. V\. Le, and J\. Dean\(2018\)Efficient neural architecture search via parameter sharing\.InProceedings of the 35th International Conference on Machine Learning \(ICML\),pp\. 4095–4104\.Cited by:[§1](https://arxiv.org/html/2605.06877#S1.p4.1),[§2\.7](https://arxiv.org/html/2605.06877#S2.SS7.p1.1),[§6\.3](https://arxiv.org/html/2605.06877#S6.SS3.SSS0.Px3.p1.2),[§6](https://arxiv.org/html/2605.06877#S6.p1.4)\.
- \[17\]J\. E\. Slotine and W\. Li\(1987\)On the adaptive control of robot manipulators\.International Journal of Robotics Research6\(3\),pp\. 49–59\.External Links:[Document](https://dx.doi.org/10.1177/027836498700600303)Cited by:[§1](https://arxiv.org/html/2605.06877#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.06877#S2.SS1.p1.2),[Table 1](https://arxiv.org/html/2605.06877#S2.T1.2.2.3),[§3\.4](https://arxiv.org/html/2605.06877#S3.SS4.p2.1)\.
- \[18\]M\. W\. Spong, S\. Hutchinson, and M\. Vidyasagar\(2020\)Robot modeling and control\.2nd edition,Wiley,Hoboken, NJ\.Cited by:[§3\.4](https://arxiv.org/html/2605.06877#S3.SS4.p2.1)\.
- \[19\]A\. Stooke, J\. Achiam, and P\. Abbeel\(2020\)Responsive safety in reinforcement learning by PID Lagrangian methods\.InProceedings of the 37th International Conference on Machine Learning \(ICML\),pp\. 9133–9143\.Cited by:[§2\.3](https://arxiv.org/html/2605.06877#S2.SS3.p1.1)\.
- \[20\]B\. L\. Welch\(1947\)The generalization of “Student’s” problem when several different population variances are involved\.Biometrika34\(1–2\),pp\. 28–35\.External Links:[Document](https://dx.doi.org/10.1093/biomet/34.1-2.28)Cited by:[§7\.1](https://arxiv.org/html/2605.06877#S7.SS1.SSS0.Px1.p1.5)\.

Similar Articles

Prediction and control with temporal segment models

OpenAI Blog

OpenAI introduces a method for learning complex nonlinear system dynamics using deep generative models over temporal segments, enabling stable long-horizon predictions and differentiable trajectory optimization for model-based control.

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv cs.LG

This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Hugging Face Daily Papers

This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.