Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction

arXiv cs.LG Papers

Summary

This paper investigates parallel-in-time algorithms for training recurrent neural networks in dynamical systems reconstruction, proposing GTF-DEER that enables stable learning over long sequences and improves reconstruction accuracy.

arXiv:2605.12683v1 Announce Type: new Abstract: Reconstructing nonlinear dynamical systems (DS) from data (DSR) is a fundamental challenge in science and engineering, but it inherently relies on sequential models. Recent breakthroughs for sequential models have produced algorithms that parallelize computation along sequence length $T$, achieving logarithmic time complexity, $\mathcal{O}(\log T)$. Since sequence lengths have been practically limited due to the linear runtime complexity $\mathcal{O}(T)$ of classical backpropagation through time, this opens new avenues for DSR. This paper studies two prominent classes of parallel-in-time algorithms for this task, both of which leverage parallel associative scans as their core computational primitive. The first class comprises models with linear yet non-autonomous dynamics and a nonlinear readout, such as modern State Space Models (SSMs), while the second consists of general nonlinear models which can be parallelized using the DEER framework. We find that the linear training-time recurrence of the first class of models imposes limitations that often hinder learning of accurate nonlinear dynamics. To address this, we augment DEER with Generalized Teacher Forcing (GTF), a novel variant within the more general nonlinear framework that ensures stable and effective learning of nonlinear dynamics across arbitrary sequence lengths. Using GTF-DEER, we investigate the benefits of training on extremely long sequences ($T>10^4$) for DSR. Our results show that access to such long trajectories significantly improves DSR if the data features long time scales. This work establishes GTF-DEER as a robust tool for data-driven discovery and underscores the largely untapped potential of long-sequence learning in modeling complex DS.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:17 AM

# Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction
Source: [https://arxiv.org/html/2605.12683](https://arxiv.org/html/2605.12683)
Florian Hess1,2&Florian Götz1,3&Daniel Durstewitz1,2,4 1Dept\. of Theoretical Neuroscience, Central Institute of Mental Health, Mannheim, Germany 2Faculty of Physics and Astronomy, Heidelberg University, Germany 3Faculty of Mathematics and Computer Science, Heidelberg University, Germany 4Interdisciplinary Center for Scientific Computing \(IWR\), Heidelberg University, Germany

###### Abstract

Reconstructing nonlinear dynamical systems \(DS\) from data \(DSR\) is a fundamental challenge in science and engineering, but it inherently relies onsequentialmodels\. Recent breakthroughs for sequential models have produced algorithms that parallelize computation along sequence lengthTT, achieving logarithmic time complexity,𝒪​\(log⁡T\)\\mathcal\{O\}\(\\log T\)\. Since sequence lengths have been practically limited due to the linear runtime complexity𝒪​\(T\)\\mathcal\{O\}\(T\)of classical backpropagation through time, this opens new avenues for DSR\. This paper studies two prominent classes of parallel\-in\-time algorithms for this task, both of which leverageparallel associative scansas their core computational primitive\. The first class comprises models with linear yet non\-autonomous dynamics and a nonlinear readout, such as modern State Space Models \(SSMs\), while the second consists of general nonlinear models which can be parallelized using the DEER framework\. We find that the linear training\-time recurrence of the first class of models imposes limitations that often hinder learning of accurate nonlinear dynamics\. To address this, we augment DEER with Generalized Teacher Forcing \(GTF\), a novel variant within the more general nonlinear framework that ensures stable and effective learning of nonlinear dynamics across arbitrary sequence lengths\. Using GTF\-DEER, we investigate the benefits of training on extremely long sequences \(T\>104T\>10^\{4\}\) for DSR\. Our results show that access to such long trajectories significantly improves DSR if the data features long time scales\. This work establishes GTF\-DEER as a robust tool for data\-driven discovery and underscores the largely untapped potential of long\-sequence learning in modeling complex DS\.

## 1Introduction

Understanding and predicting the behavior of complex nonlinear systems from neural circuits and climate dynamics to fluid flows and ecological networks is a central aim across the natural and engineering sciences\(BhallaIyengar1999Emergent;Kalnay2003AMDAP;tziperman\-97;izhikevich\_dynamical\_2007\)\. A particularly ambitious goal is to*reconstruct*the underlying dynamical system \(DS\) directly from observed time series, a problem known as dynamical systems reconstruction \(DSR\)\. Beyond short\-term forecasting, DSR requires that the learned model faithfully reproduces the long\-term statistical and geometric properties of the true system, such as attractor geometry, power spectra, and Lyapunov exponents\.

Central to the definition of DS is the flow operator, which provides a recursive rule of how the DS evolves in timestrogatz2024nonlinear;katok\_introduction\_1995\. DSR methods that approximate it are inherently recursive\(goring\_domain\_2024;mikhaeil\_difficulty\_2022\), such as RNNs, commonly trained by backpropagation through time \(BPTT;\(werbos1990backpropagation\)\)\. While several well\-documented pathologies of BPTT – most notably exploding gradients under chaotic dynamics\(mikhaeil\_difficulty\_2022\)– can be successfully mitigated by control\-theoretic training algorithms such as sparse and generalized teacher forcing \(STF/GTF\)\(mikhaeil\_difficulty\_2022;hess\_generalized\_2023\), the computational cost of BPTT remains fundamentally linear in the sequence lengthTT, rendering training on problems with long intrinsic timescales prohibitively expensive\. Indeed, DSR applications have historically been confined to modest sequence lengths to keep training tractable\(brenner\_tractable\_2022;hess\_generalized\_2023;vlachas2024learning\), leaving open the question of whether learning from substantially longer sequences offers any benefit for reconstruction quality\.

Recent advances in parallel\-in\-time sequence modeling offer a path forward, and we examine two common paradigms with respect to their performance in DSR\. Linear training\-time recurrences with a nonlinear readout, as instantiated by modern SSMs, admit trivial parallelization via linear scans and avoid chaos\-induced exploding gradients by construction\(mikhaeil\_difficulty\_2022;orvieto2023resurrecting;zucchet2024recurrent\)\. By exposing adualitybetween these linear SSMs and nonlinear RNNs, we show, however, that this convenience comes at a cost: common diagonal parameterization of the linear recurrence used during training imposes structural limitations that often prevent the model from learning accurate nonlinear dynamics\. Moreover, its training suffers from exposure bias, degrading autoregressive roll\-outs at test\-time\(bengio2015scheduled;ranzato2016sequence\)\. The second paradigm – general nonlinear RNNs parallelized via DEER\(lim2024parallelizing\)– is in principle better suited to DSR, but naive application fails on chaotic data because the Jacobian products driving DEER’s Newton updates diverge whenever the underlying dynamics exhibits positive Lyapunov exponents\(mikhaeil\_difficulty\_2022;gonzalez2025predictability\)\.

Our main methodological contribution is to resolve this tension by combining DEER with Generalized Teacher Forcing \(GTF\)\(hess\_generalized\_2023\)\. The resulting algorithm,GTF\-DEER, inherits the long\-sequence scalability of DEER while retaining GTF’s chaos\-taming properties: GTF turns the model into a stable DS during training such that GTF\-DEER enjoys the average\-case𝒪​\(\(log⁡T\)2\)\\mathcal\{O\}\(\(\\log T\)^\{2\}\)scaling established ingonzalez2025predictability, even when the underlying system is chaotic\. Empirically, GTF\-DEER delivers speedups of up to870×870\\timesover sequential training while matching or improving reconstruction quality\. Equipped with GTF\-DEER, we then revisit the central empirical question:*does training on much longer sequences actually improve DSR?*Leveraging the ability to train stably on trajectories of lengthT\>104T\>10^\{4\}, we find that access to long trajectories allows the DSR model to efficiently capture long time scales if present in the data, yielding substantial gains in long\-term statistics, which an equivalent linear SSM based model cannot match\. Taken together, our results establish GTF\-DEER as a direct replacement of sequential training while highlighting long\-sequence training as a largely untapped lever for DSR\.

## 2Related work

##### Dynamical systems reconstruction

Data\-driven methods for DSR fall into two broad categories: The first approximates thevector fieldthat underlies the observed dynamics, assuming the process is governed by differential equations\. Sparse Identification of Nonlinear Dynamics \(SINDy\) and its variants\(brunton\_discovering\_2016;brunton\_data\-driven\_2019;champion\_data\-driven\_2019;messenger\_weak\_2021;cortiella\_sparse\_2021\)are particularly popular in the physical sciences and enjoy quick training through least\-squares regression, but rely on pre\-defined function libraries and struggle with noisy, non\-stationary and partially observed empirical data\. Neural ODE/PDE methods\(chen\_neural\_2018;karlsson\_modelling\_2019;alvarez\_dynode\_2020;ko\_homotopy\-based\_2023;aka2025balanced\)offer universal approximation capabilities and can be augmented with physical priors\(raissi\_physics\-informed\_2019;li\_physics\-informed\_2022\), but are difficult to train in practice\. The second and larger class of models directly approximates theflow operatorthrough black\-box universal approximators based on neural networks, including neural operators\(li2020fourier;lu\_learning\_2021\), Koopman operators\(lusch\_deep\_2018;otto\_linearly\-recurrent\_2019;brunton\_modern\_2021;naiman\_koopman\_2021;azencot\_forecasting\_2020;wang\_koopman\_2022\)and hybrids thereof, reservoir computers\(pathak\_using\_2017;pathak\_model\-free\_2018;verzelli2021learn;platt2023constraining;patel\_using\_2023;gauthier\_next\_2021\), and RNNs trained through BPTT\(vlachas\_data\-driven\_2018;vlachas2024learning;brenner\_tractable\_2022;hess\_generalized\_2023;rusch\_long\_2022;brenner\_almost\_2024\), often accompanied by specialized control techniques that ease optimization and address exploding gradients under chaos\(mikhaeil\_difficulty\_2022;brenner\_tractable\_2022;hess\_generalized\_2023;sagtekin2025error\)\. The latter approach in particular achieves SOTA performance on a wide\-range of benchmark systems and performs well even on challenging empirical systems\(hess\_generalized\_2023;volkmann2024scalable;brenner\_integrating\_2024\), which has led to its adoption as a backbone in foundation models for DSR\(hemmer2025true\)\.

##### Efficient sequence modeling

A central challenge in sequence modeling is the stable capture and retrieval of long\-term dependencies\(bengio\_learning\_1994;hochreiter\_lstm\_97;gu\_efficiently\_2022;gu2024mamba;orvieto2023resurrecting;zucchet2024recurrent\)\. One practical bottleneck when using autoregressive models such as SSMs or RNNs is the𝒪​\(T\)\\mathcal\{O\}\(T\)runtime of BPTT\(werbos1990backpropagation\)\. Recent developments in the field of sequence models address this issue by parallelizing the inherently sequential operation on parallel accelerators, such as GPUs or TPUs, usingparallel associative scans\(blelloch1990prefix;martin2018parallelizing;smith\_simplified\_2023\)\. Parallel scans evaluate*linear*recurrences, often used as the core block of modern SSMs\(orvieto2023resurrecting;smith\_simplified\_2023;gu2024mamba\), in𝒪​\(log⁡T\)\\mathcal\{O\}\(\\log T\)time\. Furthermore,\(lim2024parallelizing\)introduced the DEER framework which parallelizes generalnonlinearsequence models by reformulating the forward pass as a fixed\-point iteration problem, where each iteration can be solved using a parallel scan\. While the worst\-case runtime complexity is𝒪​\(T​log⁡T\)\\mathcal\{O\}\(T\\log T\)\(gonzalez2024towards\), the average\-case observed in practice for models exhibiting contracting dynamics scales as𝒪​\(\(log⁡T\)2\)\\mathcal\{O\}\(\(\\log T\)^\{2\}\)gonzalez2025predictability, sparking a renaissance for nonlinear RNNs in long\-term sequence modeling\. Although DEER demonstrates promising performance in various ML problems, its applicability to the field of DSR remains impractical due to guaranteed worst\-case scaling when evaluating chaotic RNNs\.

## 3Theoretical background

### 3\.1Dynamical systems reconstruction with autoregressive models

Given observed time series data𝑿∈ℝTobs×N\\bm\{X\}\\in\\mathbb\{R\}^\{T\_\{\\mathrm\{obs\}\}\\times N\}originating from some underlying physical process, DSR seeks to learn a generative model that is able to both perform accurate short\-term predictions and reproduce the long\-term behavior of the observed system\. To tackle this task, we consider parameterized state space models of the form

𝒛t=F𝜽​\(𝒛t−1,𝒙t−1,𝒔t\),𝒙^t=G𝝍​\(𝒛t\),\\bm\{z\}\_\{t\}=F\_\{\\boldsymbol\{\\theta\}\}\(\\bm\{z\}\_\{t\-1\},\\bm\{x\}\_\{t\-1\},\\bm\{s\}\_\{t\}\),\\qquad\\bm\{\\hat\{x\}\}\_\{t\}=G\_\{\\boldsymbol\{\\psi\}\}\(\\bm\{z\}\_\{t\}\),\(1\)where𝜽\\boldsymbol\{\\theta\}and𝝍\\boldsymbol\{\\psi\}denote parameter vectors,𝒛t\\bm\{z\}\_\{t\}is anMM\-dimensional state vector,𝒙^t\\bm\{\\hat\{x\}\}\_\{t\}areNN\-dimensional predicted observations,𝒙t−1\\bm\{x\}\_\{t\-1\}is an optional teacher signal used during training, and𝒔t\\bm\{s\}\_\{t\}areKK\-dimensional optional external inputs\.F𝜽F\_\{\\boldsymbol\{\\theta\}\}is a discrete\-time universal DS modeling a latent process which is coupled to the observations \(data\) throughG𝝍G\_\{\\boldsymbol\{\\psi\}\}\. The aim of training is to learn\{𝜽,𝝍\}\\\{\\bm\{\\theta\},\\bm\{\\psi\}\\\}such that the SSM approximates the flow operator of the underlying DS, and after training we have

𝒙t≈G𝝍​\(F𝜽​\(…​F𝜽​\(F𝜽​\(𝒛0,𝒔1\),𝒔2\)​…,𝒔t\)\)≕G𝝍​\(F𝜽∘t​\(𝒛0,𝒔1:t\)\)\.\\bm\{x\}\_\{t\}\\approx G\_\{\\bm\{\\psi\}\}\(F\_\{\\bm\{\\theta\}\}\(\\dots F\_\{\\bm\{\\theta\}\}\(F\_\{\\bm\{\\theta\}\}\(\\bm\{z\}\_\{0\},\\bm\{s\}\_\{1\}\),\\bm\{s\}\_\{2\}\)\\dots,\\bm\{s\}\_\{t\}\)\)\\eqqcolon G\_\{\\bm\{\\psi\}\}\(F\_\{\\bm\{\\theta\}\}^\{\\circ t\}\(\\bm\{z\}\_\{0\},\\bm\{s\}\_\{1:t\}\)\)\.\(2\)In this work, we will investigate two general parameterizations of Eq\. \([1](https://arxiv.org/html/2605.12683#S3.E1)\)\.

#### 3\.1\.1Linear training\-time recurrences

In one setting,F𝜽F\_\{\\bm\{\\theta\}\}is strictlylinear, which shifts the burden of capturing nonlinearities in the data to a nonlinear observation functionG𝝍G\_\{\\boldsymbol\{\\psi\}\}which feeds back into the system at test time,

𝒛t=𝑨​𝒛t−1\+𝑼​𝒙t−1\+𝑪​𝒔t\+𝒉,𝒙^t=𝑩​ϕ​\(𝑽​𝒛t\+𝒃\),\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\bm\{U\}\\bm\{x\}\_\{t\-1\}\+\\bm\{C\}\\bm\{s\}\_\{t\}\+\\bm\{h\},\\qquad\\hat\{\\bm\{x\}\}\_\{t\}=\\bm\{B\}\\phi\(\\bm\{V\}\\bm\{z\}\_\{t\}\+\\bm\{b\}\),\(3\)where𝑨∈ℝM×M\\bm\{A\}\\in\\mathbb\{R\}^\{M\\times M\}\(often chosen to be diagonal\),𝑼∈ℝM×N\\bm\{U\}\\in\\mathbb\{R\}^\{M\\times N\},𝑪∈ℝM×K\\bm\{C\}\\in\\mathbb\{R\}^\{M\\times K\},𝒉∈ℝM\\bm\{h\}\\in\\mathbb\{R\}^\{M\},𝑩∈ℝN×L\\bm\{B\}\\in\\mathbb\{R\}^\{N\\times L\},𝑽∈ℝL×M\\bm\{V\}\\in\\mathbb\{R\}^\{L\\times M\},𝒃∈ℝL\\bm\{b\}\\in\\mathbb\{R\}^\{L\}andϕ\\phiis a nonlinear function such as theReLU​\(𝒛\)=max​\(0,𝒛\)\\mathrm\{ReLU\}\(\\bm\{z\}\)=\\mathrm\{max\}\(0,\\bm\{z\}\)\. This is essentially the architectural setup of modern SSMs, where linear recurrences are followed by non\-linear, point\-wise sequence transformations\(orvieto2023resurrecting;orvieto2024universality;smith\_simplified\_2023;gu2024mamba\), often implemented by MLPs\. In our case, the MLP is a simple one\-hidden\-layer neural network with hidden layer sizeLL\. In the following, we will refer to models defined by Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) as ‘LSSM’\. A crucial insight is that while the recurrence in Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) constitutes a linear,non\-autonomousDS during training, the model can produce nonlinear dynamics during evaluation by replacing the teacher signals𝒙t−1\\bm\{x\}\_\{t\-1\}with predictions𝒙^t−1\\hat\{\\bm\{x\}\}\_\{t\-1\}\. Indeed, during trajectory generationaftertraining, the recursion of Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) turns into

𝒛t=𝑨​𝒛t−1\+𝑾¯​ϕ​\(𝑽​𝒛t−1\+𝒃\)\+𝑪​𝒔t\+𝒉\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\overline\{\\bm\{W\}\}\\phi\(\\bm\{V\}\\bm\{z\}\_\{t\-1\}\+\\bm\{b\}\)\+\\bm\{C\}\\bm\{s\}\_\{t\}\+\\bm\{h\}\(4\)where𝑼𝑩=:𝑾¯∈ℝM×L\\bm\{U\}\\bm\{B\}=:\\overline\{\\bm\{W\}\}\\in\\mathbb\{R\}^\{M\\times L\}withrank⁡\(𝑾¯\)≤min⁡\(N,M,L\)\\operatorname\{rank\}\(\\overline\{\\bm\{W\}\}\)\\leq\\min\(N,M,L\)\. This allows SSMs to exhibit inherently nonlinear traits such as chaotic dynamics and multistability during autoregressive generation\. While the major feature of Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) is the fact that the forward pass can be calculated in logarithmic time using parallel scan, another insight is that linear recurrences do not suffer from chaos\-induced exploding gradients\(mikhaeil\_difficulty\_2022\), and hence remedy associated training instabilities by design\(mikhaeil\_difficulty\_2022;orvieto2023resurrecting;zucchet2024recurrent\); see Appx\.[A](https://arxiv.org/html/2605.12683#A1)for details\. Training of LSSMs makes use of a variant of teacher forcing originating from the sequence modeling field, where ground\-truth data from the previous time step𝒙t−1\\bm\{x\}\_\{t\-1\}is fed through a dedicated input layer \(𝑼\\bm\{U\}\)\(bengio2015scheduled;Goodfellow\-et\-al\-2016\); a generic training algorithm is provided in Alg\.[1](https://arxiv.org/html/2605.12683#alg1)\. This type of training suffers fromexposure bias, where the absence of forcing signals during evaluation \(Eq\. \([4](https://arxiv.org/html/2605.12683#S3.E4)\)\) significantly degrades autoregressive roll\-outs\(ranzato2016sequence\)\. While methods such as scheduled sampling address this issue\(bengio2015scheduled;vlachas2024learning\), they destroy linearity and therefore efficient parallelization of the recurrence in Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) during training; see Appx\.[C](https://arxiv.org/html/2605.12683#A3)for details\.

#### 3\.1\.2Non\-linear training\-time recurrences

To compare training algorithms, we will consider a second parameterization of Eq\. \([1](https://arxiv.org/html/2605.12683#S3.E1)\), which uses a nonlinear RNN as the latent model, coupled to the observations through a simple linear mapping:

𝒛t=𝑨​𝒛t−1\+𝑾​ϕ​\(𝑽​𝒛t−1\+𝒃\)\+𝑪​𝒔t\+𝒉,𝒙^t=𝑩​𝒛t\.\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\bm\{W\}\\phi\(\\bm\{V\}\\bm\{z\}\_\{t\-1\}\+\\bm\{b\}\)\+\\bm\{C\}\\bm\{s\}\_\{t\}\+\\bm\{h\},\\qquad\\hat\{\\bm\{x\}\}\_\{t\}=\\bm\{B\}\\bm\{z\}\_\{t\}\.\(5\)
Withϕ​\(⋅\)=ReLU​\(⋅\)\\phi\(\\cdot\)=\\mathrm\{ReLU\}\(\\cdot\), the latent model is known as a shallow piecewise\-linear RNN \(shPLRNN;\(hess\_generalized\_2023\)\), an established RNN architecture designed for DSR\. Indeed, up to the constraint of a low\-rank connectivity matrix, which can be manually enforced in the latter parameterization, the recurrences of Eqs\. \([4](https://arxiv.org/html/2605.12683#S3.E4)\) and \([5](https://arxiv.org/html/2605.12683#S3.E5)\) aremathematically equivalent\. This is important as it allows us todirectly compare training algorithms and parameterizations, unconfoundedby any architectural differences at test time, studying optimized models with parameters𝜽RNN∪𝝍RNN=\{𝑨,𝑾,𝑽,𝒃,𝑪,𝒉,𝑩\}\\bm\{\\theta\}\_\{\\mathrm\{RNN\}\}\\cup\\bm\{\\psi\}\_\{\\mathrm\{RNN\}\}=\\\{\\bm\{A\},\\bm\{W\},\\bm\{V\},\\bm\{b\},\\bm\{C\},\\bm\{h\},\\bm\{B\}\\\}and𝜽LSSM∪𝝍LSSM=\{𝑨,𝑾¯,𝑽,𝒃,𝑪,𝒉,𝑩\}\\bm\{\\theta\}\_\{\\mathrm\{LSSM\}\}\\cup\\bm\{\\psi\}\_\{\\mathrm\{LSSM\}\}=\\\{\\bm\{A\},\\overline\{\\bm\{W\}\},\\bm\{V\},\\bm\{b\},\\bm\{C\},\\bm\{h\},\\bm\{B\}\\\}\. All results established below hold in the presence of external inputs𝒔\\bm\{s\}, but we will drop them from notation for brevity\.

Due to the nonlinearity of the latent model in Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\), the forward pass is evaluated sequentially and cannot be naively parallelized\. Hence, a common method to train for DSR is to generate trajectories of lengthTTand then compare them to the data𝒙1:T\\bm\{x\}\_\{1:T\}\. To avoid training instabilities caused by, for example, exploding gradients, training is accompanied by control\-theoretic forcing methods such as Generalized Teacher Forcing \(GTF;\(doya\_bifurcations\_1992;hess\_generalized\_2023\)\)\. GTF alters the forward pass of the model during training by linearly interpolating between the latent state and a teacher signal before application of the sequence model:

𝒛t=F𝜽∘δα​\(𝒛t−1,𝒛¯t−1\)=F𝜽​\(\(1−α\)​𝒛t−1\+α​𝒛¯t−1\)=F𝜽​\(𝒛~t−1\),\\displaystyle\\bm\{z\}\_\{t\}=F\_\{\\bm\{\\theta\}\}\\circ\\delta\_\{\\alpha\}\(\\bm\{z\}\_\{t\-1\},\\overline\{\\bm\{z\}\}\_\{t\-1\}\)=F\_\{\\bm\{\\theta\}\}\(\(1\-\\alpha\)\\bm\{z\}\_\{t\-1\}\+\\alpha\\overline\{\\bm\{z\}\}\_\{t\-1\}\)=F\_\{\\bm\{\\theta\}\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\}\),\(6\)where0≤α≤10\\leq\\alpha\\leq 1is the forcing strength and𝒛¯\\overline\{\\bm\{z\}\}denotes the teacher signal which is generally computed by inversion of the observation model or output layer, i\.e\.𝒛¯t=G𝝍−1​\(𝒙t\)\\overline\{\\bm\{z\}\}\_\{t\}=G\_\{\\boldsymbol\{\\psi\}\}^\{\-1\}\(\\bm\{x\}\_\{t\}\), and𝒛~\\tilde\{\\bm\{z\}\}is the forced state\. During training, this leads to decomposition of the Jacobian as

𝑱F∘δ​\(𝒛t−1\)=\(1−α\)​𝑱F​\(𝒛~t−1\),\\bm\{J\}\_\{F\\circ\\delta\}\(\\bm\{z\}\_\{t\-1\}\)=\(1\-\\alpha\)\\bm\{J\}\_\{F\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\}\),\(7\)where𝑱F​\(𝒛\)≔∂F​\(𝒛\)∂𝒛\\bm\{J\}\_\{F\}\(\\bm\{z\}\)\\coloneqq\\frac\{\\partial\{F\(\\bm\{z\}\)\}\}\{\\partial\{\\bm\{z\}\}\}\. The forcing parameterα∈\[0,1\]\\alpha\\in\[0,1\]controls the norm of the Jacobians during BPTT and can be optimally and adaptively adjusted in training to mitigate exploding gradients\(hess\_generalized\_2023\)\.

For linear observation models \(Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\)\) and the common case where the RNN has more units than there are observed dynamical variables \(M\>NM\>N\),\(sagtekin2025error\)introduced a correction to GTF \([6](https://arxiv.org/html/2605.12683#S3.E6)\) which includes the row\-space projector of𝑩\\bm\{B\}in the forcing equation:

δα,𝑩​\(𝒛t,𝒙t\)=\(𝑰−α​𝑩\+​𝑩\)​𝒛t\+α​𝑩\+​𝒙t,\\delta\_\{\\alpha,\\bm\{B\}\}\(\\bm\{z\}\_\{t\},\\bm\{x\}\_\{t\}\)=\(\\bm\{I\}\-\\alpha\\bm\{B\}^\{\+\}\\bm\{B\}\)\\bm\{z\}\_\{t\}\+\\alpha\\bm\{B\}^\{\+\}\\bm\{x\}\_\{t\},\(8\)where𝑩\+\\bm\{B\}^\{\+\}denotes the pseudo\-inverse of𝑩\\bm\{B\}, such that Eq\. \([7](https://arxiv.org/html/2605.12683#S3.E7)\) becomes

𝑱F∘δα,𝑩​\(𝒛t−1\)=𝑱F​\(𝒛~t−1\)​𝑷α,\\bm\{J\}\_\{F\\circ\\delta\_\{\\alpha,\\bm\{B\}\}\}\(\\bm\{z\}\_\{t\-1\}\)=\\bm\{J\}\_\{F\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\}\)\\bm\{P\}\_\{\\alpha\},\(9\)where𝑷α:=𝑰−α​𝑩\+​𝑩\\bm\{P\}\_\{\\alpha\}:=\\bm\{I\}\-\\alpha\\bm\{B\}^\{\+\}\\bm\{B\}\. While this correction does not preserve the mitigation of exploding gradients in general, it dampens Jacobian singular values along directions corresponding to expansion in observation space\. Indeed, we find empirically that Jacobian damping still holds in practice in this case, which we attribute to mixing of directions through the Jacobian product\. Note that forM≤NM\\leq Nand assuming full column\-rank,𝑩\+​𝑩=𝑰\\bm\{B\}^\{\+\}\\bm\{B\}=\\bm\{I\}and vanilla GTF is recovered\. For a detailed overview of the training protocol under GTF, see Alg\.[2](https://arxiv.org/html/2605.12683#alg2)\.

#### 3\.1\.3Parallel\-in\-time algorithms for nonlinear sequence models

Training nonlinear flow operators on long sequences sequentially is prohibitive due to the linear𝒪​\(T\)\\mathcal\{O\}\(T\)scaling of BPTT in sequence length\. To make training feasible, we will make use of the recently proposed DEER algorithm\(lim2024parallelizing;gonzalez2024towards\)\. Given an initial condition𝒛0\\bm\{z\}\_\{0\}, let𝒛1:T\\bm\{z\}\_\{1:T\}be a series of candidate latent states and define the residual vector𝒓​\(𝒛1:T\):=\[𝒛1−F​\(𝒛0\),…,𝒛T−F​\(𝒛T−1\)\]∈ℝM​T\\bm\{r\}\(\\bm\{z\}\_\{1:T\}\):=\[\\bm\{z\}\_\{1\}\-F\(\\bm\{z\}\_\{0\}\),\\dots,\\bm\{z\}\_\{T\}\-F\(\\bm\{z\}\_\{T\-1\}\)\]\\in\\mathbb\{R\}^\{MT\}\. The fixed point𝒛1:T∗=F​\(𝒛0:T−1∗\)\\bm\{z\}\_\{1:T\}^\{\\ast\}=F\(\\bm\{z\}\_\{0:T\-1\}^\{\\ast\}\)is the only solution with zero residual, i\.e\.𝒓​\(𝒛1:T∗\)=𝟎\\bm\{r\}\(\\bm\{z\}\_\{1:T\}^\{\\ast\}\)=\\bm\{0\}\. Conventionally, the forward pass is generated sequentially by iteratingFFstarting from𝒛0\\bm\{z\}\_\{0\}\. DEER turns the roll\-out of the mapFFinto a root\-finding problem for the residual, which is solved using Newton’s method\. Starting from an initial guess𝒛1:T\(0\)\\bm\{z\}\_\{1:T\}^\{\(0\)\}, where the superscript indicates the current Newton iteration, the true trace𝒛1:T∗\\bm\{z\}\_\{1:T\}^\{\\ast\}is approximated by iteratively solving the update equation

𝒛1:T\(i\+1\)=𝒛1:T\(i\)−\(𝑱𝒓​\(𝒛1:T\)\)−1​𝒓​\(𝒛1:T\(i\)\),\\bm\{z\}\_\{1:T\}^\{\(i\+1\)\}=\\bm\{z\}\_\{1:T\}^\{\(i\)\}\-\\left\(\\bm\{J\}\_\{\\bm\{r\}\}\(\\bm\{z\}\_\{1:T\}\)\\right\)^\{\-1\}\\bm\{r\}\\left\(\\bm\{z\}\_\{1:T\}^\{\(i\)\}\\right\),\(10\)where𝑱𝒓:=∂𝒓∂𝒛1:T\\bm\{J\}\_\{\\bm\{r\}\}:=\\frac\{\\partial\\bm\{r\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}\. Inverting the Jacobian matrix explicitly is infeasible for largeM​TMT\. Instead, it is more practical to multiply both sides of the equation by the Jacobian and exploit its block\-bidiagonal structure \(cf\. Eq\. \([40](https://arxiv.org/html/2605.12683#A4.E40)\) in Appx\.[D](https://arxiv.org/html/2605.12683#A4)\) to obtain a recursive formula forΔ​𝒛t\(i\+1\):=𝒛t\(i\+1\)−𝒛t\(i\)\\Delta\\bm\{z\}\_\{t\}^\{\(i\+1\)\}:=\\bm\{z\}\_\{t\}^\{\(i\+1\)\}\-\\bm\{z\}\_\{t\}^\{\(i\)\}:

Δ​𝒛t\(i\+1\)=𝑱F​\(𝒛t−1\(i\)\)​Δ​𝒛t−1\(i\+1\)−\[𝒓​\(𝒛1:T\(i\)\)\]t\\Delta\\bm\{z\}\_\{t\}^\{\(i\+1\)\}=\\bm\{J\}\_\{F\}\\left\(\\bm\{z\}\_\{t\-1\}^\{\(i\)\}\\right\)\\Delta\\bm\{z\}\_\{t\-1\}^\{\(i\+1\)\}\-\\left\[\\bm\{r\}\(\\bm\{z\}\_\{1:T\}^\{\(i\)\}\)\\right\]\_\{t\}\(11\)Eq\. \([11](https://arxiv.org/html/2605.12683#S3.E11)\) is linear inΔ​𝒛\\Delta\\bm\{z\}and hence each Newton iteration can be solved in𝒪​\(log⁡T\)\\mathcal\{O\}\(\\log T\)time by a parallel associative scan\. Due to its usage of the full Jacobian and hence full matrix products, increasing the dimensionality of the state of the sequence model can become a bottleneck\. To address this,gonzalez2024towardssuggested the use of quasi\-Newton methods which replace the full Jacobian in Eq\. \([11](https://arxiv.org/html/2605.12683#S3.E11)\) by its diagonaldiag​\(𝑱F\)∈ℝM\\mathrm\{diag\}\(\\bm\{J\}\_\{F\}\)\\in\\mathbb\{R\}^\{M\}, reducing the*work*𝒲\\mathcal\{W\}of each Newton iteration from𝒲full=𝒪​\(M3​T\)\\mathcal\{W\}\_\{\\mathrm\{full\}\}=\\mathcal\{O\}\(M^\{3\}T\)to𝒲diag=𝒪​\(M​T\)\\mathcal\{W\}\_\{\\mathrm\{diag\}\}=\\mathcal\{O\}\(MT\)\.

While DEER and its variants can be directly used to train sequence models on data from nonlinear DS, the methods suffer from Jacobian divergence when tasked to generate orbits from systems with \(on average\) unstable dynamics, i\.e\. where∥𝑱F​\(𝒛t\)∥\>1\\lVert\\bm\{J\}\_\{F\}\(\\bm\{z\}\_\{t\}\)\\rVert\>1for mosttt\(gonzalez2025predictability\)\. This divergence is inevitable when training on chaotic systemsmikhaeil\_difficulty\_2022\. Since most if not all natural complex DS are chaotic\(Sivakumar2004ChaosGeophysics;durstewitz\_dynamical\_2007;GovindanECGchaos;Turchin92Ecology;FieldKorosNoyes1972\), this needs to be amended\.

## 4Methods

### 4\.1DSR from long sequences

To enable stable training from long sequences of chaotic dynamics, we propose to combine GTF and DEER, which we coin GTF\-DEER\. By simply replacing the RNN mapF𝜽F\_\{\\bm\{\\theta\}\}with the teacher\-forced variant \([6](https://arxiv.org/html/2605.12683#S3.E6)\), the residual becomes the vector with entries

\[𝒓α​\(𝒛1:T,𝒙1:T\)\]t=𝒛t−F𝜽​\(δα,𝑩​\(𝒛t−1,𝒙t−1\)\)\\left\[\\bm\{r\}\_\{\\alpha\}\(\\bm\{z\}\_\{1:T\},\\bm\{x\}\_\{1:T\}\)\\right\]\_\{t\}=\\bm\{z\}\_\{t\}\-F\_\{\\bm\{\\theta\}\}\\left\(\\delta\_\{\\alpha,\\bm\{B\}\}\(\\bm\{z\}\_\{t\-1\},\\bm\{x\}\_\{t\-1\}\)\\right\)\(12\)fort=2​…​Tt=2\\dots Tand\[𝒓α​\(𝒛1:T,𝒙1:T\)\]1=𝒛1−F𝜽​\(𝒛0\)\\left\[\\bm\{r\}\_\{\\alpha\}\(\\bm\{z\}\_\{1:T\},\\bm\{x\}\_\{1:T\}\)\\right\]\_\{1\}=\\bm\{z\}\_\{1\}\-F\_\{\\bm\{\\theta\}\}\(\\bm\{z\}\_\{0\}\)since𝒛0\\bm\{z\}\_\{0\}is given\. Hence the linear recursion uses the decomposed Jacobians \([9](https://arxiv.org/html/2605.12683#S3.E9)\)

Δ​𝒛t\(i\+1\)=\[𝑱F​\(𝒛~t−1\(i\)\)​𝑷α\]​Δ​𝒛t−1\(i\+1\)−\[𝒓α​\(𝒛1:T\(i\),𝒙1:T\)\]t\\Delta\\bm\{z\}\_\{t\}^\{\(i\+1\)\}=\\left\[\\bm\{J\}\_\{F\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\}^\{\(i\)\}\)\\bm\{P\}\_\{\\alpha\}\\right\]\\ \\Delta\\bm\{z\}\_\{t\-1\}^\{\(i\+1\)\}\-\\left\[\\bm\{r\}\_\{\\alpha\}\\left\(\\bm\{z\}\_\{1:T\}^\{\(i\)\},\\bm\{x\}\_\{1:T\}\\right\)\\right\]\_\{t\}\(13\)Similar to sequential GTF, Jacobian divergence caused by chaotic dynamics is tamed by𝑷α\\bm\{P\}\_\{\\alpha\}which forM≤NM\\leq Nstrictly turns a DS with diverging state space directions into a contracting one, improving DEER convergence\(gonzalez2025predictability\)\. To demonstrate this formally, we need the following definitions\.

###### Definition 1\.

The*largest Lyapunov exponent*\(LLE\) of an orbit of the iterated mapFFstarting in𝒛0\\bm\{z\}\_\{0\}is

λ​\(𝒛0\)=limT→∞1T​log⁡‖𝑱T​𝑱T−1​…​𝑱1‖2,\\textstyle\\lambda\(\\bm\{z\}\_\{0\}\)=\\lim\_\{T\\to\\infty\}\\frac\{1\}\{T\}\\log\\left\\lVert\{\\bm\{J\}\_\{T\}\\bm\{J\}\_\{T\-1\}\\dots\\bm\{J\}\_\{1\}\}\\right\\rVert\_\{2\},\(14\)where𝑱t:=𝑱​\(𝒛t−1\):=∂F​\(𝒛t−1\)∂𝒛t−1\\bm\{J\}\_\{t\}:=\\bm\{J\}\(\\bm\{z\}\_\{t\-1\}\):=\\frac\{\\partial F\(\\bm\{z\}\_\{t\-1\}\)\}\{\\partial\\bm\{z\}\_\{t\-1\}\}\.

We say the iterated mapFFis*divergent*in𝒛0\\bm\{z\}\_\{0\}ifλ​\(𝒛0\)\>0\\lambda\(\\bm\{z\}\_\{0\}\)\>0\. Whenλ​\(𝒛0\)<0\\lambda\(\\bm\{z\}\_\{0\}\)<0,FFis*contracting*\.

###### Proposition 1\.

Let an SSM be given by a latent modelF:ℝM→ℝMF:\\mathbb\{R\}^\{M\}\\to\\mathbb\{R\}^\{M\}and a linear observation modelG:ℝM→ℝN,𝐳↦𝐁​𝐳G:\\mathbb\{R\}^\{M\}\\to\\mathbb\{R\}^\{N\},\\ \\bm\{z\}\\mapsto\\bm\{B\}\\bm\{z\}whereM≤NM\\leq Nand𝐁\\bm\{B\}is assumed to have full rank\. Let the supremal Jacobian norm satisfyσ~max:=sup\{∥𝐉​\(𝐳\)∥2=σmax​\(𝐉​\(𝐳\)\)∣𝐳∈ℝM\}\>1\\tilde\{\\sigma\}\_\{\\max\}:=\\sup\\left\\\{\\lVert\\bm\{J\}\(\\bm\{z\}\)\\rVert\_\{2\}=\\sigma\_\{\\mathrm\{max\}\}\(\\bm\{J\}\(\\bm\{z\}\)\)\\mid\\bm\{z\}\\in\\mathbb\{R\}^\{M\}\\right\\\}\>1\. Choose the GTF forcing strengthα∈\(α∗,1\]\\alpha\\in\(\\alpha^\{\*\},\\;1\], whereα∗:=1−σ~max−1\\alpha^\{\*\}:=1\-\\tilde\{\\sigma\}\_\{\\max\}^\{\-1\}\(hess\_generalized\_2023\), and define the effective contraction rate

ρ:=\(1−α\)​σ~max<1\.\\rho:=\(1\-\\alpha\)\\,\\tilde\{\\sigma\}\_\{\\max\}<1\.\(15\)Then the forced systemF∘δα,𝐁F\\circ\\delta\_\{\\alpha,\\bm\{B\}\}is globally contracting with rateρ\\rho\. Its LLE satisfies

λGTF≤log⁡ρ<0\.\\lambda\_\{\\mathrm\{GTF\}\}\\leq\\log\\rho<0\.\(16\)

Proposition[1](https://arxiv.org/html/2605.12683#Thmproposition1)says that given a suitableα\\alpha, training a nonlinear sequence model using GTF\-DEER leads to guaranteed convergence of the forward pass regardless of the dynamics that underlie the data, enabling stable and efficient parallel\-in\-time training for DSR\. We validate this theoretical finding in Sect\.[5\.1](https://arxiv.org/html/2605.12683#S5.SS1)and investigate the effect ofM\>NM\>Non GTF\-DEER convergence empirically\. Moreover, since we directly want to match forced model trajectories to targets given by the data, we can initialize the Newton iterations of DEER using the forcing signals, i\.e\.𝒛1:T\(0\)=𝒛¯1:T\\bm\{z\}\_\{1:T\}^\{\(0\)\}=\\overline\{\\bm\{z\}\}\_\{1:T\}, which speeds up convergence over the typical initialization by zeros\(lim2024parallelizing;gonzalez2024towards\)\. We provide the training routine using GTF\-DEER in Alg\.[3](https://arxiv.org/html/2605.12683#alg3)\.

### 4\.2Capturing long\-term dependencies

##### Regularization

To effectively capture long\-term dependencies, sequence models must maintain stable flow of information over extended temporal horizons\(bengio\_learning\_1994;hochreiter\_lstm\_97;gu\_efficiently\_2022;zucchet2024recurrent\)\. Crucially, for maintaining connections among states across time we must not only avoid chaotic divergence, but also prevent error signals propagated backward from decaying too fast\. In this work we therefore make use of Manifold Attractor Regularization \(MAR;\(schmidt\_identifying\_2021\)\) to ameliorate diminishing gradients by equipping the sequence model with a latent subspace that adaptively captures slow timescales in the data\. In Appx\.[E](https://arxiv.org/html/2605.12683#A5), we review the MAR contribution to the loss function and show how this regularization leads to stable error propagation by imposing a block\-diagonal identity structure in the model Jacobian𝑱F\\bm\{J\}\_\{F\}\.

##### Latent state warm\-up

To provide the RNN with sufficient dynamical context during training, we employ latent state warm\-up\(vlachas\_backpropagation\_2020;schiller2026tuning\)\. During training we sample sequences of lengthTT, where the firstTwT\_\{w\}steps are used for warm\-up\. The warm\-up consists of forcing the sequence model with data𝒙1:Tw\\bm\{x\}\_\{1:T\_\{w\}\}, either by feeding them through𝑼\\bm\{U\}for the LSSM \(Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\)\) or by GTF in the case of the nonlinear RNN \(Eqs\. \([6](https://arxiv.org/html/2605.12683#S3.E6)\) and \([8](https://arxiv.org/html/2605.12683#S3.E8)\)\) withα=1\\alpha=1\. This yields a latent state𝒛Tw\\bm\{z\}\_\{T\_\{w\}\}which holds a compressed history of the warm\-up signal, which is then used to predict the remainingT−TwT\-T\_\{w\}time steps that contribute to the loss function given by the mean\-squared\-errorMSE​\(𝒙Tw\+1:T,𝒙^Tw\+1:T\)\\textrm\{MSE\}\(\\bm\{x\}\_\{T\_\{w\}\+1:T\},\\hat\{\\bm\{x\}\}\_\{T\_\{w\}\+1:T\}\)\.

Crucially, the forward pass of the entire sequence of lengthTT*including*the warm\-up can be computed by a single call to GTF\-DEER, enabling long contexts during training \(cf\. Alg\.[3](https://arxiv.org/html/2605.12683#alg3)\)\.

##### Performance measures

To evaluate DSR performance we use an established long\-term measure which computes the KL divergence between multivariate state distributions in observation space,DstspD\_\{\\mathrm\{stsp\}\}\(koppe\_identifying\_2019;hess\_generalized\_2023\)\. For partially observed systems, we first perform a delay embedding\(takens\_detecting\_1981;sauer\_embedology\_1991\)of ground\-truth and generated trajectories and computeDstspD\_\{\\mathrm\{stsp\}\}on these embeddings\. We will make delay embeddings explicit in notation by writing the measure asDstspDED\_\{\\mathrm\{stsp\}\}^\{\\mathrm\{DE\}\}\. We also report the128128\-step root\-mean\-square\-error \(RMSE\) as a measure of short\- to medium\-length prediction accuracy\. For mathematical and experimental details on evaluation measures, see Appx\.[G](https://arxiv.org/html/2605.12683#A7)\.

## 5Results

We will first evaluate the computational efficiency of GTF\-DEER by performing runtime analyses\. We will then investigate the benefit of long\-sequence training for DSR through a set of ablation experiments and comparisons to state\-of\-the\-art sequence models\. Specifically, we 1\) ask whether long sequences improve DSR and 2\) reveal and explain the shortcomings of SSMs for DSR\.

### 5\.1Runtime convergence and performance of GTF\-DEER

![Refer to caption](https://arxiv.org/html/2605.12683v1/x1.png)Figure 1:A: GTF\-DEER scales favorably in sequence length, but the GPU quickly saturates for larger problems\. All analyses were performed on an NVIDIA RTX 6000 Blackwell \(96GB\) GPU\. Note the logarithmic scaling of the y\-axis\.B: The forcing parameterα\\alphacontrols Jacobian norms and hence reduces the number of Newton iterations needed for convergence of the GTF\-DEER forward pass\. Initializing the model trajectory using estimated latents through𝒛1:T\(0\)=𝑩\+​𝒙1:T\\bm\{z\}^\{\(0\)\}\_\{1:T\}=\\bm\{B\}^\{\+\}\\bm\{x\}\_\{1:T\}\(‘P\-inv’\) improves GTF\-DEER convergence over naive initializations forN=MN=M; however, the effect vanishes forM\>NM\>N\.We evaluate the efficacy of GTF\-DEER by performing runtime comparisons to the sequential baseline as well as performing ablations on several parameters of the GTF\-DEER algorithm\. Fig\.[1](https://arxiv.org/html/2605.12683#S5.F1)Ashows median runtimes for combined forward\+backward passes for both sequential and parallel evaluation using GTF\-DEER while training to reconstruct dynamics of the chaotic Lorenz\-63 attractor \(N=3N=3\)\. While the sequential approach exhibits linear scaling throughout, GTF\-DEER enables sublinear scaling with sequence length, leading to a speedup of up to𝟖𝟕𝟎×\\mathbf\{870\\times\}over sequential evaluation \(see Fig\.[1](https://arxiv.org/html/2605.12683#S5.F1)AsettingM=4M=4,T=32,768T=32\{,\}768\)\. However, for large problems, i\.e\. largeB×T×MB\\times T\\times M, GPU memory bandwidth quickly saturates leading to near linear to constant scaling, a common bottleneck in parallel scan implementations on GPU devices\(merrill2016single;gu2024mamba\)\. Nevertheless, for all settings using GTF\-DEER outperforms naive sequential training in terms of raw runtime\.

Fig\.[1](https://arxiv.org/html/2605.12683#S5.F1)Bshows the convergence of the DEER forward pass under increasing values ofα∈\[0,1\]\\alpha\\in\[0,1\]\. For GTF\-DEER,α\\alphadoes not only control Jacobian norms during the backward pass, addressing the exploding gradient problem in face of chaos\(mikhaeil\_difficulty\_2022;hess\_generalized\_2023\), but also controls convergence of the linear recurrence in Eq\. \([13](https://arxiv.org/html/2605.12683#S4.E13)\)\. Ifα\\alphais too small, the Jacobians will diverge in chaotic settings, and hence the forward pass becomes numerically unstable such that a large number of iterations is needed to converge\. Increasingα\\alphapushes Jacobian norms below11such that the forced model constitutes a stable DS, leading to convergence of GTF\-DEER\(gonzalez2025predictability\)\. In fact, for sufficiently largeα\\alpha, the forward pass even converges optimally in just22Newton iterations forM≤NM\\leq N\!111That 2 instead of just 1 Newton iterations are required is due to the fact that one additional iteration is needed to verify that the previous iteration has indeed converged\.ForM\>NM\>N, only a subspace of the latent space of the RNN is forced such that GTF\-DEER needs more iterations to converge to the true latent dynamics\.

Finally, the different curves in Fig\.[1](https://arxiv.org/html/2605.12683#S5.F1)Bshow the benefit of readily available forcing targets: Initializing the latent trajectory with estimated forcing signals by pseudo\-inversion of the observation model drastically reduces the number of Newton iterations for a givenα\\alphaandM≤NM\\leq N\. The benefit vanishes when moving to the overdetermined case ofM\>NM\>N\. This is expected, as the RNN dynamics will veer off the estimated forcing signals as GTF is only applied to the zero\-error manifold\(sagtekin2025error\)\. For more details on the experimental setup underlying Fig\.[1](https://arxiv.org/html/2605.12683#S5.F1)we refer to Appx\.[H\.1](https://arxiv.org/html/2605.12683#A8.SS1)\.

### 5\.2GTF\-DEER ablations

An important observation is that using diagonal approximation of the Jacobians in Eq\. \([13](https://arxiv.org/html/2605.12683#S4.E13)\) for the forward pass \(quasi\-DEER,\(gonzalez2024towards\)\) affects gradients during the backward pass, as the same diagonal Jacobians will be used \(see Appx\.[D](https://arxiv.org/html/2605.12683#A4)for more details\)\. In Fig\.[2](https://arxiv.org/html/2605.12683#S5.F2), we test how Jacobian diagonalization affects GTF\-DEER convergence and reconstruction quality\. We show loss curves and Newton iterations of shPLRNNs \(M=5M=5,L=50L=50\) trained on the Lorenz\-63 system under two settings: In the ‘FO’ setting, the shPLRNN is tasked to reconstruct the dynamics from the fully observed system \(N=3N=3\), while in the more challenging ‘PO’ setting only thexx\-component is observed \(N=1N=1\)\. We then consider training with diagonalized Jacobians \(‘quasi’\) as well as full Jacobians, i\.e\. standard GTF\-DEER, while keeping all other training\-related hyperparameters the same\. We find that while full\-Jacobian training handled both settings with minimal Newton iterations and provided good reconstruction quality \(‘PO’:DstspDE=\(7\.5±6\.2\)⋅10−3D^\{\\mathrm\{DE\}\}\_\{\\mathrm\{stsp\}\}=\(7\.5\\pm 6\.2\)\\cdot 10^\{\-3\}, ‘FO’:Dstsp=\(8\.7±3\.8\)⋅10−3D\_\{\\mathrm\{stsp\}\}=\(8\.7\\pm 3\.8\)\\cdot 10^\{\-3\}\), diagonal approximations needed≈100×\\approx 100\\timesmore Newton iterations for the forward pass to converge, failed to yield convergent loss curves in the more challenging ‘PO’ setting, and fell behind in reconstruction performance across both settings \(‘PO \+ quasi’:DstspDE=\(7±7\)D^\{\\mathrm\{DE\}\}\_\{\\mathrm\{stsp\}\}=\(7\\pm 7\)where14/2014/20runs diverged, ‘FO \+ quasi’:Dstsp=\(4\.4±3\.5\)⋅10−2D\_\{\\mathrm\{stsp\}\}=\(4\.4\\pm 3\.5\)\\cdot 10^\{\-2\}\)\. This result highlights the role of the Jacobians during the backward pass: Since the diagonal approximations of the Jacobians are used to compute the gradient, they lose information on temporal mixing effects\. This prevents the DSR model from learning a faithful embedding of the partially observed dynamics, which is also reflected in the erratic loss curve in Fig[2](https://arxiv.org/html/2605.12683#S5.F2)A\. Thus, while full Jacobians in theory increase the time demand of each Newton step, this is more than compensated for by the manifold fewer iterations needed and the more stable training\.

![Refer to caption](https://arxiv.org/html/2605.12683v1/x2.png)Figure 2:Training dynamics and reconstruction quality of a shPLRNN \(M=5M=5,L=50L=50\) on the Lorenz\-63 system under fully observed \(FO,N=3N=3\) and partially observed \(PO,N=1N=1\) conditions\.A: Training loss curves\.B: Newton iterations per training step\.C: State\-space divergenceDstspD\_\{\\mathrm\{stsp\}\}\(lower is better\); red crosses mark runs where trajectory divergence during generation produced NaNs\.D: Wall\-clock runtime per training step\. All reported values are median±\\pmMAD across2020runs\.
### 5\.3Reconstructing dynamical systems with long\-term dependencies

To demonstrate the advantage of processing long sequences, we generated data from 1\) a Lorenz\-96 system \(N=6N=6\)\(lorenz\_predictability\_1996\)augmented with a sinusoidal forcing term of period15,00015\{,\}000discrete time steps, thereby introducing an explicit long time scale into the dynamics, and 2\) a bursting neuron biophysical model\(durstewitz\_dynamical\_2007\)which exhibits long inter\-burst\-intervals of\>104\>10^\{4\}time steps \(see Appx\.[F](https://arxiv.org/html/2605.12683#A6)for further details on data generation\)\. We trained shPLRNNs withM=10M=10for the Lorenz96 system andM=6M=6for the bursting neuron, where both settings shareL=128L=128, to reconstruct the underlying dynamics while systematically varying the training sequence length\. To control for the fact that longer sequences expose the model to more data per gradient update, we held the productB⋅T=215=32,768B\\cdot T=2^\{15\}=32\{,\}768fixed\. For full experimental details see Appx\.[H](https://arxiv.org/html/2605.12683#A8)\.

Figure[3](https://arxiv.org/html/2605.12683#S5.F3)Areports the long\-term reconstruction measureDstspDED\_\{\\mathrm\{stsp\}\}^\{\\mathrm\{DE\}\}as a function of sequence length\. For both systems, the measure improves with increasing sequence length, confirming that the model benefits substantially from training specifically on longer contexts, while keeping thetotalamount of training data constant\. TheRMSE​\(128\)\\textrm\{RMSE\}\(128\), however, already saturates for shorter sequence lengths, as expected by design of this short\-term measure, with mean±\\pmSEM over all sequence lengths and runs given byRMSE​\(128\)=0\.27±0\.02\\textrm\{RMSE\}\(128\)=0\.27\\pm 0\.02for the Lorenz\-96 and0\.047±0\.0030\.047\\pm 0\.003for the bursting neuron system, respectively \(see also Fig\.[A3](https://arxiv.org/html/2605.12683#A9.F3)\)\. Training at such sequence lengths becomes tractable only through the favorable scaling of GTF\-DEER: whereas training on sequences of length32,76832\{,\}768required on average≈10\{\\approx\}\\,10minutes for the Lorenz\-96 setting on a single NVIDIA RTX 6000 Blackwell GPU, sequential GTF\(hess\_generalized\_2023\)would require approximately3434hours\.

![Refer to caption](https://arxiv.org/html/2605.12683v1/x3.png)Figure 3:A: Long\-term measureDstspDED\_\{\\mathrm\{stsp\}\}^\{\\mathrm\{DE\}\}as a function of sequence length for shPLRNNs trained on the forced Lorenz\-96 and bursting neuron system\.B:DstspDED\_\{\\mathrm\{stsp\}\}^\{\\mathrm\{DE\}\}evaluated on the forced Lorenz\-96 system for different models \(see Fig\.[A2](https://arxiv.org/html/2605.12683#A9.F2)for qualitative comparison\)\. Note that even Mamba\-2 cannot match the performance of ther=7r=7shPLRNN trained with GTF\-DEER even when allowed many more parameters\.C: Example reconstructions for long\-sequence GTF\-DEER \(top\) vs\. a model trained only on standard\-length sequences \(bottom\) for the bursting neuron \(left\) and forced Lorenz\-96 \(right\)\.
### 5\.4Linear vs\. nonlinear training\-time recurrences

Finally, we compare the DSR performance of LSSMs \(Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) to that of nonlinear RNNs \(Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\)\), trained by GTF\-DEER, on the forced Lorenz\-96 system\. For comparability, we fix common hyperparameters, i\.e\. both models useM=10M=10,L=128L=128,T=81,920T=81\{,\}920andB=1B=1\. To investigate the effect of the low\-rank constraint in the LSSM \(Eq\. \([4](https://arxiv.org/html/2605.12683#S3.E4)\)\), we also train a low\-rank version of the shPLRNN, where𝑾:=𝑾L⋅𝑾R\\bm\{W\}:=\\bm\{W\}\_\{L\}\\cdot\\bm\{W\}\_\{R\}with𝑾L∈ℝM×r\\bm\{W\}\_\{L\}\\in\\mathbb\{R\}^\{M\\times r\}and𝑾R∈ℝr×L\\bm\{W\}\_\{R\}\\in\\mathbb\{R\}^\{r\\times L\}\. To account for MAR applied to the shPLRNN, we add a similar regularization term to the training of the LSSM, see Appx\.[E](https://arxiv.org/html/2605.12683#A5)for details\. Figure[3](https://arxiv.org/html/2605.12683#S5.F3)Bsummarizes the results of this ablation quantitatively throughDstspDED^\{\\mathrm\{DE\}\}\_\{\\mathrm\{stsp\}\}, while qualitative example reconstructions are provided in Fig\.[A2](https://arxiv.org/html/2605.12683#A9.F2)\. The LSSMs as well as the shPLRNN with their rank limited to the number of observed variables,r=N=6r=N=6, all fall behind in reconstructing the limiting dynamics of the forced Lorenz\-96 system, failing to capture the influence of the sinusoidal forcing pattern\. Simply increasing the rank by just11\(r=7r=7\) equips the shPLRNN with the expressivity needed to capture the latent forcing dynamics, hence significantly improving reconstruction quality\. We also trained a Mamba\-2\-based model\(dao2024transformers\)with a comparable parameter count on this task \(see Appx\.[H](https://arxiv.org/html/2605.12683#A8)for details\)\. Even though Mamba\-2 avoids the low\-rank constraint through its selection mechanism, where recurrence parameters are parameterized by the data,𝑨→𝑨​\(𝒙t\)\\bm\{A\}\\rightarrow\\bm\{A\}\(\\bm\{x\}\_\{t\}\)and𝑼→𝑼​\(𝒙t\)\\bm\{U\}\\rightarrow\\bm\{U\}\(\\bm\{x\}\_\{t\}\), it performs much worse than ther=7r=7shPLRNN trained by GTF\-DEER and requires more trainable parameters \(see Mamba\-2\-7k vs\. Mamba\-2\-17k\), while at the same time exhibiting much higher variance in training outcome\. The latter observation highlights another important feature of the training algorithm: GTF\-DEER improves autoregressive roll\-outs at test time, since GTF reduces exposure bias by keepingα\\alpha, i\.e\. the forcing strength, minimal\. Applying similar strategies to address exposure bias to linear training\-time recurrences destroys their amenability to efficient parallelization \(see Appx\.[C](https://arxiv.org/html/2605.12683#A3)\)\.

## 6Discussion

We introduced GTF\-DEER, a parallel\-in\-time training algorithm that combines the long\-sequence scalability of DEER with the chaos\-taming properties of Generalized Teacher Forcing \(GTF\)\. Through forcing, GTF\-DEER turns the RNN during training into a stable DS, improving empirical runtime scaling from the worst\-case of𝒪​\(T​log⁡T\)\\mathcal\{O\}\(T\\log T\)under unstable dynamics to the average case of𝒪​\(\(log⁡T\)2\)\\mathcal\{O\}\(\(\\log T\)^\{2\}\)for contractive RNNs\(gonzalez2025predictability\)\. Leveraging fast training from arbitrary sequence lengths, we provided what is to our knowledge the first systematic evidence that DSR benefits from training on such long sequences when the data carry slow timescales\. Furthermore, we showed that feeding predictions of linear SSMs \(LSSMs\) with nonlinear read\-out back into the recurrence during autoregressive generation imposes a low\-rank constraint that hinders the LSSM from inferring unobserved dynamical variables in the data\. Even when SSMs ameliorate this problem by introducing gating or selection mechanismsgu2024mamba;dao2024transformers, DSR quality suffers due to the inherent exposure bias of conventional teacher forcingbengio2015scheduled;ranzato2016sequence, which degrades autoregressive roll\-outs during testing\. For modern SSMs, this problem cannot be resolved without trading the linear recurrence for a \(locally\) nonlinear one, such that parallelization by parallel scan and similar algorithms breaks down \(cf\. Appx\.[C](https://arxiv.org/html/2605.12683#A3),\(blelloch1990prefix;gu\_efficiently\_2022;smith\_simplified\_2023;gu2024mamba;dao2024transformers\)\)\.

##### Limitations

GTF\-DEER inherits DEER’s cubic work in the latent dimensionMMper Newton\-iteration,𝒲full=𝒪​\(M3​T\)\\mathcal\{W\}\_\{\\mathrm\{full\}\}=\\mathcal\{O\}\(M^\{3\}T\), and the diagonal \(quasi\-DEER\) approximation that would alleviate this degrades gradients in partially observed settings substantially \(Sect\.[5\.2](https://arxiv.org/html/2605.12683#S5.SS2)\)\. This sets practical limits to the latent dimensionMM\(see Fig\.[1](https://arxiv.org/html/2605.12683#S5.F1)A\)\. Finally, Proposition[1](https://arxiv.org/html/2605.12683#Thmproposition1)strictly holds forM≤NM\\leq N, while forM\>NM\>Nwe only find empirical evidence that GTF\-DEER’s convergence also holds for the settings and problems considered in this work\.

## Acknowledgments and Disclosure of Funding

This work was supported by individual grants Du 354/15\-1 \(project no\. 502196519\) and Du 354/18\-1 \(project no\. 567025973\) from the German Research Foundation \(DFG\), by the German Ministry for Research, Astronautics, and Technology \(BMFTR\) through NAILIt \(“Neuro\-Inspired AI for Learning & Inference in Non\-Stationary Environments”\), grant number 01GQ2509A\.

## References

## Appendix

## Appendix AAbsence of chaos\-induced exploding gradients in linear SSMs

The gradient propagation properties of linear SSMs are well\-established in prior work\[mikhaeil\_difficulty\_2022,orvieto2023resurrecting,zucchet2024recurrent\]and we briefly recap them here for completeness\. Consider a discrete\-time linear SSM with time\-invariant dynamics:

𝒛t=𝑨​𝒛t−1\+𝒗t,\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\bm\{v\}\_\{t\},\(17\)where𝒛t∈ℝM\\bm\{z\}\_\{t\}\\in\\mathbb\{R\}^\{M\}is the latent state,𝑨∈ℝM×M\\bm\{A\}\\in\\mathbb\{R\}^\{M\\times M\}is the state transition matrix, and𝒗t∈ℝM\\bm\{v\}\_\{t\}\\in\\mathbb\{R\}^\{M\}is an extrinsic input\. Differentiating the recurrence yields a constant Jacobian𝑱i=∂𝒛i/∂𝒛i−1=𝑨\\bm\{J\}\_\{i\}=\\partial\\bm\{z\}\_\{i\}/\\partial\\bm\{z\}\_\{i\-1\}=\\bm\{A\}, so the BPTT chain\-rule product collapses to

∂𝒛t∂𝒛r=∏i=r\+1t𝑱i=𝑨t−r,‖∂𝒛t∂𝒛r‖≤‖𝑨‖t−r\.\\frac\{\\partial\\bm\{z\}\_\{t\}\}\{\\partial\\bm\{z\}\_\{r\}\}=\\prod\_\{i=r\+1\}^\{t\}\\bm\{J\}\_\{i\}=\\bm\{A\}^\{t\-r\},\\qquad\\left\\\|\\frac\{\\partial\\bm\{z\}\_\{t\}\}\{\\partial\\bm\{z\}\_\{r\}\}\\right\\\|\\leq\\\|\\bm\{A\}\\\|^\{t\-r\}\.\(18\)Gradient propagation is therefore independent of the state trajectory\{𝒛i\}\\\{\\bm\{z\}\_\{i\}\\\}and inputs\{𝒗i\}\\\{\\bm\{v\}\_\{i\}\\\}, and is fully determined by the spectrum of𝑨\\bm\{A\}\. As a consequence, linear SSMs cannot exhibit chaos\-induced gradient explosion: constraining the eigenvalues of𝑨\\bm\{A\}\(typically near the unit disk\) is sufficient to control gradient magnitudes\.

## Appendix BProof of Proposition[1](https://arxiv.org/html/2605.12683#Thmproposition1): GTF\-DEER convergence

###### Proof\.

SinceM≤NM\\leq Nand the observation matrix𝑩\\bm\{B\}has full column rank,𝑩\+​𝑩=𝑰M\\bm\{B\}^\{\+\}\\bm\{B\}=\\bm\{I\}\_\{M\}, and by Eq\. \([8](https://arxiv.org/html/2605.12683#S3.E8)\),𝑷α=\(1−α\)​𝑰M\\bm\{P\}\_\{\\alpha\}=\(1\-\\alpha\)\\,\\bm\{I\}\_\{M\}\. The Jacobian decomposition simplifies to Eq\. \([7](https://arxiv.org/html/2605.12683#S3.E7)\):

𝑱~t:=𝑱~​\(𝒛t−1\):=𝑱F∘δα​\(𝒛t−1\)=\(1−α\)​𝑱F​\(𝒛~t−1\)\.\\tilde\{\\bm\{J\}\}\_\{t\}:=\\tilde\{\\bm\{J\}\}\(\\bm\{z\}\_\{t\-1\}\):=\\bm\{J\}\_\{F\\circ\\delta\_\{\\alpha\}\}\(\\bm\{z\}\_\{t\-1\}\)=\(1\-\\alpha\)\\,\\bm\{J\}\_\{F\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\}\)\.\(19\)Taking spectral norms leads to:

\(1−α\)​‖𝑱F​\(𝒛~t−1\)‖2≤\(1−α\)​σ~max=ρ\.\(1\-\\alpha\)\\left\\lVert\\bm\{J\}\_\{F\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\}\)\\right\\rVert\_\{2\}\\leq\(1\-\\alpha\)\\,\\tilde\{\\sigma\}\_\{\\max\}=\\rho\.\(20\)Sinceα\>α∗=1−1/σ~max\\alpha\>\\alpha^\{\*\}=1\-1/\\tilde\{\\sigma\}\_\{\\mathrm\{max\}\}, we have\(1−α\)<1/σ~max\(1\-\\alpha\)<1/\\tilde\{\\sigma\}\_\{\\mathrm\{max\}\}and henceρ<1\\rho<1\.

For any product ofkkconsecutive forced Jacobians:

‖𝑱~t​𝑱~t−1​⋯​𝑱~t−k\+1‖2≤∏i=0k−1‖𝑱~t−i‖2≤ρk\\left\\lVert\\tilde\{\\bm\{J\}\}\_\{t\}\\tilde\{\\bm\{J\}\}\_\{t\-1\}\\cdots\\tilde\{\\bm\{J\}\}\_\{t\-k\+1\}\\right\\rVert\_\{2\}\\leq\\prod\_\{i=0\}^\{k\-1\}\\left\\lVert\\tilde\{\\bm\{J\}\}\_\{t\-i\}\\right\\rVert\_\{2\}\\leq\\rho^\{k\}\(21\)Then by Def\.[1](https://arxiv.org/html/2605.12683#Thmdefinition1), we have

λ​\(𝒛\)=limT→∞1T​log⁡‖∏i=0T−1𝑱~T−i‖2≤limT→∞1T​log⁡ρT=log⁡ρ<0\.\\lambda\(\\bm\{z\}\)=\\lim\_\{T\\to\\infty\}\\frac\{1\}\{T\}\\log\\left\\lVert\\prod\_\{i=0\}^\{T\-1\}\\tilde\{\\bm\{J\}\}\_\{T\-i\}\\right\\rVert\_\{2\}\\leq\\lim\_\{T\\to\\infty\}\\frac\{1\}\{T\}\\log\\rho^\{T\}=\\log\\rho<0\.\(22\)Thus the forced system is*contracting*\. ∎

## Appendix CScheduled sampling breaks the linearity of LSSMs

Conventional teacher forcing\[Goodfellow\-et\-al\-2016\]feeds the ground\-truth observation𝒙t−1\\bm\{x\}\_\{t\-1\}into the latent recurrence

𝒛t=𝑨​𝒛t−1\+𝑼​𝒙t−1\+𝑪​𝒔t\+𝒉\.\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\bm\{U\}\\bm\{x\}\_\{t\-1\}\+\\bm\{C\}\\bm\{s\}\_\{t\}\+\\bm\{h\}\.\(23\)at every step\. While the following holds for any forcing strategy that re\-introduces a feedback from model\-predicted observations𝒙^\\hat\{\\bm\{x\}\}into the latent recurrence, we will exemplify the problem using the well established method of scheduled sampling\[bengio2015scheduled\]\. Scheduled sampling switches between teacher forcing and free\-running generation during training by replacing𝒙t−1\\bm\{x\}\_\{t\-1\}with the model’s own prediction𝒙^t−1\\hat\{\\bm\{x\}\}\_\{t\-1\}with a scheduled probabilityϵt∈\[0,1\]\\epsilon\_\{t\}\\in\[0,1\],

𝒙~t−1=\{𝒙t−1with probability​1−ϵt,𝒙^t−1with probability​ϵt,\\tilde\{\\bm\{x\}\}\_\{t\-1\}=\\begin\{cases\}\\bm\{x\}\_\{t\-1\}&\\text\{with probability \}1\-\\epsilon\_\{t\},\\\\ \\hat\{\\bm\{x\}\}\_\{t\-1\}&\\text\{with probability \}\\epsilon\_\{t\},\\end\{cases\}\(24\)so that the latent recurrence used during training reads

𝒛t=𝑨​𝒛t−1\+𝑼​𝒙~t−1\+𝑪​𝒔t\+𝒉\.\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\bm\{U\}\\tilde\{\\bm\{x\}\}\_\{t\-1\}\+\\bm\{C\}\\bm\{s\}\_\{t\}\+\\bm\{h\}\.\(25\)Whenever the prediction branch is taken, substituting𝒙^t−1=𝑩​ϕ​\(𝑽​𝒛t−1\+𝒃\)\\hat\{\\bm\{x\}\}\_\{t\-1\}=\\bm\{B\}\\phi\(\\bm\{V\}\\bm\{z\}\_\{t\-1\}\+\\bm\{b\}\)from Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) into Eq\. \([25](https://arxiv.org/html/2605.12683#A3.E25)\) yields

𝒛t=𝑨​𝒛t−1\+𝑾¯​ϕ​\(𝑽​𝒛t−1\+𝒃\)\+𝑪​𝒔t\+𝒉,\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\overline\{\\bm\{W\}\}\\,\\phi\(\\bm\{V\}\\bm\{z\}\_\{t\-1\}\+\\bm\{b\}\)\+\\bm\{C\}\\bm\{s\}\_\{t\}\+\\bm\{h\},\(26\)with𝑾¯=𝑼​𝑩\\overline\{\\bm\{W\}\}=\\bm\{U\}\\bm\{B\}, i\.e\. exactly the test\-time recurrence of Eq\. \([4](https://arxiv.org/html/2605.12683#S3.E4)\)\. The transition𝒛t−1↦𝒛t\\bm\{z\}\_\{t\-1\}\\mapsto\\bm\{z\}\_\{t\}is therefore no longer linear \(affine\) in𝒛t−1\\bm\{z\}\_\{t\-1\}but contains the nonlinearityϕ\\phi\.

##### Incompatibility with parallel scan

Evaluation of Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\) via parallel scan\[blelloch1990prefix,martin2018parallelizing,smith\_simplified\_2023\]hinges on each transition being representable as an affine map𝒛t=𝑨t​𝒛t−1\+𝒃t\\bm\{z\}\_\{t\}=\\bm\{A\}\_\{t\}\\bm\{z\}\_\{t\-1\}\+\\bm\{b\}\_\{t\}, identified with the pair\(𝑨t,𝒃t\)\(\\bm\{A\}\_\{t\},\\bm\{b\}\_\{t\}\)\. Composition of two such maps,

\(𝑨2,𝒃2\)∘\(𝑨1,𝒃1\)=\(𝑨2​𝑨1,𝑨2​𝒃1\+𝒃2\),\(\\bm\{A\}\_\{2\},\\bm\{b\}\_\{2\}\)\\circ\(\\bm\{A\}\_\{1\},\\bm\{b\}\_\{1\}\)=\\bigl\(\\bm\{A\}\_\{2\}\\bm\{A\}\_\{1\},\\;\\bm\{A\}\_\{2\}\\bm\{b\}\_\{1\}\+\\bm\{b\}\_\{2\}\\bigr\),\(27\)is associative and again affine, such that partial compositions over disjoint chunks of the training sequence can be combined in any order, enabling an𝒪​\(log⁡T\)\\mathcal\{O\}\(\\log T\)\-depth reduction\. In the standard LSSM recurrence the input term𝑼​𝒙t−1\\bm\{U\}\\bm\{x\}\_\{t\-1\}depends only on observed data and contributes solely to𝒃t\\bm\{b\}\_\{t\}, leaving the map affine in the latent state\.

Once𝒙t−1\\bm\{x\}\_\{t\-1\}is replaced by𝒙^t−1\\hat\{\\bm\{x\}\}\_\{t\-1\}, the input becomes a nonlinear function of𝒛t−1\\bm\{z\}\_\{t\-1\}and the transition is no longer of this affine form\. The composition of two such nonlinear transitions cannot in general be represented by a fixed, finite\-parameter operator that isindependentof the latent state, such that the recurrence can not be reduced to an associative scan\. Training must instead unroll Eq\. \([26](https://arxiv.org/html/2605.12683#A3.E26)\) sequentially, recovering the𝒪​\(T\)\\mathcal\{O\}\(T\)bottleneck\.

## Appendix DGTF\-DEER gradients

We define the per\-timestep MSE loss functionℓ:ℝN×ℝN→ℝ\\ell:\\mathbb\{R\}^\{N\}\\times\\mathbb\{R\}^\{N\}\\to\\mathbb\{R\},

ℓ​\(𝒙,𝒙^\)=1N​‖𝒙−𝒙^‖22,\\ell\(\\bm\{x\},\\hat\{\\bm\{x\}\}\)=\\frac\{1\}\{N\}\\left\\lVert\\bm\{x\}\-\\hat\{\\bm\{x\}\}\\right\\rVert\_\{2\}^\{2\},\(28\)and the total loss,ℒ:\(ℝN\)T×\(ℝN\)T→ℝ\\mathcal\{L\}:\(\\mathbb\{R\}^\{N\}\)^\{T\}\\times\(\\mathbb\{R\}^\{N\}\)^\{T\}\\to\\mathbb\{R\}, for the full time series,

ℒ​\(𝒙1:T,𝒙^1:T\)=1T​∑t=1Tℓ​\(𝒙t,𝒙^t\)=1N​T​∑t=1T‖𝒙t−𝒙^t‖22,\\mathcal\{L\}\(\\bm\{x\}\_\{1:T\},\\hat\{\\bm\{x\}\}\_\{1:T\}\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\ell\(\\bm\{x\}\_\{t\},\\hat\{\\bm\{x\}\}\_\{t\}\)=\\frac\{1\}\{NT\}\\sum\_\{t=1\}^\{T\}\\left\\lVert\\bm\{x\}\_\{t\}\-\\hat\{\\bm\{x\}\}\_\{t\}\\right\\rVert\_\{2\}^\{2\},\(29\)where in the model \([1](https://arxiv.org/html/2605.12683#S3.E1)\),𝒙^t=𝑮ψ​\(𝒛t\)\\bm\{\\hat\{x\}\}\_\{t\}=\\bm\{G\}\_\{\\psi\}\(\\bm\{z\}\_\{t\}\)are the observations associated with the latent states; we consider𝒙1:T∈ℝN​T≅\(ℝN\)T\\bm\{x\}\_\{1:T\}\\in\\mathbb\{R\}^\{NT\}\\cong\(\\mathbb\{R\}^\{N\}\)^\{T\}as a stacked vector; and in practice truncate a warm\-up period0<Tw<T0<T\_\{w\}<Tand only supply vectors𝒙,𝒙^\\bm\{x\},\\hat\{\\bm\{x\}\}of sizeN​T~N\\tilde\{T\},T~≔T−Tw\\tilde\{T\}\\coloneqq T\-T\_\{w\}, cf\. Sect\.[4\.2](https://arxiv.org/html/2605.12683#S4.SS2)and Eq\. \([60](https://arxiv.org/html/2605.12683#A5.E60)\)\.

To optimize the parameters𝜽\\bm\{\\theta\}of the RNNF~𝜽,α=F𝜽∘δα,𝑩\\tilde\{F\}\_\{\\bm\{\\theta\},\\alpha\}=F\_\{\\bm\{\\theta\}\}\\circ\\delta\_\{\\alpha,\\bm\{B\}\}, we need to compute the gradient of the loss function,

∂ℒ∂θ=∂ℒ∂𝒛​∂𝒛∂θ\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\}\\frac\{\\partial\\bm\{z\}\}\{\\partial\\theta\}\(30\)for all elementsθ\\theta\(for notational simplicity, we choose not to write any index\) in the abstract parameter vector𝜽\\bm\{\\theta\}which comprises all matrix weights\.

In the DEER forward pass, we use the residual

𝒓𝜽,α​\(𝒛t−1,𝒛t\)=𝒛t−F~𝜽,α​\(𝒛t−1,𝒙t−1\)∈ℝM,\\displaystyle\\bm\{r\}\_\{\\bm\{\\theta\},\\alpha\}\(\\bm\{z\}\_\{t\-1\},\\bm\{z\}\_\{t\}\)=\\bm\{z\}\_\{t\}\-\\tilde\{F\}\_\{\\bm\{\\theta\},\\alpha\}\(\\bm\{z\}\_\{t\-1\},\\bm\{x\}\_\{t\-1\}\)\\in\\mathbb\{R\}^\{M\},\(31\)𝑹𝜽,α​\(𝒛1:T\)=\(𝒓α​\(𝒛0,𝒛1\),…,𝒓α​\(𝒛T−1,𝒛T\)\)∈ℝM​T,\\displaystyle\\bm\{R\}\_\{\\bm\{\\theta\},\\alpha\}\(\\bm\{z\}\_\{1:T\}\)=\(\\bm\{r\}\_\{\\alpha\}\(\\bm\{z\}\_\{0\},\\bm\{z\}\_\{1\}\),\\dots,\\bm\{r\}\_\{\\alpha\}\(\\bm\{z\}\_\{T\-1\},\\bm\{z\}\_\{T\}\)\)\\in\\mathbb\{R\}^\{MT\},\(32\)where𝒛0\\bm\{z\}\_\{0\}is a fixed initial state \(which is not predicted\), to obtain the predicted time series𝒛1:T∈ℝM​T≅\(ℝM\)T\\bm\{z\}\_\{1:T\}\\in\\mathbb\{R\}^\{MT\}\\cong\(\\mathbb\{R\}^\{M\}\)^\{T\}\. At the end of each forward pass,

𝑹α​\(𝒛1:T,𝜽\)≔𝑹𝜽,α​\(𝒛1:T\)=𝟎∈ℝM​T,\\bm\{R\}\_\{\\alpha\}\(\\bm\{z\}\_\{1:T\},\\bm\{\\theta\}\)\\coloneqq\\bm\{R\}\_\{\\bm\{\\theta\},\\alpha\}\(\\bm\{z\}\_\{1:T\}\)=\\bm\{0\}\\in\\mathbb\{R\}^\{MT\},\(33\)so by the implicit function theorem,

∂𝑹α∂𝒛1:T​∂𝒛1:T∂θ\+∂𝑹α∂θ=𝟎∈ℝM​T\.\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}\\frac\{\\partial\\bm\{z\}\_\{1:T\}\}\{\\partial\\theta\}\+\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\theta\}=\\bm\{0\}\\in\\mathbb\{R\}^\{MT\}\.\(34\)Rearranging this equation gives us

∂𝒛1:T∂θ=−\(∂𝑹α∂𝒛1:T\)−1​∂𝑹α∂θ\.\\frac\{\\partial\\bm\{z\}\_\{1:T\}\}\{\\partial\\theta\}=\-\\left\(\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}\\right\)^\{\-1\}\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\theta\}\.\(35\)We insert this into the gradient of the loss:

∂ℒ∂θ=−∂ℒ∂𝒛1:T​\(∂𝑹α∂𝒛1:T\)−1⏟≕𝑽⊤⁣∈ℝ1×M​T​∂𝑹α∂θ,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta\}=\\underbrace\{\-\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}\\left\(\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}\\right\)^\{\-1\}\}\_\{\\eqqcolon\\bm\{V\}^\{\\top\}\\in\\mathbb\{R\}^\{1\\times MT\}\}\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\theta\},\(36\)where we write𝑽=\(𝒗1,…,𝒗T\)∈ℝM​T≅\(ℝM\)T\\bm\{V\}=\(\\bm\{v\}\_\{1\},\\dots,\\bm\{v\}\_\{T\}\)\\in\\mathbb\{R\}^\{MT\}\\cong\(\\mathbb\{R\}^\{M\}\)^\{T\}\. Rather than explicitly inverting the huge Jacobian matrix∂𝑹α∂𝒛1:T\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}, we instead solve the adjoint problem

𝑽⊤​∂𝑹α∂𝒛1:T=−∂ℒ∂𝒛1:T\.\\bm\{V\}^\{\\top\}\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}=\-\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}\.\(37\)To make this more explicit, we use

∂𝒓α,𝜽​\(𝒛t−1,𝒛t\)∂𝒛t−1=−∂F~𝜽,α∂𝒛​\(𝒛t−1\)≕−𝑱F~​\(𝒛t−1\)≕−𝑱~t∈ℝM×M,\\displaystyle\\frac\{\\partial\\bm\{r\}\_\{\\alpha,\\bm\{\\theta\}\}\(\\bm\{z\}\_\{t\-1\},\\bm\{z\}\_\{t\}\)\}\{\\partial\\bm\{z\}\_\{t\-1\}\}=\-\\frac\{\\partial\\tilde\{F\}\_\{\\bm\{\\theta\},\\alpha\}\}\{\\partial\\bm\{z\}\}\(\\bm\{z\}\_\{t\-1\}\)\\eqqcolon\-\\bm\{J\}\_\{\\tilde\{F\}\}\(\\bm\{z\}\_\{t\-1\}\)\\eqqcolon\-\\tilde\{\\bm\{J\}\}\_\{t\}\\in\\mathbb\{R\}^\{M\\times M\},\(38\)∂𝒓α,𝜽​\(𝒛t−1,𝒛t\)∂𝒛t=𝑰M,\\displaystyle\\frac\{\\partial\\bm\{r\}\_\{\\alpha,\\bm\{\\theta\}\}\(\\bm\{z\}\_\{t\-1\},\\bm\{z\}\_\{t\}\)\}\{\\partial\\bm\{z\}\_\{t\}\}=\\bm\{I\}\_\{M\},\(39\)and write

∂𝑹α∂𝒛1:T=\(𝑰M0⋯⋯0−𝑱~2𝑰M0⋯00−𝑱~3𝑰M⋱0⋮⋮⋱⋱0000−𝑱~T𝑰M\)\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\bm\{z\}\_\{1:T\}\}=\{\\begin\{pmatrix\}\\bm\{I\}\_\{M\}&0&\\cdots&\\cdots&0\\\\ \-\\tilde\{\\bm\{J\}\}\_\{2\}&\\bm\{I\}\_\{M\}&0&\\cdots&0\\\\ 0&\-\\tilde\{\\bm\{J\}\}\_\{3\}&\\bm\{I\}\_\{M\}&\\ddots&0\\\\ \\vdots&\\vdots&\\ddots&\\ddots&0\\\\ 0&0&0&\-\\tilde\{\\bm\{J\}\}\_\{T\}&\\bm\{I\}\_\{M\}\\end\{pmatrix\}\}\(40\)and

∂𝑹α∂θ=∂𝑹α∂F~𝜽,α​∂F~𝜽,α∂θ=−∂F~𝜽,α∂θ\.\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\theta\}=\\frac\{\\partial\\bm\{R\}\_\{\\alpha\}\}\{\\partial\\tilde\{F\}\_\{\\bm\{\\theta\},\\alpha\}\}\\frac\{\\partial\\tilde\{F\}\_\{\\bm\{\\theta\},\\alpha\}\}\{\\partial\\theta\}=\-\\frac\{\\partial\\tilde\{F\}\_\{\\bm\{\\theta\},\\alpha\}\}\{\\partial\\theta\}\.\(41\)The block\-bidiagonal structure of the matrix turns \([37](https://arxiv.org/html/2605.12683#A4.E37)\) into a \(backward\) recursion intt:

𝒗T⊤=−∂ℒ∂𝒛T,𝒗t−1⊤=𝒗t⊤​𝑱F~​\(𝒛t−1\)−∂ℒ∂𝒛t−1,\\bm\{v\}\_\{T\}^\{\\top\}=\-\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\_\{T\}\},\\qquad\\bm\{v\}\_\{t\-1\}^\{\\top\}=\\bm\{v\}\_\{t\}^\{\\top\}\\bm\{J\}\_\{\\tilde\{F\}\}\(\\bm\{z\}\_\{t\-1\}\)\-\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\_\{t\-1\}\},\(42\)where, substituting in the observation model,𝒙^=Gψ​\(𝒛\)\\hat\{\\bm\{x\}\}=G\_\{\\psi\}\(\\bm\{z\}\),

∂ℒ∂𝒛t−1=∂ℒ∂𝒙^t−1​∂Gψ∂𝒛t−1=2N​T​\(𝒙^t−1−𝒙t−1\)⊤​∂Gψ∂𝒛t−1\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\_\{t\-1\}\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\hat\{\\bm\{x\}\}\_\{t\-1\}\}\\frac\{\\partial G\_\{\\psi\}\}\{\\partial\\bm\{z\}\_\{t\-1\}\}=\\frac\{2\}\{NT\}\(\\hat\{\\bm\{x\}\}\_\{t\-1\}\-\\bm\{x\}\_\{t\-1\}\)^\{\\top\}\\frac\{\\partial G\_\{\\psi\}\}\{\\partial\\bm\{z\}\_\{t\-1\}\}\.\(43\)In the same way as in the forward pass, this recursion can be solved in parallel via an associative scan\. Note that for quasi\-DEER, the Jacobians𝑱F~\\bm\{J\}\_\{\\tilde\{F\}\}in \([42](https://arxiv.org/html/2605.12683#A4.E42)\) are replaced by a diagonal approximation and the equation becomes

𝒗T⊤=−∂ℒ∂𝒛T,𝒗t−1⊤=\(𝒗t⊙diag​\(𝑱F~​\(𝒛t−1\)\)\)⊤−∂ℒ∂𝒛t−1\\bm\{v\}\_\{T\}^\{\\top\}=\-\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\_\{T\}\},\\qquad\\bm\{v\}\_\{t\-1\}^\{\\top\}=\\left\(\\bm\{v\}\_\{t\}\\odot\\mathrm\{diag\}\\left\(\\bm\{J\}\_\{\\tilde\{F\}\}\(\\bm\{z\}\_\{t\-1\}\)\\right\)\\right\)^\{\\top\}\-\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\bm\{z\}\_\{t\-1\}\}\(44\)with the element\-wise product⊙\\odot\.

In summary, the loss is therefore calculated simply by solving the recursion \([42](https://arxiv.org/html/2605.12683#A4.E42)\) using the final latent trajectory and observations computed via the forward pass, and inserting the result into

∂ℒ∂θ=−𝑽⊤​∂F~𝜽,α∂θ\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta\}=\-\\bm\{V\}^\{\\top\}\\frac\{\\partial\\tilde\{F\}\_\{\\bm\{\\theta\},\\alpha\}\}\{\\partial\\theta\}\.\(45\)

## Appendix EModel setup, loss and regularizations

##### Parameterizations

For both the RNN and LSSM we parameterize𝑨=diag⁡\(tanh⁡\(𝑨¯\)\)\\bm\{A\}=\\operatorname\{diag\}\(\\tanh\(\\bar\{\\bm\{A\}\}\)\)where𝑨¯∈ℝM\\bar\{\\bm\{A\}\}\\in\\mathbb\{R\}^\{M\}\. The nonlinearity avoids instabilities in the𝑨​𝒛t−1\\bm\{A\}\\bm\{z\}\_\{t\-1\}term when‖𝑨‖2≥1\\left\\lVert\\bm\{A\}\\right\\rVert\_\{2\}\\geq 1\.

##### Initialization

To facilitate the capture of long time scales in the data, we initialize the RNN to exhibit long time scales at initialization, by initializing the RNN near the identity:

𝑨¯\\displaystyle\\bm\{\\bar\{A\}\}=artanh⁡\(κ\)​𝐈M\\displaystyle=\\operatorname\{artanh\}\(\\kappa\)\\ \\mathbf\{I\}\_\{M\}\(46\)Wi​j\\displaystyle W\_\{ij\}∼𝒰​\(−\(1−κ\)​L−1/2,\(1−κ\)​L−1/2\)\\displaystyle\\sim\\mathcal\{U\}\\left\(\-\(1\-\\kappa\)L^\{\-1/2\},\(1\-\\kappa\)L^\{\-1/2\}\\right\)Vi​j\\displaystyle V\_\{ij\}∼𝒰​\(−\(1−κ\)​M−1/2,\(1−κ\)​M−1/2\)\\displaystyle\\sim\\mathcal\{U\}\\left\(\-\(1\-\\kappa\)M^\{\-1/2\},\(1\-\\kappa\)M^\{\-1/2\}\\right\)Ci​j\\displaystyle C\_\{ij\}∼𝒰​\(−\(1−κ\)​K−1/2,\(1−κ\)​K−1/2\)\\displaystyle\\sim\\mathcal\{U\}\\left\(\-\(1\-\\kappa\)K^\{\-1/2\},\(1\-\\kappa\)K^\{\-1/2\}\\right\)𝒃\\displaystyle\\bm\{b\}=𝟎\\displaystyle=\\bm\{0\}𝒉\\displaystyle\\bm\{h\}=𝟎,\\displaystyle=\\bm\{0\},where𝒰\\mathcal\{U\}is the uniform distribution\. In practice we useκ=0\.9995\\kappa=0\.9995\. ForM≥NM\\geq N, the observation matrix𝑩\\bm\{B\}is initialized to perform identity read\-out of the firstNNunits in the RNN, i\.e\.𝑩=\[𝑰N𝟎N×\(M−N\)\]\\bm\{B\}=\\begin\{bmatrix\}\\bm\{I\}\_\{N\}&\\bm\{0\}\_\{\\,N\\times\(M\-N\)\}\\end\{bmatrix\}\.

For the LSSM, we follow similar strategy:

𝑨¯\\displaystyle\\bm\{\\bar\{A\}\}=artanh⁡\(κ\)​𝐈M\\displaystyle=\\operatorname\{artanh\}\(\\kappa\)\\ \\mathbf\{I\}\_\{M\}\(47\)Ui​j\\displaystyle U\_\{ij\}∼𝒰​\(−\(1−κ\)​N−1/2,\(1−κ\)​N−1/2\)\\displaystyle\\sim\\mathcal\{U\}\\left\(\-\(1\-\\kappa\)N^\{\-1/2\},\(1\-\\kappa\)N^\{\-1/2\}\\right\)Ci​j\\displaystyle C\_\{ij\}∼𝒰​\(−\(1−κ\)​K−1/2,\(1−κ\)​K−1/2\)\\displaystyle\\sim\\mathcal\{U\}\\left\(\-\(1\-\\kappa\)K^\{\-1/2\},\(1\-\\kappa\)K^\{\-1/2\}\\right\)𝒉\\displaystyle\\bm\{h\}=𝟎,\\displaystyle=\\bm\{0\},and the observation model is initialized as

Bi​j\\displaystyle B\_\{ij\}∼𝒰​\(−\(1−κ\)​L−1/2,\(1−κ\)​L−1/2\)\\displaystyle\\sim\\mathcal\{U\}\\left\(\-\(1\-\\kappa\)L^\{\-1/2\},\(1\-\\kappa\)L^\{\-1/2\}\\right\)\(48\)Vi​j\\displaystyle V\_\{ij\}∼𝒰​\(−\(1−κ\)​M−1/2,\(1−κ\)​M−1/2\)\\displaystyle\\sim\\mathcal\{U\}\\left\(\-\(1\-\\kappa\)M^\{\-1/2\},\(1\-\\kappa\)M^\{\-1/2\}\\right\)𝒃\\displaystyle\\bm\{b\}=𝟎\.\\displaystyle=\\bm\{0\}\.For all experiments, we useϕ​\(⋅\)=ReLU=max⁡\(0,⋅\)\\phi\(\\cdot\)=\\mathrm\{ReLU\}=\\max\(0,\\cdot\)\.

##### Manifold attractor regularization \(MAR\)

For the nonlinear RNN introduced in Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\), we can encourage a slow manifold in the lastMrM\_\{r\}units by adding the term\[schmidt\_identifying\_2021\]

ℒMAR​\(𝜽RNN\)=λMARMr​∑i=M−Mr\+1M\[\|1−Ai​i\|p\+1L​∑j=1L\(\|Wi​j\|p\+\|Vj​i\|p\)\+\|hi\|p\]\\mathcal\{L\}\_\{\\mathrm\{MAR\}\}\(\\bm\{\\theta\}\_\{\\mathrm\{RNN\}\}\)=\\frac\{\\lambda\_\{\\mathrm\{MAR\}\}\}\{M\_\{r\}\}\\sum\_\{i=M\-M\_\{r\}\+1\}^\{M\}\\left\[\\lvert 1\-A\_\{ii\}\\rvert^\{p\}\+\\frac\{1\}\{L\}\\sum\_\{j=1\}^\{L\}\\left\(\\lvert W\_\{ij\}\\rvert^\{p\}\+\\lvert V\_\{ji\}\\rvert^\{p\}\\right\)\+\\lvert h\_\{i\}\\rvert^\{p\}\\right\]\(49\)to the loss function, whereλMAR\\lambda\_\{\\mathrm\{MAR\}\}is a regularization parameter andp∈\{1,2\}p\\in\\\{1,2\\\}determines the type of penalty\. In practice, we found it beneficial to scale regularization terms of connectivity matrices𝑾\\bm\{W\}and𝑽\\bm\{V\}based on their variance at initialization, i\.e\.

ℒMAR​\(𝜽RNN\)=λMARMr​∑i=M−Mr\+1M\[\|1−Ai​i\|p\+1L​∑j=1L\(γ𝑾​\|Wi​j\|p\+γ𝑽​\|Vj​i\|p\)\+\|hi\|p\],\\mathcal\{L\}\_\{\\mathrm\{MAR\}\}\(\\bm\{\\theta\}\_\{\\mathrm\{RNN\}\}\)=\\frac\{\\lambda\_\{\\mathrm\{MAR\}\}\}\{M\_\{r\}\}\\sum\_\{i=M\-M\_\{r\}\+1\}^\{M\}\\left\[\\lvert 1\-A\_\{ii\}\\rvert^\{p\}\+\\frac\{1\}\{L\}\\sum\_\{j=1\}^\{L\}\\left\(\\gamma\_\{\\bm\{W\}\}\\lvert W\_\{ij\}\\rvert^\{p\}\+\\gamma\_\{\\bm\{V\}\}\\lvert V\_\{ji\}\\rvert^\{p\}\\right\)\+\\lvert h\_\{i\}\\rvert^\{p\}\\right\],\(50\)whereγ𝑾=13​L\\gamma\_\{\\bm\{W\}\}=\\frac\{1\}\{3L\},γ𝑽=13​M\\gamma\_\{\\bm\{V\}\}=\\frac\{1\}\{3M\}\. ForλMAR→∞\\lambda\_\{\\mathrm\{MAR\}\}\\to\\infty, this regularization leads to stable error propagation by imposing a block\-diagonal identity structure in the model Jacobian𝑱F\\bm\{J\}\_\{F\}, as we demonstrate below\.

Recall the nonlinear RNN from Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\) with𝑨=diag​\(a1,…,aM\)\\bm\{A\}=\\mathrm\{diag\}\(a\_\{1\},\\ldots,a\_\{M\}\):

𝒛t=𝑨​𝒛t−1\+𝑾​ϕ​\(𝑽​𝒛t−1\+𝒃\)\+𝑪​𝒔t\+𝒉\.\\bm\{z\}\_\{t\}=\\bm\{A\}\\bm\{z\}\_\{t\-1\}\+\\bm\{W\}\\phi\(\\bm\{V\}\\bm\{z\}\_\{t\-1\}\+\\bm\{b\}\)\+\\bm\{C\}\\bm\{s\}\_\{t\}\+\\bm\{h\}\.\(51\)Its Jacobian is

𝑱F​\(𝒛t−1\)=𝑨\+𝑾​diag​\(ϕ′​\(𝑽​𝒛t−1\+𝒃\)\)​𝑽\.\\bm\{J\}\_\{F\}\(\\bm\{z\}\_\{t\-1\}\)=\\bm\{A\}\+\\bm\{W\}\\,\\mathrm\{diag\}\\\!\\left\(\\phi^\{\\prime\}\(\\bm\{V\}\\bm\{z\}\_\{t\-1\}\+\\bm\{b\}\)\\right\)\\bm\{V\}\.\(52\)
LetMs:=M−MrM\_\{s\}:=M\-M\_\{r\}and partition𝒛=\(𝒛\(s\),𝒛\(r\)\)⊤\\bm\{z\}=\(\\bm\{z\}^\{\(s\)\},\\bm\{z\}^\{\(r\)\}\)^\{\\top\}into “fast” \(unregularized\) and “slow” \(regularized\) units\. The diagonal𝑨\\bm\{A\}and the non\-linear coupling term decompose as

𝑨=\(𝑨s𝟎𝟎𝑨r\),𝑾=\(𝑾s𝑾r\),𝑽=\(𝑽s𝑽r\)\.\\bm\{A\}=\\begin\{pmatrix\}\\bm\{A\}\_\{s\}&\\bm\{0\}\\\\ \\bm\{0\}&\\bm\{A\}\_\{r\}\\end\{pmatrix\},\\qquad\\bm\{W\}=\\begin\{pmatrix\}\\bm\{W\}\_\{s\}\\\\ \\bm\{W\}\_\{r\}\\end\{pmatrix\},\\qquad\\bm\{V\}=\\begin\{pmatrix\}\\bm\{V\}\_\{s\}&\\bm\{V\}\_\{r\}\\end\{pmatrix\}\.\(53\)
Asλ→∞\\lambda\\to\\infty, the MAR penalty, Eq\. \([49](https://arxiv.org/html/2605.12683#A5.E49)\), drives

𝑨r→𝑰Mr,𝑾r→𝟎,𝑽r→𝟎\.\\bm\{A\}\_\{r\}\\to\\bm\{I\}\_\{M\_\{r\}\},\\qquad\\bm\{W\}\_\{r\}\\to\\bm\{0\},\\qquad\\bm\{V\}\_\{r\}\\to\\bm\{0\}\.\(54\)Writing𝑫s​\(𝒛t−1\):=diag​\(ϕ′​\(𝑽s​𝒛t−1\(s\)\+𝒃\)\)\\bm\{D\}\_\{s\}\(\\bm\{z\}\_\{t\-1\}\):=\\mathrm\{diag\}\\\!\\left\(\\phi^\{\\prime\}\(\\bm\{V\}\_\{s\}\\bm\{z\}^\{\(s\)\}\_\{t\-1\}\+\\bm\{b\}\)\\right\), the limiting Jacobian is block\-diagonal:

𝑱F​\(𝒛t−1\)→λ→∞\(𝑨s\+𝑾s​𝑫s​\(𝒛t−1\)​𝑽s𝟎𝟎𝑰Mr\)\.\\bm\{J\}\_\{F\}\(\\bm\{z\}\_\{t\-1\}\)\\;\\xrightarrow\{\\lambda\\to\\infty\}\\;\\begin\{pmatrix\}\\bm\{A\}\_\{s\}\+\\bm\{W\}\_\{s\}\\bm\{D\}\_\{s\}\(\\bm\{z\}\_\{t\-1\}\)\\bm\{V\}\_\{s\}&\\bm\{0\}\\\\\[4\.0pt\] \\bm\{0\}&\\bm\{I\}\_\{M\_\{r\}\}\\end\{pmatrix\}\.\(55\)
The latent space decouples into a nonlinear subsystem acting on𝒛\(s\)\\bm\{z\}^\{\(s\)\}and a manifold\-attractor subsystem on𝒛\(r\)\\bm\{z\}^\{\(r\)\}with identity dynamics\. Consequently, the Jacobian product for a lengthttsequence andλ→∞\\lambda\\to\\inftysatisfies

∏k=0t−1𝑱F​\(𝒛t−1−k\)=\(∏k=0t−1\[𝑨s\+𝑾s​𝑫s​\(𝒛t−1−k\)​𝑽s\]𝟎𝟎𝑰Mr\),\\prod\_\{k=0\}^\{t\-1\}\\bm\{J\}\_\{F\}\(\\bm\{z\}\_\{t\-1\-k\}\)=\\begin\{pmatrix\}\\prod\_\{k=0\}^\{t\-1\}\\left\[\\bm\{A\}\_\{s\}\+\\bm\{W\}\_\{s\}\\bm\{D\}\_\{s\}\(\\bm\{z\}\_\{t\-1\-k\}\)\\bm\{V\}\_\{s\}\\right\]&\\bm\{0\}\\\\\[4\.0pt\] \\bm\{0\}&\\bm\{I\}\_\{M\_\{r\}\}\\end\{pmatrix\},\(56\)where error signals propagate without decay or amplification along theMrM\_\{r\}directions, enabling stable capture of long\-term dependencies\.

##### Observation model regularization

In addition to MAR, we reduce the direct contribution of theMrM\_\{r\}MAR units to the read\-out and hence discourage them from being forced through Eq\. \([8](https://arxiv.org/html/2605.12683#S3.E8)\) by

ℒ1​\(𝑩\)=λ1N⋅Mr​∑i=1N∑j=M−MrM\|Bi​j\|p,\\mathcal\{L\}\_\{1\}\(\\bm\{B\}\)=\\frac\{\\lambda\_\{1\}\}\{N\\cdot M\_\{r\}\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=M\-M\_\{r\}\}^\{M\}\\lvert B\_\{ij\}\\rvert^\{p\},\(57\)where𝑩\\bm\{B\}is the readout matrix \(cf\. Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\)\)\. We also follow\[hess\_generalized\_2023\]and regularize𝑩\\bm\{B\}to stay well\-conditioned by pulling its singular values towards11:

ℒ2​\(𝑩\)=λ2r​∑i=1r\(σi​\(𝑩\)−1\)2,\\mathcal\{L\}\_\{2\}\(\\bm\{B\}\)=\\frac\{\\lambda\_\{2\}\}\{r\}\\sum\_\{i=1\}^\{r\}\\left\(\\sigma\_\{i\}\(\\bm\{B\}\)\-1\\right\)^\{2\},\(58\)wherer=rank⁡\(𝑩\)r=\\operatorname\{rank\}\(\\bm\{B\}\)andσi​\(𝑩\)\\sigma\_\{i\}\(\\bm\{B\}\)denotes theii\-th singular value of𝑩\\bm\{B\}\.

##### Total loss

The overall loss function used for training of the nonlinear RNN defined in Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\) is given by

ℒ​\(𝒙Tw\+1:T,𝒙^Tw\+1:T;𝜽RNN,𝑩\)=\\displaystyle\\mathcal\{L\}\(\\bm\{x\}\_\{T\_\{w\}\+1:T\},\\hat\{\\bm\{x\}\}\_\{T\_\{w\}\+1:T\};\\ \\bm\{\\theta\}\_\{\\mathrm\{RNN\}\},\\bm\{B\}\)=ℒMSE​\(𝒙Tw\+1:T,𝒙^Tw\+1:T\)\\displaystyle\\ \\mathcal\{L\}\_\{\\mathrm\{MSE\}\}\(\\bm\{x\}\_\{T\_\{w\}\+1:T\},\\hat\{\\bm\{x\}\}\_\{T\_\{w\}\+1:T\}\)\(59\)\+ℒMAR​\(𝜽RNN\)\+ℒ1​\(𝑩\)\+ℒ2​\(𝑩\),\\displaystyle\+\\mathcal\{L\}\_\{\\mathrm\{MAR\}\}\(\\bm\{\\theta\}\_\{\\mathrm\{RNN\}\}\)\+\\mathcal\{L\}\_\{1\}\(\\bm\{B\)\}\+\\mathcal\{L\}\_\{2\}\(\\bm\{B\}\),with

ℒMSE​\(𝒙Tw\+1:T,𝒙^Tw\+1:T\)=1N​\(T−Tw\)​∑t=Tw\+1T‖𝒙t−𝒙^t‖22\.\\mathcal\{L\}\_\{\\mathrm\{MSE\}\}\(\\bm\{x\}\_\{T\_\{w\}\+1:T\},\\hat\{\\bm\{x\}\}\_\{T\_\{w\}\+1:T\}\)=\\frac\{1\}\{N\(T\-T\_\{w\}\)\}\\sum\_\{t=T\_\{w\}\+1\}^\{T\}\\left\\lVert\\bm\{x\}\_\{t\}\-\\hat\{\\bm\{x\}\}\_\{t\}\\right\\rVert\_\{2\}^\{2\}\.\(60\)

##### MAR for the LSSM

Application of MAR to LSSMs \(Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\)\) is straightforward with

ℒMAR​\(𝜽LSSM\)=λMARMr​∑i=M−Mr\+1M\[\|1−Ai​i\|p\+1N​∑j=1N\|Ui​j\|p\+\|hi\|p\]\.\\mathcal\{L\}\_\{\\mathrm\{MAR\}\}\(\\bm\{\\theta\}\_\{\\mathrm\{LSSM\}\}\)=\\frac\{\\lambda\_\{\\mathrm\{MAR\}\}\}\{M\_\{r\}\}\\sum\_\{i=M\-M\_\{r\}\+1\}^\{M\}\\left\[\\lvert 1\-A\_\{ii\}\\rvert^\{p\}\+\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\lvert U\_\{ij\}\\rvert^\{p\}\+\\lvert h\_\{i\}\\rvert^\{p\}\\right\]\.\(61\)

## Appendix FDatasets

##### Lorenz\-63

Our first benchmark system is the three\-dimensional Lorenz\-63 system\[lorenz\_deterministic\_1963\]\. Its dynamics are described by the differential equations

d​xd​t\\displaystyle\\frac\{dx\}\{dt\}=σ​\(y−x\),\\displaystyle=\\sigma\(y\-x\),d​yd​t\\displaystyle\\frac\{dy\}\{dt\}=x​\(ρ−z\)−y,\\displaystyle=x\(\\rho\-z\)\-y,\(62\)d​zd​t\\displaystyle\\frac\{dz\}\{dt\}=x​y−β​z\.\\displaystyle=xy\-\\beta z\.We chose the classical parametersσ=10\\sigma=10,ρ=28\\rho=28,β=8/3\\beta=8/3, which put the system into a chaotic regime\. We used a Runge\-Kutta 4/5 scheme withΔ​t=0\.01\\Delta t=0\.01and integrated a trajectory of length100,000100\{,\}000time steps from an initial point𝒙0\\bm\{x\}\_\{0\}\. The trajectory was standardized per dynamical variable after generation\.

##### Forced Lorenz\-96

As a benchmark for long time scales, we equipped the vanilla Lorenz\-96 equation\[lorenz\_predictability\_1996\]with a sinusoidal forcing term:

d​xid​t=\(xi\+1−xi−2\)​xi−1−xi\+F0\+A​sin⁡\(ω​t\),i=1,…,N,\\frac\{dx\_\{i\}\}\{dt\}=\(x\_\{i\+1\}\-x\_\{i\-2\}\)\\,x\_\{i\-1\}\-x\_\{i\}\+F\_\{0\}\+A\\sin\(\\omega t\),\\qquad i=1,\\ldots,N,\(63\)whereF0F\_\{0\}is a constant forcing offset,AAis the amplitude andω\\omegathe frequency of the sinusoidal forcing\. We setF0=14F\_\{0\}=14,A=12A=12,ω=2​π/75\\omega=2\\pi\\ /\\ 75andN=6N=6\. The autonomous Lorenz\-96 system \(i\.e\.A=0A=0\) is chaotic forF≥8F\\geq 8, such that the sinusoidal forcing leads to a periodic switching between cyclic and chaotic dynamics \(see Fig\.[A1](https://arxiv.org/html/2605.12683#A9.F1)\)\. For integration we used a Runge\-Kutta 4/5 scheme withΔ​t=5⋅10−3\\Delta t=5\\cdot 10^\{\-3\}and integrated two trajectories from different initialttand𝒙0\\bm\{x\}\_\{0\}of length500500internal time units, leading to two trajectories of length100,000100\{,\}000time steps\. One trajectory is contaminated with5%5\\%Gaussian observation noise and used for training, while the other trajectory is kept clean and used for testing\. Both trajectories are standardized per dynamical variable after generation\.

##### Bursting neuron model

We used a three\-dimensional simplified Hodgkin\-Huxley\-type neuron model; its dynamical variables are the membrane potentialVVas well as two gating variablesnnandhhwhich control the opening of fast and slow potassium channels, respectively\[durstewitz\_implications\_2009,schmidt\_identifying\_2021\]:

V˙\\displaystyle\\dot\{V\}=1C\[I−gL\(V−EL\)−gNam∞\(V\)\(V−ENa\)−gKn\(V−EK\)\\displaystyle=\\frac\{1\}\{C\}\\Bigl\[I\-g\_\{L\}\(V\-E\_\{L\}\)\-g\_\{\\mathrm\{Na\}\}\\,m\_\{\\infty\}\(V\)\\,\(V\-E\_\{\\mathrm\{Na\}\}\)\-g\_\{K\}\\,n\\,\(V\-E\_\{K\}\)\(64\)−gMh\(V−EK\)−gNMDAs∞\(V\)\(V−EN​M​D​A\)\],\\displaystyle\\hphantom\{=\\frac\{1\}\{C\}\\Bigl\[\}\-g\_\{M\}\\,h\\,\(V\-E\_\{K\}\)\-g\_\{\\mathrm\{NMDA\}\}\\,s\_\{\\infty\}\(V\)\\,\(V\-E\_\{NMDA\}\)\\Bigr\],n˙\\displaystyle\\dot\{n\}=n∞​\(V\)−nτn,\\displaystyle=\\frac\{n\_\{\\infty\}\(V\)\-n\}\{\\tau\_\{n\}\},h˙\\displaystyle\\dot\{h\}=h∞​\(V\)−hτh,\\displaystyle=\\frac\{h\_\{\\infty\}\(V\)\-h\}\{\\tau\_\{h\}\},with

m∞​\(V\)\\displaystyle m\_\{\\infty\}\(V\)=11\+exp⁡\(\(Vh,Na−V\)/kNa\),\\displaystyle=\\frac\{1\}\{1\+\\exp\\bigl\(\(V\_\{h,\\mathrm\{Na\}\}\-V\)/k\_\{\\mathrm\{Na\}\}\\bigr\)\},h∞​\(V\)\\displaystyle h\_\{\\infty\}\(V\)=11\+exp⁡\(\(Vh,M−V\)/kM\),\\displaystyle=\\frac\{1\}\{1\+\\exp\\bigl\(\(V\_\{h,M\}\-V\)/k\_\{M\}\\bigr\)\},\(65\)n∞​\(V\)\\displaystyle n\_\{\\infty\}\(V\)=11\+exp⁡\(\(Vh,K−V\)/kK\),\\displaystyle=\\frac\{1\}\{1\+\\exp\\bigl\(\(V\_\{h,K\}\-V\)/k\_\{K\}\\bigr\)\},s∞​\(V\)\\displaystyle s\_\{\\infty\}\(V\)=11\+0\.33​exp⁡\(−0\.0625​V\)\.\\displaystyle=\\frac\{1\}\{1\+0\.33\\,\\exp\\bigl\(\-0\.0625\\,V\\bigr\)\}\.The model parameters we used for our experiments are reported in Table[A1](https://arxiv.org/html/2605.12683#A6.T1)\. Similar to the Lorenz\-96 system, we used a Runge\-Kutta 4/5 scheme withΔ​t=2\.5⋅10−2\\Delta t=2\.5\\cdot 10^\{\-2\}and integrated two trajectories from different initial conditions of length50005000internal time units from which we cut10001000as transients, leading to two trajectories of length160,000160\{,\}000discrete time steps\. One trajectory is contaminated with5%5\\%Gaussian observation noise and used for training, while the other trajectory is kept clean and used for testing and both trajectories are again standardized per dynamical variable after generation\. Moreover, to make reconstruction more challenging, we throw away thehhvariable after generation, such that the system is partially observed withN=2N=2\. For an example reconstruction, see Fig\.[3](https://arxiv.org/html/2605.12683#S5.F3)\.

Table A1:Neuron model parameter settings

## Appendix GEvaluation measures

##### State space divergenceDstspD\_\{\\textrm\{stsp\}\}

The state space divergenceDstspD\_\{\\textrm\{stsp\}\}measures the geometrical disagreement between state distributions of the datap​\(𝒙\)p\(\\bm\{x\}\)and that of model generated trajectoriesq​\(𝒙\)q\(\\bm\{x\}\)by evaluating the Kullback\-Leibler \(KL\) divergence:

Dstsp:=DKL\(p\(𝒙\)\|\|q\(𝒙\)\)=∫𝒙∈ℝNp\(𝒙\)logp​\(𝒙\)q​\(𝒙\)d𝒙\.\\displaystyle D\_\{\\textrm\{stsp\}\}:=D\_\{\\textrm\{KL\}\}\\left\(p\(\\bm\{x\}\)\\ \|\|\\ q\(\\bm\{x\}\)\\right\)=\\int\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{N\}\}p\(\\bm\{x\}\)\\log\\frac\{p\(\\bm\{x\}\)\}\{q\(\\bm\{x\}\)\}d\\bm\{x\}\.\(66\)In practice, we estimatep​\(𝒙\)p\(\\bm\{x\}\)andq​\(𝒙\)q\(\\bm\{x\}\)by placing Gaussian Mixture Models \(GMM\) along orbits\[koppe\_identifying\_2019,brenner\_tractable\_2022\]\), i\.e\.

p^​\(𝒙\)\\displaystyle\\hat\{p\}\(\\bm\{x\}\)=1T1​∑t=1T𝒩​\(𝒙;𝒙t,𝚺\)\\displaystyle=\\frac\{1\}\{T\_\{1\}\}\\sum\_\{t=1\}^\{T\}\\mathcal\{N\}\(\\bm\{x\};\\ \\bm\{x\}\_\{t\},\\bm\{\\Sigma\}\)\(67\)q^​\(𝒙\)\\displaystyle\\hat\{q\}\(\\bm\{x\}\)=1T2​∑t=1T2𝒩​\(𝒙;𝒙^t,𝚺\),\\displaystyle=\\frac\{1\}\{T\_\{2\}\}\\sum\_\{t=1\}^\{T\_\{2\}\}\\mathcal\{N\}\(\\bm\{x\};\\ \\hat\{\\bm\{x\}\}\_\{t\},\\bm\{\\Sigma\}\),where𝒙1:T1\\bm\{x\}\_\{1:T\_\{1\}\}is the ground\-truth data of lengthT1T\_\{1\}and𝒙^1:T2\\hat\{\\bm\{x\}\}\_\{1:T\_\{2\}\}is a model\-generated trajectory of lengthT2T\_\{2\},𝒩​\(𝒙;𝒙t,𝚺\)\\mathcal\{N\}\(\\bm\{x\};\\ \\bm\{x\}\_\{t\},\\bm\{\\Sigma\}\)is a multivariate Gaussian with mean vector𝒙t\\bm\{x\}\_\{t\}and covariance matrix𝚺=diag​\(\[σ12,…,σN2\]\)\\bm\{\\Sigma\}=\\mathrm\{diag\}\(\[\\sigma^\{2\}\_\{1\},\\dots,\\sigma^\{2\}\_\{N\}\]\)\. The KL\-divergence between two GMMs can be approximated using a Monte\-Carlo approach\[Hershey2007ApproximatingTK\]

Dstsp=DKL\(p^\(𝒙\)\|\|q^\(𝒙\)\)≈1n∑i=1nlogp^​\(𝒙\(i\)\)q^​\(𝒙\(i\)\),\\displaystyle D\_\{\\textrm\{stsp\}\}=D\_\{\\textrm\{KL\}\}\\left\(\\hat\{p\}\(\\bm\{x\}\)\\ \|\|\\ \\hat\{q\}\(\\bm\{x\}\)\\right\)\\approx\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\frac\{\\hat\{p\}\(\\bm\{x\}^\{\(i\)\}\)\}\{\\hat\{q\}\(\\bm\{x\}^\{\(i\)\}\)\},\(68\)withnnMonte Carlo samples𝒙\(i\)\\bm\{x\}^\{\(i\)\}randomly drawn from the GMM based on observed orbits,p^​\(𝒙\(i\)\)\\hat\{p\}\(\\bm\{x\}^\{\(i\)\}\)\. When the observed system is only partially observed and the true dynamics unfold ind\>Nd\>Ndimensions, the attractor first has to be unfolded to applyDstspD\_\{\\mathrm\{stsp\}\}correctly\[botvinick2025invariant\]\. To this end we first apply a delay embedding \(DE\)\[takens\_detecting\_1981,sauer\_embedology\_1991\]of the observed and model generated orbits

𝝃t=\[𝒙t,𝒙t−τ,𝒙t−2​τ,…,𝒙t−\(m−1\)​τ\]⊤∈ℝm​N\\bm\{\\xi\}\_\{t\}=\\left\[\\bm\{x\}\_\{t\},\\bm\{x\}\_\{t\-\\tau\},\\bm\{x\}\_\{t\-2\\tau\},\\dots,\\bm\{x\}\_\{t\-\(m\-1\)\\tau\}\\right\]^\{\\top\}\\in\\mathbb\{R\}^\{mN\}\(69\)with delayτ\\tauand embedding dimensionmm, where the same construction is performed for𝝃^t\\hat\{\\bm\{\\xi\}\}\_\{t\}\. We then fit GMMs and calculateDstspD\_\{\\mathrm\{stsp\}\}in embedding space, i\.e\.

DstspDE=DKL\(p^\(𝝃\)\|\|q^\(𝝃\)\)≈1n∑i=1nlogp^​\(𝝃\(i\)\)q^​\(𝝃\(i\)\)\.D^\{\\mathrm\{DE\}\}\_\{\\mathrm\{stsp\}\}=D\_\{\\textrm\{KL\}\}\\left\(\\hat\{p\}\(\\bm\{\\xi\}\)\\ \|\|\\ \\hat\{q\}\(\\bm\{\\xi\}\)\\right\)\\approx\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\frac\{\\hat\{p\}\(\\bm\{\\xi\}^\{\(i\)\}\)\}\{\\hat\{q\}\(\\bm\{\\xi\}^\{\(i\)\}\)\}\.\(70\)
For all experiments, we usen=106n=10^\{6\}and set the bandwidth𝚺\\bm\{\\Sigma\}of Gaussian compartments using𝚺=fb​w⋅diag​\(\[σ^12,…,σ^d2\]\)\\bm\{\\Sigma\}=f\_\{bw\}\\cdot\\mathrm\{diag\}\(\[\\hat\{\\sigma\}\_\{1\}^\{2\},\\dots,\\hat\{\\sigma\}\_\{d\}^\{2\}\]\), wherefb​w=\(T1​\(d\+2\)4\)−1/\(d\+4\)f\_\{bw\}=\\left\(\\frac\{T\_\{1\}\(d\+2\)\}\{4\}\\right\)^\{\-1/\(d\+4\)\}is a bandwith factor chosen according to Silverman’s rule\-of\-thumb\[silverman2018density\]andσ^i\\hat\{\\sigma\}\_\{i\}is the empirical standard deviation ofxi,1:T1x\_\{i,1:T\_\{1\}\}\. Furthermore, to probe long\-term consistency beyond the available data, we choose to generate model orbits forT2=3⋅T1T\_\{2\}=3\\cdot T\_\{1\}\. This avoids lowDstspD\_\{\\mathrm\{stsp\}\}when the model is onlytransientlygenerating the correct long\-term behavior\.

Since the forced Lorenz\-96 system \(Appx\.[F](https://arxiv.org/html/2605.12683#A6)\) and bursting neuron data sets are only partially observed, we used settingsm=3,τ=4,096m=3,\\tau=4\{,\}096for the Lorenz\-96 andm=7,τ=1,024m=7,\\tau=1\{,\}024for the bursting neuron\. We explicitly used a large delays to resolve the long time scales present in the data\. For the partially observed Lorenz\-63 \(see Sect\.[5\.2](https://arxiv.org/html/2605.12683#S5.SS2)\), we usedm=3,τ=10m=3,\\tau=10\.

##### Short\-term RMSE

Thenn\-step\-ahead RMSE assesses short\- to medium\-length forecasts and is defined as

RMSE​\(n\)=1n​∑k=1n‖𝒙k−G𝝍​\(F𝜽∘k​\(𝒛0\)\)‖2\\text\{RMSE\}\(n\)=\\sqrt\{\\frac\{1\}\{n\}\\sum\_\{k=1\}^\{n\}\\left\\lVert\\bm\{x\}\_\{k\}\-G\_\{\\bm\{\\psi\}\}\(F\_\{\\bm\{\\theta\}\}^\{\\circ k\}\(\\bm\{z\}\_\{0\}\)\)\\right\\rVert^\{2\}\}\(71\)where𝒙1:n:=𝒙ti:ti\+n−1\\bm\{x\}\_\{1:n\}:=\\bm\{x\}\_\{t\_\{i\}:t\_\{i\}\+n\-1\}is a window starting at indextit\_\{i\}of the total ground\-truth data𝒙1:Ttot\\bm\{x\}\_\{1:T\_\{\\mathrm\{tot\}\}\}\. The initial condition𝒛0\\bm\{z\}\_\{0\}is obtained by performing a warm\-up using a history of lengthTwT\_\{w\},𝒙ti−Tw:ti−1\\bm\{x\}\_\{t\_\{i\}\-T\_\{w\}:t\_\{i\}\-1\}\(see Sect\.[4\.2](https://arxiv.org/html/2605.12683#S4.SS2)for details on warm\-up\)\. In practice we used averaged the RMSE over100100windows from different initial conditions\.

## Appendix HAdditional experimental details

This appendix section lists specific experimental details of experiments performed in the main paper\. All experiments were performed on a compute server with 2x96\-Core AMD EPYC 9655 CPUs, 768 GB RAM, and 6x NVIDIA RTX Pro 6000 Blackwell \(96 GB\) GPUs; however, individual trainings were always performed on a single GPU at a time\.

### H\.1Figure[1](https://arxiv.org/html/2605.12683#S5.F1)

##### Fig\. 1A

We trained shPLRNNs \(Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\),L=50L=50\) of varying latent dimensionsM∈\{4,16,64,128\}M\\in\\\{4,16,64,128\\\}on the Lorenz\-63 system using GTF\-DEER\. The batch size was fixed atB=1B=1, while the sequence length was varied in powers of 2, i\.e\.T∈\{128,256,…,32,768\}T\\in\\\{128,256,\\,\\dots,\\ 32\{,\}768\\\}and we used no warm\-up for this experiment \(Tw=0T\_\{w\}=0\)\. We measured the time of combined forward\+backward passes throughout training\. Training was performed for10,00010,000updates \(samples\)\. We choose to explicitly train a model to account for the fact that GTF\-DEER’s convergence depends on the dynamics of the model, giving us an estimate of the spread of runtimes throughout training\. Indeed, for the chosenα=0\.15\\alpha=0\.15, GTF\-DEER’s convergence stayed consistent throughout training \(see also Fig\.[2](https://arxiv.org/html/2605.12683#S5.F2)B\)\. The sequential baseline runtime is independent of explicit dynamics, and hence we only measured the time of trained models usingNs=500N\_\{s\}=500samples\. In code, the only difference between the sequential and parallel implementation is the call to the forward pass solver, which is given by a straight\-forwardjax\.lax\.scanfor the sequential case, and theseq1d\(DEER solver\) function for parallel one, respectively\. Both methods parallelize over the batch size usingjax\.vmap\. All data was gathered on an NVIDIA RTX 6000 Blackwell \(96GB\) GPU\.

##### Fig\. 1B

For this figure we used a shPLRNN \(varyingMM, fixedL=50L=50\) that was trained on the full Lorenz\-63 system \(N=3N=3\)\. We performed forward passes for different values of forcing strengthα∈\[0,1\]\\alpha\\in\[0,1\]\. The maximum number of Newton iterations was capped atniter=500n\_\{\\textrm\{iter\}\}=500: In practice, GTF\-DEER loses its competitive performance over sequential evaluation even for moderate Newton iterations such that500500is already a generous upper bound of what can be considered a useful regime\. The different curves correspond to different initial guess strategies, where ‘Zero’ corresponds to initializing𝒛1:T\(0\)=𝟎T×M\\bm\{z\}^\{\(0\)\}\_\{1:T\}=\\bm\{0\}\_\{T\\times M\}\. ‘𝒩​\(0,1\)\\mathcal\{N\}\(0,1\)’ draws entries from a standard Normal distribution,zi,t\(0\)∼𝒩​\(0,1\)z\_\{i,t\}^\{\(0\)\}\\sim\\mathcal\{N\}\(0,1\)and ‘P\-inv’ uses𝒛1:T\(0\)=𝑩\+​𝒙1:T\\bm\{z\}^\{\(0\)\}\_\{1:T\}=\\bm\{B\}^\{\+\}\\bm\{x\}\_\{1:T\}to initialize the Newton iterations\.

### H\.2Figure[2](https://arxiv.org/html/2605.12683#S5.F2)

For this experiment we trained shPLRNN \(M=5M=5,L=50L=50\) on the Lorenz\-63 under different conditions\. We trained on the full Lorenz\-63 system \(N=3N=3\) as well as on the partially observed one \(N=1N=1,xx\-component\)\. Furthermore, we switched between GTF\-DEER with full Jacobians \(Eq\. \([13](https://arxiv.org/html/2605.12683#S4.E13)\)\) and diagonalized ones \(quasi\-DEER,𝑱F→diag⁡\(𝑱F\)\\bm\{J\}\_\{F\}\\rightarrow\\operatorname\{diag\}\(\\bm\{J\}\_\{F\}\)\)\. We trained2020models per setting for20,00020\{,\}000parameter updates\. The forcing strength wasα=0\.15\\alpha=0\.15\.

### H\.3Figure[3](https://arxiv.org/html/2605.12683#S5.F3)A

For Fig\.[3](https://arxiv.org/html/2605.12683#S5.F3)Awe trained shPLRNNs withM=10M=10andL=128L=128on the forced Lorenz\-96 system andM=6M=6for the bursting neuron model \(see Appx\.[F](https://arxiv.org/html/2605.12683#A6)\)\. To probe the effect of increasing sequence length on this problem, we trained the models on sequence lengthsT∈\{256,512,1,024,2,048,4,096,8,192,16,384,32,768\}T\\in\\\{256,512,1\{,\}024,2\{,\}048,4\{,\}096,8\{,\}192,16\{,\}384,32\{,\}768\\\}where for each setting we usedTw=T/2T\_\{w\}=T/2\. To keep data parsed by the model per parameter update the same, we adjusted batch size accordingly, such that the productB⋅T=215=32,768B\\cdot T=2^\{15\}=32\{,\}768stayed fixed\. This resulted in settings\(B,T\)=\(256,128\)\(B,T\)=\(256,128\)and\(B,T\)=\(1,32,768\)\(B,T\)=\(1,32\{,\}768\)for the lowest and highest sequence length used, respectively\. For each data point in the figure, we trained1010independent models\. For the Lorenz\-96 we used a forcing strength ofα=0\.08\\alpha=0\.08, for the bursting neuron modelα=0\.4\\alpha=0\.4\.

### H\.4Figure[3](https://arxiv.org/html/2605.12683#S5.F3)B

#### H\.4\.1Mamba\-2

As a state\-of\-the\-art SSM baseline, we compare the LSSM \(Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\)\) and the GTF\-DEER\-trained shPLRNN \(Eq\. \([5](https://arxiv.org/html/2605.12683#S3.E5)\)\) to a model in which the linear recurrence of the LSSM is replaced by a single Mamba\-2 block\[dao2024transformers\], while keeping the rest of the architecture, in particular the one\-hidden\-layer MLP read\-out from Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\), identical\. We refer to this configuration as Mamba\-2\-npn\_\{p\}, wherenpn\_\{p\}is the number of parameters\. For the non\-recurrence machinery of the Mamba\-2 block \(input projections, depth\-wise causal convolution, gating branch, multi\-head setup, output projection\) we refer the reader to the original work by\[dao2024transformers\]\. In the following, we clarify the, to us important distinction to the LSSM introduced in Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\)\.

##### Selective SSM recurrence\.

The main*functional*difference of the Mamba recurrence is the introduction of a*selection mechanism*\[gu2024mamba,dao2024transformers\], which renders the model non\-autonomous even in the absence of external inputs𝒔t\\bm\{s\}\_\{t\}\. Dropping𝒔t\\bm\{s\}\_\{t\}for brevity and using the LSSM \(Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\)\) as a template, the core recurrence inside the Mamba\-2 block can be written as

𝒛t=𝑨¯t​\(𝒙t−1\)​𝒛t−1\+𝑼¯t​\(𝒙t−1\)​𝒙t−1\\bm\{z\}\_\{t\}=\\bar\{\\bm\{A\}\}\_\{t\}\(\\bm\{x\}\_\{t\-1\}\)\\,\\bm\{z\}\_\{t\-1\}\+\\bar\{\\bm\{U\}\}\_\{t\}\(\\bm\{x\}\_\{t\-1\}\)\\,\\bm\{x\}\_\{t\-1\}\(72\)i\.e\.,*structurally*identical to the LSSM but with𝑨¯t\\bar\{\\bm\{A\}\}\_\{t\}and𝑼¯t\\bar\{\\bm\{U\}\}\_\{t\}now depending on the input𝒙t−1\\bm\{x\}\_\{t\-1\}\. Specifically, parameterizations of𝑨¯t\\bar\{\\bm\{A\}\}\_\{t\}and𝑼¯t\\bar\{\\bm\{U\}\}\_\{t\}are given by

Δt\\displaystyle\\Delta\_\{t\}=softplus​\(𝑾Δ​𝒙t−1\+𝒃Δ\)∈ℝ\\displaystyle=\\mathrm\{softplus\}\(\\bm\{W\}\_\{\\Delta\}\\bm\{x\}\_\{t\-1\}\+\\bm\{b\}\_\{\\Delta\}\)\\in\\mathbb\{R\}\(73\)𝑼t\\displaystyle\\bm\{U\}\_\{t\}=𝑾U​𝒙t−1\\displaystyle=\\bm\{W\}\_\{U\}\\bm\{x\}\_\{t\-1\}𝑨¯t\\displaystyle\\bar\{\\bm\{A\}\}\_\{t\}=exp⁡\(Δt​𝑨\)\\displaystyle=\\exp\(\\Delta\_\{t\}\\,\\bm\{A\}\)𝑼¯t\\displaystyle\\bar\{\\bm\{U\}\}\_\{t\}=Δt​𝑼t,\\displaystyle=\\Delta\_\{t\}\\,\\bm\{U\}\_\{t\},where𝑨\\bm\{A\}is a*fixed*, learnable parameter which for Mamba\-2 is restricted to𝑨=a​𝑰\\bm\{A\}=a\\,\\bm\{I\}for a single learned scalara<0a<0per head\. The recurrence parameters thus become functions of the input through the data\-dependent scalarΔt\\Delta\_\{t\}and the data\-dependent input vector𝑼t\\bm\{U\}\_\{t\}, while𝑨\\bm\{A\}itself remains time\-invariant\. SettingΔt=Δ\\Delta\_\{t\}=\\Deltaand𝑼t=𝑼\\bm\{U\}\_\{t\}=\\bm\{U\}recovers a fixed\-parameter linear recurrence equivalent to the LSSM \(Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\)\)\.222For clarity, Eqs\. \([72](https://arxiv.org/html/2605.12683#A8.E72)\) & \([73](https://arxiv.org/html/2605.12683#A8.E73)\) describe the Mamba\-2 recurrence at the level of abstraction relevant for comparison with the LSSM\. In practice, a Mamba\-2 block does not operate directly on𝒙t−1\\bm\{x\}\_\{t\-1\}: the input is first lifted to an expanded representation of dimensionD=E⋅ND=E\\cdot Nvia a linear projection, whereEEis the expansion factor, then split intoHHheads, and each head runs an independent SSM with its own data\-dependent parameters\(Δt,𝑼t\)\(\\Delta\_\{t\},\\bm\{U\}\_\{t\}\)shared across the channels within that head, while𝑨\\bm\{A\}is a learned scalar per head\. Channel mixing across the observation dimensions therefore occurs in the surrounding input/output projections rather than inside the recurrence itself\. We refer the reader to\[dao2024transformers\]for the full block specification\.

##### Relation to the LSSM

Crucially, the recurrence in Eq\. \([72](https://arxiv.org/html/2605.12683#A8.E72)\) is still*linear in the latent state*𝒛t−1\\bm\{z\}\_\{t\-1\}for any fixed value of𝒙t−1\\bm\{x\}\_\{t\-1\}, which is what permits parallelization via parallel scan during training\[dao2024transformers\]\. At the same time, the selection mechanism circumvents therank≤min⁡\(N,M,L\)\\mathrm\{rank\}\\leq\\min\(N,M,L\)bottleneck on the effective test\-time recurrence \(Eq\. \([4](https://arxiv.org/html/2605.12683#S3.E4)\)\) discussed in Sect\.[3\.1](https://arxiv.org/html/2605.12683#S3.SS1): because𝑼t\\bm\{U\}\_\{t\}varies withtt, the model is no longer constrained to a single fixed product𝑾¯=𝑼​𝑩\\overline\{\\bm\{W\}\}=\\bm\{U\}\\bm\{B\}of fixed rank, but can in principle realize a different effective input pathway at every step\.

#### H\.4\.2Hyperparameters

##### Optimization settings

For experiments in Sect\.[5\.4](https://arxiv.org/html/2605.12683#S5.SS4), all models were trained using the same following settings: batch sizeB=1B=1, sequence lengthT=81,920T=81\{,\}920with warm\-upTw=16,384T\_\{w\}=16\{,\}384and a learning rate decay fromηs=5⋅10−5\\eta\_\{s\}=5\\cdot 10^\{\-5\}to10−610^\{\-6\}using a cosine decay schedule\[loshchilov2017sgdr\]\. Training was performed for150,000150\{,\}000parameter updates\.

##### Mamba\-2

For the Mamba\-2 models, we evaluated two parameter budgets, denoted Mamba\-2\-7k with7,0387\{,\}038and Mamba\-2\-17k with16,83016\{,\}830parameters, respectively\. The former is matched as closely as possible to the shPLRNN/LSSM budget \(≈2,200\\approx 2\{,\}200–2,9002\{,\}900parameters\) for direct comparison at fixed capacity without overly limiting Mamba\-2’s expressiveness; the latter probes whether additional capacity closes the gap\. Both use a single Mamba\-2 block with hidden dimensionD=16D=16\(d\_model\) for Mamba\-2\-7k andD=32D=32for Mamba\-2\-17k\. All other parameters stay the same: State sizeN=16N=16\(d\_state\),d\_conv=4\\texttt\{d\\\_conv\}=4andexpand=2\\texttt\{expand\}=2\. Following the Mamba\-2 block is the same MLP read\-out as used in the LSSM \(Eq\. \([3](https://arxiv.org/html/2605.12683#S3.E3)\)\) withL=128L=128\. We used the Mamba\-2 implementation from the original code repository[https://github\.com/state\-spaces/mamba](https://github.com/state-spaces/mamba), licensed under the Apache License, Version 2\.0\.

##### RNN

The RNN usedM=10M=10andL=128L=128andMr=4M\_\{r\}=4\. We usedλMAR=1\\lambda\_\{\\mathrm\{MAR\}\}=1,λ1=1\\lambda\_\{1\}=1andλ2=10−4\\lambda\_\{2\}=10^\{\-4\}, however, we found the regularization of the singular values of𝑩\\bm\{B\}did not have a huge impact on training convergence and reconstruction performance\. The forcing wasα=0\.08\\alpha=0\.08\.

##### LSSM

For the LSSM we kept core settings equal to the ones used in the RNN, i\.e\.M=10M=10,L=128L=128\. Additionally, we apply MAR to the LSSM \(see Appx\.[E](https://arxiv.org/html/2605.12683#A5)\) where we scannedλMAR∈\{10−4,10−3,10−2,10−1,1,10\}\\lambda\_\{\\mathrm\{MAR\}\}\\in\\\{10^\{\-4\},10^\{\-3\},10^\{\-2\},10^\{\-1\},1,10\\\}but found that whileλMAR=10\\lambda\_\{\\mathrm\{MAR\}\}=10degraded performance, all other settings produced similar results\. Hence we reportedλMAR=1\\lambda\_\{\\mathrm\{MAR\}\}=1to make clear that MAR does not improve reconstruction for the LSSM\. For the latter settings we also usedMr=4M\_\{r\}=4, mirroring the setting of the RNN\.

## Appendix IAdditional figures

![Refer to caption](https://arxiv.org/html/2605.12683v1/x4.png)Figure A1:Generated dynamics of a shPLRNN \(M=10,L=128M=10,L=128\) trained on the forced Lorenz\-96 system\.A: Access to long sequences during training enables the model to capture the latent long\-period sinusoidal forcingF​\(t\)F\(t\)in the MAR regularized units, and hence producing accurate long\-term rollouts\.B: A model trained on short sequence lengths w\.r\.t\. the intrinsic time scales of the data fails to capture and reproduce the latent forcing during autoregressive roll\-outs\.![Refer to caption](https://arxiv.org/html/2605.12683v1/figures/FigA2.png)Figure A2:Generated dynamics of3×3\\timesthe length of available ground truth data from models evaluated in Sect\.[5\.4](https://arxiv.org/html/2605.12683#S5.SS4)\. LSSMs and shPLRNNs with their rank limited to the number of observed variablesr=N=6r=N=6fail to learn the latent sinusoidal forcing\. Increasing the rank to77in the shPLRNN enables the model to learn the underlying forcing and produce convincing limiting behavior, however, the model still struggles to infer the correct phase from the context\. Even though the Mamba\-2 based model can circumvent the low\-rank constraint due to its data\-dependent recurrence weights, it does not match the performance of the shPLRNNs trained with GTF\-DEER\. Each generated trace was produced using the best model \(out of1010\) per configuration\.![Refer to caption](https://arxiv.org/html/2605.12683v1/figures/FigA3.png)Figure A3:RMSE​\(128\)\\textrm\{RMSE\}\(128\)as a function of sequence length for shPLRNNs trained on the forced Lorenz\-96 and bursting neuron systems, complementary to theDstspD\_\{\\mathrm\{stsp\}\}plot in Fig\.[3](https://arxiv.org/html/2605.12683#S5.F3)\. Median±\\pmMAD,1010runs per setting\.
## Appendix JTraining algorithms

Algorithm 1Linear SSM training \(using parallel associative scan\)1:Data𝑿∈ℝTobs×N\\bm\{X\}\\in\\mathbb\{R\}^\{T\_\{\\mathrm\{obs\}\}\\times N\}, optional external inputs𝑺∈ℝTobs×K\\bm\{S\}\\in\\mathbb\{R\}^\{T\_\{\\mathrm\{obs\}\}\\times K\}, linear recurrenceF𝜽​\(𝒛t−1,𝒙t−1,𝒔t\)F\_\{\\bm\{\\theta\}\}\(\\bm\{z\}\_\{t\-1\},\\bm\{x\}\_\{t\-1\},\\bm\{s\}\_\{t\}\), decoderG𝝍​\(𝒛t\)G\_\{\\bm\{\\psi\}\}\(\\bm\{z\}\_\{t\}\), batch sizeBB, sequence lengthTT, warm\-up lengthTwT\_\{w\}

2:Trained parameters𝜽,𝝍\\bm\{\\theta\},\\bm\{\\psi\}

3:repeat

4:Sample sequences𝒙0:T∈ℝB×\(T\+1\)×N\\bm\{x\}\_\{0:T\}\\in\\mathbb\{R\}^\{B\\times\(T\+1\)\\times N\}from𝑿\\bm\{X\},𝒔1:T∈ℝB×T×K\\bm\{s\}\_\{1:T\}\\in\\mathbb\{R\}^\{B\\times T\\times K\}from𝑺\\bm\{S\}

5:

⊳\\trianglerightImplicit vectorized batch processing below, e\.g\. viajax\.vmap⊲\\triangleleft

6:Initialize𝒛0=𝟎\\bm\{z\}\_\{0\}=\\bm\{0\}

7:𝒛1:T←associative​\_​scan​\(F𝜽,𝒛0,𝒙0:T−1,𝒔1:T\)\\bm\{z\}\_\{1:T\}\\leftarrow\\mathrm\{associative\\\_scan\}\(F\_\{\\bm\{\\theta\}\},\\bm\{z\}\_\{0\},\\bm\{x\}\_\{0:T\-1\},\\bm\{s\}\_\{1:T\}\)⊳\\trianglerightSolve in parallel on GPU

8:𝒙^1:T=G𝝍​\(𝒛1:T\)\\hat\{\\bm\{x\}\}\_\{1:T\}=G\_\{\\bm\{\\psi\}\}\(\\bm\{z\}\_\{1:T\}\)

9:ℒ←MSE​\(𝒙Tw\+1:T,𝒙^Tw\+1:T\)\\mathcal\{L\}\\leftarrow\\mathrm\{MSE\}\(\\bm\{x\}\_\{T\_\{w\}\+1:T\},\\hat\{\\bm\{x\}\}\_\{T\_\{w\}\+1:T\}\)⊳\\trianglerightExclude warm\-up steps from loss

10:\(𝒈𝜽,𝒈𝝍\)←grad\(ℒ;𝜽,𝝍\(\\bm\{g\}\_\{\\bm\{\\theta\}\},\\bm\{g\}\_\{\\bm\{\\psi\}\}\)\\leftarrow\\mathrm\{grad\}\(\\mathcal\{L\};\\ \\bm\{\\theta\},\\bm\{\\psi\}\)

11:\(𝜽,𝝍\)←optimize​\(𝜽,𝝍,𝐠𝜽,𝐠𝝍\)\(\\bm\{\\theta\},\\bm\{\\psi\}\)\\leftarrow\\mathrm\{optimize\(\\bm\{\\theta\},\\bm\{\\psi\},\\bm\{g\}\_\{\\bm\{\\theta\}\},\\bm\{g\}\_\{\\bm\{\\psi\}\}\}\)

12:untilconvergence

Algorithm 2Initial value / trajectory matching training with generalized teacher forcing1:Data𝑿∈ℝTobs×N\\bm\{X\}\\in\\mathbb\{R\}^\{T\_\{\\mathrm\{obs\}\}\\times N\}, optional external inputs𝑺∈ℝTobs×K\\bm\{S\}\\in\\mathbb\{R\}^\{T\_\{\\mathrm\{obs\}\}\\times K\}, recurrent modelF𝜽F\_\{\\bm\{\\theta\}\}, observation matrix𝑩∈ℝN×M\\bm\{B\}\\in\\mathbb\{R\}^\{N\\times M\}, forcing strengthα∈\[0,1\]\\alpha\\in\[0,1\], batch sizeBB, sequence lengthTT

2:Trained parameters𝜽,𝑩\\bm\{\\theta\},\\bm\{B\}

3:repeat

4:Compute𝑩\+←pinv​\(𝑩\)\\bm\{B\}^\{\+\}\\leftarrow\\mathrm\{pinv\}\(\\bm\{B\}\),𝑷←𝑰M−α​𝑩\+​𝑩\\bm\{P\}\\leftarrow\\bm\{I\}\_\{M\}\-\\alpha\\bm\{B\}^\{\+\}\\bm\{B\}

5:Sample sequences𝒙0:T∈ℝB×\(T\+1\)×N\\bm\{x\}\_\{0:T\}\\in\\mathbb\{R\}^\{B\\times\(T\+1\)\\times N\}from𝑿\\bm\{X\},𝒔1:T∈ℝB×T×K\\bm\{s\}\_\{1:T\}\\in\\mathbb\{R\}^\{B\\times T\\times K\}from𝑺\\bm\{S\}

6:

⊳\\trianglerightImplicit vectorized batch processing below, e\.g\. viajax\.vmap⊲\\triangleleft

7:Compute teacher signals𝒛¯0:T=𝑩\+​\(x0:T\)\\bar\{\\bm\{z\}\}\_\{0:T\}=\\bm\{B\}^\{\+\}\(x\_\{0:T\}\)

8:Initializez0←z¯0z\_\{0\}\\leftarrow\\bar\{z\}\_\{0\}

9:fort=1,…,Tt=1,\\dots,Tdo

10:𝒛~t−1←𝑷​𝒛t−1\+α​𝒛¯t−1\\tilde\{\\bm\{z\}\}\_\{t\-1\}\\leftarrow\\bm\{P\}\\,\\bm\{z\}\_\{t\-1\}\+\\alpha\\,\\bar\{\\bm\{z\}\}\_\{t\-1\}⊳\\trianglerightBuild weighted forced state

11:𝒛t←F𝜽​\(𝒛~t−1,𝒔t\)\\bm\{z\}\_\{t\}\\leftarrow F\_\{\\bm\{\\theta\}\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\},\\bm\{s\}\_\{t\}\)

12:𝒙^t←𝑩​𝒛t\\hat\{\\bm\{x\}\}\_\{t\}\\leftarrow\\bm\{B\}\\bm\{z\}\_\{t\}

13:ℒ←MSE​\(𝒙1:T,𝒙^1:T\)\\mathcal\{L\}\\leftarrow\\mathrm\{MSE\}\(\\bm\{x\}\_\{1:T\},\\hat\{\\bm\{x\}\}\_\{1:T\}\)

14:\(𝒈𝜽,𝒈𝑩\)←grad​\(ℒ;𝜽,𝑩\)\(\\bm\{g\}\_\{\\bm\{\\theta\}\},\\bm\{g\}\_\{\\bm\{B\}\}\)\\leftarrow\\mathrm\{grad\}\(\\mathcal\{L\};\\ \\bm\{\\theta\},\\bm\{B\}\)⊳\\trianglerightBackpropagation through time

15:\(𝜽,𝑩\)←optimize​\(𝜽,𝑩,𝒈𝜽,𝒈𝑩\)\(\\bm\{\\theta\},\\bm\{B\}\)\\leftarrow\\mathrm\{optimize\}\(\\bm\{\\theta\},\\bm\{B\},\\bm\{g\}\_\{\\bm\{\\theta\}\},\\bm\{g\}\_\{\\bm\{B\}\}\)

16:untilconvergence

Algorithm 3GTF\-DEER: Generalized teacher forcing with DEER parallel solver1:Data𝑿∈ℝTobs×N\\bm\{X\}\\in\\mathbb\{R\}^\{T\_\{\\mathrm\{obs\}\}\\times N\}, optional external inputs𝑺∈ℝTobs×K\\bm\{S\}\\in\\mathbb\{R\}^\{T\_\{\\mathrm\{obs\}\}\\times K\}, recurrent modelF𝜽F\_\{\\bm\{\\theta\}\}, observation matrix𝑩∈ℝN×M\\bm\{B\}\\in\\mathbb\{R\}^\{N\\times M\}, forcing strengthα∈\[0,1\]\\alpha\\in\[0,1\], batch sizeBB, sequence lengthTT, toleranceε\\varepsilon

2:Trained parameters𝜽\\bm\{\\theta\},𝑩\\bm\{B\}

3:repeat

4:Compute𝑩\+←pinv​\(𝑩\)\\bm\{B\}^\{\+\}\\leftarrow\\mathrm\{pinv\}\(\\bm\{B\}\),𝑷α←𝑰M−α​𝑩\+​𝑩\\bm\{P\}\_\{\\alpha\}\\leftarrow\\bm\{I\}\_\{M\}\-\\alpha\\bm\{B\}^\{\+\}\\bm\{B\},𝑷1←𝑰M−𝑩\+​𝑩\\bm\{P\}\_\{1\}\\leftarrow\\bm\{I\}\_\{M\}\-\\bm\{B\}^\{\+\}\\bm\{B\}

5:Sample sequences𝒙0:T∈ℝB×\(T\+1\)×N\\bm\{x\}\_\{0:T\}\\in\\mathbb\{R\}^\{B\\times\(T\+1\)\\times N\}from𝑿\\bm\{X\},𝒔1:T∈ℝB×T×K\\bm\{s\}\_\{1:T\}\\in\\mathbb\{R\}^\{B\\times T\\times K\}from𝑺\\bm\{S\}

6:

⊳\\trianglerightImplicit vectorized batch processing below, e\.g\. viajax\.vmap⊲\\triangleleft

7:Compute forcing targets𝒛¯t←𝑩\+​𝒙t\\bar\{\\bm\{z\}\}\_\{t\}\\leftarrow\\bm\{B\}^\{\+\}\\bm\{x\}\_\{t\}fort=0,…,Tt=0,\\ldots,T

8:Initialize𝒛0←𝒛¯0\\bm\{z\}\_\{0\}\\leftarrow\\bar\{\\bm\{z\}\}\_\{0\}⊳\\trianglerightBelow, fix𝐳0\(k\)=𝐳0​∀k\\bm\{z\}^\{\(k\)\}\_\{0\}=\\bm\{z\}\_\{0\}\\;\\forall k

9:Initial guess𝒛1:T\(0\)←𝒛¯1:T\\bm\{z\}^\{\(0\)\}\_\{1:T\}\\leftarrow\\bar\{\\bm\{z\}\}\_\{1:T\}⊳\\trianglerightWarm start from data

10:fork=0,1,…k=0,1,\\ldotsuntil‖Δ​𝒛1:T\(k\+1\)‖∞<ε\\\|\\Delta\\bm\{z\}\_\{1:T\}^\{\(k\+1\)\}\\\|\_\{\\infty\}<\\varepsilondo⊳\\trianglerightDEER iteration

11:𝒛~t\(k\)←𝑷1​𝒛t\(k\)\+𝒛¯t\\tilde\{\\bm\{z\}\}\_\{t\}^\{\(k\)\}\\leftarrow\\bm\{P\}\_\{1\}\\bm\{z\}\_\{t\}^\{\(k\)\}\+\\bar\{\\bm\{z\}\}\_\{t\}fort=1,…,Twt=1,\\ldots,T\_\{w\}⊳\\trianglerightWarm\-up

12:𝒛~t\(k\)←𝑷α​𝒛t\(k\)\+α​𝒛¯t\\tilde\{\\bm\{z\}\}\_\{t\}^\{\(k\)\}\\leftarrow\\bm\{P\}\_\{\\alpha\}\\bm\{z\}^\{\(k\)\}\_\{t\}\+\\alpha\\bar\{\\bm\{z\}\}\_\{t\}fort=Tw\+1,…,Tt=T\_\{w\}\+1,\\ldots,T⊳\\trianglerightForced inputs

13:𝑱t\(k\)←∂Fθ∂𝒛\|𝒛~t\(k\)​𝑷1\\bm\{J\}\_\{t\}^\{\(k\)\}\\leftarrow\\frac\{\\partial F\_\{\\theta\}\}\{\\partial\\bm\{z\}\}\\big\|\_\{\\tilde\{\\bm\{z\}\}\_\{t\}^\{\(k\)\}\}\\bm\{P\}\_\{1\}fort=1,…,Twt=1,\\ldots,T\_\{w\}⊳\\trianglerightWarm\-up Jacobians

14:𝑱t\(k\)←∂Fθ∂𝒛\|𝒛~t\(k\)​𝑷α\\bm\{J\}\_\{t\}^\{\(k\)\}\\leftarrow\\frac\{\\partial F\_\{\\theta\}\}\{\\partial\\bm\{z\}\}\\big\|\_\{\\tilde\{\\bm\{z\}\}\_\{t\}^\{\(k\)\}\}\\bm\{P\}\_\{\\alpha\}fort=Tw\+1,…,Tt=T\_\{w\}\+1,\\ldots,T⊳\\trianglerightForced Jacobians

15:𝒓t\(k\)←𝒛t\(k\)−F𝜽​\(𝒛~t−1\(k\)\)\\bm\{r\}\_\{t\}^\{\(k\)\}\\leftarrow\\bm\{z\}^\{\(k\)\}\_\{t\}\-F\_\{\\bm\{\\theta\}\}\(\\tilde\{\\bm\{z\}\}\_\{t\-1\}^\{\(k\)\}\)fort=1,…,Tt=1,\\ldots,T⊳\\trianglerightResiduals

16:SolveΔ​𝒛t\(k\+1\)=𝑱t−1\(k\)​Δ​𝒛t−1\(k\+1\)−𝒓t\(k\)\\Delta\\bm\{z\}\_\{t\}^\{\(k\+1\)\}=\\bm\{J\}\_\{t\-1\}^\{\(k\)\}\\Delta\\bm\{z\}\_\{t\-1\}^\{\(k\+1\)\}\-\\bm\{r\}\_\{t\}^\{\(k\)\}fort=1,…,Tt=1,\\ldots,Tvia associative scan

17:𝒛1:T\(k\+1\)←𝒛1:T\(k\)\+Δ​𝒛1:T\(k\+1\)\\bm\{z\}^\{\(k\+1\)\}\_\{1:T\}\\leftarrow\\bm\{z\}^\{\(k\)\}\_\{1:T\}\+\\Delta\\bm\{z\}^\{\(k\+1\)\}\_\{1:T\}

18:𝒙^Tw\+1:T←𝑩​𝒛Tw\+1:T\(final\)\\hat\{\\bm\{x\}\}\_\{T\_\{w\}\+1:T\}\\leftarrow\\bm\{B\}\\bm\{z\}\_\{T\_\{w\}\+1:T\}^\{\(\\mathrm\{final\}\)\}

19:ℒ←MSE​\(𝒙Tw\+1:T,𝒙^Tw\+1:T\)\\mathcal\{L\}\\leftarrow\\mathrm\{MSE\}\(\\bm\{x\}\_\{T\_\{w\}\+1:T\},\\hat\{\\bm\{x\}\}\_\{T\_\{w\}\+1:T\}\)⊳\\trianglerightLoss excluding warm\-up states

20:\(𝒈𝜽,𝒈𝑩\)←grad​\(ℒ;𝜽,𝑩\)\(\\bm\{g\}\_\{\\bm\{\\theta\}\},\\bm\{g\}\_\{\\bm\{B\}\}\)\\leftarrow\\mathrm\{grad\}\(\\mathcal\{L\};\\ \\bm\{\\theta\},\\bm\{B\}\)⊳\\trianglerightSingle associative scan via IFT

21:\(𝜽,𝑩\)←optimize​\(𝜽,𝑩,𝒈𝜽,𝒈𝑩\)\(\\bm\{\\theta\},\\bm\{B\}\)\\leftarrow\\mathrm\{optimize\}\(\\bm\{\\theta\},\\bm\{B\},\\bm\{g\}\_\{\\bm\{\\theta\}\},\\bm\{g\}\_\{\\bm\{B\}\}\)

22:untilconvergence

Similar Articles

Back to Repair: A Minimal Denoising Network\ for Time Series Anomaly Detection

Hugging Face Daily Papers

This paper introduces JuRe (Just Repair), a minimal denoising network for time series anomaly detection that matches or exceeds complex neural baselines on the TSB-AD and UCR benchmarks, demonstrating that a proper manifold-projection training objective is more important than architectural complexity.

Prediction and control with temporal segment models

OpenAI Blog

OpenAI introduces a method for learning complex nonlinear system dynamics using deep generative models over temporal segments, enabling stable long-horizon predictions and differentiable trajectory optimization for model-based control.

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

arXiv cs.CL

Researchers from Fordham University introduce Reciprocal Co-Training (RCT), a framework that couples LLMs and Random Forest classifiers via reinforcement learning, creating an iterative feedback loop where each model improves using signals from the other. Experiments on three medical datasets show consistent performance gains for both models, demonstrating a general mechanism for integrating incompatible model families.

Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

arXiv cs.LG

Proposes Selective Alignment Knowledge Distillation (SeAl-KD) for Spiking Neural Networks, which selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity, achieving consistent improvements over existing distillation methods on static and neuromorphic datasets.