Mamba-Assisted Non-Markovian Closure for Reduced-Order Modeling
Summary
Proposes the Mamba-Assisted Closure (MAC) framework, a Mamba-based sequence model for non-Markovian closure in reduced-order modeling of high-dimensional dynamical systems, outperforming GRU-based and Markovian methods on Burgers' equation and Lorenz '96 systems.
View Cached Full Text
Cached at: 06/05/26, 08:10 AM
# Mamba-Assisted Non-Markovian Closure for Reduced-Order ModelingCurrent version: \DTMenglishmonthname 3, 2026. Typeset with and XELaTeX. Source: [https://arxiv.org/html/2606.05371](https://arxiv.org/html/2606.05371) \[ BoldFont = lmroman10\-bold\.otf, ItalicFont = lmroman10\-italic\.otf, BoldItalicFont = lmroman10\-bolditalic\.otf, SmallCapsFont = lmromancaps10\-regular\.otf, SmallCapsFeatures = ItalicFont = lmromancaps10\-oblique\.otf \]\\setCJKfamilyfontzhfsFandolFang\-Regular\.otf\\NewBibliographyStringkeyword,keywords\\NewBibliographyStringartno Zhi\-Feng WeiEmail:[zfwei@pnnl\.gov](https://arxiv.org/html/2606.05371v1/mailto:[email protected])\.Advanced Computing, Mathematics, and Data Division,Pacific Northwest National Laboratory, Richland, WA 99354, USASaad QadeerEmail:[saad\.qadeer@pnnl\.gov](https://arxiv.org/html/2606.05371v1/mailto:[email protected])\.Advanced Computing, Mathematics, and Data Division,Pacific Northwest National Laboratory, Richland, WA 99354, USAPanos StinisEmail:[panagiotis\.stinis@pnnl\.gov](https://arxiv.org/html/2606.05371v1/mailto:[email protected])\.Advanced Computing, Mathematics, and Data Division,Pacific Northwest National Laboratory, Richland, WA 99354, USADepartment of Applied Mathematics, University of Washington, Seattle, WA 98195, USADivision of Applied Mathematics, Brown University, Providence, RI 02912, USA ###### Abstract Reduced\-order modeling of high\-dimensional dynamical systems is often hindered by the non\-Markovian closure term that represents the effect of unresolved variables on the resolved dynamics\. Inspired by the Mori–Zwanzig formalism, in which the closure takes the form of a memory functional of the resolved trajectory, we recast closure modeling as a sequence modeling problem and propose the Mamba\-Assisted Closure \(MAC\) framework: a Mamba\-based sequence model, trained to predict the closure from the resolved trajectory, is coupled with the reduced\-order governing equations through a numerical integrator to advance the resolved variables in time\. A key feature of the framework is its exploitation of the dual representation of state\-space models — the model is trained in a sequence\-to\-sequence fashion via the convolutional form, and deployed for step\-by\-step autoregressive rollout via the recurrent form, yielding both efficient long\-trajectory training and constant per\-step inference cost\. On the viscous Burgers’ equation and the chaotic two\-scale Lorenz ’96 system, the MAC model substantially outperforms the Markovian reduced\-order model, the GRU\-based sequence model, and the Wilks method in predictive accuracy and long\-time rollout stability\. Keywords:reduced\-order modeling, non\-Markovian closure, Mori–Zwanzig formalism, selective state\-space models \(Mamba\), sequence\-to\-sequence learning ###### Contents 1. [1Introduction](https://arxiv.org/html/2606.05371#S1) 2. [2Technical Approach](https://arxiv.org/html/2606.05371#S2)1. [2\.1Reduced\-Order Modeling and Memory Effects](https://arxiv.org/html/2606.05371#S2.SS1) 2. [2\.2Mamba for Non\-Markovian Closure Modeling](https://arxiv.org/html/2606.05371#S2.SS2) 3. [2\.3Training and Inference Framework](https://arxiv.org/html/2606.05371#S2.SS3) 3. [3Numerical Results](https://arxiv.org/html/2606.05371#S3)1. [3\.1Viscous Burgers’ Equation](https://arxiv.org/html/2606.05371#S3.SS1) 2. [3\.2Two\-Scale Lorenz ’96 System](https://arxiv.org/html/2606.05371#S3.SS2) 4. [4Conclusion](https://arxiv.org/html/2606.05371#S4) 5. [Code and Data Availability](https://arxiv.org/html/2606.05371#Sx1) 6. [Acknowledgments](https://arxiv.org/html/2606.05371#Sx2) 7. [AAdditional Closure\-Term Results](https://arxiv.org/html/2606.05371#A1)1. [A\.1Burgers’ Equation](https://arxiv.org/html/2606.05371#A1.SS1) 2. [A\.2Lorenz ’96 System](https://arxiv.org/html/2606.05371#A1.SS2) 8. [BImplementation Details](https://arxiv.org/html/2606.05371#A2)1. [B\.1Data Generation](https://arxiv.org/html/2606.05371#A2.SS1) 2. [B\.2Neural Network Architectures](https://arxiv.org/html/2606.05371#A2.SS2) 3. [B\.3Model and Training Hyperparameters](https://arxiv.org/html/2606.05371#A2.SS3) 4. [B\.4Data Normalization](https://arxiv.org/html/2606.05371#A2.SS4) 5. [B\.5Noise Injection](https://arxiv.org/html/2606.05371#A2.SS5) 6. [B\.6Training Loss Choices](https://arxiv.org/html/2606.05371#A2.SS6) 7. [B\.7Warm\-up in Inference](https://arxiv.org/html/2606.05371#A2.SS7) 8. [B\.8RK4 Time Stepping with Zero\-Order Holding](https://arxiv.org/html/2606.05371#A2.SS8) 9. [CSymbols & Notation](https://arxiv.org/html/2606.05371#A3) 10. [References](https://arxiv.org/html/2606.05371#Sx1a) 11. [References](https://arxiv.org/html/2606.05371#bib) ## 1Introduction Many physical systems of scientific and engineering interest, such as turbulent flows, climate dynamics, computational biology, and materials science, are governed by high\-dimensional nonlinear dynamics that span a wide range of spatial and temporal scales\[[undef](https://arxiv.org/html/2606.05371#bib.bibx1),[undeff](https://arxiv.org/html/2606.05371#bib.bibx7),[undefj](https://arxiv.org/html/2606.05371#bib.bibx11),[undefz](https://arxiv.org/html/2606.05371#bib.bibx27),[undefab](https://arxiv.org/html/2606.05371#bib.bibx29)\]\. Direct numerical simulation of such systems at full resolution remains prohibitively expensive, motivating the development of reduced\-order models \(ROMs\) that evolve only a subset of resolved variables at reduced cost\[[undefy](https://arxiv.org/html/2606.05371#bib.bibx26)\]\. However, the evolution of the resolved variables is in general not closed: the effects of the unresolved scales feed back onto the resolved dynamics through a closure term, whose accurate modeling is a central challenge in reduced\-order modeling\[[undefaf](https://arxiv.org/html/2606.05371#bib.bibx33),[undefg](https://arxiv.org/html/2606.05371#bib.bibx8),[undefh](https://arxiv.org/html/2606.05371#bib.bibx9)\]\. The Mori–Zwanzig \(MZ\) formalism provides a principled characterization of this closure term\[[undefu](https://arxiv.org/html/2606.05371#bib.bibx22),[undefai](https://arxiv.org/html/2606.05371#bib.bibx36),[undefe](https://arxiv.org/html/2606.05371#bib.bibx6)\]\. By projecting the full dynamics onto the resolved subspace, it shows that the closure is not determined by the instantaneous resolved state alone, but is instead a functional of its entire temporal history — a defining feature of non\-Markovian dynamics\[[undefv](https://arxiv.org/html/2606.05371#bib.bibx23)\]\. Physically, information transferred from the resolved to the unresolved scales is not instantaneously lost, but is partially retained and fed back into the resolved dynamics at later times\. The full Mori–Zwanzig decomposition expresses the closure as a sum of a memory term, which captures this history dependence, and an orthogonal fluctuation term arising from dynamics intrinsic to the unresolved subspace\. Of these two contributions, the memory term is determined entirely by the history of the resolved variables and is thus the natural target for modeling within the reduced\-order framework\. In the linear regime, the memory term reduces to a convolution between the resolved trajectory and a memory kernel\. While conceptually transparent, the Mori–Zwanzig representation is rarely computationally tractable in practice: explicitly estimating the memory kernel and evaluating the associated convolution integral at every step is computationally prohibitive for high\-dimensional systems and long\-time rollouts\[[undefah](https://arxiv.org/html/2606.05371#bib.bibx35)\]\. This motivates a data\-driven alternative: learning the closure’s non\-Markovian dependence on the resolved history directly from trajectory data, thereby recasting closure modeling as a sequence modeling problem\[[undefn](https://arxiv.org/html/2606.05371#bib.bibx15),[undefs](https://arxiv.org/html/2606.05371#bib.bibx20)\]\. Within this sequence\-modeling perspective, existing data\-driven approaches differ primarily in how — and whether — they represent the closure’s dependence on the temporal history\. One line of work couples a neural network with a numerical solver and trains the coupled system by unrolling it over multiple steps, so that the network is optimized against its own rollout trajectory\[[undefaa](https://arxiv.org/html/2606.05371#bib.bibx28),[undefq](https://arxiv.org/html/2606.05371#bib.bibx18)\]\. Here, temporal consistency is promoted through the training procedure rather than through an explicit memory of past resolved states\. To capture non\-Markovian memory effects explicitly, a second line of work incorporates the temporal history of the resolved variables into the closure\. The simplest such approach uses a fixed\-length window of past resolved states as input\[[undefc](https://arxiv.org/html/2606.05371#bib.bibx4)\], which turns the reduced dynamics into a delay system but requires the memory depth to be prescribed a priori\. A more flexible alternative employs recurrent neural networks such as LSTMs, which encode the history in an evolving hidden state and have been used to represent closure and memory terms in reduced\-order models\[[undefad](https://arxiv.org/html/2606.05371#bib.bibx31),[undeft](https://arxiv.org/html/2606.05371#bib.bibx21)\]\. Most recently, an LSTM\-based memory model has been coupled with a differentiable physics solver to learn non\-Markovian closures for coarse\-grained fluid transport\[[undefag](https://arxiv.org/html/2606.05371#bib.bibx34)\]\. Recurrent architectures, however, are known to struggle with long\-range temporal dependencies owing to vanishing gradients\[[undefb](https://arxiv.org/html/2606.05371#bib.bibx3),[undefw](https://arxiv.org/html/2606.05371#bib.bibx24)\], which limits the effective memory they can reliably capture\. Across these approaches, capturing non\-Markovian memory — while inferring the relevant memory depth from data rather than prescribing it, and retaining both efficient training over long trajectories and low\-cost inference — remains an open challenge\. State\-space models offer a natural architectural candidate for addressing this challenge\. The output of a structured SSM is equivalent to a discrete convolution of its input with a learnable kernel, establishing a direct structural correspondence with the Mori–Zwanzig memory integral\[[undefm](https://arxiv.org/html/2606.05371#bib.bibx14),[undefl](https://arxiv.org/html/2606.05371#bib.bibx13)\]\. Mamba extends structured SSMs with input\-dependent state\-space matrices, whose selective mechanism allows the model to dynamically determine what temporal information to retain or discard based on the current input\[[undefk](https://arxiv.org/html/2606.05371#bib.bibx12)\]\. A further advantage lies in the dual representation of state\-space models: SSMs admit both a convolutional form that enables parallel computation over long sequences, and a recurrent form that advances step by step at constant per\-step cost — in contrast to Transformer\-based architectures, whose computational cost scales quadratically with sequence length\[[undefac](https://arxiv.org/html/2606.05371#bib.bibx30)\]\. SSMs have already shown strong performance in scientific machine learning settings — including PDE operator learning\[[undefo](https://arxiv.org/html/2606.05371#bib.bibx16)\]and dynamical systems modeling\[[undefp](https://arxiv.org/html/2606.05371#bib.bibx17)\]— providing additional motivation for their application to the closure modeling problem\. In this work, we propose the Mamba\-Assisted Closure \(MAC\) framework, which recasts non\-Markovian closure modeling as a sequence modeling problem\. Motivated by the structural correspondence between SSM convolutions and the Mori–Zwanzig memory integral, we model the history\-dependent closure with a Mamba\-based sequence model, trained to predict the closure from the resolved trajectory and coupled with the reduced\-order governing equations through a numerical integrator to advance the resolved variables in time\. The selective mechanism of Mamba allows the effective memory depth to be inferred from the resolved trajectory itself, without prescribing a fixed history window\. The framework exploits the dual representation of state\-space models: the closure model is trained in a sequence\-to\-sequence fashion via the convolutional form, enabling parallel computation over long training trajectories with linear\-time scaling, and deployed step by step via the recurrent form at constant per\-step inference cost during autoregressive rollout\. We validate the MAC framework on two benchmark systems with complementary characteristics — the viscous Burgers’ equation in Fourier space and the chaotic two\-scale Lorenz ’96 system — where it substantially outperforms the Markovian reduced\-order model, the GRU\-based sequence model, and the Wilks method in predictive accuracy and long\-time rollout stability\. The remainder of this paper is organized as follows\.[Section2](https://arxiv.org/html/2606.05371#S2)presents the reduced\-order modeling formulation, the associated memory effects, and the MAC framework\.[Section3](https://arxiv.org/html/2606.05371#S3)reports numerical experiments on the viscous Burgers’ equation and the two\-scale Lorenz ’96 system\.[Section4](https://arxiv.org/html/2606.05371#S4)concludes the paper and discusses directions for future work\. Additional closure\-term results are reported in[AppendixA](https://arxiv.org/html/2606.05371#A1), implementation details in[AppendixB](https://arxiv.org/html/2606.05371#A2), and a summary of notation in the final appendix\. ## 2Technical Approach In this section, we describe the technical framework used for closure modeling in reduced\-order models\. We first introduce the reduced\-order modeling formulation and the associated memory effects arising from unresolved variables\. We also discuss why the Mamba\-based architecture is well suited for closure modeling with memory effects\. We then present the training and inference framework in which sequence models are used to model the closure term and are coupled with the reduced\-order governing equations to advance the resolved variables in time and predict their temporal evolution\. ### 2\.1Reduced\-Order Modeling and Memory Effects Many high\-dimensional dynamical systems admit effective low\-dimensional descriptions, with their essential dynamics evolving on lower\-dimensional subspaces despite the large number of degrees of freedom in the full system\. This observation motivates the construction of reduced\-order models \(ROMs\) that evolve only a subset of the full state variables while aiming to faithfully reproduce the dynamics of interest\[[undefy](https://arxiv.org/html/2606.05371#bib.bibx26)\]\. To formalize this idea, consider a general dynamical system dφdt=R\(φ\),\\frac\{\\mathrm\{d\}\{\\varphi\}\}\{\\mathrm\{d\}\{t\}\}=R\(\\varphi\),\(1\)whereφ\\varphidenotes the vector of full state variables\. In reduced\-order modeling, we decomposeφ\\varphiinto resolved and unresolved components, φ=\(φ^,φ~\),\\varphi=\(\\widehat\{\\varphi\},\\widetilde\{\\varphi\}\),whereφ^\\widehat\{\\varphi\}collects the variables retained in the ROM andφ~\\widetilde\{\\varphi\}collects the degrees of freedom excluded from the reduced\-order representation\. A naive truncation that retains only the interactions among resolved variables yields the approximate dynamicsdφ^/dt≈R^\(φ^\)\\,\\mathrm\{d\}\\widehat\{\\varphi\}/\\,\\mathrm\{d\}t\\approx\\widehat\{R\}\(\\widehat\{\\varphi\}\)\. However, this approximation is generally inadequate: the unresolved variables continue to influence the resolved dynamics through their coupling in the original system, and neglecting this influence introduces systematic errors\. The exact reduced\-order dynamics therefore takes the form dφ^dt=R^\(φ^\)\+𝒞,\\frac\{\\mathrm\{d\}\{\\widehat\{\\varphi\}\}\}\{\\mathrm\{d\}\{t\}\}=\\widehat\{R\}\(\\widehat\{\\varphi\}\)\+\\mathcal\{C\},\(2\)whereR^\(φ^\)\\widehat\{R\}\(\\widehat\{\\varphi\}\)represents the retained interactions among the resolved variables and𝒞\\mathcal\{C\}— the closure term — represents the net effect of the unresolved variables on the resolved dynamics\. A central question is how to characterize the closure term𝒞\\mathcal\{C\}\. The Mori–Zwanzig formalism provides a principled answer by making the role of memory explicit\[[undefu](https://arxiv.org/html/2606.05371#bib.bibx22),[undefai](https://arxiv.org/html/2606.05371#bib.bibx36),[undefe](https://arxiv.org/html/2606.05371#bib.bibx6)\]\. In the full system[Eq\.1](https://arxiv.org/html/2606.05371#S2.E1), the resolved and unresolved variables continuously exchange information: information transferred fromφ^\\widehat\{\\varphi\}toφ~\\widetilde\{\\varphi\}is generally not instantaneously lost, but is partially retained by the unresolved variables and subsequently fed back into the resolved dynamics over time\. Althoughφ~\\widetilde\{\\varphi\}is not explicitly evolved in the reduced\-order model, its influence is therefore implicitly carried forward through memory effects\. Consequently, the closure term generally depends not only on the instantaneous state of the resolved variables, but also on their temporal history — a defining feature of non\-Markovian dynamics\. Formally, the Mori–Zwanzig formalism expresses the closure as a functional of the history of the resolved variables, 𝒞\(t\)=ℱ\(φ^\(s\):0⩽s⩽t\),\\mathcal\{C\}\(t\)=\\mathcal\{F\}\\bigl\(\\widehat\{\\varphi\}\(s\):0\\leqslant s\\leqslant t\\bigr\),which, in the linear case, reduces to the convolution form 𝒞\(t\)=∫0tK\(t−s\)φ^\(s\)ds,\\mathcal\{C\}\(t\)=\\int\_\{0\}^\{t\}K\(t\-s\)\\,\\widehat\{\\varphi\}\(s\)\\,\\mathrm\{d\}s,withKKa memory kernel quantifying the influence of past resolved states on the present closure\. We note that the full Mori–Zwanzig decomposition also includes an orthogonal error term arising from dynamics intrinsic to the unresolved subspace; in this work, we focus on the memory contribution, which captures the component of the closure that is determined by the history of the resolved variables and is therefore amenable to data\-driven modeling\. While conceptually transparent, these representations are rarely tractable in practice: explicitly estimating the memory kernel and evaluating the resulting convolution integral at every time step is computationally prohibitive for long\-time rollouts and high\-dimensional systems\. These difficulties motivate a data\-driven alternative\. Rather than estimating the memory kernel explicitly, one may directly learn the non\-Markovian dependence of the closure term on the history of the resolved variables from trajectory data, thereby recasting closure modeling as a sequence modeling problem\. This perspective motivates the sequence\-modeling framework developed in the next subsection\. ### 2\.2Mamba for Non\-Markovian Closure Modeling As discussed in[Section2\.1](https://arxiv.org/html/2606.05371#S2.SS1), the closure term generally depends on the temporal history of the resolved variables due to non\-Markovian memory effects\. The central modeling question is therefore how to approximate this history\-dependent closure in a way that is both expressive and computationally tractable\. In this subsection, we review several candidate approaches and explain why a Mamba\-based architecture is particularly well suited to this task\. The simplest baseline is to neglect the closure term entirely and evolve only the resolved dynamics\. This Markovian approximation ignores the memory effects induced by the unresolved variables and therefore cannot capture the delayed feedback that drives much of the reduced\-order error\. A more expressive alternative is to use recurrent neural networks, such as gated recurrent units \(GRUs\), which encode temporal information in evolving hidden states and can in principle represent non\-Markovian dependencies\[[undefd](https://arxiv.org/html/2606.05371#bib.bibx5)\]\. In practice, however, recurrent architectures are known to struggle with long\-range temporal dependencies due to vanishing\-gradient and limited\-capacity issues in their hidden\-state propagation, which can limit their ability to capture extended memory effects in closure modeling\[[undefb](https://arxiv.org/html/2606.05371#bib.bibx3),[undefw](https://arxiv.org/html/2606.05371#bib.bibx24)\]\. State\-space models \(SSMs\) offer an alternative sequence\-modeling framework that is particularly well aligned with the closure modeling problem, since they describe sequences through an underlying latent dynamical system — much like the resolved variables themselves evolve under a \(history\-dependent\) dynamical system\[[undefm](https://arxiv.org/html/2606.05371#bib.bibx14)\]\. A continuous SSM takes the form dh\(t\)dt=Ah\(t\)\+Bx\(t\),y\(t\)=Ch\(t\)\.\\frac\{\\mathrm\{d\}\{h\}\(t\)\}\{\\mathrm\{d\}\{t\}\}=Ah\(t\)\+Bx\(t\),\\qquad y\(t\)=Ch\(t\)\.After temporal discretization, this becomes ht=A¯ht−1\+B¯xt,yt=C¯ht,h\_\{t\}=\\overline\{A\}h\_\{t\-1\}\+\\overline\{B\}x\_\{t\},\\qquad y\_\{t\}=\\overline\{C\}h\_\{t\},wherextx\_\{t\}denotes the input at discrete time steptt,hth\_\{t\}is the hidden state,yty\_\{t\}is the corresponding output, andA¯\\overline\{A\},B¯\\overline\{B\},C¯\\overline\{C\}are the discretized state\-space matrices\. This recurrent structure naturally enables temporal information from earlier inputs and states to influence future predictions over long time horizons, making SSMs well suited for reduced\-order models with long memory effects\. In structured state\-space models such as S4, the matricesA¯\\overline\{A\},B¯\\overline\{B\},C¯\\overline\{C\}are fixed across all time steps\[[undefl](https://arxiv.org/html/2606.05371#bib.bibx13)\]\. Under this assumption, the recurrence may be written out explicitly\. Starting fromh0=B¯x0h\_\{0\}=\\overline\{B\}x\_\{0\}, successive hidden states satisfy h1=A¯B¯x0\+B¯x1,h2=A¯2B¯x0\+A¯B¯x1\+B¯x2,…,h\_\{1\}=\\overline\{A\}\\overline\{B\}x\_\{0\}\+\\overline\{B\}x\_\{1\},\\quad h\_\{2\}=\\overline\{A\}^\{2\}\\overline\{B\}x\_\{0\}\+\\overline\{A\}\\overline\{B\}x\_\{1\}\+\\overline\{B\}x\_\{2\},\\quad\\ldots,with corresponding outputs y0=C¯B¯x0,y1=C¯A¯B¯x0\+C¯B¯x1,y2=C¯A¯2B¯x0\+C¯A¯B¯x1\+C¯B¯x2,…y\_\{0\}=\\overline\{C\}\\overline\{B\}x\_\{0\},\\quad y\_\{1\}=\\overline\{C\}\\overline\{A\}\\overline\{B\}x\_\{0\}\+\\overline\{C\}\\overline\{B\}x\_\{1\},\\quad y\_\{2\}=\\overline\{C\}\\overline\{A\}^\{2\}\\overline\{B\}x\_\{0\}\+\\overline\{C\}\\overline\{A\}\\overline\{B\}x\_\{1\}\+\\overline\{C\}\\overline\{B\}x\_\{2\},\\quad\\ldotsMore generally, yk=C¯A¯kB¯x0\+C¯A¯k−1B¯x1\+⋯\+C¯A¯B¯xk−1\+C¯B¯xk\.y\_\{k\}=\\overline\{C\}\\overline\{A\}^\{k\}\\overline\{B\}x\_\{0\}\+\\overline\{C\}\\overline\{A\}^\{k\-1\}\\overline\{B\}x\_\{1\}\+\\cdots\+\\overline\{C\}\\overline\{A\}\\overline\{B\}x\_\{k\-1\}\+\\overline\{C\}\\overline\{B\}x\_\{k\}\.Equivalently, the SSM admits the convolutional representation with input sequencex=\(x0,x1,…,xL\)x=\(x\_\{0\},x\_\{1\},\\ldots,x\_\{L\}\), output sequencey=\(y0,y1,…,yL\)y=\(y\_\{0\},y\_\{1\},\\ldots,y\_\{L\}\), and convolution kernel K=\(C¯B¯,C¯A¯B¯,…,C¯A¯LB¯\)\.K=\\bigl\(\\overline\{C\}\\overline\{B\},\\ \\overline\{C\}\\overline\{A\}\\overline\{B\},\\ \\ldots,\\ \\overline\{C\}\\overline\{A\}^\{L\}\\overline\{B\}\\bigr\)\.This convolutional representation closely mirrors the memory\-integral structure that arises in the Mori–Zwanzig formalism, in which the closure term depends on the temporal history of the resolved variables through a convolution\-like memory operator\. The analogy is direct: the SSM kernelKKplays the role of the memory kernel, while the input sequencexxcorresponds to the history of the resolved variables\. This structural parallel provides a principled motivation for using SSM\-based sequence models for reduced\-order closure modeling, beyond their empirical success on long\-sequence tasks\. However, in conventional SSMs such as S4, the matricesA¯,B¯,C¯\\overline\{A\},\\overline\{B\},\\overline\{C\}remain fixed after training and are shared across all input sequences\. Consequently, the same state\-space dynamics — and hence the same convolution kernel — are applied regardless of the current input, limiting the model’s ability to adaptively retain or discard temporal information according to the evolving sequence\. To overcome this rigidity, we employ the Mamba architecture for reduced\-order closure modeling in this work\[[undefk](https://arxiv.org/html/2606.05371#bib.bibx12)\]\. Mamba preserves the SSM framework while introducing input\-dependent selective state transitions, allowing the model to dynamically retain, propagate, or discard temporal information according to the current input\. Beyond the selective mechanism, Mamba offers two further properties well suited to closure modeling\. First, it does not prescribe a fixed memory window: the relevant memory depth is instead inferred from the input itself, which is desirable when the effective memory length of the closure is not known a priori\. Second, Mamba scales linearly in sequence length, enabling efficient processing of the long temporal trajectories that arise in closure modeling problems with extended memory effects\. Motivated by these properties, we employ Mamba\-based sequence models to learn the non\-Markovian closure term from the temporal history of the resolved variables\. ### 2\.3Training and Inference Framework Having motivated the use of Mamba\-based sequence models for closure modeling, we now describe how such models are trained and subsequently coupled with the reduced\-order system during inference to evolve the resolved variables\. The framework presented below is applied consistently to all sequence model architectures considered in this work, including both Mamba\-based and GRU\-based sequence models; their detailed architectures are deferred to[SectionB\.2](https://arxiv.org/html/2606.05371#A2.SS2)\. Recall from[Section2\.1](https://arxiv.org/html/2606.05371#S2.SS1)that the reduced\-order system takes the form dφ^dt=R^\(φ^\)\+𝒞,\\frac\{\\mathrm\{d\}\\widehat\{\\varphi\}\}\{\\mathrm\{d\}t\}=\\widehat\{R\}\(\\widehat\{\\varphi\}\)\+\\mathcal\{C\},\(3\)whereφ^\\widehat\{\\varphi\}denotes the resolved variables,R^\\widehat\{R\}represents the retained interactions among them and is typically known from the full dynamical system[Eq\.1](https://arxiv.org/html/2606.05371#S2.E1), and𝒞\\mathcal\{C\}denotes the unknown closure term associated with the unresolved variables\. The role of the sequence model is to learn𝒞\\mathcal\{C\}as a function of the history ofφ^\\widehat\{\\varphi\}, so that the resulting closed system can be advanced in time to predict the evolution of the resolved variables\. During training, the sequence model is supervised using resolved trajectories together with the corresponding closure trajectories extracted from the training data\. Specifically, given a resolved trajectory \{φ^0,φ^1,…,φ^T\},\\\{\\widehat\{\\varphi\}\_\{0\},\\widehat\{\\varphi\}\_\{1\},\\dots,\\widehat\{\\varphi\}\_\{T\}\\\},the sequence model produces the associated trajectory of closure terms \{𝒞0,𝒞1,…,𝒞T\}\.\\\{\\mathcal\{C\}\_\{0\},\\mathcal\{C\}\_\{1\},\\dots,\\mathcal\{C\}\_\{T\}\\\}\.Training is carried out under teacher forcing: the ground\-truth resolved trajectory is supplied as the input sequence at every step, so that the model learns the mapping from resolved histories to closure terms in a stable, non\-autoregressive fashion\. During inference, in contrast, the resolved and closure trajectories are advanced autoregressively in tandem with the reduced\-order dynamics\. Starting from an initial resolved stateφ^0\\widehat\{\\varphi\}\_\{0\}, the sequence model first predicts the corresponding closure term𝒞0\\mathcal\{C\}\_\{0\}\. The reduced\-order system[Eq\.3](https://arxiv.org/html/2606.05371#S2.E3)is then advanced by one numerical integrator step that takes bothφ^0\\widehat\{\\varphi\}\_\{0\}and𝒞0\\mathcal\{C\}\_\{0\}as inputs, yielding the next resolved stateφ^1\\widehat\{\\varphi\}\_\{1\}\. This updated state is fed back into the sequence model to predict𝒞1\\mathcal\{C\}\_\{1\}, which together withφ^1\\widehat\{\\varphi\}\_\{1\}drives the next integrator step to produceφ^2\\widehat\{\\varphi\}\_\{2\}, and so on\. The procedure is illustrated in[Figure1](https://arxiv.org/html/2606.05371#S2.F1)\. Iterating in this manner rolls out the full resolved trajectory, with both the predicted resolved states and the predicted closure terms recorded for subsequent evaluation\. In our implementation, the numerical integrator is a fourth\-order Runge–Kutta scheme with a zero\-order holding treatment of the predicted closure; details are provided in[SectionB\.8](https://arxiv.org/html/2606.05371#A2.SS8)\. φ^0\\widehat\{\\varphi\}\_\{0\}φ^1\\widehat\{\\varphi\}\_\{1\}φ^2\\widehat\{\\varphi\}\_\{2\}⋯\\cdots𝒞0\\mathcal\{C\}\_\{0\}𝒞1\\mathcal\{C\}\_\{1\}𝒞2\\mathcal\{C\}\_\{2\}⋯\\cdotsRK4RK4RK4Sequence modelFigure 1:Schematic illustration of the autoregressive rollout procedure for reduced\-order model evolution with predicted closure terms\. At each step, the sequence model predicts the closure term𝒞t\\mathcal\{C\}\_\{t\}from the current resolved stateφ^t\\widehat\{\\varphi\}\_\{t\}\(red arrows\)\. The resolved state is then advanced by one integrator step \(RK4\) that takes bothφ^t\\widehat\{\\varphi\}\_\{t\}and𝒞t\\mathcal\{C\}\_\{t\}as inputs to produceφ^t\+1\\widehat\{\\varphi\}\_\{t\+1\}\(gray arrows\)\. See[SectionB\.8](https://arxiv.org/html/2606.05371#A2.SS8)for details of the time\-stepping procedure\.A practical advantage of this autoregressive procedure is that, at each step, only the current resolved stateφ^t\\widehat\{\\varphi\}\_\{t\}needs to be supplied to the sequence model: the temporal history of the resolved variables is implicitly carried in the hidden state\. Recurrent sequence models such as Mamba and GRU therefore enable efficient autoregressive rollout without reprocessing the entire trajectory at every step\. In contrast, Transformer\-based architectures must apply attention over an ever\-growing input sequence during rollout, incurring substantially higher computational cost for long trajectories\[[undefac](https://arxiv.org/html/2606.05371#bib.bibx30)\]\. Notably, this training–inference design aligns naturally with the two equivalent representations of state\-space models discussed in[Section2\.2](https://arxiv.org/html/2606.05371#S2.SS2)\. During training, the sequence model is applied in a sequence\-to\-sequence fashion: an entire resolved trajectory is mapped to the corresponding closure trajectory in parallel, exploiting the convolutional representation of the SSM for efficient computation over long sequences\. During inference, by contrast, the model is advanced step by step in tandem with the reduced\-order dynamics, naturally leveraging the recurrent representation of the SSM, which provides constant per\-step cost and avoids reprocessing the temporal history\. This duality — parallel sequence\-to\-sequence training via the convolutional form, and sequential step\-by\-step inference via the recurrent form — is a defining feature of SSM\-based architectures and is well matched to the requirements of reduced\-order closure modeling\. A small but practically important consequence of this training–inference duality is that, at the very beginning of an autoregressive rollout, the recurrent hidden state of the SSM has not yet accumulated any temporal context, whereas during training every output is produced with access to its full preceding sequence\. To reduce this initial\-state mismatch, we apply a short cache warm\-up at the start of each inference rollout: the initial resolved stateφ^0\\widehat\{\\varphi\}\_\{0\}is repeatedly passed through the Mamba\-based sequence model for a small number of padding steps to populate the internal SSM cache, after which the autoregressive rollout described above begins fromφ^0\\widehat\{\\varphi\}\_\{0\}\. The closure predictions produced during these padding steps are discarded and do not advance the reduced\-order dynamics\. Implementation details are provided in[SectionB\.7](https://arxiv.org/html/2606.05371#A2.SS7) We refer to the resulting framework — in which a Mamba\-based sequence model is coupled with the reduced\-order dynamics to perform autoregressive closure prediction and trajectory evolution — as the Mamba\-Assisted Closure \(MAC\) framework, and use*MAC model*interchangeably to denote the resulting closed reduced\-order system\. The MAC framework serves as the basis for all numerical experiments in the remainder of this paper\. ## 3Numerical Results In this section, we report our experimental results on Burgers’ equation and the Lorenz ’96 system\. These two systems represent complementary settings for closure modeling: deterministic PDE dynamics and chaotic multiscale systems\. For Burgers’ equation, we compare the performance of the MAC model with that of the corresponding Markovian reduced\-order model without closure correction\. For the Lorenz ’96 system, we compare the performance of the MAC model with those of the GRU\-based sequence model and the classical Wilks method\[[undefd](https://arxiv.org/html/2606.05371#bib.bibx5),[undefae](https://arxiv.org/html/2606.05371#bib.bibx32)\]\. Throughout this section, we focus primarily on the resolved variables; additional results and plots for the closure terms are reported in[AppendixA](https://arxiv.org/html/2606.05371#A1)\. Implementation details are provided in[AppendixB](https://arxiv.org/html/2606.05371#A2)\. ### 3\.1Viscous Burgers’ Equation Consider the viscous Burgers’ equation with periodic boundary conditions ut=∂x\(−12u2\)\+νuxx,u\(0,x\)=u0\(x\),x∈\[0,2π\)\.u\_\{t\}=\\partial\_\{x\}\\Bigl\(\-\\frac\{1\}\{2\}u^\{2\}\\Bigr\)\+\\nu u\_\{xx\},\\qquad u\(0,x\)=u\_\{0\}\(x\),\\qquad x\\in\[0,2\\pi\)\.Throughout our experiments, the viscosity coefficient is set toν=0\.1\\nu=0\.1\. Expanding the solutionu\(t,x\)u\(t,x\)in terms of Fourier modes u\(t,x\)=∑k∈ℤθk\(t\)eikx,u\(0,x\)=u0\(x\),u\(t,x\)=\\sum\_\{k\\in\\mathbb\{Z\}\}\\theta\_\{k\}\(t\)e^\{ikx\},\\qquad u\(0,x\)=u\_\{0\}\(x\),we can rewrite the viscous Burgers’ equation as a system of ODEs θk′\(t\)=−ik2∑p,q∈ℤp\+q=kθp\(t\)θq\(t\)−νk2θk\(t\)≕Rk\(θ\(t\)\),fork∈ℤ\.\\theta^\{\\prime\}\_\{k\}\(t\)=\-\\frac\{ik\}\{2\}\\sum\_\{\\begin\{subarray\}\{c\}p,q\\in\\mathbb\{Z\}\\\\ p\+q=k\\end\{subarray\}\}\\theta\_\{p\}\(t\)\\theta\_\{q\}\(t\)\-\\nu k^\{2\}\\theta\_\{k\}\(t\)\\eqcolon R\_\{k\}\\bigl\(\\theta\(t\)\\bigr\),\\quad\\text\{for\\quad\}k\\in\\mathbb\{Z\}\.\(4\)Here θk\(0\)=\(u0^\)k=12π∫02πe−ikxu0\(x\)dx,\\theta\_\{k\}\(0\)=\\bigl\(\\widehat\{u\_\{0\}\}\\bigr\)\_\{k\}=\\frac\{1\}\{2\\pi\}\\int\_\{0\}^\{2\\pi\}e^\{\-ikx\}u\_\{0\}\(x\)\\,\\mathrm\{d\}x,and the reality ofuuimpliesθ−k\(t\)=θk\(t\)¯\\theta\_\{\-k\}\(t\)=\\overline\{\\theta\_\{k\}\(t\)\}fork∈ℤk\\in\\mathbb\{Z\}, so that the negatively indexed Fourier modes are completely determined by the positively indexed ones\. Sinceθ0′\(t\)=0\\theta^\{\\prime\}\_\{0\}\(t\)=0, the zero Fourier mode remains constant in time\. Therefore, without loss of generality, we restrict attention to problems with a vanishing zero Fourier mode, i\.e\.,θ0\(0\)=0\\theta\_\{0\}\(0\)=0, so thatθ0\(t\)=0\\theta\_\{0\}\(t\)=0for all nonnegativett\. To construct a reduced\-order system, we retain the firstMMpositively indexed Fourier modes as resolved modes and treat the remaining positively indexed modes as unresolved\. The corresponding negatively indexed modes are determined by the reality constraint\. For notational convenience, we also represent the complex Fourier modes in real coordinates and write θk\(t\)=ϕk\(t\)\+iψk\(t\),fork∈ℤ\.\\theta\_\{k\}\(t\)=\\phi\_\{k\}\(t\)\+i\\psi\_\{k\}\(t\),\\quad\\text\{for\\quad\}k\\in\\mathbb\{Z\}\.In our experiments, we retainM=3M=3positively indexed complex Fourier modes as resolved variables, corresponding to66real variables\. The evolution of the resolved modes alone is not closed, since the unresolved Fourier modes continue to influence the low\-frequency dynamics through nonlinear interactions\. Following our discussion in[Section2](https://arxiv.org/html/2606.05371#S2), this unresolved contribution is represented by a closure term\. For each resolved modek=1,2,…,Mk=1,2,\\ldots,M, we define ℳk\(t\)=Rk\(θ\(t\)\)−Rk\(θ~\(t\)\),\\mathcal\{M\}\_\{k\}\(t\)=R\_\{k\}\\bigl\(\\theta\(t\)\\bigr\)\-R\_\{k\}\\bigl\(\\widetilde\{\\theta\}\(t\)\\bigr\),whereθ~\(t\)\\widetilde\{\\theta\}\(t\)denotes the truncated Fourier state obtained by retaining only the resolved positively indexed modes and enforcing the corresponding reality constraint\. The first termRk\(θ\(t\)\)R\_\{k\}\(\\theta\(t\)\)represents the exact evolution of the resolved mode computed from the full Fourier state, while the second termRk\(θ~\(t\)\)R\_\{k\}\(\\widetilde\{\\theta\}\(t\)\)corresponds to the reduced dynamics obtained after truncation in the reduced\-order model\. We further decompose the closure term into real and imaginary parts, ℳk\(t\)=mkϕ\(t\)\+imkψ\(t\)\.\\mathcal\{M\}\_\{k\}\(t\)=m\_\{k\}^\{\\phi\}\(t\)\+im\_\{k\}^\{\\psi\}\(t\)\. In our numerical experiments, the sequence models are trained to learn \{mkϕ,mkψ\}k=1M\\\{m\_\{k\}^\{\\phi\},m\_\{k\}^\{\\psi\}\\\}\_\{k=1\}^\{M\}from the trajectory of the real representation \{ϕk,ψk\}k=1M\\\{\\phi\_\{k\},\\psi\_\{k\}\\\}\_\{k=1\}^\{M\}of the resolved Fourier modes\. During inference, as described in[Section2\.3](https://arxiv.org/html/2606.05371#S2.SS3), the trained sequence model is coupled with the reduced\-order dynamics through an autoregressive rollout procedure to evolve the resolved modes from the initial condition\. In data preparation, the initial conditions are sampled in the resolved Fourier modes by drawing bothϕk\(0\)\\phi\_\{k\}\(0\)andψk\(0\)\\psi\_\{k\}\(0\)uniformly from the interval\(−e−k,e−k\)\(\-\\mathrm\{e\}^\{\-k\},\\mathrm\{e\}^\{\-k\}\)fork=1,2,…,Mk=1,2,\\ldots,M, and setting the remaining modes to zero\. This produces random low\-frequency initial conditions with exponentially decaying amplitudes\. To prepare training data, Burgers’ equation is first solved on the time interval\[0,1\]\[0,1\]using a pseudospectral method and a fourth\-order Runge–Kutta integrator with time stepΔt=10−4\\Delta t=10^\{\-4\}\. In training the MAC model, the neural network directly predicts the closure term, and the training loss is computed by comparing the predicted closure term with the reference value\. We now present the experimental results obtained with the MAC model and compare them against those of the corresponding Markovian reduced\-order model without closure correction\. We first examine the prediction accuracy within the training time interval\[0,1\]\[0,1\], corresponding to a temporal interpolation regime\.[Figure2](https://arxiv.org/html/2606.05371#S3.F2)compares the predicted trajectories of the resolved Fourier modes against the true solutions for one representative test initial condition\. Figure 2:Comparison of resolved\-mode predictions from the MAC model and the Markovian reduced\-order model on one representative test initial condition over the temporal interpolation regime\[0,1\]\[0,1\]\. For each resolved Fourier mode, the relativeL2L^\{2\}error over the time interval\[0,1\]\[0,1\]is also reported\.From[Figure2](https://arxiv.org/html/2606.05371#S3.F2), we observe that the MAC model matches the true solution more closely than the Markovian model, especially for higher\-frequency modes\. The mean relativeL2L^\{2\}error for each Fourier mode across all initial conditions in the test dataset, reported in[Table1](https://arxiv.org/html/2606.05371#S3.T1), further confirms this observation\. Both the mean relativeL2L^\{2\}error and the corresponding standard deviation are roughly one order of magnitude smaller for the MAC model than for the Markovian model, indicating substantially improved prediction accuracy\. Table 1:RelativeL2L^\{2\}error statistics for each resolved Fourier mode over the temporal interpolation regime\[0,1\]\[0,1\]\. The mean and standard deviation are computed across all initial conditions in the test dataset and over multiple random seeds for both the MAC model and the Markovian reduced\-order model\.We then examine the prediction accuracy over the longer time interval\[0,2\]\[0,2\]\. Recall that the training dataset was generated only on the time interval\[0,1\]\[0,1\]; therefore, this test corresponds to a temporal extrapolation regime\.[Figure3](https://arxiv.org/html/2606.05371#S3.F3)compares the predicted trajectories of the resolved Fourier modes against the true solutions on the same representative test initial condition as in[Figure2](https://arxiv.org/html/2606.05371#S3.F2)\. Figure 3:Comparison of resolved\-mode predictions from the MAC model and the Markovian reduced\-order model on the same representative test initial condition over the temporal extrapolation regime\[0,2\]\[0,2\]\. For each resolved Fourier mode, the relativeL2L^\{2\}errors over the intervals\[0,1\]\[0,1\],\[1,2\]\[1,2\], and\[0,2\]\[0,2\]are also reported\.From[Figure3](https://arxiv.org/html/2606.05371#S3.F3), we observe that the MAC model continues to match the true solution closely even on the temporal extrapolation interval\[1,2\]\[1,2\], while that of the Markovian reduced\-order model rapidly deteriorates and deviates significantly from the true dynamics\. The mean relativeL2L^\{2\}error for each Fourier mode across all initial conditions in the test dataset, reported in[Table2](https://arxiv.org/html/2606.05371#S3.T2), further confirms this observation\. Compared with the temporal interpolation regime, both models exhibit increased prediction error on the extrapolation interval\[1,2\]\[1,2\], reflecting the accumulation of rollout error outside the training horizon\. Nevertheless, the MAC model maintains stable long\-time prediction accuracy across all resolved modes, whereas the Markovian reduced\-order model suffers substantial error growth during extrapolation\. The performance gap becomes increasingly pronounced for higher\-frequency resolved modes\. In particular, for the third resolved Fourier mode, the mean relativeL2L^\{2\}error of the Markovian model reaches approximately20%20\\%to40%40\\%, while the corresponding error produced by the MAC model remains around1%1\\%\. This observation suggests that unresolved nonlinear interactions become increasingly important for higher\-frequency resolved dynamics, and that the memory\-aware MAC model is significantly more effective at capturing these long\-time non\-Markovian memory effects\. Table 2:RelativeL2L^\{2\}error statistics for each resolved Fourier mode over the temporal extrapolation regime\. Results are reported on both the full rollout interval\[0,2\]\[0,2\]and the pure extrapolation interval\[1,2\]\[1,2\]\. The mean and standard deviation are computed across all initial conditions in the test dataset and over multiple random seeds for both the MAC model and the Markovian reduced\-order model\.We further compare the evolution of the absoluteL2L^\{2\}error in physical space over the entire extrapolation interval\[0,2\]\[0,2\]\. At each time instant, theL2L^\{2\}error on the physical\-space intervalx∈\[0,2π\)x\\in\[0,2\\pi\)is computed between the reconstructed physical\-space field obtained from the resolved Fourier modes and the corresponding true resolved field extracted from the full\-order simulation\. As shown in[Figure4](https://arxiv.org/html/2606.05371#S3.F4), the error produced by the MAC model remains consistently small throughout the rollout, whereas the error produced by the Markovian reduced\-order model grows rapidly over time\. This further demonstrates the effectiveness of the MAC model in capturing memory effects in reduced\-order dynamics\. Figure 4:Evolution of the physical\-spaceL2L^\{2\}error over the temporal extrapolation interval\[0,2\]\[0,2\]for the same representative test initial condition, comparing the MAC model and the Markovian reduced\-order model\.Recall that both the training and test datasets discussed above are generated from random low\-frequency Fourier initial conditions, where the resolved coefficientsϕk\(0\)\\phi\_\{k\}\(0\)andψk\(0\)\\psi\_\{k\}\(0\)are independently sampled from the interval\(−e−k,e−k\)\(\-\\mathrm\{e\}^\{\-k\},\\mathrm\{e\}^\{\-k\}\)fork=1,2,…,Mk=1,2,\\ldots,M\. Although the temporal extrapolation results above already demonstrate strong long\-time stability of the MAC closure model, the corresponding initial conditions still lie within the same distributional setting as the training data\. To further investigate out\-of\-distribution generalization, we further consider several structured analytic initial conditions that lie outside the training distribution: u0\(x\)=sinx,u0\(x\)=esinx,u0\(x\)=cos\(2sinx\)\.u\_\{0\}\(x\)=\\sin x,\\qquad u\_\{0\}\(x\)=\\mathrm\{e\}^\{\\sin x\},\\qquad u\_\{0\}\(x\)=\\cos\(2\\sin x\)\.These three initial conditions lie substantially outside the training support in both amplitude and Fourier structure\. In particular, the dominant low\-frequency Fourier coefficients exceed the corresponding training bounds by factors ranging from approximately22to55, as summarized in[Table3](https://arxiv.org/html/2606.05371#S3.T3)\. Moreover, the functioncos\(2sinx\)\\cos\(2\\sin x\)exhibits a highly structured even\-mode Fourier pattern that is entirely absent from the training data, providing a strong structural distribution shift in addition to the substantial amplitude extrapolation\. Table 3:Comparison between the dominant Fourier coefficients of the out\-of\-distribution test initial conditions and the corresponding coefficient bounds of the training initial condition distribution\.For these out\-of\-distribution initial conditions, we perform long\-time rollout on the time intervalt∈\[0,20\]t\\in\[0,20\], whereas the training dataset was generated only on the time interval\[0,1\]\[0,1\]\. We compare the MAC model against both the Markovian reduced\-order model and the linear and cubic memory models proposed in\[[undefx](https://arxiv.org/html/2606.05371#bib.bibx25)\]\. At each time instant, the absolute physical\-spaceL2L^\{2\}error is computed between the reconstructed resolved physical\-space field and the corresponding true resolved field extracted from the full\-order simulation\. The resulting error evolutions for the three out\-of\-distribution initial conditions are shown in[Figures5](https://arxiv.org/html/2606.05371#S3.F5),[6](https://arxiv.org/html/2606.05371#S3.F6)and[7](https://arxiv.org/html/2606.05371#S3.F7)\. Figure 5:Evolution of the physical\-spaceL2L^\{2\}error for the out\-of\-distribution initial conditionu0\(x\)=sinxu\_\{0\}\(x\)=\\sin xover the long\-time rollout interval\[0,20\]\[0,20\], comparing the MAC model, the Markovian reduced\-order model, and the linear and cubic memory models in\[[undefx](https://arxiv.org/html/2606.05371#bib.bibx25)\]\.Figure 6:Evolution of the physical\-spaceL2L^\{2\}error for the out\-of\-distribution initial conditionu0\(x\)=esinxu\_\{0\}\(x\)=\\mathrm\{e\}^\{\\sin x\}over the long\-time rollout interval\[0,20\]\[0,20\], comparing the MAC model, the Markovian reduced\-order model, and the linear and cubic memory models in\[[undefx](https://arxiv.org/html/2606.05371#bib.bibx25)\]\.Figure 7:Evolution of the physical\-spaceL2L^\{2\}error for the out\-of\-distribution initial conditionu0\(x\)=cos\(2sinx\)u\_\{0\}\(x\)=\\cos\(2\\sin x\)over the long\-time rollout interval\[0,20\]\[0,20\], comparing the MAC model, the Markovian reduced\-order model, and the linear and cubic memory models in\[[undefx](https://arxiv.org/html/2606.05371#bib.bibx25)\]\.As shown in[Figures5](https://arxiv.org/html/2606.05371#S3.F5),[6](https://arxiv.org/html/2606.05371#S3.F6)and[7](https://arxiv.org/html/2606.05371#S3.F7), the MAC model maintains stable long\-time prediction accuracy across all three out\-of\-distribution initial conditions\. For the first two test cases, the prediction accuracy of the MAC model is comparable to that of linear memory model in\[[undefx](https://arxiv.org/html/2606.05371#bib.bibx25)\], even though these initial conditions lie significantly outside the support of the training initial condition distribution\. For the third and most challenging structured initial condition, the MAC model still substantially outperforms the Markovian reduced\-order model and remains comparable to the two explicit memory\-based approaches in\[[undefx](https://arxiv.org/html/2606.05371#bib.bibx25)\]\. We note that both the linear and cubic memory models of in\[[undefx](https://arxiv.org/html/2606.05371#bib.bibx25)\]require the evaluation of convolution integrals to estimate the memory during rollout\. In contrast, the MAC model learns the effective non\-Markovian dynamics directly from data and performs inference through a lightweight autoregressive sequence model\. These comparisons suggest that the proposed MAC model achieves strong long\-time predictive accuracy while maintaining favorable computational efficiency for long rollout horizons\. We further investigate the physical consistency of the learned reduced\-order dynamics through the evolution of the resolved energy\. For the resolved Fourier modes, we define the resolved energy by Eres\(t\)=∑k=1M\|θk\(t\)\|2=∑k=1M\(\|ϕk\(t\)\|2\+\|ψk\(t\)\|2\)\.E\_\{\\mathrm\{res\}\}\(t\)=\\sum\_\{k=1\}^\{M\}\\big\\lvert\\theta\_\{k\}\(t\)\\big\\rvert^\{2\}=\\sum\_\{k=1\}^\{M\}\\big\(\\big\\lvert\\phi\_\{k\}\(t\)\\big\\rvert^\{2\}\+\\big\\lvert\\psi\_\{k\}\(t\)\\big\\rvert^\{2\}\\big\)\. Figure 8:Evolution of the resolved energy over the temporal extrapolation interval\[0,2\]\[0,2\]for the same representative test initial condition, comparing the MAC model and the Markovian reduced\-order model\.[Figure8](https://arxiv.org/html/2606.05371#S3.F8)compares the evolution of the resolved energy for the same representative extrapolation test case considered above\. As expected for the viscous Burgers’ equation, the resolved energy decays over time due to viscous dissipation and nonlinear energy transfer to the unresolved modes\. We observe that the MAC model accurately reproduces the evolution of the resolved energy throughout the extrapolation interval, whereas the Markovian reduced\-order model exhibits noticeable deviation from the true energy trajectory\. This indicates that the MAC model not only improves trajectory\-level prediction accuracy, but also better captures the physically consistent long\-time energy behavior induced by unresolved nonlinear interactions and memory effects\. Finally, we investigate the computational scalability of the proposed MAC model in both training and inference\. We note that the autoregressive rollouts reported in the previous experiments are performed in parallel over the entire test batch, which contains5050initial conditions for the in\-distribution test set and33initial conditions for the out\-of\-distribution test cases\. In the timing experiments below, the batch size is fixed throughout\. During training, we measure the average wall\-clock time per training iteration for different input sequence lengths fed into the model\. The results are reported in[Table4](https://arxiv.org/html/2606.05371#S3.T4)\. As shown in[Table4](https://arxiv.org/html/2606.05371#S3.T4), the training cost per iteration grows approximately linearly with the input sequence length\. Table 4:Training cost of the MAC model for different input sequence lengths\. The batch size is fixed at9696throughout the timing experiments\.During autoregressive inference, we measure the computational cost of advancing one rollout step for the entire batch\. The results for different rollout horizons are reported in[Table5](https://arxiv.org/html/2606.05371#S3.T5)\. As shown in[Table5](https://arxiv.org/html/2606.05371#S3.T5), the inference cost per autoregressive rollout step remains essentially constant even for long rollout horizons, consistent with the autoregressive inference framework described in[Section2\.3](https://arxiv.org/html/2606.05371#S2.SS3), where temporal information is propagated through the hidden state without repeatedly processing the entire temporal history\. This behavior is particularly advantageous for long\-time reduced\-order simulation, where the dynamics must be advanced sequentially over many time steps\. Table 5:Inference cost per autoregressive rollout step for different rollout horizons\. The batch size is fixed at5050throughout the timing experiments\.Compared with transformer\-based sequence models, whose training cost typically grows rapidly with sequence length, the approximately linear scaling observed in[Table4](https://arxiv.org/html/2606.05371#S3.T4)highlights the favorable scalability of the Mamba\-based architecture for long training trajectories\[[undefac](https://arxiv.org/html/2606.05371#bib.bibx30)\]\. Similarly, during inference, the MAC model maintains essentially constant per\-step rollout cost through a lightweight state\-space update, in contrast to explicit memory\-based approaches that require repeated evaluation of memory integrals over the trajectory history\. Moreover, compared with traditional recurrent architectures such as GRU/LSTM, which are known to become increasingly difficult to train on long sequences due to gradient propagation issues, the Mamba architecture remains effective for the very long training trajectories arising from the viscous Burgers’ equation\. ### 3\.2Two\-Scale Lorenz ’96 System The two\-scale Lorenz ’96 system is a standard benchmark for studying multiscale chaotic dynamics and reduced\-order closure modeling\[[undefr](https://arxiv.org/html/2606.05371#bib.bibx19),[undefc](https://arxiv.org/html/2606.05371#bib.bibx4)\]\. Originally introduced as a simplified atmospheric model, it has since become a canonical test problem in weather prediction, data assimilation, and multiscale modeling\. The system consists of a set of slow variablesXk\(t\)X\_\{k\}\(t\)coupled with a larger collection of fast variablesYj\(t\)Y\_\{j\}\(t\): dXkdt=−Xk−1\(Xk−2−Xk\+1\)−Xk\+F−hcb∑j=J\(k−1\)\+1kJYj,k=1,2,…,N,\\frac\{\\mathrm\{d\}X\_\{k\}\}\{\\mathrm\{d\}t\}=\-X\_\{k\-1\}\(X\_\{k\-2\}\-X\_\{k\+1\}\)\-X\_\{k\}\+F\-\\frac\{hc\}\{b\}\\sum\_\{j=J\(k\-1\)\+1\}^\{kJ\}Y\_\{j\},\\qquad k=1,2,\\ldots,N,and dYjdt=−cbYj\+1\(Yj\+2−Yj−1\)−cYj\+hcbX⌊\(j−1\)/J⌋\+1,j=1,2,…,JN\.\\frac\{\\mathrm\{d\}Y\_\{j\}\}\{\\mathrm\{d\}t\}=\-cbY\_\{j\+1\}\(Y\_\{j\+2\}\-Y\_\{j\-1\}\)\-cY\_\{j\}\+\\frac\{hc\}\{b\}X\_\{\\lfloor\(j\-1\)/J\\rfloor\+1\},\\qquad j=1,2,\\ldots,JN\.Here the variables\{Xk\}k=1N\\\{X\_\{k\}\\\}\_\{k=1\}^\{N\}represent the slow\-varying large\-scale dynamics, while\{Yj\}j=1JN\\\{Y\_\{j\}\\\}\_\{j=1\}^\{JN\}represent the fast\-varying small\-scale variables\. The interaction between the fast and slow variables enters the slow dynamics through the coupling term Uk\(t\)≔−hcb∑j=J\(k−1\)\+1kJYj,k=1,2,…,N\.U\_\{k\}\(t\)\\coloneq\-\\frac\{hc\}\{b\}\\sum\_\{j=J\(k\-1\)\+1\}^\{kJ\}Y\_\{j\},\\qquad k=1,2,\\ldots,N\.In our experiments, we setN=8N=8,J=32J=32,F=10F=10,h=1h=1,c=10c=10, andb=10b=10, corresponding to a chaotic multiscale regime commonly used in the Lorenz ’96 literature\[[undefi](https://arxiv.org/html/2606.05371#bib.bibx10)\]\. To verify numerically that the system is chaotic under our specific configuration, we estimated the maximum Lyapunov exponent via the Benettin renormalization method\[[undefa](https://arxiv.org/html/2606.05371#bib.bibx2)\]applied to the full state vector\(X,Y\)∈ℝN\(1\+J\)\(X,Y\)\\in\\mathbb\{R\}^\{N\(1\+J\)\}\. The running estimate converged overt=250t=250to λmax≈6\.426\>0,\\lambda\_\{\\max\}\\approx 6\.426\>0,confirming sensitive dependence on initial conditions and providing numerical evidence that the system operates in a chaotic regime\. In the reduced\-order setting, the slow variables\{Xk\}k=1N\\\{X\_\{k\}\\\}\_\{k=1\}^\{N\}are treated as resolved variables, while the fast variables\{Yj\}j=1JN\\\{Y\_\{j\}\\\}\_\{j=1\}^\{JN\}are regarded as unresolved degrees of freedom\. The resolved dynamics may therefore be written as dXkdt=−Xk−1\(Xk−2−Xk\+1\)−Xk\+F\+Uk,k=1,2,…,N\.\\frac\{\\mathrm\{d\}X\_\{k\}\}\{\\mathrm\{d\}t\}=\-X\_\{k\-1\}\(X\_\{k\-2\}\-X\_\{k\+1\}\)\-X\_\{k\}\+F\+U\_\{k\},\\qquad k=1,2,\\ldots,N\.The closure modeling problem then consists of learning an effective approximation of the coupling term\{Uk\}k=1N\\\{U\_\{k\}\\\}\_\{k=1\}^\{N\}from the temporal history of the resolved variables\. In our experiments, the MAC model is compared against the GRU\-based sequence model and the statistical method proposed by Wilks, which models the coupling term using polynomial regression together with autoregressive stochastic noise\[[undefd](https://arxiv.org/html/2606.05371#bib.bibx5),[undefae](https://arxiv.org/html/2606.05371#bib.bibx32)\]\. In training data preparation, the full two\-scale Lorenz ’96 system is first simulated using the fine time stepΔtfine=0\.005\\Delta t\_\{\\mathrm\{fine\}\}=0\.005\. The resulting trajectories are then downsampled by a factor of two in time, yielding coarse observations with time stepΔt=0\.01\\Delta t=0\.01\. A single long coarse trajectory of length1400114001over the time interval\[0,140\]\[0,140\]is generated after an initial warmup stage that removes transient effects\. The first1000110001coarse time steps over the time interval\[0,100\]\[0,100\]are used to construct the training dataset\. More specifically, sliding windows of length6464with stride11are extracted from the long trajectory and used as training sequences for the sequence models considered in this study\. To investigate temporal interpolation and extrapolation performance separately, we additionally construct two deterministic test segments from the same long simulation trajectory\. The first20012001coarse time steps over the interval\[0,20\]\[0,20\]are used for temporal interpolation evaluation, corresponding to prediction within the time range covered by the training data\. A separate segment consisting of coarse time steps1200012000–1400014000over the interval\[120,140\]\[120,140\]is used for temporal extrapolation evaluation, corresponding to prediction beyond the training time range\. Unlike the Burgers’ experiments, where the sequence models are trained directly against the closure term, the sequence model for the Lorenz ’96 system is trained using a multi\-step resolved\-variable rollout loss\. More specifically, the sequence model first predicts the closure term from the temporal history of the resolved variables\. The predicted closure is then coupled with the reduced slow dynamics, which are subsequently advanced in time using fourth\-order Runge–Kutta method \(see[SectionB\.8](https://arxiv.org/html/2606.05371#A2.SS8)\)\. The training loss is computed by comparing the resulting multi\-step rollout of the resolved variables against the corresponding true resolved trajectory\. In our experiments, we use a five\-step rollout loss\. See[SectionB\.6](https://arxiv.org/html/2606.05371#A2.SS6)for details\. We first examine the temporal interpolation performance of the MAC model and the baseline methods\. Recall that the training trajectories are generated over the time interval\[0,100\]\[0,100\]\. The following experiments evaluate autoregressive prediction accuracy over the time interval\[0,20\]\[0,20\], which lies entirely within the training time range\. Unless otherwise stated, the trajectory\-level visualizations shown below correspond to a single representative random seed\. [Figure9](https://arxiv.org/html/2606.05371#S3.F9)compares the predicted trajectories of the resolved slow variables against the true resolved trajectories extracted from the full Lorenz ’96 simulation on the time interval\[0,20\]\[0,20\]\. As shown in[Figure9](https://arxiv.org/html/2606.05371#S3.F9), the MAC model accurately reproduces the oscillatory and chaotic behavior of the resolved slow variables throughout the rollout time interval\. In contrast, both the GRU\-based model and the Wilks method gradually deviate from the true dynamics over time\. In long\-time autoregressive rollout, phase errors may gradually accumulate due to the chaotic nature of the dynamics\. Nevertheless, the MAC model continues to deliver substantially improved trajectory accuracy and maintains the correct large\-scale oscillatory behavior over long time horizons\. Figure 9:Comparison of resolved slow\-variable predictions from the MAC model, the GRU\-based model, and the Wilks method over the temporal interpolation regime\[0,20\]\[0,20\]\. For each resolved slow variable, the relativeL2L^\{2\}error is also reported\.To quantify the prediction accuracy, we examine the evolution of the running cumulative relativeL2L^\{2\}error\. For a rollout trajectory, the cumulative relativeL2L^\{2\}error at timetit\_\{i\}is defined by \(∑j=0i∥X\(tj\)−X~\(tj\)∥22∑j=0T−1∥X\(tj\)∥22\)1/2,\\Biggl\(\\frac\{\\sum\_\{j=0\}^\{i\}\\lVert X\(t\_\{j\}\)\-\\widetilde\{X\}\(t\_\{j\}\)\\rVert\_\{2\}^\{2\}\}\{\\sum\_\{j=0\}^\{T\-1\}\\lVert X\(t\_\{j\}\)\\rVert\_\{2\}^\{2\}\}\\Biggr\)^\{1/2\},whereT=2001T=2001is the full sequence length, andXXandX~\\widetilde\{X\}denote the true and predicted vectors of resolved variables, respectively\. As shown in[Figure10](https://arxiv.org/html/2606.05371#S3.F10), the cumulative relativeL2L^\{2\}error produced by the MAC model grows substantially more slowly than those of the GRU\-based model and the Wilks method\. This indicates that the MAC model maintains stable prediction accuracy during long\-time autoregressive rollout\. Figure 10:Evolution of the running cumulative relativeL2L^\{2\}error over the temporal interpolation regime\[0,20\]\[0,20\]for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, and the Wilks method\.We next examine the running cumulative correlation coefficient between the predicted and true resolved trajectories\. The running cumulative correlation coefficient at timetit\_\{i\}is defined by Corr\(ti\)=∑j=0i⟨X\(tj\),X~\(tj\)⟩\(∑j=0i∥X\(tj\)∥22\)1/2\(∑j=0i∥X~\(tj\)∥22\)1/2\.\\mathrm\{Corr\}\(t\_\{i\}\)=\\frac\{\\sum\_\{j=0\}^\{i\}\\bigl\\langle X\(t\_\{j\}\),\\widetilde\{X\}\(t\_\{j\}\)\\bigr\\rangle\}\{\\Bigl\(\\sum\_\{j=0\}^\{i\}\\lVert X\(t\_\{j\}\)\\rVert\_\{2\}^\{2\}\\Bigr\)^\{1/2\}\\Bigl\(\\sum\_\{j=0\}^\{i\}\\lVert\\widetilde\{X\}\(t\_\{j\}\)\\rVert\_\{2\}^\{2\}\\Bigr\)^\{1/2\}\}\. Figure 11:Evolution of the running cumulative correlation coefficient between the predicted and true resolved variables over the temporal interpolation regime\[0,20\]\[0,20\]for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, and the Wilks method\.From[Figure11](https://arxiv.org/html/2606.05371#S3.F11), we observe that the MAC model maintains consistently higher correlation with the true dynamics throughout the rollout interval\. In contrast, the GRU\-based model and the Wilks method exhibit more rapid degradation of correlation over the rollout interval\. Finally, beyond pointwise trajectory accuracy, we examine whether the generated trajectories preserve the correct temporal statistical structure of the underlying chaotic dynamics\. To this end, we compare the temporal autocorrelation function \(ACF\) of the predicted closure trajectories\. Note that although the sequence models for the Lorenz ’96 system are trained using multi\-step rollout loss on the resolved variables, the predicted closure terms are still recorded during autoregressive inference rollout \(see[SectionA\.2](https://arxiv.org/html/2606.05371#A1.SS2)\)\. For a given time lagτ\\tau, we define the uncentered temporal autocorrelation function by ACF\(τ\)=∑t=0T−τ∑k=1NUk\(t\)Uk\(t\+τ\)∑t=0T∑k=1NUk\(t\)2,\\mathrm\{ACF\}\(\\tau\)=\\frac\{\\sum\_\{t=0\}^\{T\-\\tau\}\\sum\_\{k=1\}^\{N\}U\_\{k\}\(t\)U\_\{k\}\(t\+\\tau\)\}\{\\sum\_\{t=0\}^\{T\}\\sum\_\{k=1\}^\{N\}U\_\{k\}\(t\)^\{2\}\},whereT=2001T=2001is the full sequence length\. Figure 12:Temporal autocorrelation function of the generated closure trajectories over the temporal interpolation regime\[0,20\]\[0,20\]for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, the Wilks method, and the true dynamics\.As shown in[Figure12](https://arxiv.org/html/2606.05371#S3.F12), the autocorrelation structures of the closure terms generated by all three models are qualitatively similar, indicating that all of them are able to reproduce, at least to some extent, the temporal statistical structure of the closure dynamics from the history of the resolved variables\. Meanwhile, the true dynamics exhibit slightly faster decorrelation due to the strongly chaotic nature of the underlying Lorenz ’96 system\. However, despite these similar autocorrelation behaviors, the MAC model achieves substantially better trajectory prediction accuracy and long\-time rollout stability, as demonstrated in[Figures9](https://arxiv.org/html/2606.05371#S3.F9),[10](https://arxiv.org/html/2606.05371#S3.F10)and[11](https://arxiv.org/html/2606.05371#S3.F11)\. This improvement is likely attributable to the selective mechanism in the Mamba architecture, which enables the model to adaptively retain or discard temporal information according to the evolving input dynamics during autoregressive rollout\. We next examine the temporal extrapolation performance of the MAC model and the baseline methods\. In this experiment, the models are evaluated on a trajectory segment over the physical time interval\[120,140\]\[120,140\], which lies entirely outside the training time interval\[0,100\]\[0,100\]\. For convenience, the rollout time within this extrapolation window is shifted and displayed as the interval\[0,20\]\[0,20\]in the following figures\.[Figures13](https://arxiv.org/html/2606.05371#S3.F13),[14](https://arxiv.org/html/2606.05371#S3.F14)and[15](https://arxiv.org/html/2606.05371#S3.F15)compare the extrapolation rollout performance of the three methods through, respectively, trajectory visualization, the running cumulative relativeL2L^\{2\}error, and the running cumulative correlation coefficient\. As shown in[Figures13](https://arxiv.org/html/2606.05371#S3.F13),[14](https://arxiv.org/html/2606.05371#S3.F14)and[15](https://arxiv.org/html/2606.05371#S3.F15), the MAC model continues to accurately capture the chaotic oscillatory behavior of the resolved variables even outside the temporal regime used for training\. The cumulative relativeL2L^\{2\}error of the MAC model grows substantially more slowly than those of the GRU\-based model and the Wilks method, while its correlation coefficient remains consistently higher throughout the extrapolation rollout interval\. In contrast, the GRU\-based model rapidly loses phase accuracy and exhibits substantial amplitude distortion during long\-time rollout, while the Wilks method, though more stable than the GRU\-based model, also remains consistently less accurate than the MAC model\. Figure 13:Comparison of resolved slow\-variable predictions from the MAC model, the GRU\-based model, and the Wilks method over the temporal extrapolation regime\[120,140\]\[120,140\]\(displayed as\[0,20\]\[0,20\]\)\. For each resolved slow variable, the relativeL2L^\{2\}error is also reported\.Figure 14:Evolution of the running cumulative relativeL2L^\{2\}error over the temporal extrapolation regime\[120,140\]\[120,140\]\(displayed as\[0,20\]\[0,20\]\) for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, and the Wilks method\.Figure 15:Evolution of the running cumulative correlation coefficient between the predicted and true resolved variables over the temporal extrapolation regime\[120,140\]\[120,140\]\(displayed as\[0,20\]\[0,20\]\) for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, and the Wilks method\.Figure 16:Temporal autocorrelation function of the generated closure trajectories over the temporal extrapolation regime\[120,140\]\[120,140\]\(displayed as\[0,20\]\[0,20\]\) for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, the Wilks method, and the true dynamics\.Finally,[Figure16](https://arxiv.org/html/2606.05371#S3.F16)compares the temporal autocorrelation structures of the generated closure trajectories during temporal extrapolation rollout\. Similar to the temporal interpolation experiment, all three models reproduce qualitatively similar autocorrelation behavior, whereas the true dynamics exhibit slightly faster decorrelation due to the strongly chaotic nature of the Lorenz ’96 system\. Despite these similar statistical behaviors, the MAC model again demonstrates substantially superior trajectory prediction accuracy and rollout stability, as shown in[Figures13](https://arxiv.org/html/2606.05371#S3.F13),[14](https://arxiv.org/html/2606.05371#S3.F14)and[15](https://arxiv.org/html/2606.05371#S3.F15)\. To evaluate generalization to unseen initial conditions, we generate100100independent test trajectories using random initial conditions different from those used in the training trajectory\. Each test trajectory contains20012001coarse time steps over the time interval\[0,20\]\[0,20\]and is evolved independently using the same numerical solver and parameter settings\. We first present one representative unseen\-initial\-condition trajectory to illustrate the qualitative rollout behavior of the different methods\. As shown in[Figure17](https://arxiv.org/html/2606.05371#S3.F17), the MAC model continues to accurately reproduce the chaotic oscillatory behavior of the resolved variables even for initial conditions that are not observed during training\. In contrast, the GRU\-based model again exhibits substantial phase drift and amplitude distortion during autoregressive rollout, while the Wilks method, though more stable than the GRU\-based model, also remains consistently less accurate than the MAC model\. Figure 17:Comparison of resolved slow\-variable predictions from the MAC model, the GRU\-based model, and the Wilks method for one representative unseen initial condition over the time interval\[0,20\]\[0,20\]\. For each resolved slow variable, the relativeL2L^\{2\}error is also reported\.Figure 18:Evolution of the running cumulative relativeL2L^\{2\}error averaged over100100unseen\-initial\-condition test trajectories over the time interval\[0,20\]\[0,20\]for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, and the Wilks method\.Figure 19:Evolution of the running cumulative correlation coefficient between the predicted and true resolved variables, averaged over100100unseen\-initial\-condition test trajectories over the time interval\[0,20\]\[0,20\]for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, and the Wilks method\.As shown in[Figures18](https://arxiv.org/html/2606.05371#S3.F18)and[19](https://arxiv.org/html/2606.05371#S3.F19), the MAC model maintains substantially smaller cumulative relativeL2L^\{2\}error growth and consistently higher correlation with the true dynamics compared with both the GRU\-based model and the Wilks method\. These results indicate that the MAC closure model generalizes robustly to unseen initial conditions during long\-time autoregressive rollout\. Figure 20:Temporal autocorrelation function of the generated closure trajectories, averaged over100100unseen\-initial\-condition test trajectories over the time interval\[0,20\]\[0,20\]for the Lorenz ’96 system, comparing the MAC model, the GRU\-based model, the Wilks method, and the true dynamics\.Finally,[Figure20](https://arxiv.org/html/2606.05371#S3.F20)compares the temporal autocorrelation structures of the generated closure trajectories, averaged over the100100unseen\-initial\-condition test trajectories\. Similar to the temporal interpolation and extrapolation experiments, all three models reproduce qualitatively similar autocorrelation behavior, whereas the MAC model again demonstrates substantially superior trajectory prediction accuracy and rollout stability, as shown in[Figures17](https://arxiv.org/html/2606.05371#S3.F17),[18](https://arxiv.org/html/2606.05371#S3.F18)and[19](https://arxiv.org/html/2606.05371#S3.F19)\. As in the temporal interpolation experiment, this is likely attributable to the selective mechanism in the Mamba architecture, which adaptively retains the most relevant temporal information during autoregressive rollout\. To reduce potential bias associated with the choice of random seeds, we additionally perform experiments using multiple random seeds\. The final cumulative relativeL2L^\{2\}error, final correlation coefficient, and temporal autocorrelation function at lagτ=10\\tau=10are summarized in[Table6](https://arxiv.org/html/2606.05371#S3.T6)\. Overall, these results demonstrate that the MAC model provides substantially improved long\-time predictive accuracy and robustness for the resolved dynamics compared with both the GRU\-based model and the Wilks method\. Table 6:Final statistical diagnostics for the temporal interpolation, temporal extrapolation, and unseen\-initial\-condition experiments in the Lorenz ’96 system\. The first column reports the running cumulative relativeL2L^\{2\}error at the final timet=Tt=T, the second column reports the running cumulative correlation coefficient at the final timet=Tt=T, and the third column reports the value of the temporal autocorrelation functionACF\(τ\)\\mathrm\{ACF\}\(\\tau\)at lagτ=10\\tau=10\(corresponding to a physical time lag of0\.10\.1\)\. For the ACF column, the value of the true dynamics is also reported as a reference\. All values are averaged over multiple random seeds\. Note that ACF closeness to the true value alone does not necessarily reflect trajectory prediction accuracy; see the discussion in[Section3\.2](https://arxiv.org/html/2606.05371#S3.SS2)\. ## 4Conclusion In this work, we proposed the Mamba\-Assisted Closure \(MAC\) framework, which recasts non\-Markovian closure modeling for reduced\-order dynamical systems as a sequence modeling problem\. Inspired by the Mori–Zwanzig formalism, in which the closure term takes the form of a memory functional of the resolved trajectory, we model this functional with a Mamba\-based sequence model and learn it directly from data\. The use of Mamba is central to the framework: its input\-dependent selective mechanism allows the relevant memory depth to be inferred from the resolved trajectory itself, without prescribing a fixed history window or explicitly estimating a memory kernel, while its linear\-time scaling makes long training and rollout horizons computationally tractable\. The dual representation of state\-space models—a parallel convolutional form during training and a recurrent form during inference—further aligns naturally with the requirements of reduced\-order closure modeling\. We validated the MAC framework on two benchmark systems with complementary characteristics\. For the viscous Burgers’ equation in Fourier space, the MAC model reduces the relativeL2L^\{2\}error of the resolved Fourier modes by approximately one order of magnitude compared with the Markovian reduced\-order model in both the temporal interpolation and extrapolation regimes, preserves the physically expected dissipative decay of the resolved energy, and remains stable under structured out\-of\-distribution initial conditions over long rollout horizons\. For the chaotic two\-scale Lorenz ’96 system, the MAC model substantially outperforms both a GRU\-based sequence model and the classical Wilks method across temporal interpolation, temporal extrapolation, and unseen\-initial\-condition experiments, achieving correlation coefficients with the true dynamics that remain consistently above0\.990\.99across all test scenarios\. In addition, our scalability experiments confirm the architectural advantages of Mamba in this setting: training cost grows approximately linearly with sequence length, while per\-step inference cost remains essentially constant regardless of rollout horizon\. Taken together, these results indicate that treating closure modeling as a sequence modeling problem and exploiting the structural correspondence between Mori–Zwanzig memory functionals and state\-space models leads to substantial improvements in long\-time predictive accuracy, physical consistency, and computational efficiency\. The MAC framework is general and can in principle be applied to other reduced\-order modeling settings where non\-Markovian memory effects play an important role\. Several directions naturally extend the present work, including integration with stochastic closure modeling to capture inherently random unresolved dynamics, and a more systematic study of how the Mamba selective mechanism encodes effective memory depth in different physical regimes\. ## Code and Data Availability To support reproducibility and facilitate future research, the code and all accompanying data will be made publicly available upon publication acceptance\. ## Acknowledgments The work of Zhi\-Feng Wei was supported by the U\.S\. Department of Energy \(DOE\) Office of Advanced Scientific Computing Research \(ASCR\) through the ASCR Distinguished Computational Mathematics Postdoctoral Fellowship \(Project Nos\. 71268 and 83358\)\. The work of Saad Qadeer was supported by the U\.S\. Department of Energy, Office of Science, Scientific Discovery through Advanced Computing \(SciDAC\) program, via a partnership in Earth System Model Development between the ASCR and Biological and Environmental Research \(BER\) programs, as part of the project titled “Physical, Accurate, and Efficient atmosphere and surface coupling across SCALes” \(PAESCAL; proposal No\. 0000267817\)\. The work of Panos Stinis was partially supported by ASCR under the “Resolution\-invariant deep learning for accelerated propagation of epistemic and aleatory uncertainty in multi\-scale energy storage systems, and beyond” project \(Project No\. 81824\)\. The computational work was performed using the Pacific Northwest National Laboratory \(PNNL\) Research Computing facilities\. PNNL is a multi\-program national laboratory operated for the U\.S\. Department of Energy by Battelle Memorial Institute under Contract No\. DE\-AC05\-76RL01830\. ## Appendix AAdditional Closure\-Term Results The main text focuses on the resolved\-variable predictions\. Here we provide additional results for the learned closure terms, including predictions for representative initial conditions and related summary statistics\. These results complement the resolved\-variable comparisons reported in the main text\. ### A\.1Burgers’ Equation For Burgers’ equation, the main text reports autoregressive rollout results for the resolved Fourier modes in[Section3\.1](https://arxiv.org/html/2606.05371#S3.SS1)\. Here we show the corresponding closure\-term predictions for the same representative test initial condition as in[Figure2](https://arxiv.org/html/2606.05371#S3.F2)\. These plots directly compare the predicted closure terms produced by the MAC model with the true closure data\. Figure A1:Comparison between the closure\-term predictions from the MAC model and the true closure terms on the same representative test initial condition as in[Figure2](https://arxiv.org/html/2606.05371#S3.F2)over the temporal interpolation regime\[0,1\]\[0,1\]\. For each closure term, the relativeL2L^\{2\}error over the time interval\[0,1\]\[0,1\]is also reported\.Figure A2:Comparison between the closure\-term predictions from the MAC model and the true closure terms on the same representative test initial condition as in[Figure2](https://arxiv.org/html/2606.05371#S3.F2)over the temporal extrapolation regime\[0,2\]\[0,2\]\. For each closure term, the relativeL2L^\{2\}errors over the intervals\[0,1\]\[0,1\],\[1,2\]\[1,2\], and\[0,2\]\[0,2\]are also reported\.To complement the closure\-term predictions for the representative test initial condition, we also report summary statistics for the Burgers closure prediction errors\.[TableA1](https://arxiv.org/html/2606.05371#A1.T1)lists the mean relativeL2L^\{2\}errors of the MAC\-predicted closure terms on the interpolation interval\[0,1\]\[0,1\], the extrapolation interval\[1,2\]\[1,2\], and the full interval\[0,2\]\[0,2\]\. These values are computed over5050test cases and averaged over multiple random seeds\. Although the closure\-term errors are larger than the resolved\-variable errors reported in the main text, the predicted closure trajectories still capture the main trends of the true closure data\. In fact, the relativeL2L^\{2\}errors of the predicted closure terms can appear large because the closure terms themselves have small magnitudes, so that even a small absolute discrepancy may lead to a relatively large relative error\. The closure prediction errors remain comparable between the interpolation and extrapolation intervals, suggesting that the learned closure does not substantially degrade over the longer rollout\. Table A1:Mean relativeL2L^\{2\}errors of the Burgers closure terms predicted by the MAC model over the interpolation interval\[0,1\]\[0,1\], the extrapolation interval\[1,2\]\[1,2\], and the full interval\[0,2\]\[0,2\]\. The values are averaged over5050test cases and multiple random seeds\.We further include a comparison on an additional representative test initial condition for which the closure relative errors are comparatively large\. For this test case, we show both the resolved\-variable rollout and the corresponding closure\-term prediction\. This provides a more complete view of the relation between closure\-term accuracy and resolved\-variable prediction accuracy: the resolved\-variable prediction remains accurate even when the relativeL2L^\{2\}error of the closure prediction appears large\. Figure A3:Comparison of resolved\-mode predictions from the MAC model and the Markovian reduced\-order model on an additional representative test initial condition over the temporal interpolation regime\[0,1\]\[0,1\]\. For each resolved Fourier mode, the relativeL2L^\{2\}error over the time interval\[0,1\]\[0,1\]is also reported\.Figure A4:Comparison between the closure\-term predictions from the MAC model and the true closure terms on the same representative test initial condition as in[FigureA3](https://arxiv.org/html/2606.05371#A1.F3)over the temporal interpolation regime\[0,1\]\[0,1\]\. For each closure term, the relativeL2L^\{2\}error over the time interval\[0,1\]\[0,1\]is also reported\.Figure A5:Comparison of resolved\-mode predictions from the MAC model and the Markovian reduced\-order model on the same representative test initial condition as in[FigureA3](https://arxiv.org/html/2606.05371#A1.F3)over the temporal extrapolation regime\[0,2\]\[0,2\]\. For each resolved Fourier mode, the relativeL2L^\{2\}errors over the intervals\[0,1\]\[0,1\],\[1,2\]\[1,2\], and\[0,2\]\[0,2\]are also reported\.Figure A6:Comparison between the closure\-term predictions from the MAC model and the true closure terms on the same representative test initial condition as in[FigureA3](https://arxiv.org/html/2606.05371#A1.F3)over the temporal extrapolation regime\[0,2\]\[0,2\]\. For each resolved Fourier mode, the relativeL2L^\{2\}errors over the intervals\[0,1\]\[0,1\],\[1,2\]\[1,2\], and\[0,2\]\[0,2\]are also reported\. ### A\.2Lorenz ’96 System For the Lorenz ’96 system, the main text reports resolved\-variable rollout results for temporal interpolation, temporal extrapolation, and unseen initial conditions in[Section3\.2](https://arxiv.org/html/2606.05371#S3.SS2)\. Here we show the corresponding closure\-term predictions, comparing the true closure terms with the predictions from the MAC model, the GRU\-based model, and the Wilks method\. Figure A7:Comparison of closure\-term predictions from the MAC model, the GRU\-based model, and the Wilks method over the temporal interpolation regime\[0,20\]\[0,20\]for the Lorenz ’96 system\. For each closure term, the relativeL2L^\{2\}error is also reported\.Figure A8:Comparison of closure\-term predictions from the MAC model, the GRU\-based model, and the Wilks method over the temporal extrapolation regime\[120,140\]\[120,140\]\(displayed as\[0,20\]\[0,20\]\) for the Lorenz ’96 system\. For each closure term, the relativeL2L^\{2\}error is also reported\.Figure A9:Comparison of closure\-term predictions from the MAC model, the GRU\-based model, and the Wilks method for the same representative unseen initial condition as in[Figure17](https://arxiv.org/html/2606.05371#S3.F17)over the time interval\[0,20\]\[0,20\]\. For each closure term, the relativeL2L^\{2\}error is also reported\. ## Appendix BImplementation Details This appendix provides additional implementation details for the numerical experiments reported in[Section3](https://arxiv.org/html/2606.05371#S3)\. While the main text focuses on the modeling framework and empirical findings, this appendix collects the practical choices needed to reproduce the experiments, including data generation, neural network architectures, model training hyperparameters, data normalization, noise injection schedules, and training loss design\. Unless otherwise stated, the same preprocessing, training protocol, and evaluation procedure are used across all sequence models considered in the comparison\. In particular, the MAC model and the GRU\-based sequence model are trained under comparable settings, with the number of trainable parameters controlled to be of the same order whenever possible\. This setup ensures a fair comparison by avoiding differences in model capacity or optimization budget\. ### B\.1Data Generation #### B\.1\.1Viscous Burgers’ equation For the viscous Burgers’ equation, the data generation follows the Fourier\-space reduced\-order construction described in[Section3\.1](https://arxiv.org/html/2606.05371#S3.SS1)\. The trajectories of the full\-order system are generated by solving the Fourier ODE system for the coefficients\{θk\(t\)\}k∈ℤ\\\{\\theta\_\{k\}\(t\)\\\}\_\{k\\in\\mathbb\{Z\}\}using a pseudospectral method\. The spatial interval\[0,2π\)\[0,2\\pi\)is discretized usingNx=210N\_\{x\}=2^\{10\}equispaced grid points, and the full Fourier system is advanced using a fourth\-order Runge–Kutta method with time stepΔt=10−4\\Delta t=10^\{\-4\}\. The viscosity coefficient used in our experiments isν=0\.1\\nu=0\.1\. For trajectories generated from random initial conditions, the resolved Fourier modes are initialized as described in the main text: bothϕk\(0\)\\phi\_\{k\}\(0\)andψk\(0\)\\psi\_\{k\}\(0\)are sampled uniformly from\(−e−k,e−k\)\(\-\\mathrm\{e\}^\{\-k\},\\mathrm\{e\}^\{\-k\}\)fork=1,2,…,Mk=1,2,\\ldots,M, while all other positive modes are set to zero\. The corresponding negatively indexed modes are determined by the reality constraint onuu, and the zero Fourier mode is taken to vanish\. For these random initial conditions, the full Fourier system is solved on the time interval\[0,5\]\[0,5\], and the resulting resolved trajectories are saved together with the corresponding closure data for the subsequent experiments\. In total, we generate500500trajectories from random initial conditions, of which400400are used for training,5050for validation, and5050for testing\. For the out\-of\-distribution tests, we use three prescribed initial conditions: u0\(x\)=sinx,u0\(x\)=esinx,u0\(x\)=cos\(2sinx\)\.u\_\{0\}\(x\)=\\sin x,\\qquad u\_\{0\}\(x\)=\\mathrm\{e\}^\{\\sin x\},\\qquad u\_\{0\}\(x\)=\\cos\(2\\sin x\)\.For all three out\-of\-distribution initial conditions, only the first three positively indexed Fourier modes and their corresponding negatively indexed modes are retained, with the zero Fourier mode taken to vanish as described in the main text\. Specifically, to test the rollout performance over a long\-time interval for these out\-of\-distribution initial conditions, we solve the full Fourier system on the longer time interval\[0,20\]\[0,20\]and save the resulting trajectories of the resolved modes together with the corresponding closure trajectories for the subsequent experiments\. #### B\.1\.2Two\-scale Lorenz ’96 system For the two\-scale Lorenz ’96 system studied in[Section3\.2](https://arxiv.org/html/2606.05371#S3.SS2), the full slow–fast dynamics are simulated directly using a fourth\-order Runge–Kutta method\. The slow variables\{Xk\}k=1N\\\{X\_\{k\}\\\}\_\{k=1\}^\{N\}are treated as resolved variables, while the fast variables\{Yj\}j=1JN\\\{Y\_\{j\}\\\}\_\{j=1\}^\{JN\}are regarded as unresolved degrees of freedom\. The parameters used in the experiments are N=8,J=32,F=10,h=1,c=10,b=10\.N=8,\\qquad J=32,\\qquad F=10,\\qquad h=1,\\qquad c=10,\\qquad b=10\.The interaction between the fast and slow variables enters the slow dynamics through the coupling term Uk\(t\)≔−hcb∑j=J\(k−1\)\+1kJYj\(t\),k=1,2,…,N\.U\_\{k\}\(t\)\\coloneq\-\\frac\{hc\}\{b\}\\sum\_\{j=J\(k\-1\)\+1\}^\{kJ\}Y\_\{j\}\(t\),\\qquad k=1,2,\\ldots,N\. The full two\-scale Lorenz ’96 system is first simulated using the fine time step Δtfine=0\.005\.\\Delta t\_\{\\mathrm\{fine\}\}=0\.005\.The recorded fine trajectory is then downsampled by a factor of22, giving the coarse time step A single long coarse trajectory of length1400114001over the time interval\[0,140\]\[0,140\]is generated\. This long trajectory is initialized randomly, with the initial slow variables sampled aroundF/10F/10and the fast variables initialized with small random values\. Before recording data, the system is advanced for50005000fine time steps as a warm\-up period\. After this warm\-up, the first1000110001coarse time steps over the time interval\[0,100\]\[0,100\]are used to construct the training dataset\. More specifically, sliding windows of length6464with stride11are extracted from the training segment and used as training sequences for the sequence models considered in this study\. The segment over\[100,120\]\[100,120\]is used for validation during training, while the segment over\[120,140\]\[120,140\]is reserved for temporal extrapolation tests\. In addition to the single long trajectory,100100independent test trajectories are generated from unseen random initial conditions\. Similarly, for each unseen initial condition, the trajectory is initialized randomly, with the initial slow variables sampled aroundF/10F/10and the fast variables initialized with small random values\. Each test trajectory contains20012001coarse time points over the time interval\[0,20\]\[0,20\]\. The saved Lorenz ’96 datasets contain the resolved slow variablesXXand the corresponding coupling termsUU, which are used in the subsequent autoregressive rollout experiments\. ### B\.2Neural Network Architectures In the Mamba\-Assisted Closure \(MAC\) model, the Mamba\-based sequence model predicts closure terms from the resolved variables\. These predicted closure terms are then incorporated into the reduced\-order dynamics, which are advanced in time to generate autoregressive resolved trajectories\. When the MAC model is applied to Burgers’ equation and the Lorenz ’96 system, the reduced\-order dynamics differ between the two benchmark problems, but the same general neural\-network architecture is used for the Mamba\-based sequence model\. For the two benchmark problems, we use different input/output dimensions and different choices of depth and width for the Mamba\-based sequence model, which are reported in[SectionB\.3](https://arxiv.org/html/2606.05371#A2.SS3)\. For Burgers’ equation, the Mamba\-based sequence model input consists of the resolved Fourier modes and has dimension2M2M, corresponding to the real and imaginary parts of the resolved modes, whereM=3M=3in our study\. For the Lorenz ’96 system, the Mamba\-based sequence model input consists of the resolved slow variables and has dimensionN=8N=8\. In both cases, the Mamba\-based sequence model maps an input sequence of resolved variables to an output sequence of the same dimension, representing the corresponding closure term\. In this architecture, the Mamba\-based sequence model first applies a linear input projection from the input dimension to a hidden dimensiondmodeld\_\{\\mathrm\{model\}\}, followed by aSiLU\\operatorname\{SiLU\}activation\. The resulting hidden sequence is then passed through a stack of residual Mamba blocks\. Each residual Mamba block consists of a normalization layer, a Mamba layer, and a residual connection: h↦h\+Mamba\(Norm\(h\)\)\.h\\mapsto h\+\\operatorname\{Mamba\}\\bigl\(\\operatorname\{Norm\}\(h\)\\bigr\)\.Here, the Mamba layer is the module implemented using the Python packagemamba\-ssm, available from[PyPI](https://pypi.org/project/mamba-ssm/)\. After the stacked residual Mamba blocks, a final normalization layer is applied, followed by a two\-layer output projection: h↦W2σ\(W1h\),h\\mapsto W\_\{2\}\\,\\sigma\(W\_\{1\}h\),whereσ\\sigmadenotes theSiLU\\operatorname\{SiLU\}activation\. The output dimension is chosen to match the input dimension\. For comparison, the GRU\-based sequence model uses the same input and output projection structure\. The main difference is that the residual Mamba blocks are replaced by residual GRU blocks\. Each residual GRU block consists of a normalization layer, a GRU layer, and a residual connection: h↦h\+GRU\(Norm\(h\)\)\.h\\mapsto h\+\\operatorname\{GRU\}\\bigl\(\\operatorname\{Norm\}\(h\)\\bigr\)\.Here, the GRU layer is the standard module available in PyTorch\. Thus, the GRU\-based sequence model preserves the same overall sequence\-to\-sequence structure as the Mamba\-based sequence model\. In both architectures, the normalization layer is applied inside each residual block before the Mamba or GRU layer, and another normalization layer is applied before the final output projection\. The specific choices of depth, model hidden dimension, Mamba state dimension, convolution width, and GRU width are given in[SectionB\.3](https://arxiv.org/html/2606.05371#A2.SS3)\. These choices are selected so that the Mamba\-based sequence model and the GRU\-based sequence model have comparable numbers of trainable parameters whenever possible\. ### B\.3Model and Training Hyperparameters #### B\.3\.1Burgers’ Equation For the Burgers experiments, the Mamba\-based sequence model takes the resolved Fourier modes as input\. SinceM=3M=3complex resolved positively indexed Fourier modes are retained, the input and output dimensions are both2M=62M=6, corresponding to the real and imaginary parts of the resolved modes\. The model hyperparameters used in the full training runs are summarized in[TableB1](https://arxiv.org/html/2606.05371#A2.T1)\. The main training hyperparameters are summarized in[TableB2](https://arxiv.org/html/2606.05371#A2.T2)\. The noise injection schedule and loss design are described separately in[SectionB\.5](https://arxiv.org/html/2606.05371#A2.SS5)and[SectionB\.6](https://arxiv.org/html/2606.05371#A2.SS6)\. Table B1:Model hyperparameters for the Burgers experiments\.Table B2:Training hyperparameters for the Burgers experiments\. #### B\.3\.2Lorenz ’96 System For the Lorenz ’96 experiments, the sequence model takes the resolved slow variables as input\. SinceN=8N=8slow variables are treated as resolved variables, the input and output dimensions are bothN=8N=8\. The model hyperparameters used in the full training runs are summarized in[TableB3](https://arxiv.org/html/2606.05371#A2.T3)\. The width of the GRU\-based sequence model is chosen to make its number of trainable parameters comparable to that of the Mamba\-based sequence model\. Table B3:Model hyperparameters for the Lorenz ’96 experiments\.HyperparameterMamba\-based modelGRU\-based modelInput dimensiondind\_\{\\mathrm\{in\}\}N=8N=8N=8N=8Output dimensionN=8N=8N=8N=8Number of residual blocks5555Hidden dimensiondmodeld\_\{\\mathrm\{model\}\}64648888Mamba expansion factor22–Mamba convolution width44–Mamba state dimension4848–GRU layers per residual block–11Linear bias in Mamba layerno–Convolution bias in Mamba layeryes–Normalization typeLayerNormLayerNormActivation functionSiLU\\operatorname\{SiLU\}SiLU\\operatorname\{SiLU\}Number of trainable parameters230,644230,644245,352245,352The main training hyperparameters are summarized in[TableB4](https://arxiv.org/html/2606.05371#A2.T4)\. The noise injection schedule and loss design are described separately in[SectionB\.5](https://arxiv.org/html/2606.05371#A2.SS5)and[SectionB\.6](https://arxiv.org/html/2606.05371#A2.SS6)\. Table B4:Training hyperparameters for the Lorenz ’96 experiments\. ### B\.4Data Normalization Different normalization strategies are used for the Burgers and Lorenz ’96 experiments, following the scale and structure of the corresponding data\. In all cases, the normalization statistics are computed only from the training data and then reused for validation, testing, and autoregressive rollout experiments\. For Burgers’ equation, we normalize the resolved Fourier modes\{ϕk,ψk\}k=1M\\\{\\phi\_\{k\},\\psi\_\{k\}\\\}\_\{k=1\}^\{M\}and the closure data\{mkϕ,mkψ\}k=1M\\\{m\_\{k\}^\{\\phi\},m\_\{k\}^\{\\psi\}\\\}\_\{k=1\}^\{M\}separately using component\-wise max\-absolute scaling\. More specifically, for each resolved modek=1,…,Mk=1,\\ldots,M, the scaling factors are computed over all samples and all time points in the training dataset as sϕ,k=maxtraining data\|ϕk\(t\)\|,sψ,k=maxtraining data\|ψk\(t\)\|,s\_\{\\phi,k\}=\\max\_\{\\text\{training data\}\}\\lvert\\phi\_\{k\}\(t\)\\rvert,\\qquad s\_\{\\psi,k\}=\\max\_\{\\text\{training data\}\}\\lvert\\psi\_\{k\}\(t\)\\rvert,and sϕ,k\(m\)=maxtraining data\|mkϕ\(t\)\|,sψ,k\(m\)=maxtraining data\|mkψ\(t\)\|\.s\_\{\\phi,k\}^\{\(m\)\}=\\max\_\{\\text\{training data\}\}\\lvert m\_\{k\}^\{\\phi\}\(t\)\\rvert,\\qquad s\_\{\\psi,k\}^\{\(m\)\}=\\max\_\{\\text\{training data\}\}\\lvert m\_\{k\}^\{\\psi\}\(t\)\\rvert\.The normalized quantities used by the sequence model at timettare then \{ϕk\(t\)sϕ,k,ψk\(t\)sψ,k\}k=1M,\\biggl\\\{\\frac\{\\phi\_\{k\}\(t\)\}\{s\_\{\\phi,k\}\},\\frac\{\\psi\_\{k\}\(t\)\}\{s\_\{\\psi,k\}\}\\biggr\\\}\_\{k=1\}^\{M\},and \{mkϕ\(t\)sϕ,k\(m\),mkψ\(t\)sψ,k\(m\)\}k=1M\.\\biggl\\\{\\frac\{m\_\{k\}^\{\\phi\}\(t\)\}\{s^\{\(m\)\}\_\{\\phi,k\}\},\\frac\{m\_\{k\}^\{\\psi\}\(t\)\}\{s^\{\(m\)\}\_\{\\psi,k\}\}\\biggr\\\}\_\{k=1\}^\{M\}\. For the Lorenz ’96 system, we normalize the resolved slow variables\{Xk\}k=1N\\\{X\_\{k\}\\\}\_\{k=1\}^\{N\}and the closure terms\{Uk\}k=1N\\\{U\_\{k\}\\\}\_\{k=1\}^\{N\}component\-wise\. For eachk=1,…,Nk=1,\\ldots,N, the meansμX,k\\mu\_\{X,k\}andμU,k\\mu\_\{U,k\}, as well as the standard deviationsσX,k\\sigma\_\{X,k\}andσU,k\\sigma\_\{U,k\}, are computed over all samples and all time points in the training dataset\. The normalized quantities used by the sequence model at timettare then \{Xk\(t\)−μX,kσX,k\}k=1N,\\biggl\\\{\\frac\{X\_\{k\}\(t\)\-\\mu\_\{X,k\}\}\{\\sigma\_\{X,k\}\}\\biggr\\\}\_\{k=1\}^\{N\},and \{Uk\(t\)−μU,kσU,k\}k=1N\.\\biggl\\\{\\frac\{U\_\{k\}\(t\)\-\\mu\_\{U,k\}\}\{\\sigma\_\{U,k\}\}\\biggr\\\}\_\{k=1\}^\{N\}\. ### B\.5Noise Injection During training, we inject small additive noise into the normalized input sequences to improve the robustness of the trained sequence model during autoregressive rollout\. The same noise\-injection schedule is used for both the Burgers and Lorenz ’96 experiments\. Letξ\\xidenote the normalized resolved input sequence used by the sequence model\. When noise injection is active, the model input is replaced by ξnoisy=ξ\+ε,ε∼𝒩\(0,σnoise2I\),\\xi\_\{\\mathrm\{noisy\}\}=\\xi\+\\varepsilon,\\qquad\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\mathrm\{noise\}\}^\{2\}I\),where the noise is sampled independently with the same shape asξ\\xi\. In the experiments, we use σnoise=5×10−4\.\\sigma\_\{\\mathrm\{noise\}\}=5\\times 10^\{\-4\}\. The noise level is scheduled over the training process in three phases\. LetSSbe the total number of training steps\. For the first10%10\\%of training steps, no noise is injected\. For the middle phase, from0\.1S0\.1Sto0\.8S0\.8S, the above Gaussian noise is added to the normalized input sequence\. For the final20%20\\%of training steps, the noise is turned off again\. Thus the schedule is σnoise\(s\)=\{0,s⩽0\.1S,5×10−4,0\.1S<s⩽0\.8S,0,0\.8S<s⩽S\.\\sigma\_\{\\mathrm\{noise\}\}\(s\)=\\begin\{cases\}0,&s\\leqslant 0\.1S,\\\\ 5\\times 10^\{\-4\},&0\.1S<s\\leqslant 0\.8S,\\\\ 0,&0\.8S<s\\leqslant S\.\\end\{cases\} This schedule is used to separate the early fitting stage, the robustness\-enhancing stage, and the final refinement stage\. The noise is applied only to the model input during training; the training targets are not modified\. ### B\.6Training Loss Choices When training the Mamba\-based sequence model within the MAC framework, we use different loss functions for Burgers’ equation and the Lorenz ’96 system\. For Burgers’ equation, the training loss is the mean squared error between the true closure terms and the closure terms predicted by the Mamba\-based sequence model\. For the Lorenz ’96 system, we instead use a multi\-step loss on the resolved variables\. We now detail the loss\-function design for these two benchmark problems\. #### B\.6\.1Burgers’ Equation When training the Mamba\-based sequence model for Burgers’ equation, the input sequence consists of the normalized resolved Fourier modes\. The sequence model outputs the \(normalized\) predicted closure terms at the corresponding time points\. These predicted closure terms are compared with the true \(normalized\) closure terms using a mean squared error loss, which is used as the training objective\. This direct closure loss is sufficient to obtain accurate predictions for both the resolved Fourier modes and the closure terms, as shown in[Sections3\.1](https://arxiv.org/html/2606.05371#S3.SS1)and[A\.1](https://arxiv.org/html/2606.05371#A1.SS1)\. In our exploration, we also tested an alternative loss design for Burgers’ equation\. In this alternative approach, the closure term predicted by the Mamba\-based sequence model is inserted into the resolved dynamics, and the reduced\-order system is advanced by one time step\. The predicted resolved Fourier modes at the next time step are then compared with the true resolved modes using a mean squared error loss\. This single\-step resolved\-mode loss is therefore applied to train the Mamba\-based sequence model through the MAC framework\. However, this alternative loss design gives worse prediction accuracy in our experiments\. The results for the predicted resolved modes and closure terms with this alternative loss design, on the same representative case shown in the main text, are presented in[FiguresB1](https://arxiv.org/html/2606.05371#A2.F1)and[B2](https://arxiv.org/html/2606.05371#A2.F2)\. We can see that, under this alternative loss design, the prediction accuracy on the resolved Fourier modes becomes much worse, and the prediction accuracy on the closure terms also deteriorates\. Figure B1:Comparison of resolved\-mode predictions from the MAC model and the Markovian reduced\-order model on the same representative test initial condition as in[Figure2](https://arxiv.org/html/2606.05371#S3.F2)over the temporal interpolation regime\[0,1\]\[0,1\]\. For each resolved Fourier mode, the relativeL2L^\{2\}error over the time interval\[0,1\]\[0,1\]is also reported\. Here the MAC model is trained using the alternative single\-step resolved\-mode loss\.Figure B2:Comparison between the closure\-term predictions from the MAC model and the true closure terms on the same representative test initial condition as in[FigureB1](https://arxiv.org/html/2606.05371#A2.F1)over the temporal interpolation regime\[0,1\]\[0,1\]\. For each closure term, the relativeL2L^\{2\}error over the time interval\[0,1\]\[0,1\]is also reported\. Here the MAC model is trained using the alternative single\-step resolved\-mode loss\.One possible reason is that the single\-step resolved\-mode loss provides only a weak and indirect training signal for the closure term\. For Burgers’ equation, the time step is very small,Δt=10−4\\Delta t=10^\{\-4\}, so the contribution of the closure term to a one\-step resolved\-mode update is also small\. As a result, different closure predictions may lead to similar one\-step resolved\-mode errors, and the single\-step resolved\-mode loss alone may not strongly constrain the closure term\. Another possible reason is the potential non\-uniqueness of the single\-step resolved\-mode loss objective for Burgers’ equation\. Since the closure term enters the resolved dynamics through nonlinear spectral interactions and only a subset of Fourier modes is retained, multiple closure configurations may produce similar resolved trajectories\. Consequently, the single\-step resolved\-mode loss may not uniquely identify the true closure term, and the model may instead converge to an inaccurate closure solution that fails to capture the true non\-Markovian contributions\. Both of these factors are consistent with our empirical observation that training with the single\-step resolved\-mode loss yields substantially degraded prediction accuracy, as shown in[FiguresB1](https://arxiv.org/html/2606.05371#A2.F1)and[B2](https://arxiv.org/html/2606.05371#A2.F2)\. In contrast, the direct closure loss provides a stronger supervised signal for learning the closure term than the alternative single\-step resolved\-mode loss\. #### B\.6\.2Lorenz ’96 System For the Lorenz ’96 system, we use a five\-step resolved\-variable rollout loss to train the Mamba\-based sequence model\. Unlike the Burgers case, the loss is not computed directly on the predicted closure terms\. Instead, the predicted closure terms are inserted into the resolved dynamics, and the resulting MAC framework is advanced autoregressively for multiple time steps\. The loss is then computed on the predicted resolved variables\. More specifically, each training input is a normalized resolved sequence of length6464, X0,X1,…,X63\.X\_\{0\},X\_\{1\},\\ldots,X\_\{63\}\.Since the rollout loss uses five steps, the first64−5=5964\-5=59states, X0,X1,…,X58,X\_\{0\},X\_\{1\},\\ldots,X\_\{58\},are used as starting states for the rollout\. Starting from these states, the MAC framework is advanced for five steps\. At each step, the Mamba\-based sequence model predicts the closure terms, these predicted closure terms are inserted into the resolved Lorenz ’96 dynamics, and the resolved dynamics are advanced by one time step\. The loss is computed only on the advanced resolved variables\. The one\-step predictions are compared with X1,X2,…,X59,X\_\{1\},X\_\{2\},\\ldots,X\_\{59\},the two\-step predictions are compared with X2,X3,…,X60,X\_\{2\},X\_\{3\},\\ldots,X\_\{60\},and this continues up to the five\-step predictions, which are compared with X5,X6,…,X63\.X\_\{5\},X\_\{6\},\\ldots,X\_\{63\}\.The resulting training objective is the mean squared error over these five sets of predicted resolved variables\. This loss design is motivated by the chaotic nature of the Lorenz ’96 system, in which small errors in the closure term can be rapidly amplified through the resolved dynamics, leading to trajectory divergence during autoregressive rollout\. A direct closure loss penalizes only the pointwise error in the predicted closure terms, without accounting for how these errors propagate and accumulate in the subsequent resolved\-variable evolution\. By contrast, the multi\-step resolved\-variable rollout loss trains the Mamba\-based sequence model through the MAC framework so that the predicted closure terms are optimized according to their effect on the resolved\-variable trajectory\. This makes the training objective more closely aligned with the autoregressive rollout task and helps suppress error accumulation in the chaotic regime\. To compare this choice with a direct closure loss, we also trained the Mamba\-based sequence model by minimizing the mean squared error between the predicted closure terms and the true closure terms\. The resulting temporal interpolation results are shown in[FiguresB3](https://arxiv.org/html/2606.05371#A2.F3)and[B4](https://arxiv.org/html/2606.05371#A2.F4)\. In this setting, training with the direct closure loss leads to substantially worse predictions for both the resolved variables and the closure terms than training with the multi\-step resolved\-variable rollout loss\. This comparison supports the use of the multi\-step resolved\-variable rollout loss for the Lorenz ’96 system\. Figure B3:Comparison of resolved slow\-variable predictions from the MAC model, the GRU\-based model, and the Wilks method over the temporal interpolation regime\[0,20\]\[0,20\]for the Lorenz ’96 system\. For each resolved slow variable, the relativeL2L^\{2\}error is also reported\. Here the MAC model and the GRU\-based model are trained using the alternative direct closure loss\.Figure B4:Comparison of closure\-term predictions from the MAC model, the GRU\-based model, and the Wilks method over the temporal interpolation regime\[0,20\]\[0,20\]for the Lorenz ’96 system\. For each closure term, the relativeL2L^\{2\}error is also reported\. Here the MAC model and the GRU\-based model are trained using the alternative direct closure loss\. ### B\.7Warm\-up in Inference During autoregressive inference with the MAC model, we use a short cache warm\-up procedure before starting the rollout\. This procedure is used to initialize the internal inference cache of the Mamba layer and to reduce the mismatch between sequence\-based training and step\-by\-step autoregressive inference\. In our implementation, the normalized initial resolved state is repeatedly passed through the Mamba\-based sequence model for padding steps\. These padding steps only advance the internal Mamba inference state; the predicted closure terms from this stage are not used to update the reduced\-order variables\. After this warm\-up, the autoregressive rollout starts from the same initial resolved state\. This cache warm\-up is used only for the Mamba\-based sequence model\. The GRU\-based sequence model starts the rollout directly from the initial resolved state\. ### B\.8RK4 Time Stepping with Zero\-Order Holding We briefly describe the time\-stepping procedure used in the autoregressive rollout experiments\. Consider an abstract reduced\-order system with a learned closure term, dydt=f\(y\)\+g^\(y\),\\frac\{\\mathrm\{d\}y\}\{\\mathrm\{d\}t\}=f\(y\)\+\\widehat\{g\}\(y\),whereffdenotes the known reduced\-order dynamics andg^\\widehat\{g\}denotes the closure predicted by the sequence model\. A standard fourth\-order Runge–Kutta discretization of this full learned right\-hand side would evaluate bothffandg^\\widehat\{g\}at each intermediate stage: k1\\displaystyle k\_\{1\}=f\(yn\)\+g^\(yn\),\\displaystyle=f\(y\_\{n\}\)\+\\widehat\{g\}\(y\_\{n\}\),k2\\displaystyle k\_\{2\}=f\(yn\+Δt2k1\)\+g^\(yn\+Δt2k1\),\\displaystyle=f\\bigl\(y\_\{n\}\+\\tfrac\{\\Delta t\}\{2\}k\_\{1\}\\bigr\)\+\\widehat\{g\}\\bigl\(y\_\{n\}\+\\tfrac\{\\Delta t\}\{2\}k\_\{1\}\\bigr\),k3\\displaystyle k\_\{3\}=f\(yn\+Δt2k2\)\+g^\(yn\+Δt2k2\),\\displaystyle=f\\bigl\(y\_\{n\}\+\\tfrac\{\\Delta t\}\{2\}k\_\{2\}\\bigr\)\+\\widehat\{g\}\\bigl\(y\_\{n\}\+\\tfrac\{\\Delta t\}\{2\}k\_\{2\}\\bigr\),k4\\displaystyle k\_\{4\}=f\(yn\+Δtk3\)\+g^\(yn\+Δtk3\),\\displaystyle=f\(y\_\{n\}\+\\Delta tk\_\{3\}\)\+\\widehat\{g\}\(y\_\{n\}\+\\Delta tk\_\{3\}\),followed by yn\+1=yn\+Δt6\(k1\+2k2\+2k3\+k4\)\.y\_\{n\+1\}=y\_\{n\}\+\\frac\{\\Delta t\}\{6\}\\bigl\(k\_\{1\}\+2k\_\{2\}\+2k\_\{3\}\+k\_\{4\}\\bigr\)\.This would require evaluating the learned closure at the RK4 half\-step stage states, and also at the final stage state\. In our implementation, we instead use a zero\-order holding treatment for the learned closure\. At the beginning of each time step, the sequence model predicts the closure once, g^n=g^\(yn\),\\widehat\{g\}\_\{n\}=\\widehat\{g\}\(y\_\{n\}\),and this predicted closure is held fixed throughout the RK4 stages: k1\\displaystyle k\_\{1\}=f\(yn\)\+g^n,\\displaystyle=f\(y\_\{n\}\)\+\\widehat\{g\}\_\{n\},k2\\displaystyle k\_\{2\}=f\(yn\+Δt2k1\)\+g^n,\\displaystyle=f\\bigl\(y\_\{n\}\+\\tfrac\{\\Delta t\}\{2\}k\_\{1\}\\bigr\)\+\\widehat\{g\}\_\{n\},k3\\displaystyle k\_\{3\}=f\(yn\+Δt2k2\)\+g^n,\\displaystyle=f\\bigl\(y\_\{n\}\+\\tfrac\{\\Delta t\}\{2\}k\_\{2\}\\bigr\)\+\\widehat\{g\}\_\{n\},k4\\displaystyle k\_\{4\}=f\(yn\+Δtk3\)\+g^n\.\\displaystyle=f\(y\_\{n\}\+\\Delta tk\_\{3\}\)\+\\widehat\{g\}\_\{n\}\.The update toyn\+1y\_\{n\+1\}is then computed using the same RK4 weighted average formula\. This zero\-order holding approximation is used for three practical reasons\. First, the sequence model is trained and used primarily on resolved input sequences sampled at discrete time points, so evaluating it at intermediate RK4 stage states would introduce an additional design choice\. Second, in both benchmark problems, the rollout time step is small:Δt=10−4\\Delta t=10^\{\-4\}for Burgers’ equation andΔt=0\.01\\Delta t=0\.01for the Lorenz ’96 system\. Together with the smooth temporal behavior of the closure terms along the resolved trajectories, as illustrated by the closure\-variable plots in[AppendixA](https://arxiv.org/html/2606.05371#A1), this makes the zero\-order holding treatment a reasonable practical approximation over a single time step\. Third, evaluating the sequence model once per time step reduces the computational cost in both training and inference\. A full stage\-wise RK4 treatment would require additional sequence\-model evaluations at the intermediate RK4 stages\. With this treatment, the known reduced\-order partffis still evaluated at the RK4 stage states, while the learned closure is used as a piecewise constant term over each time step\. In our experiments, this simplified RK4 procedure is used for all predictive results reported in[Section3](https://arxiv.org/html/2606.05371#S3)and leads to stable autoregressive rollouts\. ## Appendix CSymbols & Notation [TableC1](https://arxiv.org/html/2606.05371#A3.T1)summarizes the main symbols and notation used in this work\. Table C1:Summary of the main symbols and notation used in this work\. ## References ## References - \[undef\]Shady E\. Ahmed et al\.“On Closures for Reduced Order Models—a Spectrum of First\-Principle to Machine\-Learned Avenues”In*Physics of Fluids*33\.9, 2021DOI:[10\.1063/5\.0061577](https://dx.doi.org/10.1063/5.0061577) - \[undefa\]Giancarlo Benettin, Luigi Galgani, Antonio Giorgilli and Jean\-Marie Strelcyn“Lyapunov Characteristic Exponents for Smooth Dynamical Systems and for Hamiltonian Systems; a Method for Computing All of Them\. Part 1: Theory”In*Meccanica*15\.1, 1980, pp\. 9–20DOI:[10\.1007/BF02128236](https://dx.doi.org/10.1007/BF02128236) - \[undefb\]Yoshua Bengio, Patrice Simard and Paolo Frasconi“Learning Long\-Term Dependencies with Gradient Descent Is Difficult”In*IEEE Transactions on Neural Networks*5\.2, 1994, pp\. 157–166DOI:[10\.1109/72\.279181](https://dx.doi.org/10.1109/72.279181) - \[undefc\]Mohamed Aziz Bhouri and Pierre Gentine“Memory\-Based Parameterization with Differentiable Solver: Application to LOrenz ’96”In*Chaos: An Interdisciplinary Journal of Nonlinear Science*33\.7, 2023DOI:[10\.1063/5\.0131929](https://dx.doi.org/10.1063/5.0131929) - \[undefd\]Kyunghyun Cho et al\.“Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation”In*Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*Association for Computational Linguistics, 2014, pp\. 1724–1734DOI:[10\.3115/v1/D14\-1179](https://dx.doi.org/10.3115/v1/D14-1179) - \[undefe\]Alexandre J\. Chorin, Ole H\. Hald and Raz Kupferman“Optimal Prediction and the MOri–ZWanzig Representation of Irreversible Processes”In*Proceedings of the National Academy of Sciences*97\.7, 2000, pp\. 2968–2973DOI:[10\.1073/pnas\.97\.7\.2968](https://dx.doi.org/10.1073/pnas.97.7.2968) - \[undeff\]Hannah Christensen and Laure Zanna“Parametrization in Weather and Climate Models”In*Oxford Research Encyclopedia of Climate Science*Oxford University Press, 2022DOI:[10\.1093/acrefore/9780190228620\.013\.826](https://dx.doi.org/10.1093/acrefore/9780190228620.013.826) - \[undefg\]Jonathan Demaeyer and Stéphane Vannitsem“Stochastic Parameterization of Subgrid\-Scale Processes: A Review of Recent Physically Based Approaches”In*Advances in Nonlinear Geosciences*Cham: Springer International Publishing, 2017, pp\. 55–85DOI:[10\.1007/978\-3\-319\-58895\-7˙3](https://dx.doi.org/10.1007/978-3-319-58895-7_3) - \[undefh\]Paul A\. Durbin“Some Recent Developments in Turbulence Closure Modeling”In*Annual Review of Fluid Mechanics*50\.1, 2018, pp\. 77–103DOI:[10\.1146/annurev\-fluid\-122316\-045020](https://dx.doi.org/10.1146/annurev-fluid-122316-045020) - \[undefi\]Ibrahim Fatkullin and Eric Vanden\-Eijnden“A Computational Strategy for Multiscale Systems with Applications to LOrenz 96 Model”In*Journal of Computational Physics*200\.2, 2004, pp\. 605–638DOI:[10\.1016/j\.jcp\.2004\.04\.013](https://dx.doi.org/10.1016/j.jcp.2004.04.013) - \[undefj\]Hugo Frezat, Ronan Fablet, Guillaume Balarac and Julien Le Sommer“Gradient\-Free Online Learning of Subgrid\-Scale Dynamics with Neural Emulators”, 2023DOI:[10\.48550/arXiv\.2310\.19385](https://dx.doi.org/10.48550/arXiv.2310.19385) - \[undefk\]Albert Gu and Tri Dao“Mamba: Linear\-Time Sequence Modeling with Selective State Spaces”, 2023DOI:[10\.48550/arXiv\.2312\.00752](https://dx.doi.org/10.48550/arXiv.2312.00752) - \[undefl\]Albert Gu, Karan Goel and Christopher Ré“Efficiently Modeling Long Sequences with Structured State Spaces”, 2021DOI:[10\.48550/arXiv\.2111\.00396](https://dx.doi.org/10.48550/arXiv.2111.00396) - \[undefm\]Albert Gu et al\.“Combining Recurrent, Convolutional, and Continuous\-Time Models with Linear State\-Space Layers”In*Proceedings of the 35th International Conference on Neural Information Processing Systems*Curran Associates Inc\., 2021, pp\. 572–585URL:[https://dl\.acm\.org/doi/10\.5555/3540261\.3540305](https://dl.acm.org/doi/10.5555/3540261.3540305) - \[undefn\]Abhinav Gupta and Pierre F\.\. Lermusiaux“Neural Closure Models for Dynamical Systems”In*Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences*477\.2252, 2021DOI:[10\.1098/rspa\.2020\.1004](https://dx.doi.org/10.1098/rspa.2020.1004) - \[undefo\]Zheyuan Hu, Qianying Cao, Kenji Kawaguchi and George Em Karniadakis“Deepomamba: State\-Space Model for Spatio\-Temporal PDE Neural Operator Learning”In*Journal of Computational Physics*540, 2025DOI:[https://doi\.org/10\.1016/j\.jcp\.2025\.114272](https://dx.doi.org/https://doi.org/10.1016/j.jcp.2025.114272) - \[undefp\]Zheyuan Hu et al\.“State\-Space Models Are Accurate and Efficient Neural Operators for Dynamical Systems”In*Neural Networks*197, 2026DOI:[https://doi\.org/10\.1016/j\.neunet\.2025\.108496](https://dx.doi.org/https://doi.org/10.1016/j.neunet.2025.108496) - \[undefq\]Dmitrii Kochkov et al\.“Machine Learning–Accelerated Computational Fluid Dynamics”In*Proceedings of the National Academy of Sciences*118\.21, 2021DOI:[10\.1073/pnas\.2101784118](https://dx.doi.org/10.1073/pnas.2101784118) - \[undefr\]Edward N Lorenz“Predictability: A Problem Partly Solved”In*Proc\. Seminar on predictability*Reading, 1996, pp\. 1–18URL:[https://www\.ecmwf\.int/en/elibrary/75462\-predictability\-problem\-partly\-solved](https://www.ecmwf.int/en/elibrary/75462-predictability-problem-partly-solved) - \[undefs\]Chao Ma, Jianchun Wang null and Weinan E“Model Reduction with Memory and the Machine Learning of Dynamical Systems”In*Communications in Computational Physics*25\.4, 2019, pp\. 947–962DOI:[10\.4208/cicp\.OA\-2018\-0269](https://dx.doi.org/10.4208/cicp.OA-2018-0269) - \[undeft\]Romit Maulik et al\.“Time\-Series Learning of Latent\-Space Dynamics for Reduced\-Order Model Closure”In*Physica D: Nonlinear Phenomena*405, 2020DOI:[https://doi\.org/10\.1016/j\.physd\.2020\.132368](https://dx.doi.org/https://doi.org/10.1016/j.physd.2020.132368) - \[undefu\]Hazime Mori“Transport, Collective Motion, and BRownian Motion”In*Progress of Theoretical Physics*33\.3, 1965, pp\. 423–455DOI:[10\.1143/ptp\.33\.423](https://dx.doi.org/10.1143/ptp.33.423) - \[undefv\]Eric J\. Parish and Karthik Duraisamy“Non\-Markovian Closure Models for Large Eddy Simulations Using the MOri\-ZWanzig Formalism”In*Physical Review Fluids*2\.1, 2017DOI:[10\.1103/PhysRevFluids\.2\.014604](https://dx.doi.org/10.1103/PhysRevFluids.2.014604) - \[undefw\]Razvan Pascanu, Tomas Mikolov and Yoshua Bengio“On the Difficulty of Training Recurrent Neural Networks”In*Proceedings of the 30th International Conference on Machine Learning*JMLR\.org, 2013, pp\. III\-1310–III\-1318URL:[https://dl\.acm\.org/doi/10\.5555/3042817\.3043083](https://dl.acm.org/doi/10.5555/3042817.3043083) - \[undefx\]Saad Qadeer, Panos Stinis and Hui Wan“Stabilizing PDE–ML Coupled Systems”, 2025DOI:[10\.48550/arXiv\.2506\.19274](https://dx.doi.org/10.48550/arXiv.2506.19274) - \[undefy\]Benjamin Sanderse, Panos Stinis, Romit Maulik and Shady E\. Ahmed“Scientific Machine Learning for Closure Models in Multiscale Problems: A Review”In*Foundations of Data Science*7\.1, 2025, pp\. 298–337DOI:[10\.3934/fods\.2024043](https://dx.doi.org/10.3934/fods.2024043) - \[undefz\]Marissa G\. Saunders and Gregory A\. Voth“Coarse\-Graining Methods for Computational Biology”In*Annual Review Biophysics*42, 2013, pp\. 73–93DOI:[10\.1146/annurev\-biophys\-083012\-130348](https://dx.doi.org/10.1146/annurev-biophys-083012-130348) - \[undefaa\]Kiwon Um et al\.“Solver\-in\-the\-Loop: Learning from Differentiable Physics to Interact with Iterative PDE\-Solvers”In*Proceedings of the 34th International Conference on Neural Information Processing Systems*Curran Associates Inc\., 2020, pp\. 6111–6122URL:[https://dl\.acm\.org/doi/10\.5555/3495724\.3496237](https://dl.acm.org/doi/10.5555/3495724.3496237) - \[undefab\]Erik van der Giessen et al\.“Roadmap on Multiscale Materials Modeling”In*Modelling and Simulation in Materials Science and Engineering*28\.4, 2020DOI:[10\.1088/1361\-651X/ab7150](https://dx.doi.org/10.1088/1361-651X/ab7150) - \[undefac\]Ashish Vaswani et al\.“Attention Is All You Need”In*Proceedings of the 31st International Conference on Neural Information Processing Systems*, NIPS’17Curran Associates, Inc\., 2017, pp\. 6000–6010URL:[https://dl\.acm\.org/doi/10\.5555/3295222\.3295349](https://dl.acm.org/doi/10.5555/3295222.3295349) - \[undefad\]Qian Wang, Nicolò Ripamonti and Jan S\. Hesthaven“Recurrent Neural Network Closure of Parametric Pod\-GAlerkin Reduced\-Order Models Based on the MOri\-ZWanzig Formalism”In*Journal of Computational Physics*410, 2020DOI:[https://doi\.org/10\.1016/j\.jcp\.2020\.109402](https://dx.doi.org/https://doi.org/10.1016/j.jcp.2020.109402) - \[undefae\]Daniel S\. Wilks“Effects of Stochastic Parametrizations in the LOrenz ’96 System”In*Quarterly Journal of the Royal Meteorological Society*131\.606, 2005, pp\. 389–407DOI:[10\.1256/qj\.04\.03](https://dx.doi.org/10.1256/qj.04.03) - \[undefaf\]Paul D Williams“Modelling Climate Change: The Role of Unresolved Processes”In*Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences*363\.1837, 2005, pp\. 2931–2946DOI:[10\.1098/rsta\.2005\.1676](https://dx.doi.org/10.1098/rsta.2005.1676) - \[undefag\]Tingkai Xue et al\.“Differentiable Physics\-Neural Models Enable Learning of Non\-MArkovian Closures for Accelerated Coarse\-Grained Physics Simulations”, 2025DOI:[10\.48550/arXiv\.2511\.21369](https://dx.doi.org/10.48550/arXiv.2511.21369) - \[undefah\]Yuanran Zhu, Jason M\. Dominy and Daniele Venturi“On the Estimation of the MOri\-ZWanzig Memory Integral”In*Journal of Mathematical Physics*59\.10, 2018DOI:[10\.1063/1\.5003467](https://dx.doi.org/10.1063/1.5003467) - \[undefai\]Robert Zwanzig“Nonlinear Generalized LAngevin Equations”In*Journal of Statistical Physics*9\.3, 1973, pp\. 215–220DOI:[10\.1007/bf01008729](https://dx.doi.org/10.1007/bf01008729)
Similar Articles
Graph Mamba Survival Analysis Based on Topology-Aware ordering
This paper proposes TopoMamSurv, a Graph Mamba framework for whole-slide image survival analysis that uses topology-aware ordering to address Mamba's sensitivity to input order, and incorporates bidirectional Mamba and GCN for spatial context modeling.
Learning Manifold and It\^o Dynamics with Branched Neural Rough Differential Equations
This paper introduces Branched Neural Rough Differential Equations, a method for learning manifold and Itô dynamics by combining rough path theory with neural networks, enabling the modeling of complex stochastic and geometric structures.
ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction
Proposes ReTAMamba, a method using reliability-aware temporal aggregation with Mamba for irregular clinical time series prediction, achieving significant AUPRC gains on MIMIC-IV, eICU, and PhysioNet 2012.
Capturing non-Markovian dynamics in non-equilibrium stochastic systems using flow matching
This paper develops a generative flow matching method to capture non-Markovian dynamics in non-equilibrium stochastic systems, demonstrating improved predictions for the Kramers first passage time problem compared to Markovian baselines.
Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting
This paper introduces MeLISA, a latent-free autoregressive generative surrogate for forecasting high-dimensional physical dynamics that uses pixel-space MeanFlow to achieve efficient one-step generation. It demonstrates superior long-horizon statistical accuracy and inference speed compared to neural operators on turbulent flow benchmarks.